Datacraft API 

Contents

Datacraft API

Core Classes 

class datacraft.DataSpec(raw_spec)

Class representing a DataSpec object

generator(iterations, **kwargs)

Creates a generator that will produce records or render the template for each record

Parameters

iterations (int) – number of iterations to execute
**kwargs –

Keyword Arguments

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Yields

Records or rendered template strings

Examples

>>> import datacraft
>>> raw_spec {'name': ['bob', 'bobby', 'robert', 'bobo']}
>>> spec = datacraft.parse_spec(raw_spec)
>>> template = 'Name: {{ name }}'
>>> processor = datacraft.outputs.processor(template=template)
>>> generator = spec.generator(
...     iterations=4,
...     processor=processor)
>>> record = next(generator)
>>> print(record)
Name: bob

Return type: Generator

get(*args, **kwargs): Return the value for key if key is in the dictionary, else default.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

pop(k[, d]) → v, remove specified key and return the corresponding value.: If key is not found, d is returned if given, otherwise KeyError is raised

abstract to_pandas(iterations)

Converts iterations number of records into a pandas DataFrame

Parameters: iterations (int) – number of iterations to run / records to generate
Returns: DataFrame with records as rows

values() → an object providing a view on D's values

class datacraft.ValueSupplierInterface

Interface for Classes that supply values

abstract next(iteration)

Produces the next value for the given iteration

Parameters: iteration – current iteration
Returns: the next value

class datacraft.Loader

Parent object for loading value suppliers from specs

abstract get(key)

Retrieve the value supplier for the given field or ref key

Parameters: key (str) – key to for field or ref name
Return type: ValueSupplierInterface
Returns: the Value Supplier for the given key
Raises: SpecException if key not found –

abstract get_from_spec(field_spec)

Retrieve the value supplier for the given field spec

Parameters: field_spec (Any) – dictionary spec or literal values
Return type: ValueSupplierInterface
Returns: the Value Supplier for the given spec
Raises: SpecException if unable to resolve the spec with appropriate handler for the type –

abstract get_ref(key)

returns the spec for the ref with the provided key

Parameters: key (str) – key to lookup ref by
Return type: dict
Returns: Ref for key

abstract property spec: get the preprocessed field specs for this loader

class datacraft.Distribution

Interface Class for a numeric distribution such as a Uniform or Gaussian distribution

abstract next_value()

get the next value for this distribution

Return type: float

class datacraft.CasterInterface

Interface for Classes that cast objects to different types

abstract cast(value)

casts the value according to the specified type

Parameters: value (Any) – to cast
Return type: Any
Returns: the cast form of the value
Raises: SpecException when unable to cast value –

class datacraft.RecordProcessor

A Class that takes in a generated record and returns it formatted as a string for output

abstract process(record)

Processes the given record into the appropriate output string

Parameters: record (Union[list, dict]) – generated record for current iteration
Return type: str
Returns: The formatted record

class datacraft.OutputHandlerInterface

Interface four handling generated output values

abstract finished_iterations(): This is called when all iterations have been completed

abstract finished_record(iteration, group_name, exclude_internal=False)

This is called whenever all of the fields for a record have been generated for one iteration

Parameters

iteration (int) – iteration we are on
group_name (str) – group this record is apart of
exclude_internal (bool) – if external fields should be excluded from output record

abstract handle(key, value)

This is called each time a new value is generated for a given field

Parameters

key (str) – the field name
value (Any) – the new value for the field

class datacraft.ResettableIterator

Iterator class that can be reset to the beginning of the iteration

abstract reset(): This will reset the iterator to the initial state for another full round of iteration

Functions to modify specs before data generations process. If there is a customization you want to do for every data spec, or an extenstion you added that requires modifications to the spec before they are run, this is where you would register that pre-processor.

>>> @datacraft.registry.preprocessors('custom-preprocessing')
... def _preprocess_spec_to_some_end(raw_spec: dict) -> dict:
...    # return spec with any modification

logging

Custom logging setup. Can override or modify the default logging behavior.

>>> @datacraft.registry.logging('denoise')
... def _customize_logging(loglevel: str):
...     logging.getLogger('too.verbose.module').level = logging.ERROR

formats

Registered formats for output. When using the –format <format name>. Unlike other registered functions, this one is called directly to perform the required formatting function. The return value from the formatter is the new value that will be written to the configured output (default is console).

>>> @datacraft.registry.formats('custom_format')
... def _format_custom(record: dict) -> str:
...     # write to database or some other custom output, return something to write out or print to console

distribution

Different numeric distributions, normal, uniform, etc. These are used for more nuanced counts values. The built in distributions are uniform and normal.

>>> @datacraft.registry.distribution('hyperbolic_inverse_haversine')
... def _hyperbolic_inverse_haversine(mean, stddev, **kwargs):
...     # return a datacraft.Distribution, args can be custom for the defined distribution

defaults

Default values. Different types have different default values for some configs. This provides a mechanism to override or to register other custom defaults. Read a default from the registry with: datacraft.registries.get_default('var_key'). While datacraft.registries.all_defaults() will give a mapping of all registered default keys and values.

>>> @datacraft.registry.defaults('special_sauce_ingredient')
... def _default_special_sauce_ingredient():
...     # return the default value (i.e. onions)

casters

Cast or alter values in simple ways. These are all the valid forms of altering generated values after they are created outside of the ValueSupplier types. Use datacraft.registries.registered_casters() to get a list of all the currently registered ones.

>>> @datacraft.registry.casters('reverse')
... def _cast_reverse_strings():
...     # return a datacraft.CasterInterface

Datacraft Errors 

class datacraft.SpecException: A SpecException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec or one of the described Field Specs. Common errors include undefined or misspelled references, missing or invalid configuration parameters, and invalid or missing data definitions.

class datacraft.SupplierException: A SupplierException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec during run time.

class datacraft.ResourceError: A ResourceLoadError indicates that an underlying resource such as a schema file was not able to be found or loaded.

Suppliers Module 

Factory like module for core supplier related functions.

datacraft.suppliers.alter(supplier, **kwargs)

Covers multiple suppliers that alter values if configured to do so through kwargs: cast, buffer, and decorate

Parameters

supplier – to alter if configured to do so

Keyword Arguments

cast (str) – caster to apply
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
buffer (bool) – if the values should be buffered
buffer_size (int) – size of buffer to use

Return type

ValueSupplierInterface

Returns

supplier with alterations

datacraft.suppliers.array_supplier(wrapped, **kwargs)

Wraps an existing supplier and always returns an array/list of elements, uses count config to determine number of items in the list

Parameters

wrapped (ValueSupplierInterface) – the underlying supplier

Keyword Arguments

count – constant, list, or weighted map
data – alias for count
count_dist – distribution in named param function style format

Return type

ValueSupplierInterface

Returns

The value supplier

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> returns_mostly_two = datacraft.suppliers.array_supplier(pet_supplier, count_dist="normal(mean=2, stddev=1)")
>>> pet_array = returns_mostly_two.next(0)

datacraft.suppliers.buffered(wrapped, **kwargs)

Creates a Value Supplier that buffers the results of the wrapped supplier allowing the retrieval

Parameters: wrapped (ValueSupplierInterface) – the Value Supplier to buffer values for
Keyword Arguments: buffer_size – number of produced values to buffer
Return type: ValueSupplierInterface
Returns: a buffered value supplier

datacraft.suppliers.calculate(suppliers_map, formula)

Creates a calculate supplier

Parameters

suppliers_map (Dict[str, ValueSupplierInterface]) – map of name to supplier of values for that name
formula (str) – to evaluate, should reference keys in suppliers_map

Return type

ValueSupplierInterface

Returns

supplier with calculated values

datacraft.suppliers.cast(supplier, cast_to)

Provides a cast supplier from explicit cast

Parameters

supplier (ValueSupplierInterface) – to cast results of
cast_to (str) – type to cast values to

Return type

ValueSupplierInterface

Returns

the casting supplier

datacraft.suppliers.character_class(data, **kwargs)

Creates a character class supplier for the given data

Parameters

data – set of characters to supply as values

Keyword Arguments

join_with (str) – string to join characters with, default is ‘’
exclude (str) – set of characters to exclude from returned values
mean (float) – mean number of characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list to use
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return

Returns

supplier for characters

datacraft.suppliers.combine(to_combine, join_with=None, as_list=None)

Creates a value supplier that will combine the outputs of the provided suppliers in order. The default is to join the values with an empty string. Provide the join_with config param to specify a different string to join the values with. Set as_list to true, if the values should be returned as a list and not joined

Parameters

to_combine – list of suppliers to combine in order of combination
as_list (Optional[bool]) – if the results should be returned as a list
join_with (Optional[str]) – value to use to join the values

Returns

supplier for mac addresses

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"], sample=True)
>>> job_supplier = datacraft.suppliers.values(["breeder", "trainer", "fighter", "wrestler"], sample=True)
>>> interesting_jobs = datacraft.suppliers.combine([pet_supplier, job_supplier], join_with=' ')
>>> next_career = interesting_jobs.next(0)
>>> next_career
'pig wrestler'

Returns:

datacraft.suppliers.constant(data)

Creates value supplier for the single value

Parameters: data (Any) – constant data to return on every iteration
Return type: ValueSupplierInterface
Returns: value supplier for the single value

Examples

>>> import datacraft
>>> single_int_supplier = datacraft.suppliers.constant(42)
>>> single_str_supplier = datacraft.suppliers.constant("42")
>>> single_float_supplier = datacraft.suppliers.constant(42.42)

datacraft.suppliers.count_supplier(**kwargs)

creates a count supplier from the config, if the count param is defined, otherwise uses default of 1

optionally can specify count or count_dist.

valid data for counts:

integer i.e. 1, 7, 99
list of integers: [1, 7, 99], [1], [1, 2, 1, 2, 3]
weighted map, where keys are numeric strings: {“1”: 0.6, “2”: 0.4}

count_dist will be interpreted as a distribution i.e:

Keyword Arguments

count – constant, list, or weighted map
data – alias for count
count_dist (str) – distribution in named param function style format

Return type

ValueSupplierInterface

Returns

a count supplier

Examples

>>> import datacraft
>>> counts = datacraft.suppliers.count_supplier(count_dist="uniform(start=10, end=100)")

datacraft.suppliers.csv(csv_path, **kwargs)

Creates a csv supplier

Parameters

csv_path – path to csv file to supply data from

Keyword Arguments

column (int) – 1 based column number, default is 1
sample (bool) – if the values for the column should be sampled, if supported
count – constant, list, or weighted map
count_dist – distribution in named param function style format
delimiter (str) – how items are separated, default is ‘,’
quotechar (str) – string used to quote values, default is ‘”’
headers (bool) – if the CSV file has a header row
sample_rows (bool) – if sampling should happen at a row level, not valid if buffering is set to true

Returns

supplier for csv field

datacraft.suppliers.date(**kwargs)

Creates supplier the provides date values according to specified format and ranges

Can use one of center_date or (start, end, offset, duration_days) etc.

Parameters

**kwargs –

Keyword Arguments

format (str) – Format string for dates
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (str) – number of days after start, default is 30
date_format_string (str) – format for parsing dates

Return type

ValueSupplierInterface

Returns

supplier for dates

datacraft.suppliers.decorated(supplier, **kwargs)

Creates a decorated supplier around the provided one

Parameters

supplier (ValueSupplierInterface) – the supplier to alter
**kwargs –

Keyword Arguments

prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’

Return type

ValueSupplierInterface

Returns

the decorated supplier

Examples

>>> import datacraft
>>> nums = datacraft.suppliers.values([1, 2, 3, 4, 5])
>>> prefix_supplier = datacraft.suppliers.decorated(nums, prefix='you are number ')
>>> prefix_supplier.next(0)
you are number 1
>>> suffix_supplier = datacraft.suppliers.decorated(nums, suffix=' more minutes')
>>> suffix_supplier.next(0)
1 more minutes
>>> quoted_supplier = datacraft.suppliers.decorated(nums, quote='"')
>>> quoted_supplier.next(0)
"1"

datacraft.suppliers.distribution_supplier(distribution)

creates a ValueSupplier that uses the given distribution to generate values

Parameters: distribution (Distribution) – to use
Return type: ValueSupplierInterface
Returns: the value supplier

datacraft.suppliers.from_list_of_suppliers(supplier_list, modulate_iteration=True)

Returns a supplier that rotates through the provided suppliers incrementally

Parameters

supplier_list (List[ValueSupplierInterface]) – to rotate through
modulate_iteration (bool) – if the iteration number should be moded by the index of the supplier

Return type

ValueSupplierInterface

Returns

a supplier for these suppliers

Examples

>>> import datacraft
>>> nice_pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> mean_pet_supplier = datacraft.suppliers.values(["alligator", "cobra", "mongoose", "killer bee"])
>>> pet_supplier = datacraft.suppliers.from_list_of_suppliers([nice_pet_supplier, mean_pet_supplier])
>>> pet_supplier.next(0)
'dog'
>>> pet_supplier.next(1)
'alligator'

datacraft.suppliers.geo_lat(**kwargs)

configures geo latitude type

Keyword Arguments

precision (int) – number of digits after decimal place
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type

ValueSupplierInterface

Returns

supplier for geo.lat type

datacraft.suppliers.geo_long(**kwargs)

configures geo longitude type

Keyword Arguments

precision (int) – number of digits after decimal place
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type

ValueSupplierInterface

Returns

supplier for geo.long type

datacraft.suppliers.geo_pair(**kwargs)

Creates geo pair supplier

Keyword Arguments

precision (int) – number of digits after decimal place
lat_first (bool) – if latitude should be populated before longitude
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
as_list (bool) – if the values should be returned as a list
join_with (str) – if the values should be joined with the provided string

Returns

supplier for geo.pair type

datacraft.suppliers.ip_precise(cidr, sample=False)

Creates a value supplier that produces precise ip address from the given cidr

Parameters

cidr (str) – notation specifying ip range
sample (bool) – if the ip addresses should be sampled from the available set

Return type

ValueSupplierInterface

Returns

supplier for precise ip addresses

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=False)
>>> ips.next(0)
'192.168.0.0'
>>> ips.next(1)
'192.168.0.1'
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=True)
>>> ips.next(0)
'192.168.0.127'
>>> ips.next(1)
'192.168.0.196'

datacraft.suppliers.ip_supplier(**kwargs)

Creates a value supplier for ipv v4 addresses

Keyword Arguments

base (str) – base of ip address, i.e. “192”, “10.” “100.100”, “192.168.”, “10.10.10”
cidr (str) – cidr to use only one /8 /16 or /24, i.e. “192.168.0.0/24”, “10.0.0.0/16”, “100.0.0.0/8”

Return type

ValueSupplierInterface

Returns

supplier for ip addresses

Raises

SpecException if one of base or cidr is not provided –

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_supplier(base="192.168.1")
>>> ips.next(0)
'192.168.1.144'

datacraft.suppliers.list_count_sampler(data, **kwargs)

Samples N elements from data list based on config. If count is provided, each iteration exactly count elements will be returned. If only min is provided, between min and the total number of elements will be provided. If only max is provided, between one and max elements will be returned. Specifying both min and max will provide a sample containing a number of elements in this range.

Parameters

data (list) – list to select subset from

Keyword Arguments

count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None

Return type

ValueSupplierInterface

Returns

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, min=2, max=5)
>>> pet_supplier.next(0)
['rabbit', 'cat', 'pig', 'cat']
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, count_dist="normal(mean=2,stddev=1,min=1,max=3)")
>>> pet_supplier.next(0)
['pig', 'horse']

datacraft.suppliers.list_stats_sampler(data, **kwargs)

sample from list (or string) with stats based params

Parameters

data (Union[str, list]) – list to select subset from

Keyword Arguments

mean (float) – mean number of items/characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list/characters to use
count_dist (str) – count distribution to use
min (int) – minimum number of items/characters to return
max (int) – maximum number of items/characters to return

Return type

ValueSupplierInterface

Returns

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_stats_sampler(pet_list, mean=2, stddev=1)
>>> new_pets = pet_supplier.next(0)

>>> char_config = {"min": 2, "mean": 4, "max": 8}
>>> char_supplier = datacraft.suppliers.list_stats_sampler("#!@#$%^&*()_-~", min=2, mean=4, max=8)
>>> two_to_eight_chars = char_supplier.next(0)

datacraft.suppliers.list_values(data, **kwargs)

creates a Value supplier for the list of provided data

Parameters

data (list) – for the supplier

Keyword Arguments

as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format

Return type

ValueSupplierInterface

Returns

the ValueSupplierInterface for the data list

datacraft.suppliers.mac_address(delimiter=None)

Creates a value supplier that produces mac addresses

Parameters: delimiter (Optional[str]) – how mac address pieces are separated, default is ‘:’
Return type: ValueSupplierInterface
Returns: supplier for mac addresses

Examples

>>> import datacraft
>>> macs = datacraft.suppliers.mac_address()
>>> macs.next(0)
'1E:D4:0F:59:41:FA'
>>> macs = datacraft.suppliers.mac_address('-')
>>> macs.next(0)
'4D-93-36-59-BD-09'

datacraft.suppliers.random_range(start, end, precision=None, count=1)

Creates a random range supplier for the start and end parameters with the given precision (number of decimal places)

Parameters

start (Union[str, int, float]) – of range
end (Union[str, int, float]) – of range
precision (Union[str, int, float, None]) – number of decimal points to keep
count (Union[int, List[int], Dict[str, float], Distribution]) – number of elements to return, default is one

Return type

ValueSupplierInterface

Returns

the value supplier for the range

Examples

>>> num_supplier = datacraft.suppliers.random_range(5, 25, precision=3)
>>> # should be between 5 and 25 with 3 decimal places
>>> num_supplier.next(0)
8.377

datacraft.suppliers.range_supplier(start, end, step=1, **kwargs)

Creates a Value Supplier for given range of data

Parameters

start (Union[int, float]) – start of range
end (Union[int, float]) – end of range
step (Union[int, float]) – of range values

Keyword Arguments

precision (int) – Number of decimal places to use, in case of floating point range

Returns

supplier to supply ranges of values with

datacraft.suppliers.resettable(iterator)

Wraps a ResettableIterator to supply values from

Parameters: iterator (ResettableIterator) – iterator with reset() method
Returns: supplier to supply generated values with

datacraft.suppliers.select_list_subset(data, **kwargs)

Creates a supplier that selects elements from the data list based on the supplier kwargs

Parameters

data (list) – list of data values to supply values from

Keyword Arguments

mean (float) – mean number of values to include in list
stddev (float) – standard deviation from the mean

Returns

supplier to supply subsets of data list

datacraft.suppliers.templated(supplier_map, template_str)

Creates a supplier that populates the template string from the supplier map

Parameters

supplier_map (Dict[str, ValueSupplierInterface]) – map of field name -> value supplier for it
template_str – templated string to populate

Return type

ValueSupplierInterface

Returns

value supplier for template

Examples

>>> from datacraft import suppliers
>>> char_to_num_supplier = { 'char': suppliers.values(['a', 'b', 'c']), 'num': suppliers.values([1, 2, 3]) }
>>> letter_number_template = 'letter {{ char }}, number {{ num }}'
>>> supplier = suppliers.templated(char_to_num_supplier, letter_number_template)
>>> supplier.next(0)
'letter a, nummber 1'

datacraft.suppliers.unicode_range(data, **kwargs)

Creates a unicode supplier for single or multiple unicode ranges

Parameters

data – list of unicode ranges to sample from

Keyword Arguments

mean (float) – mean number of values to produce
stddev (float) – standard deviation from the mean
count (int) – number of unicode characters to produce
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
as_list (bool) – if the results should be returned as a list
join_with (str) – value to join values with, default is ‘’

Returns

supplier to supply subsets of data list

datacraft.suppliers.uuid(variant=None)

Creates a UUid Value Supplier

Parameters: variant (Optional[int]) – of uuid to use, default is 4
Return type: ValueSupplierInterface
Returns: supplier to supply uuids with

datacraft.suppliers.values(spec, **kwargs)

Based on data, return the appropriate values supplier. data can be a spec, constant, list, or dict. or just the raw data

Parameters

spec (Any) – to load values from, or raw data itself
**kwargs – extra kwargs to add to config

Keyword Arguments

as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format

Return type

ValueSupplierInterface

Returns

the values supplier for the spec

Examples

>>> import datacraft
>>> raw_spec = {"type": "values", "data": [1,2,3,5,8,13]}
>>> fib_supplier = datacraft.suppliers.values(raw_spec)
>>> fib_supplier = datacraft.suppliers.values([1,2,3,5,8,13])
>>> fib_supplier.next(0)
1
>>> weights =  {"1": 0.1, "2": 0.2, "3": 0.1, "4": 0.2, "5": 0.1, "6": 0.2, "7": 0.1}
>>> mostly_even_supplier = datacraft.suppliers.values(weights)
>>> mostly_even_supplier.next(0)
'4'

datacraft.suppliers.weighted_values(data, config=None)

Creates a weighted value supplier from the data, which is a mapping of value to the weight is should represent.

Parameters

data (dict) – for the supplier
config (Optional[dict]) – optional config (Default value = None)

Return type

ValueSupplierInterface

Returns

the supplier

Raises

SpecException if data is empty –

Examples

>>> import datacraft
>>> pets = {"dog": 0.5, "cat": 0.2, "bunny": 0.1, "hamster": 0.1, "pig": 0.05, "snake": 0.04, "_NULL_": 0.01}
>>> weighted_pet_supplier = datacraft.suppliers.weighted_values(pets)
>>> most_likely_a_dog = weighted_pet_supplier.next(0)

Builder Module 

Module for parsing and helper functions for specs

Examples

>>> import datacraft
>>> raw_spec = {
...     'name': {'type': 'values', 'data': ['ann', 'bob', 'carl']},
...     'age': {'type': 'rand_int_range', 'data': [22, 47]}
... }
>>> spec = datacraft.parse_spec(raw_spec)
>>> type(spec)
DataSpec

datacraft.builder.entries(raw_spec, iterations, **kwargs)

Creates n entries from the provided spec

Parameters

raw_spec (Dict[str, Dict]) – to create generator for
iterations (int) – number of iterations before max

Keyword Arguments

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Return type

List[dict]

Returns

the list of N entries

datacraft.builder.generator(raw_spec, iterations, **kwargs)

Creates a generator for the raw spec for the specified iterations

Parameters

raw_spec (Dict[str, Dict]) – to create generator for
iterations (int) – number of iterations before max

Keyword Arguments

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Yields

Records or rendered template strings

Return type

Generator

Returns

the generator for the provided spec

datacraft.builder.parse_spec(raw_spec)

Parses the raw spec into a DataSpec object. Takes in specs that may contain shorthand specifications.

Parameters: raw_spec (dict) – raw dictionary that conforms to JSON spec format
Return type: DataSpec
Returns: the fully parsed and loaded spec

Outputs Module 

Module holds output related classes and functions

class datacraft.outputs.WriterInterface

Interface for classes that write the generated values out

abstract write(value)

Write the value to the configured output destination

Parameters: value – to write

datacraft.outputs.file_name_engine(prefix, extension)

creates a templating engine that will produce file names based on the count

Parameters

prefix (str) – prefix for file name
extension (str) – suffix for file name

Return type

RecordProcessor

Returns

template engine for producing file names

datacraft.outputs.get_writer(outdir=None, outfile=None, overwrite=False, **kwargs)

creates the appropriate output writer from the given args and params

If no output directory is specified/configured will write to stdout

Parameters

outdir (Optional[str]) – Directory to write output to
outfile (Optional[str]) – If a specific file should be used for the output, default is to construct the name from kwargs
overwrite (bool) – Should existing files with the same name be overwritten

Keyword Arguments

outfile_prefix – the prefix of the output files i.e. test-data-
extension – to append to the file name prefix i.e. .csv
suppress_output – if output to stdout should be suppressed, only valid if outdir is None

Return type

WriterInterface

Returns

The configured Writer

Examples

>>> import datacraft
>>> csv_writer = datacraft.outputs.get_writer('./output', outfileprefix='test-data-', extension='.csv')

datacraft.outputs.incrementing_file_writer(outdir, engine)

Creates a WriterInterface that increments the count in the file name once records_per_file have been written

Parameters

outdir (str) – output directory
engine (RecordProcessor) – to generate file names with

Return type

WriterInterface

Returns

a Writer that increments the a count in the file name

datacraft.outputs.processor(template=None, format_name=None)

Configures the record level processor for either the template or for the format_name

Parameters

template (Union[str, Path, None]) – path to template or template as string
format_name (Optional[str]) – one of the valid registered formatter names

Return type

Optional[RecordProcessor]

Returns

RecordProcessor if valid template of format_name provide, None otherwise

Raises

SpecException when format_name is not registered or if both template and format specified –

Examples

>>> import datacraft
>>> engine = datacraft.outputs.processor(template='/path/to/template.jinja')
>>> engine = datacraft.outputs.processor(template='{{ Inline: {{ variable }}')
>>> formatter = datacraft.outputs.processor(format_name='json')
>>> formatter = datacraft.outputs.processor(format_name='my_custom_registered_format')

datacraft.outputs.record_level(record_processor, writer, records_per_file=1)

Creates a OutputHandler for record level events

Parameters

record_processor (RecordProcessor) – to process the records into strings
writer (WriterInterface) – to write the processed records
records_per_file (int) – number of records to accumulate before writing

Return type

OutputHandlerInterface

Returns

OutputHandlerInterface

datacraft.outputs.single_field(writer, output_key)

Creates a OutputHandler field level events

Parameters

writer (WriterInterface) – to write the processed records
output_key (bool) – if the key should be output along with the value

Returns

OutputHandlerInterface

datacraft.outputs.single_file_writer(outdir, outname, overwrite)

Creates a Writer for a single output file

Parameters

outdir (str) – output directory
outname (str) – output file name
overwrite (bool) – if should overwrite exiting output files

Return type

WriterInterface

Returns

Writer for a single file

datacraft.outputs.stdout_writer()

Creates a WriterInterface that writes results to stdout

Return type: WriterInterface
Returns: writer that writes to stdout

datacraft.outputs.suppress_output_writer()

Returns a writer that suppresses the output to stdout

Return type: WriterInterface

Template Engines 

Handles loading and creating the templating engine

datacraft.template_engines.for_file(template_file)

Loads the templating engine for the template file specified

Parameters: template_file (Union[str, Path]) – to fill in, string or Path
Return type: RecordProcessor
Returns: the templating engine

datacraft.template_engines.string(template)

Returns a template engine for processing templates as strings

Return type: RecordProcessor

Spec Formatters 

Module with functions that handle formatting specs in an orderly and consistent structure i.e:

{
  "type": "<type name>",
  "data": "data stuff",
  "refs": "refs pointers",
  "config": {
    "key": "value..."
  }
}

References

JSON Custom formatting https://stackoverflow.com/questions/13249415/how-to-implement-custom-indentation-when-pretty-printing-with-the-json-module

YAML custom formatting from https://til.simonwillison.net/python/style-yaml-dump via: https://stackoverflow.com/a/8641732 and https://stackoverflow.com/a/16782282

datacraft.spec_formatters.format_json(raw_spec)

Formats the raw_spec as ordered dictionary in JSON

Parameters: raw_spec (dict) – to format
Return type: str
Returns: the ordered and formatted JSON string

datacraft.spec_formatters.format_yaml(raw_spec)

Formats the raw_spec as ordered dictionary in YAML

Parameters: raw_spec (dict) – to format
Return type: str
Returns: the ordered and formatted YAML string