Datacraft API 

The Datacraft API is can be used to generate data in a similar way to the command line tooling. Data Specs are defined as dictionaries and follow the JSON based format and schemas. Most of the time you can copy a JSON spec from a file and assign it to a variable and it will generate the same data as the command line datacraft tool.

Example:

import datacraft

spec = {
    "id": {"type": "uuid"},
    "timestamp": {"type": "date.iso.millis"},
    "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
}

print(*datacraft.entries(spec, 3), sep='\n')
# {'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'}
# {'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'}
# {'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}

# or if you prefer a generator
for record in datacraft.generator(spec, 3_000_000):
    pass

There are some functions that can be helpful for getting the list of registered types as well as examples for using them with the API.

import datacraft

# List all registered types:
datacraft.registered_types()
# ['calculate', 'char_class', 'cc-ascii', 'cc-lower', '...', 'uuid', 'values', 'replace', 'regex_replace']

# Print API usage for a specific type or types
print(datacraft.type_usage('char_class', 'replace', '...'))
# Example Output
"""
-------------------------------------
replace | API Example:

import datacraft

spec = {
 "field": {
   "type": "values",
   "data": ["foo", "bar", "baz"]
 },
 "replacement": {
   "type": "replace",
   "data": {"ba": "fi"},
   "ref": "field"
 }
}

print(*datacraft.entries(spec, 3), sep='\n')

{'field': 'foo', 'replacement': 'foo'}
{'field': 'bar', 'replacement': 'fir'}
{'field': 'baz', 'replacement': 'fiz'}
"""

Core Classes 

class datacraft.DataSpec(raw_spec)

Class representing a DataSpec object

abstract generator(iterations, **kwargs)

Creates a generator that will produce records or render the template for each record

Parameters:

iterations (int) – number of iterations to execute
**kwargs –

Keyword Arguments:

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Yields:

Records or rendered template strings

Return type:

Generator

Examples

>>> import datacraft
>>> raw_spec {'name': ['bob', 'bobby', 'robert', 'bobo']}
>>> spec = datacraft.parse_spec(raw_spec)
>>> template = 'Name: {{ name }}'
>>> processor = datacraft.outputs.processor(template=template)
>>> generator = spec.generator(
...     iterations=4,
...     processor=processor)
>>> record = next(generator)
>>> print(record)
Name: bob

get(*args, **kwargs): Return the value for key if key is in the dictionary, else default.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

pop(k[, d]) → v, remove specified key and return the corresponding value.: If the key is not found, return the default if given; otherwise, raise a KeyError.

abstract to_pandas(iterations)

Converts iterations number of records into a pandas DataFrame

Parameters:: iterations (int) – number of iterations to run / records to generate
Returns:: DataFrame with records as rows

values() → an object providing a view on D's values

class datacraft.ValueSupplierInterface

Interface for Classes that supply values

abstract next(iteration)

Produces the next value for the given iteration

Parameters:: iteration – current iteration
Returns:: the next value

class datacraft.Loader

Parent object for loading value suppliers from specs

abstract get(key)

Retrieve the value supplier for the given field or ref key

Parameters:: key (str) – key to for field or ref name
Return type:: ValueSupplierInterface
Returns:: the Value Supplier for the given key
Raises:: SpecException if key not found –

abstract get_from_spec(field_spec)

Retrieve the value supplier for the given field spec

Parameters:: field_spec (Any) – dictionary spec or literal values
Return type:: ValueSupplierInterface
Returns:: the Value Supplier for the given spec
Raises:: SpecException if unable to resolve the spec with appropriate handler for the type –

abstract get_ref(key)

returns the spec for the ref with the provided key

Parameters:: key (str) – key to lookup ref by
Return type:: dict
Returns:: Ref for key

abstract property spec: get the preprocessed field specs for this loader

class datacraft.Distribution

Interface Class for a numeric distribution such as a Uniform or Gaussian distribution

abstract next_value()

get the next value for this distribution

Return type:: float

class datacraft.CasterInterface

Interface for Classes that cast objects to different types

abstract cast(value)

casts the value according to the specified type

Parameters:: value (Any) – to cast
Return type:: Any
Returns:: the cast form of the value
Raises:: SpecException when unable to cast value –

class datacraft.RecordProcessor

A Class that takes in a generated record and returns it formatted as a string for output

abstract process(record)

Processes the given record into the appropriate output string

Parameters:: record (Union[list, dict]) – generated record for current iteration
Return type:: str
Returns:: The formatted record

class datacraft.OutputHandlerInterface

Interface four handling generated output values

abstract finished_iterations(): This is called when all iterations have been completed

abstract finished_record(iteration, group_name, exclude_internal=False)

This is called whenever all of the fields for a record have been generated for one iteration

Parameters:

iteration (int) – iteration we are on
group_name (str) – group this record is apart of
exclude_internal (bool) – if external fields should be excluded from output record

abstract handle(key, value)

This is called each time a new value is generated for a given field

Parameters:

key (str) – the field name
value (Any) – the new value for the field

class datacraft.ResettableIterator

Iterator class that can be reset to the beginning of the iteration

abstract reset(): This will reset the iterator to the initial state for another full round of iteration

Registry Decorators 

class datacraft.registries.Registry

Catalogue registry for types, preprocessors, logging configuration, and others

types

Types for field specs, registered functions for creating ValueSupplierInterface that will supply values for the given type

>>> @datacraft.registry.types('special_sauce')
... def _handle_special_type(field_spec: dict, loader: datacraft.Loader) -> ValueSupplierInterface:
...    # return ValueSupplierInterface from spec config

schemas

Schemas for field spec types, used to validate that the spec for a given type conforms to the schema for it

>>> @datacraft.registry.schemas('special_sauce')
... def _special_sauce_schema() -> dict:
...    # return JSON schema validating specs with type: special_sauce

usage

Usage for field spec types, used to provide command line help and examples

>>> @datacraft.registry.usage('special_sauce')
... def _special_sauce_usage() -> Union[str, dict]:
...    # return string describing how to use special_sauce
...    # or a dictionary with {"cli": "cli usage example", "api": "api usage example"}

preprocessors

Functions to modify specs before data generations process. If there is a customization you want to do for every data spec, or an extenstion you added that requires modifications to the spec before they are run, this is where you would register that pre-processor.

>>> @datacraft.registry.preprocessors('custom-preprocessing')
... def _preprocess_spec_to_some_end(raw_spec: dict) -> dict:
...    # return spec with any modification

logging

Custom logging setup. Can override or modify the default logging behavior.

>>> @datacraft.registry.logging('denoise')
... def _customize_logging(loglevel: str):
...     logging.getLogger('too.verbose.module').level = logging.ERROR

formats

Registered formats for output. When using the –format <format name>. Unlike other registered functions, this one is called directly to perform the required formatting function. The return value from the formatter is the new value that will be written to the configured output (default is console).

>>> @datacraft.registry.formats('custom_format')
... def _format_custom(record: dict) -> str:
...     # write to database or some other custom output, return something to write out or print to console

distribution

Different numeric distributions, normal, uniform, etc. These are used for more nuanced counts values. The built in distributions are uniform and normal.

>>> @datacraft.registry.distribution('hyperbolic_inverse_haversine')
... def _hyperbolic_inverse_haversine(mean, stddev, **kwargs):
...     # return a datacraft.Distribution, args can be custom for the defined distribution

defaults

Default values. Different types have different default values for some configs. This provides a mechanism to override or to register other custom defaults. Read a default from the registry with: datacraft.registries.get_default('var_key'). While datacraft.registries.all_defaults() will give a mapping of all registered default keys and values.

>>> @datacraft.registry.defaults('special_sauce_ingredient')
... def _default_special_sauce_ingredient():
...     # return the default value (i.e. onions)

casters

Cast or alter values in simple ways. These are all the valid forms of altering generated values after they are created outside of the ValueSupplier types. Use datacraft.registries.registered_casters() to get a list of all the currently registered ones.

>>> @datacraft.registry.casters('reverse')
... def _cast_reverse_strings() -> datacraft.CasterInterface:
...     # return a datacraft.CasterInterface

analyzers

Used by the Data Spec inference tool chain to analyze the list of values for a given field to try to determine an appropriate Field Spec that can be used to approximate the data values present

>>> @datacraft.registry.num_analyzers('custom')
... def _special_value_analyzer() -> datacraft.ValueListAnalyzer
...     # return a datacraft.ValueListAnalyzer

Datacraft Errors 

class datacraft.SpecException: A SpecException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec or one of the described Field Specs. Common errors include undefined or misspelled references, missing or invalid configuration parameters, and invalid or missing data definitions.

class datacraft.SupplierException: A SupplierException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec during run time.

class datacraft.ResourceError: A ResourceLoadError indicates that an underlying resource such as a schema file was not able to be found or loaded.

Suppliers Module 

Factory like module for core supplier related functions.

datacraft.suppliers.alter(supplier, **kwargs)

Covers multiple suppliers that alter values if configured to do so through kwargs: cast, buffer, and decorate

Parameters:

supplier – to alter if configured to do so

Keyword Arguments:

cast (str) – caster to apply
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
buffer (bool) – if the values should be buffered
buffer_size (int) – size of buffer to use

Return type:

ValueSupplierInterface

Returns:

supplier with alterations

datacraft.suppliers.array_supplier(wrapped, **kwargs)

Wraps an existing supplier and always returns an array/list of elements, uses count config to determine number of items in the list

Parameters:

wrapped (ValueSupplierInterface) – the underlying supplier

Keyword Arguments:

count – constant, list, or weighted map
data – alias for count
count_dist – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

The value supplier

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> returns_mostly_two = datacraft.suppliers.array_supplier(pet_supplier, count_dist="normal(mean=2, stddev=1)")
>>> pet_array = returns_mostly_two.next(0)

datacraft.suppliers.buffered(wrapped, **kwargs)

Creates a Value Supplier that buffers the results of the wrapped supplier allowing the retrieval

Parameters:: wrapped (ValueSupplierInterface) – the Value Supplier to buffer values for
Keyword Arguments:: buffer_size – number of produced values to buffer
Return type:: ValueSupplierInterface
Returns:: a buffered value supplier

datacraft.suppliers.calculate(suppliers_map, formula)

Creates a calculate supplier

Parameters:

suppliers_map (Dict[str, ValueSupplierInterface]) – map of name to supplier of values for that name
formula (str) – to evaluate, should reference keys in suppliers_map

Return type:

ValueSupplierInterface

Returns:

supplier with calculated values

datacraft.suppliers.cast(supplier, cast_to)

Provides a cast supplier from explicit cast

Parameters:

supplier (ValueSupplierInterface) – to cast results of
cast_to (str) – type to cast values to

Return type:

ValueSupplierInterface

Returns:

the casting supplier

datacraft.suppliers.character_class(data, **kwargs)

Creates a character class supplier for the given data

Parameters:

data – set of characters to supply as values

Keyword Arguments:

join_with (str) – string to join characters with, default is ‘’
exclude (str) – set of characters to exclude from returned values
escape (str) – set of characters to escape, i.e. ” -> ” for example
escape_str (str) – string to use for escaping, default is mean (float): mean number of characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list to use
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return

Returns:

supplier for characters

datacraft.suppliers.combine(to_combine, join_with=None, as_list=None)

Creates a value supplier that will combine the outputs of the provided suppliers in order. The default is to join the values with an empty string. Provide the join_with config param to specify a different string to join the values with. Set as_list to true, if the values should be returned as a list and not joined

Parameters:

to_combine – list of suppliers to combine in order of combination
as_list (Optional[bool]) – if the results should be returned as a list
join_with (Optional[str]) – value to use to join the values

Returns:

supplier for mac addresses

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"], sample=True)
>>> job_supplier = datacraft.suppliers.values(["breeder", "trainer", "fighter", "wrestler"], sample=True)
>>> interesting_jobs = datacraft.suppliers.combine([pet_supplier, job_supplier], join_with=' ')
>>> next_career = interesting_jobs.next(0)
>>> next_career
'pig wrestler'

Returns:

datacraft.suppliers.constant(data)

Creates value supplier for the single value

Parameters:: data (Any) – constant data to return on every iteration
Return type:: ValueSupplierInterface
Returns:: value supplier for the single value

Examples

>>> import datacraft
>>> single_int_supplier = datacraft.suppliers.constant(42)
>>> single_str_supplier = datacraft.suppliers.constant("42")
>>> single_float_supplier = datacraft.suppliers.constant(42.42)

datacraft.suppliers.count_supplier(**kwargs)

creates a count supplier from the config, if the count param is defined, otherwise uses default of 1

optionally can specify count or count_dist.

valid data for counts:

integer i.e. 1, 7, 99
list of integers: [1, 7, 99], [1], [1, 2, 1, 2, 3]
weighted map, where keys are numeric strings: {“1”: 0.6, “2”: 0.4}

count_dist will be interpreted as a distribution i.e:

Keyword Arguments:

count – constant, list, or weighted map
data – alias for count
count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

a count supplier

Examples

>>> import datacraft
>>> counts = datacraft.suppliers.count_supplier(count_dist="uniform(start=10, end=100)")

datacraft.suppliers.csv(csv_path, **kwargs)

Creates a csv supplier

Parameters:

csv_path – path to csv file to supply data from

Keyword Arguments:

column (int) – 1 based column number, default is 1
sample (bool) – if the values for the column should be sampled, if supported
count – constant, list, or weighted map
count_dist – distribution in named param function style format
delimiter (str) – how items are separated, default is ‘,’
quotechar (str) – string used to quote values, default is ‘”’
headers (bool) – if the CSV file has a header row
sample_rows (bool) – if sampling should happen at a row level, not valid if buffering is set to true

Returns:

supplier for csv field

datacraft.suppliers.cut(supplier, start=0, end=None)

Trim output of given supplier from start to end, if length permits

Parameters:

supplier (ValueSupplierInterface) – to get output from
start (int) – where in output string to cut from (inclusive)
end (Optional[int]) – where to end cut (exclusive)

Returns:

The shortened version of the output string

datacraft.suppliers.date(**kwargs)

Creates supplier the provides date values according to specified format and ranges

Can use one of center_date or (start, end, offset, duration_days) etc.

Parameters:

**kwargs –

Keyword Arguments:

format (str) – Format string for dates
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (int) – number of days after start, default is 30

Return type:

ValueSupplierInterface

Returns:

supplier for dates

datacraft.suppliers.decorated(supplier, **kwargs)

Creates a decorated supplier around the provided one

Parameters:

supplier (ValueSupplierInterface) – the supplier to alter
**kwargs –

Keyword Arguments:

prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’

Return type:

ValueSupplierInterface

Returns:

the decorated supplier

Examples

>>> import datacraft
>>> nums = datacraft.suppliers.values([1, 2, 3, 4, 5])
>>> prefix_supplier = datacraft.suppliers.decorated(nums, prefix='you are number ')
>>> prefix_supplier.next(0)
you are number 1
>>> suffix_supplier = datacraft.suppliers.decorated(nums, suffix=' more minutes')
>>> suffix_supplier.next(0)
1 more minutes
>>> quoted_supplier = datacraft.suppliers.decorated(nums, quote='"')
>>> quoted_supplier.next(0)
"1"

datacraft.suppliers.distribution_supplier(distribution)

creates a ValueSupplier that uses the given distribution to generate values

Parameters:: distribution (Distribution) – to use
Return type:: ValueSupplierInterface
Returns:: the value supplier

datacraft.suppliers.epoch_date(as_millis=False, **kwargs)

Creates supplier the provides epoch dates

Can use one of center_date or (start, end, offset, duration_days) etc.

Parameters:

as_millis (bool) – if the timestamp should be millis since epoch, default is seconds

Keyword Arguments:

format (str) – Format string for date args used, required if any provided
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (str) – number of days after start, default is 30

Return type:

ValueSupplierInterface

Returns:

supplier for dates

datacraft.suppliers.from_list_of_suppliers(supplier_list, modulate_iteration=True)

Returns a supplier that rotates through the provided suppliers incrementally

Parameters:

supplier_list (List[ValueSupplierInterface]) – to rotate through
modulate_iteration (bool) – if the iteration number should be moded by the index of the supplier

Return type:

ValueSupplierInterface

Returns:

a supplier for these suppliers

Examples

>>> import datacraft
>>> nice_pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> mean_pet_supplier = datacraft.suppliers.values(["alligator", "cobra", "mongoose", "killer bee"])
>>> pet_supplier = datacraft.suppliers.from_list_of_suppliers([nice_pet_supplier, mean_pet_supplier])
>>> pet_supplier.next(0)
'dog'
>>> pet_supplier.next(1)
'alligator'

datacraft.suppliers.geo_lat(**kwargs)

configures geo latitude type

Keyword Arguments:

precision (int) – number of digits after decimal place
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type:

ValueSupplierInterface

Returns:

supplier for geo.lat type

datacraft.suppliers.geo_long(**kwargs)

configures geo longitude type

Keyword Arguments:

precision (int) – number of digits after decimal place
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type:

ValueSupplierInterface

Returns:

supplier for geo.long type

datacraft.suppliers.geo_pair(**kwargs)

Creates geo pair supplier

Keyword Arguments:

precision (int) – number of digits after decimal place
lat_first (bool) – if latitude should be populated before longitude
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
as_list (bool) – if the values should be returned as a list
join_with (str) – if the values should be joined with the provided string

Returns:

supplier for geo.pair type

datacraft.suppliers.ip_precise(cidr, sample=False)

Creates a value supplier that produces precise ip address from the given cidr

Parameters:

cidr (str) – notation specifying ip range
sample (bool) – if the ip addresses should be sampled from the available set

Return type:

ValueSupplierInterface

Returns:

supplier for precise ip addresses

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=False)
>>> ips.next(0)
'192.168.0.0'
>>> ips.next(1)
'192.168.0.1'
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=True)
>>> ips.next(0)
'192.168.0.127'
>>> ips.next(1)
'192.168.0.196'

datacraft.suppliers.ip_supplier(**kwargs)

Creates a value supplier for ipv v4 addresses

Keyword Arguments:

base (str) – base of ip address, i.e. “192”, “10.” “100.100”, “192.168.”, “10.10.10”
cidr (str) – cidr to use only one /8 /16 or /24, i.e. “192.168.0.0/24”, “10.0.0.0/16”, “100.0.0.0/8”

Return type:

ValueSupplierInterface

Returns:

supplier for ip addresses

Raises:

SpecException if one of base or cidr is not provided –

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_supplier(base="192.168.1")
>>> ips.next(0)
'192.168.1.144'

datacraft.suppliers.list_count_sampler(data, **kwargs)

Samples N elements from data list based on config. If count is provided, each iteration exactly count elements will be returned. If only min is provided, between min and the total number of elements will be provided. If only max is provided, between one and max elements will be returned. Specifying both min and max will provide a sample containing a number of elements in this range.

Parameters:

data (list) – list to select subset from

Keyword Arguments:

count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None

Return type:

ValueSupplierInterface

Returns:

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, min=2, max=5)
>>> pet_supplier.next(0)
['rabbit', 'cat', 'pig', 'cat']
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, count_dist="normal(mean=2,stddev=1,min=1)")
>>> pet_supplier.next(0)
['pig', 'horse']

datacraft.suppliers.list_stats_sampler(data, **kwargs)

sample from list (or string) with stats based params

Parameters:

data (Union[str, list]) – list to select subset from

Keyword Arguments:

mean (float) – mean number of items/characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list/characters to use
count_dist (str) – count distribution to use
min (int) – minimum number of items/characters to return
max (int) – maximum number of items/characters to return

Return type:

ValueSupplierInterface

Returns:

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_stats_sampler(pet_list, mean=2, stddev=1)
>>> new_pets = pet_supplier.next(0)

>>> char_config = {"min": 2, "mean": 4, "max": 8}
>>> char_supplier = datacraft.suppliers.list_stats_sampler("#!@#$%^&*()_-~", min=2, mean=4, max=8)
>>> two_to_eight_chars = char_supplier.next(0)

datacraft.suppliers.list_values(data, **kwargs)

creates a Value supplier for the list of provided data

Parameters:

data (list) – for the supplier

Keyword Arguments:

as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

the ValueSupplierInterface for the data list

datacraft.suppliers.mac_address(delimiter=None)

Creates a value supplier that produces mac addresses

Parameters:: delimiter (Optional[str]) – how mac address pieces are separated, default is ‘:’
Return type:: ValueSupplierInterface
Returns:: supplier for mac addresses

Examples

>>> import datacraft
>>> macs = datacraft.suppliers.mac_address()
>>> macs.next(0)
'1E:D4:0F:59:41:FA'
>>> macs = datacraft.suppliers.mac_address('-')
>>> macs.next(0)
'4D-93-36-59-BD-09'

datacraft.suppliers.random_range(start, end, precision=None, count=1)

Creates a random range supplier for the start and end parameters with the given precision (number of decimal places)

Parameters:

start (Union[str, int, float]) – of range
end (Union[str, int, float]) – of range
precision (Union[str, int, float, None]) – number of decimal points to keep
count (Union[int, List[int], Dict[str, float], Distribution]) – number of elements to return, default is one

Return type:

ValueSupplierInterface

Returns:

the value supplier for the range

Examples

>>> num_supplier = datacraft.suppliers.random_range(5, 25, precision=3)
>>> # should be between 5 and 25 with 3 decimal places
>>> num_supplier.next(0)
8.377

datacraft.suppliers.range_supplier(start, end, step=1, **kwargs)

Creates a Value Supplier for given range of data

Parameters:

start (Union[int, float]) – start of range
end (Union[int, float]) – end of range
step (Union[int, float]) – of range values

Keyword Arguments:

precision (int) – Number of decimal places to use, in case of floating point range

Returns:

supplier to supply ranges of values with

datacraft.suppliers.resettable(iterator)

Wraps a ResettableIterator to supply values from

Parameters:: iterator (ResettableIterator) – iterator with reset() method
Returns:: supplier to supply generated values with

datacraft.suppliers.sample(data, **kwargs)

Creates a supplier that selects elements from the data list based on the supplier kwargs

Parameters:

data (list) – list of data values to supply values from

Keyword Arguments:

mean (float) – mean number of values to include in list
stddev (float) – standard deviation from the mean
count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None

Returns:

supplier to supply subsets of data list

Examples

>>> import datacraft
>>> supplier = datacraft.suppliers.sample(['dog', 'cat', 'rat'], mean=2)
>>> supplier.next(1)
['cat', 'rat']

datacraft.suppliers.templated(supplier_map, template_str)

Creates a supplier that populates the template string from the supplier map

Parameters:

supplier_map (Dict[str, ValueSupplierInterface]) – map of field name -> value supplier for it
template_str – templated string to populate

Return type:

ValueSupplierInterface

Returns:

value supplier for template

Examples

>>> from datacraft import suppliers
>>> char_to_num_supplier = { 'char': suppliers.values(['a', 'b', 'c']), 'num': suppliers.values([1, 2, 3]) }
>>> letter_number_template = 'letter {{ char }}, number {{ num }}'
>>> supplier = suppliers.templated(char_to_num_supplier, letter_number_template)
>>> supplier.next(0)
'letter a, nummber 1'

datacraft.suppliers.unicode_range(data, **kwargs)

Creates a unicode supplier for single or multiple unicode ranges

Parameters:

data – list of unicode ranges to sample from

Keyword Arguments:

mean (float) – mean number of values to produce
stddev (float) – standard deviation from the mean
count (int) – number of unicode characters to produce
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
as_list (bool) – if the results should be returned as a list
join_with (str) – value to join values with, default is ‘’

Returns:

supplier to supply subsets of data list

datacraft.suppliers.uuid(variant=None)

Creates a UUid Value Supplier

Parameters:: variant (Optional[int]) – of uuid to use, default is 4
Return type:: ValueSupplierInterface
Returns:: supplier to supply uuids with

datacraft.suppliers.values(spec, **kwargs)

Based on data, return the appropriate values supplier. data can be a spec, constant, list, or dict. or just the raw data

Parameters:

spec (Any) – to load values from, or raw data itself
**kwargs – extra kwargs to add to config

Keyword Arguments:

as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

the values supplier for the spec

Examples

>>> import datacraft
>>> raw_spec = {"type": "values", "data": [1,2,3,5,8,13]}
>>> fib_supplier = datacraft.suppliers.values(raw_spec)
>>> fib_supplier = datacraft.suppliers.values([1,2,3,5,8,13])
>>> fib_supplier.next(0)
1
>>> weights =  {"1": 0.1, "2": 0.2, "3": 0.1, "4": 0.2, "5": 0.1, "6": 0.2, "7": 0.1}
>>> mostly_even_supplier = datacraft.suppliers.values(weights)
>>> mostly_even_supplier.next(0)
'4'

datacraft.suppliers.weighted_values(data, config=None)

Creates a weighted value supplier from the data, which is a mapping of value to the weight is should represent.

Parameters:

data (dict) – for the supplier
config (Optional[dict]) – optional config (Default value = None)

Return type:

ValueSupplierInterface

Returns:

the supplier

Raises:

SpecException if data is empty –

Examples

>>> import datacraft
>>> pets = {
... "dog": 0.5, "cat": 0.2, "bunny": 0.1, "hamster": 0.1, "pig": 0.05, "snake": 0.04, "_NULL_": 0.01
... }
>>> weighted_pet_supplier = datacraft.suppliers.weighted_values(pets)
>>> most_likely_a_dog = weighted_pet_supplier.next(0)

Builder Module 

Module for parsing and helper functions for specs

datacraft.builder.entries(raw_spec, iterations, **kwargs)

Creates n entries/records from the provided spec

Parameters:

raw_spec (Dict[str, Dict]) – to create entries for
iterations (int) – number of iterations before max

Keyword Arguments:

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Return type:

List[dict]

Returns:

the list of N entries/records

Examples

>>> import datacraft
>>> field_spec = {
...     "id": {"type": "uuid"},
...     "timestamp": {"type": "date.iso.millis"},
...     "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
... }
>>> print(*datacraft.entries(spec, 3), sep='\n')
{'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'}
{'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'}
{'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}

datacraft.builder.generator(raw_spec, iterations, **kwargs)

Creates a generator for the raw spec for the specified iterations

Parameters:

raw_spec (Dict[str, Dict]) – to create generator for
iterations (int) – number of iterations before max

Keyword Arguments:

processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible

Yields:

Records or rendered template strings

Return type:

Generator

Returns:

the generator for the provided spec

datacraft.builder.parse_spec(raw_spec)

Parses the raw spec into a DataSpec object. Takes in specs that may contain shorthand specifications. This is helpful if the spec is going to be reused in different scenarios. Otherwise, prefer the generator or entries functions.

Parameters:: raw_spec (dict) – raw dictionary that conforms to JSON spec format
Return type:: DataSpec
Returns:: the fully parsed and loaded spec

Examples

>>> import datacraft
>>> raw_spec = { "field": {"type": "values", "data": [10, 100, 1000]}}
>>> spec = datacraft.parse_spec(raw_spec)
>>> record = list(spec.generator(1))

datacraft.builder.values_for(field_spec, iterations, **kwargs)

Creates n entries/records from the provided spec

Parameters:

field_spec (Dict[str, Dict]) – to create values from
iterations (int) – number of iterations before max

Keyword Arguments:

enforce_schema (bool) – If schema validation should be applied where possible

Return type:

List[dict]

Returns:

the list of N values

Outputs Module 

Module holds output related classes and functions

class datacraft.outputs.WriterInterface

Interface for classes that write the generated values out

abstract write(value)

Write the value to the configured output destination

Parameters:: value – to write

datacraft.outputs.file_name_engine(prefix, extension)

creates a templating engine that will produce file names based on the count

Parameters:

prefix (str) – prefix for file name
extension (str) – suffix for file name

Return type:

RecordProcessor

Returns:

template engine for producing file names

datacraft.outputs.get_writer(outdir=None, outfile=None, overwrite=False, **kwargs)

creates the appropriate output writer from the given args and params

If no output directory is specified/configured will write to stdout

Parameters:

outdir (Optional[str]) – Directory to write output to
outfile (Optional[str]) – If a specific file should be used for the output, default is to construct the name from kwargs
overwrite (bool) – Should existing files with the same name be overwritten

Keyword Arguments:

outfile_prefix – the prefix of the output files i.e. test-data-
extension – to append to the file name prefix i.e. .csv
suppress_output – if output to stdout should be suppressed, only valid if outdir is None

Return type:

WriterInterface

Returns:

The configured Writer

Examples

>>> import datacraft
>>> csv_writer = datacraft.outputs.get_writer('./output', outfileprefix='test-data-', extension='.csv')

datacraft.outputs.incrementing_file_writer(outdir, engine)

Creates a WriterInterface that increments the count in the file name once records_per_file have been written

Parameters:

outdir (str) – output directory
engine (RecordProcessor) – to generate file names with

Return type:

WriterInterface

Returns:

a Writer that increments the a count in the file name

datacraft.outputs.processor(template=None, format_name=None)

Configures the record level processor for either the template or for the format_name

Parameters:

template (Union[str, Path, None]) – path to template or template as string
format_name (Optional[str]) – one of the valid registered formatter names

Return type:

Optional[RecordProcessor]

Returns:

RecordProcessor if valid template of format_name provide, None otherwise

Raises:

SpecException when format_name is not registered or if both template and format specified –

Examples

>>> import datacraft
>>> engine = datacraft.outputs.processor(template='/path/to/template.jinja')
>>> engine = datacraft.outputs.processor(template='{{ Inline: {{ variable }}')
>>> formatter = datacraft.outputs.processor(format_name='json')
>>> formatter = datacraft.outputs.processor(format_name='my_custom_registered_format')

datacraft.outputs.record_level(record_processor, writer, records_per_file=1)

Creates a OutputHandler for record level events

Parameters:

record_processor (RecordProcessor) – to process the records into strings
writer (WriterInterface) – to write the processed records
records_per_file (int) – number of records to accumulate before writing

Return type:

OutputHandlerInterface

Returns:

OutputHandlerInterface

datacraft.outputs.single_field(writer, output_key)

Creates a OutputHandler field level events

Parameters:

writer (WriterInterface) – to write the processed records
output_key (bool) – if the key should be output along with the value

Returns:

OutputHandlerInterface

datacraft.outputs.single_file_writer(outdir, outname, overwrite)

Creates a Writer for a single output file

Parameters:

outdir (str) – output directory
outname (str) – output file name
overwrite (bool) – if should overwrite exiting output files

Return type:

WriterInterface

Returns:

Writer for a single file

datacraft.outputs.stdout_writer()

Creates a WriterInterface that writes results to stdout

Return type:: WriterInterface
Returns:: writer that writes to stdout

datacraft.outputs.suppress_output_writer()

Returns a writer that suppresses the output to stdout

Return type:: WriterInterface

Template Engines 

Handles loading and creating the templating engine

datacraft.template_engines.for_file(template_file)

Loads the templating engine for the template file specified

Parameters:: template_file (Union[str, Path]) – to fill in, string or Path
Return type:: RecordProcessor
Returns:: the templating engine

datacraft.template_engines.string(template)

Returns a template engine for processing templates as strings

Return type:: RecordProcessor

Spec Formatters 

data spec formatting

Module with functions that handle formatting specs in an orderly and consistent structure i.e:

{
  "type": "<type name>",
  "data": "data stuff",
  "refs": "refs pointers",
  "config": {
    "key": "value..."
  }
}

References

JSON Custom formatting https://stackoverflow.com/questions/13249415/how-to-implement-custom-indentation-when-pretty-printing-with-the-json-module

YAML custom formatting from https://til.simonwillison.net/python/style-yaml-dump via: https://stackoverflow.com/a/8641732 and https://stackoverflow.com/a/16782282

datacraft.spec_formatters.format_json(raw_spec)

Formats the raw_spec as ordered dictionary in JSON

Parameters:: raw_spec (dict) – to format
Return type:: str
Returns:: the ordered and formatted JSON string

datacraft.spec_formatters.format_yaml(raw_spec)

Formats the raw_spec as ordered dictionary in YAML

Parameters:: raw_spec (dict) – to format
Return type:: str
Returns:: the ordered and formatted YAML string

Data Spec Inference 

class datacraft.infer.RefsAggregator

Class for adding references to when building inferred specs

add(key, val)

Add spec to refs section with given key/name

Parameters:

key (str) – Name used to reference this spec
val (dict) – Field Spec for this key/name

class datacraft.infer.ValueListAnalyzer

Interface class for implementations that infer a Field Spec from a list of values

abstract compatibility_score(values)

Check if the analyzer is compatible with the provided values.

Parameters:: values (Generator[Any, None, None]) – Generator producing values to check.
Returns:: 0, for not compatible with steps up to 1 for fully and totally compatible
Return type:: int

abstract generate_spec(name, values, refs, **kwargs)

Generate a specification for the provided list of values. Adds any necessary refs to refs aggregator as needed.

Parameters:

name (str) – name of field this spec is being generated for
values (List[Any]) – List of values to generate the spec for.
refs (RefsAggregator) – for adding refs if needed for generated spec.

Keyword Arguments:

limit – for lists or weighted values, down sample to this size if needed
limit_weighted – take top N limit weights
duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values

Returns:

A dictionary with the inferred spec for the values.

Return type:

Dict[str, Any]

datacraft.infer.csv_to_spec(file_path, **kwargs)

Read a CSV from the provided file path, convert it to JSON records, and then pass it to the from_examples function to get the spec.

Parameters:

file_path (str) – The path to the CSV file.

Keyword Arguments:

limit (int) – for lists or weighted values, down sample to this size if needed
limit_weighted (bool) – take top N limit weights

Returns:

The inferred data spec from the CSV data.

Return type:

Dict[str, Union[str, Dict]]

datacraft.infer.from_examples(examples, **kwargs)

Generates a Data Spec from the list of example JSON records

Parameters:

examples (list) – Data to infer Data Spec from

Keyword Arguments:

limit (int) – for lists or weighted values, down sample to this size if needed
limit_weighted (bool) – take top N limit weights
duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values

Returns:

Data Spec as dictionary

Return type:

dict

Examples

>>> import datacraft.infer as infer
>>> xmpls = [
...     {"foo": {"bar": 22.3, "baz": "single"}},
...     {"foo": {"bar": 44.5, "baz": "double"}}
... ]
>>>
>>> infer.from_examples(xmpls)
{'foo': {'type': 'nested', 'fields': {'bar': {'type': 'rand_range', 'data': [22.3, 44.5]}, 'baz': {'type': 'values', 'data': ['single', 'double']}}}}

datacraft.infer.infer_csv_select(file_path)

Infers a csv_select spec from the given csv file

Parameters:: file_path (str) – The path to the CSV file.
Returns:: The csv_select Data Spec for the given csv data.
Return type:: Dict[str, Union[str, Dict]]