Datacraft API

The Datacraft API is can be used to generate data in a similar way to the command line tooling. Data Specs are defined as dictionaries and follow the JSON based format and schemas. Most of the time you can copy a JSON spec from a file and assign it to a variable and it will generate the same data as the command line datacraft tool.

Examples:

entries and generator

By default, datacraft will generate dictionaries from the data specs. You can access a list of generated dictionaries with the datacraft.entries function. If you hava a lot of data to generate, you will want to use a generator, you can call datacraft.generator to access the data this way.

import datacraft

spec = {
    "id": {"type": "uuid"},
    "timestamp": {"type": "date.iso.millis"},
    "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
}

print(*datacraft.entries(spec, 3), sep='\n')
# {'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'}
# {'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'}
# {'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}

# or if you prefer a generator
for record in datacraft.generator(spec, 3_000_000):
    pass

record_entries and record_generator

If you are using Data classes, you can tell datacraft to return your data as a data class using the record_entries function.

import datacraft
from dataclasses import dataclass

@dataclass
class Entry:
    id: str
    timestamp: str
    handle: str

spec = {
    "id": {"type": "uuid"},
    "timestamp": {"type": "date.iso.millis"},
    "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
}

print(*datacraft.record_entries(Entry, spec, 3), sep='\n')
# Entry(id='1a5d8158-f095-49f2-abaf-eef2e33b4075', timestamp='2050-07-11T18:58:30.376', handle='@g7Lu0Vd4')
# Entry(id='f9e23a54-f9e8-4aa4-b3f5-aca45d89dd2c', timestamp='2050-07-21T20:00:32.290', handle='@kBCD7')
# Entry(id='61239ab0-2d3d-420f-be01-15ec5d730fd1', timestamp='2050-07-04T13:53:07.322', handle='@GlWfzV6r')

# or if you prefer a generator
for record in datacraft.record_generator(Entry, spec, 3_000_000):
    pass

values_for

If you only want the generated values for a specific field, say you want 100 uuids, then you can use the datacraft.values_for function

import datacraft

datacraft.values_for({"type": "uuid"}, 3)
# ['3ab92d2f-58d5-4328-a60e-72ee616199eb', 'cd5d5b64-ff25-4a2f-b69e-5a8c39841fc2', '2326f5c4-1b47-4913-8575-a71950f0fcce']
datacraft.values_for({"type": "ip", "config": {"prefix": "address:"}}, 3)
# ['address:243.228.123.130', 'address:4.22.163.89', 'address:175.230.40.87']
datacraft.values_for({"type": "date.iso"}, 3)
# ['2050-07-21T17:08:41', '2050-07-19T11:33:04', '2050-07-06T20:08:36']

registered_types and type_usage

There are some functions that can be helpful for getting the list of registered types as well as examples for using them with the API.

import datacraft

# List all registered types:
datacraft.registered_types()
# ['calculate', 'char_class', 'cc-ascii', 'cc-lower', '...', 'uuid', 'values', 'replace', 'regex_replace']

# Print API usage for a specific type or types
print(datacraft.type_usage('char_class', 'replace', '...'))
# Example Output
"""
-------------------------------------
replace | API Example:

import datacraft

spec = {
 "field": {
   "type": "values",
   "data": ["foo", "bar", "baz"]
 },
 "replacement": {
   "type": "replace",
   "data": {"ba": "fi"},
   "ref": "field"
 }
}

print(*datacraft.entries(spec, 3), sep='\n')

{'field': 'foo', 'replacement': 'foo'}
{'field': 'bar', 'replacement': 'fir'}
{'field': 'baz', 'replacement': 'fiz'}
"""

Core Classes

class datacraft.DataSpec(raw_spec)

Class representing a DataSpec object

abstract generator(iterations, **kwargs)

Creates a generator that will produce records or render the template for each record

Parameters:
  • iterations (int) – number of iterations to execute

  • **kwargs

Keyword Arguments:
  • processor – (RecordProcessor): For any Record Level transformations such templating or formatters

  • output – (OutputHandlerInterface): For any field or record level output

  • data_dir (str) – path the data directory with csv files and such

  • enforce_schema (bool) – If schema validation should be applied where possible

Yields:

Records or rendered template strings

Return type:

Generator

Examples

>>> import datacraft
>>> raw_spec {'name': ['bob', 'bobby', 'robert', 'bobo']}
>>> spec = datacraft.parse_spec(raw_spec)
>>> template = 'Name: {{ name }}'
>>> processor = datacraft.outputs.processor(template=template)
>>> generator = spec.generator(
...     iterations=4,
...     processor=processor)
>>> record = next(generator)
>>> print(record)
Name: bob
get(*args, **kwargs)

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
pop(k[, d]) v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

abstract to_pandas(iterations)

Converts iterations number of records into a pandas DataFrame

Parameters:

iterations (int) – number of iterations to run / records to generate

Returns:

DataFrame with records as rows

values() an object providing a view on D's values
class datacraft.ValueSupplierInterface

Interface for Classes that supply values

abstract next(iteration)

Produces the next value for the given iteration

Parameters:

iteration – current iteration

Returns:

the next value

class datacraft.Loader

Parent object for loading value suppliers from specs

abstract get(key)

Retrieve the value supplier for the given field or ref key

Parameters:

key (str) – key to for field or ref name

Return type:

ValueSupplierInterface

Returns:

the Value Supplier for the given key

Raises:

SpecException if key not found

abstract get_from_spec(field_spec)

Retrieve the value supplier for the given field spec

Parameters:

field_spec (Any) – dictionary spec or literal values

Return type:

ValueSupplierInterface

Returns:

the Value Supplier for the given spec

Raises:

SpecException if unable to resolve the spec with appropriate handler for the type

abstract get_ref(key)

returns the spec for the ref with the provided key

Parameters:

key (str) – key to lookup ref by

Return type:

dict

Returns:

Ref for key

abstract property spec

get the preprocessed field specs for this loader

class datacraft.Distribution

Interface Class for a numeric distribution such as a Uniform or Gaussian distribution

abstract next_value()

get the next value for this distribution

Return type:

float

class datacraft.CasterInterface

Interface for Classes that cast objects to different types

abstract cast(value)

casts the value according to the specified type

Parameters:

value (Any) – to cast

Return type:

Any

Returns:

the cast form of the value

Raises:

SpecException when unable to cast value

class datacraft.RecordProcessor

A Class that takes in a generated record and returns it formatted as a string for output

abstract process(record)

Processes the given record into the appropriate output string

Parameters:

record (Union[list, dict]) – generated record for current iteration

Return type:

str

Returns:

The formatted record

class datacraft.OutputHandlerInterface

Interface four handling generated output values

abstract finished_iterations()

This is called when all iterations have been completed

abstract finished_record(iteration, group_name, exclude_internal=False)

This is called whenever all of the fields for a record have been generated for one iteration

Parameters:
  • iteration (int) – iteration we are on

  • group_name (str) – group this record is apart of

  • exclude_internal (bool) – if external fields should be excluded from output record

abstract handle(key, value)

This is called each time a new value is generated for a given field

Parameters:
  • key (str) – the field name

  • value (Any) – the new value for the field

class datacraft.ResettableIterator

Iterator class that can be reset to the beginning of the iteration

abstract reset()

This will reset the iterator to the initial state for another full round of iteration

Registry Decorators

class datacraft.registries.Registry

Catalogue registry for types, preprocessors, logging configuration, and others

types

Types for field specs, registered functions for creating ValueSupplierInterface that will supply values for the given type

>>> @datacraft.registry.types('special_sauce')
... def _handle_special_type(field_spec: dict, loader: datacraft.Loader) -> ValueSupplierInterface:
...    # return ValueSupplierInterface from spec config
schemas

Schemas for field spec types, used to validate that the spec for a given type conforms to the schema for it

>>> @datacraft.registry.schemas('special_sauce')
... def _special_sauce_schema() -> dict:
...    # return JSON schema validating specs with type: special_sauce
usage

Usage for field spec types, used to provide command line help and examples

>>> @datacraft.registry.usage('special_sauce')
... def _special_sauce_usage() -> Union[str, dict]:
...    # return string describing how to use special_sauce
...    # or a dictionary with {"cli": "cli usage example", "api": "api usage example"}
preprocessors

Functions to modify specs before data generations process. If there is a customization you want to do for every data spec, or an extenstion you added that requires modifications to the spec before they are run, this is where you would register that pre-processor.

>>> @datacraft.registry.preprocessors('custom-preprocessing')
... def _preprocess_spec_to_some_end(raw_spec: dict) -> dict:
...    # return spec with any modification
logging

Custom logging setup. Can override or modify the default logging behavior.

>>> @datacraft.registry.logging('denoise')
... def _customize_logging(loglevel: str):
...     logging.getLogger('too.verbose.module').level = logging.ERROR
formats

Registered formats for output. When using the –format <format name>. Unlike other registered functions, this one is called directly to perform the required formatting function. The return value from the formatter is the new value that will be written to the configured output (default is console).

>>> @datacraft.registry.formats('custom_format')
... def _format_custom(record: dict) -> str:
...     # write to database or some other custom output, return something to write out or print to console
distribution

Different numeric distributions, normal, uniform, etc. These are used for more nuanced counts values. The built in distributions are uniform and normal.

>>> @datacraft.registry.distribution('hyperbolic_inverse_haversine')
... def _hyperbolic_inverse_haversine(mean, stddev, **kwargs):
...     # return a datacraft.Distribution, args can be custom for the defined distribution
defaults

Default values. Different types have different default values for some configs. This provides a mechanism to override or to register other custom defaults. Read a default from the registry with: datacraft.registries.get_default('var_key'). While datacraft.registries.all_defaults() will give a mapping of all registered default keys and values.

>>> @datacraft.registry.defaults('special_sauce_ingredient')
... def _default_special_sauce_ingredient():
...     # return the default value (i.e. onions)
casters

Cast or alter values in simple ways. These are all the valid forms of altering generated values after they are created outside of the ValueSupplier types. Use datacraft.registries.registered_casters() to get a list of all the currently registered ones.

>>> @datacraft.registry.casters('reverse')
... def _cast_reverse_strings() -> datacraft.CasterInterface:
...     # return a datacraft.CasterInterface
analyzers

Used by the Data Spec inference tool chain to analyze the list of values for a given field to try to determine an appropriate Field Spec that can be used to approximate the data values present

>>> @datacraft.registry.num_analyzers('custom')
... def _special_value_analyzer() -> datacraft.ValueListAnalyzer
...     # return a datacraft.ValueListAnalyzer

Datacraft Errors

class datacraft.SpecException

A SpecException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec or one of the described Field Specs. Common errors include undefined or misspelled references, missing or invalid configuration parameters, and invalid or missing data definitions.

class datacraft.SupplierException

A SupplierException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec during run time.

class datacraft.ResourceError

A ResourceLoadError indicates that an underlying resource such as a schema file was not able to be found or loaded.

Suppliers Module

Factory like module for core supplier related functions.

datacraft.suppliers.alter(supplier, **kwargs)

Covers multiple suppliers that alter values if configured to do so through kwargs: cast, buffer, and decorate

Parameters:

supplier – to alter if configured to do so

Keyword Arguments:
  • cast (str) – caster to apply

  • prefix (str) – prefix to prepend to value, default is ‘’

  • suffix (str) – suffix to append to value, default is ‘’

  • quote (str) – string to both append and prepend to value, default is ‘’

  • buffer (bool) – if the values should be buffered

  • buffer_size (int) – size of buffer to use

Return type:

ValueSupplierInterface

Returns:

supplier with alterations

datacraft.suppliers.array_supplier(wrapped, **kwargs)

Wraps an existing supplier and always returns an array/list of elements, uses count config to determine number of items in the list

Parameters:

wrapped (ValueSupplierInterface) – the underlying supplier

Keyword Arguments:
  • count – constant, list, or weighted map

  • data – alias for count

  • count_dist – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

The value supplier

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> returns_mostly_two = datacraft.suppliers.array_supplier(pet_supplier, count_dist="normal(mean=2, stddev=1)")
>>> pet_array = returns_mostly_two.next(0)
datacraft.suppliers.buffered(wrapped, **kwargs)

Creates a Value Supplier that buffers the results of the wrapped supplier allowing the retrieval

Parameters:

wrapped (ValueSupplierInterface) – the Value Supplier to buffer values for

Keyword Arguments:

buffer_size – number of produced values to buffer

Return type:

ValueSupplierInterface

Returns:

a buffered value supplier

datacraft.suppliers.calculate(suppliers_map, formula)

Creates a calculate supplier

Parameters:
  • suppliers_map (Dict[str, ValueSupplierInterface]) – map of name to supplier of values for that name

  • formula (str) – to evaluate, should reference keys in suppliers_map

Return type:

ValueSupplierInterface

Returns:

supplier with calculated values

datacraft.suppliers.cast(supplier, cast_to)

Provides a cast supplier from explicit cast

Parameters:
Return type:

ValueSupplierInterface

Returns:

the casting supplier

datacraft.suppliers.character_class(data, **kwargs)

Creates a character class supplier for the given data

Parameters:

data – set of characters to supply as values

Keyword Arguments:
  • join_with (str) – string to join characters with, default is ‘’

  • exclude (str) – set of characters to exclude from returned values

  • escape (str) – set of characters to escape, i.e. “ -> “ for example

  • escape_str (str) – string to use for escaping, default is mean (float): mean number of characters to produce

  • stddev (float) – standard deviation from the mean

  • count (int) – number of elements in list to use

  • count_dist (str) – count distribution to use

  • min (int) – minimum number of characters to return

  • max (int) – maximum number of characters to return

Returns:

supplier for characters

datacraft.suppliers.combine(to_combine, join_with=None, as_list=None)

Creates a value supplier that will combine the outputs of the provided suppliers in order. The default is to join the values with an empty string. Provide the join_with config param to specify a different string to join the values with. Set as_list to true, if the values should be returned as a list and not joined

Parameters:
  • to_combine – list of suppliers to combine in order of combination

  • as_list (Optional[bool]) – if the results should be returned as a list

  • join_with (Optional[str]) – value to use to join the values

Returns:

supplier for mac addresses

Examples

>>> import datacraft
>>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"], sample=True)
>>> job_supplier = datacraft.suppliers.values(["breeder", "trainer", "fighter", "wrestler"], sample=True)
>>> interesting_jobs = datacraft.suppliers.combine([pet_supplier, job_supplier], join_with=' ')
>>> next_career = interesting_jobs.next(0)
>>> next_career
'pig wrestler'

Returns:

datacraft.suppliers.constant(data)

Creates value supplier for the single value

Parameters:

data (Any) – constant data to return on every iteration

Return type:

ValueSupplierInterface

Returns:

value supplier for the single value

Examples

>>> import datacraft
>>> single_int_supplier = datacraft.suppliers.constant(42)
>>> single_str_supplier = datacraft.suppliers.constant("42")
>>> single_float_supplier = datacraft.suppliers.constant(42.42)
datacraft.suppliers.count_supplier(**kwargs)

creates a count supplier from the config, if the count param is defined, otherwise uses default of 1

optionally can specify count or count_dist.

valid data for counts:
  • integer i.e. 1, 7, 99

  • list of integers: [1, 7, 99], [1], [1, 2, 1, 2, 3]

  • weighted map, where keys are numeric strings: {“1”: 0.6, “2”: 0.4}

count_dist will be interpreted as a distribution i.e:

Keyword Arguments:
  • count – constant, list, or weighted map

  • data – alias for count

  • count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

a count supplier

Examples

>>> import datacraft
>>> counts = datacraft.suppliers.count_supplier(count_dist="uniform(start=10, end=100)")
datacraft.suppliers.csv(csv_path, **kwargs)

Creates a csv supplier

Parameters:

csv_path – path to csv file to supply data from

Keyword Arguments:
  • column (int) – 1 based column number, default is 1

  • sample (bool) – if the values for the column should be sampled, if supported

  • count – constant, list, or weighted map

  • count_dist – distribution in named param function style format

  • delimiter (str) – how items are separated, default is ‘,’

  • quotechar (str) – string used to quote values, default is ‘”’

  • headers (bool) – if the CSV file has a header row

  • sample_rows (bool) – if sampling should happen at a row level, not valid if buffering is set to true

Returns:

supplier for csv field

datacraft.suppliers.cut(supplier, start=0, end=None)

Trim output of given supplier from start to end, if length permits

Parameters:
  • supplier (ValueSupplierInterface) – to get output from

  • start (int) – where in output string to cut from (inclusive)

  • end (Optional[int]) – where to end cut (exclusive)

Returns:

The shortened version of the output string

datacraft.suppliers.date(**kwargs)

Creates supplier the provides date values according to specified format and ranges

Can use one of center_date or (start, end, offset, duration_days) etc.

Parameters:

**kwargs

Keyword Arguments:
  • format (str) – Format string for dates

  • center_date (str) – Date matching format to center dates around

  • stddev_days (float) – Standard deviation in days from center date

  • start (str) – start date string

  • end (str) – end date string

  • offset (int) – number of days to shift the duration, positive is back negative is forward

  • duration_days (int) – number of days after start, default is 30

Return type:

ValueSupplierInterface

Returns:

supplier for dates

datacraft.suppliers.decorated(supplier, **kwargs)

Creates a decorated supplier around the provided one

Parameters:
Keyword Arguments:
  • prefix (str) – prefix to prepend to value, default is ‘’

  • suffix (str) – suffix to append to value, default is ‘’

  • quote (str) – string to both append and prepend to value, default is ‘’

Return type:

ValueSupplierInterface

Returns:

the decorated supplier

Examples

>>> import datacraft
>>> nums = datacraft.suppliers.values([1, 2, 3, 4, 5])
>>> prefix_supplier = datacraft.suppliers.decorated(nums, prefix='you are number ')
>>> prefix_supplier.next(0)
you are number 1
>>> suffix_supplier = datacraft.suppliers.decorated(nums, suffix=' more minutes')
>>> suffix_supplier.next(0)
1 more minutes
>>> quoted_supplier = datacraft.suppliers.decorated(nums, quote='"')
>>> quoted_supplier.next(0)
"1"
datacraft.suppliers.distribution_supplier(distribution)

creates a ValueSupplier that uses the given distribution to generate values

Parameters:

distribution (Distribution) – to use

Return type:

ValueSupplierInterface

Returns:

the value supplier

datacraft.suppliers.epoch_date(as_millis=False, **kwargs)

Creates supplier the provides epoch dates

Can use one of center_date or (start, end, offset, duration_days) etc.

Parameters:

as_millis (bool) – if the timestamp should be millis since epoch, default is seconds

Keyword Arguments:
  • format (str) – Format string for date args used, required if any provided

  • center_date (str) – Date matching format to center dates around

  • stddev_days (float) – Standard deviation in days from center date

  • start (str) – start date string

  • end (str) – end date string

  • offset (int) – number of days to shift the duration, positive is back negative is forward

  • duration_days (str) – number of days after start, default is 30

Return type:

ValueSupplierInterface

Returns:

supplier for dates

datacraft.suppliers.from_list_of_suppliers(supplier_list, modulate_iteration=True)

Returns a supplier that rotates through the provided suppliers incrementally

Parameters:
  • supplier_list (List[ValueSupplierInterface]) – to rotate through

  • modulate_iteration (bool) – if the iteration number should be moded by the index of the supplier

Return type:

ValueSupplierInterface

Returns:

a supplier for these suppliers

Examples

>>> import datacraft
>>> nice_pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"])
>>> mean_pet_supplier = datacraft.suppliers.values(["alligator", "cobra", "mongoose", "killer bee"])
>>> pet_supplier = datacraft.suppliers.from_list_of_suppliers([nice_pet_supplier, mean_pet_supplier])
>>> pet_supplier.next(0)
'dog'
>>> pet_supplier.next(1)
'alligator'
datacraft.suppliers.geo_lat(**kwargs)

configures geo latitude type

Keyword Arguments:
  • precision (int) – number of digits after decimal place

  • start_lat (int) – minimum value for latitude

  • end_lat (int) – maximum value for latitude

  • bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type:

ValueSupplierInterface

Returns:

supplier for geo.lat type

datacraft.suppliers.geo_long(**kwargs)

configures geo longitude type

Keyword Arguments:
  • precision (int) – number of digits after decimal place

  • start_long (int) – minimum value for longitude

  • end_long (int) – maximum value for longitude

  • bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

Return type:

ValueSupplierInterface

Returns:

supplier for geo.long type

datacraft.suppliers.geo_pair(**kwargs)

Creates geo pair supplier

Keyword Arguments:
  • precision (int) – number of digits after decimal place

  • lat_first (bool) – if latitude should be populated before longitude

  • start_lat (int) – minimum value for latitude

  • end_lat (int) – maximum value for latitude

  • start_long (int) – minimum value for longitude

  • end_long (int) – maximum value for longitude

  • bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]

  • as_list (bool) – if the values should be returned as a list

  • join_with (str) – if the values should be joined with the provided string

Returns:

supplier for geo.pair type

datacraft.suppliers.ip_precise(cidr, sample=False)

Creates a value supplier that produces precise ip address from the given cidr

Parameters:
  • cidr (str) – notation specifying ip range

  • sample (bool) – if the ip addresses should be sampled from the available set

Return type:

ValueSupplierInterface

Returns:

supplier for precise ip addresses

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=False)
>>> ips.next(0)
'192.168.0.0'
>>> ips.next(1)
'192.168.0.1'
>>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=True)
>>> ips.next(0)
'192.168.0.127'
>>> ips.next(1)
'192.168.0.196'
datacraft.suppliers.ip_supplier(**kwargs)

Creates a value supplier for ipv v4 addresses

Keyword Arguments:
  • base (str) – base of ip address, i.e. “192”, “10.” “100.100”, “192.168.”, “10.10.10”

  • cidr (str) – cidr to use only one /8 /16 or /24, i.e. “192.168.0.0/24”, “10.0.0.0/16”, “100.0.0.0/8”

Return type:

ValueSupplierInterface

Returns:

supplier for ip addresses

Raises:

SpecException if one of base or cidr is not provided

Examples

>>> import datacraft
>>> ips = datacraft.suppliers.ip_supplier(base="192.168.1")
>>> ips.next(0)
'192.168.1.144'
datacraft.suppliers.list_count_sampler(data, **kwargs)

Samples N elements from data list based on config. If count is provided, each iteration exactly count elements will be returned. If only min is provided, between min and the total number of elements will be provided. If only max is provided, between one and max elements will be returned. Specifying both min and max will provide a sample containing a number of elements in this range.

Parameters:

data (list) – list to select subset from

Keyword Arguments:
  • count – number of elements in list to use

  • count_dist – count distribution to use

  • min – minimum number of values to return

  • max – maximum number of values to return

  • join_with – value to join values with, default is None

Return type:

ValueSupplierInterface

Returns:

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, min=2, max=5)
>>> pet_supplier.next(0)
['rabbit', 'cat', 'pig', 'cat']
>>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, count_dist="normal(mean=2,stddev=1,min=1)")
>>> pet_supplier.next(0)
['pig', 'horse']
datacraft.suppliers.list_stats_sampler(data, **kwargs)

sample from list (or string) with stats based params

Parameters:

data (Union[str, list]) – list to select subset from

Keyword Arguments:
  • mean (float) – mean number of items/characters to produce

  • stddev (float) – standard deviation from the mean

  • count (int) – number of elements in list/characters to use

  • count_dist (str) – count distribution to use

  • min (int) – minimum number of items/characters to return

  • max (int) – maximum number of items/characters to return

Return type:

ValueSupplierInterface

Returns:

the supplier

Examples

>>> import datacraft
>>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"]
>>> pet_supplier = datacraft.suppliers.list_stats_sampler(pet_list, mean=2, stddev=1)
>>> new_pets = pet_supplier.next(0)
>>> char_config = {"min": 2, "mean": 4, "max": 8}
>>> char_supplier = datacraft.suppliers.list_stats_sampler("#!@#$%^&*()_-~", min=2, mean=4, max=8)
>>> two_to_eight_chars = char_supplier.next(0)
datacraft.suppliers.list_values(data, **kwargs)

creates a Value supplier for the list of provided data

Parameters:

data (list) – for the supplier

Keyword Arguments:
  • as_list (bool) – if data should be returned as a list

  • sample (bool) – if the data should be sampled instead of iterated through incrementally

  • count – constant, list, or weighted map

  • count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

the ValueSupplierInterface for the data list

datacraft.suppliers.mac_address(delimiter=None)

Creates a value supplier that produces mac addresses

Parameters:

delimiter (Optional[str]) – how mac address pieces are separated, default is ‘:’

Return type:

ValueSupplierInterface

Returns:

supplier for mac addresses

Examples

>>> import datacraft
>>> macs = datacraft.suppliers.mac_address()
>>> macs.next(0)
'1E:D4:0F:59:41:FA'
>>> macs = datacraft.suppliers.mac_address('-')
>>> macs.next(0)
'4D-93-36-59-BD-09'
datacraft.suppliers.random_range(start, end, precision=None, count=1)

Creates a random range supplier for the start and end parameters with the given precision (number of decimal places)

Parameters:
  • start (Union[str, int, float]) – of range

  • end (Union[str, int, float]) – of range

  • precision (Union[str, int, float, None]) – number of decimal points to keep

  • count (Union[int, List[int], Dict[str, float], Distribution]) – number of elements to return, default is one

Return type:

ValueSupplierInterface

Returns:

the value supplier for the range

Examples

>>> num_supplier = datacraft.suppliers.random_range(5, 25, precision=3)
>>> # should be between 5 and 25 with 3 decimal places
>>> num_supplier.next(0)
8.377
datacraft.suppliers.range_supplier(start, end, step=1, **kwargs)

Creates a Value Supplier for given range of data

Parameters:
  • start (Union[int, float]) – start of range

  • end (Union[int, float]) – end of range

  • step (Union[int, float]) – of range values

Keyword Arguments:

precision (int) – Number of decimal places to use, in case of floating point range

Returns:

supplier to supply ranges of values with

datacraft.suppliers.resettable(iterator)

Wraps a ResettableIterator to supply values from

Parameters:

iterator (ResettableIterator) – iterator with reset() method

Returns:

supplier to supply generated values with

datacraft.suppliers.sample(data, **kwargs)

Creates a supplier that selects elements from the data list based on the supplier kwargs

Parameters:

data (list) – list of data values to supply values from

Keyword Arguments:
  • mean (float) – mean number of values to include in list

  • stddev (float) – standard deviation from the mean

  • count – number of elements in list to use

  • count_dist – count distribution to use

  • min – minimum number of values to return

  • max – maximum number of values to return

  • join_with – value to join values with, default is None

Returns:

supplier to supply subsets of data list

Examples

>>> import datacraft
>>> supplier = datacraft.suppliers.sample(['dog', 'cat', 'rat'], mean=2)
>>> supplier.next(1)
['cat', 'rat']
datacraft.suppliers.templated(supplier_map, template_str)

Creates a supplier that populates the template string from the supplier map

Parameters:
  • supplier_map (Dict[str, ValueSupplierInterface]) – map of field name -> value supplier for it

  • template_str – templated string to populate

Return type:

ValueSupplierInterface

Returns:

value supplier for template

Examples

>>> from datacraft import suppliers
>>> char_to_num_supplier = { 'char': suppliers.values(['a', 'b', 'c']), 'num': suppliers.values([1, 2, 3]) }
>>> letter_number_template = 'letter {{ char }}, number {{ num }}'
>>> supplier = suppliers.templated(char_to_num_supplier, letter_number_template)
>>> supplier.next(0)
'letter a, nummber 1'
datacraft.suppliers.unicode_range(data, **kwargs)

Creates a unicode supplier for single or multiple unicode ranges

Parameters:

data – list of unicode ranges to sample from

Keyword Arguments:
  • mean (float) – mean number of values to produce

  • stddev (float) – standard deviation from the mean

  • count (int) – number of unicode characters to produce

  • count_dist (str) – count distribution to use

  • min (int) – minimum number of characters to return

  • max (int) – maximum number of characters to return

  • as_list (bool) – if the results should be returned as a list

  • join_with (str) – value to join values with, default is ‘’

Returns:

supplier to supply subsets of data list

datacraft.suppliers.uuid(variant=None)

Creates a UUid Value Supplier

Parameters:

variant (Optional[int]) – of uuid to use, default is 4

Return type:

ValueSupplierInterface

Returns:

supplier to supply uuids with

datacraft.suppliers.values(spec, **kwargs)

Based on data, return the appropriate values supplier. data can be a spec, constant, list, or dict. or just the raw data

Parameters:
  • spec (Any) – to load values from, or raw data itself

  • **kwargs – extra kwargs to add to config

Keyword Arguments:
  • as_list (bool) – if data should be returned as a list

  • sample (bool) – if the data should be sampled instead of iterated through incrementally

  • count – constant, list, or weighted map

  • count_dist (str) – distribution in named param function style format

Return type:

ValueSupplierInterface

Returns:

the values supplier for the spec

Examples

>>> import datacraft
>>> raw_spec = {"type": "values", "data": [1,2,3,5,8,13]}
>>> fib_supplier = datacraft.suppliers.values(raw_spec)
>>> fib_supplier = datacraft.suppliers.values([1,2,3,5,8,13])
>>> fib_supplier.next(0)
1
>>> weights =  {"1": 0.1, "2": 0.2, "3": 0.1, "4": 0.2, "5": 0.1, "6": 0.2, "7": 0.1}
>>> mostly_even_supplier = datacraft.suppliers.values(weights)
>>> mostly_even_supplier.next(0)
'4'
datacraft.suppliers.weighted_values(data, config=None)

Creates a weighted value supplier from the data, which is a mapping of value to the weight is should represent.

Parameters:
  • data (dict) – for the supplier

  • config (Optional[dict]) – optional config (Default value = None)

Return type:

ValueSupplierInterface

Returns:

the supplier

Raises:

SpecException if data is empty

Examples

>>> import datacraft
>>> pets = {
... "dog": 0.5, "cat": 0.2, "bunny": 0.1, "hamster": 0.1, "pig": 0.05, "snake": 0.04, "_NULL_": 0.01
... }
>>> weighted_pet_supplier = datacraft.suppliers.weighted_values(pets)
>>> most_likely_a_dog = weighted_pet_supplier.next(0)

Builder Module

Module for parsing and helper functions for specs

datacraft.builder.entries(raw_spec, iterations, **kwargs)

Creates n entries/records from the provided spec

Parameters:
  • raw_spec (Dict[str, Dict]) – to create entries for

  • iterations (int) – number of iterations before max

Keyword Arguments:
  • processor – (RecordProcessor): For any Record Level transformations such templating or formatters

  • output – (OutputHandlerInterface): For any field or record level output

  • data_dir (str) – path the data directory with csv files and such

  • enforce_schema (bool) – If schema validation should be applied where possible

Return type:

List[dict]

Returns:

the list of N entries/records

Examples

>>> import datacraft
>>> field_spec = {
...     "id": {"type": "uuid"},
...     "timestamp": {"type": "date.iso.millis"},
...     "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
... }
>>> print(*datacraft.entries(spec, 3), sep='\n')
{'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'}
{'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'}
{'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}
datacraft.builder.generator(raw_spec, iterations, **kwargs)

Creates a generator for the raw spec for the specified iterations

Parameters:
  • raw_spec (Dict[str, Dict]) – to create generator for

  • iterations (int) – number of iterations before max

Keyword Arguments:
  • processor – (RecordProcessor): For any Record Level transformations such templating or formatters

  • output – (OutputHandlerInterface): For any field or record level output

  • data_dir (str) – path the data directory with csv files and such

  • enforce_schema (bool) – If schema validation should be applied where possible

Yields:

Records or rendered template strings

Return type:

Generator

Returns:

the generator for the provided spec

datacraft.builder.parse_spec(raw_spec)

Parses the raw spec into a DataSpec object. Takes in specs that may contain shorthand specifications. This is helpful if the spec is going to be reused in different scenarios. Otherwise, prefer the generator or entries functions.

Parameters:

raw_spec (dict) – raw dictionary that conforms to JSON spec format

Return type:

DataSpec

Returns:

the fully parsed and loaded spec

Examples

>>> import datacraft
>>> raw_spec = { "field": {"type": "values", "data": [10, 100, 1000]}}
>>> spec = datacraft.parse_spec(raw_spec)
>>> record = list(spec.generator(1))
datacraft.builder.record_entries(data_class, raw_spec, iterations, **kwargs)

Creates a list of instances of a given data class from the provided spec.

Parameters:
  • data_class (Type[TypeVar(T)]) – The data class to create instances of.

  • raw_spec (Dict[str, Dict]) – Specification to create entries for.

  • iterations (int) – Number of iterations before max.

Keyword Arguments:
  • processor – (RecordProcessor): For any Record Level transformations such templating or formatters.

  • output – (OutputHandlerInterface): For any field or record level output.

  • data_dir (str) – Path to the data directory with CSV files and such.

  • enforce_schema (bool) – If schema validation should be applied where possible.

Return type:

List[TypeVar(T)]

Returns:

List of instances of the data class.

Examples

>>> @dataclass
>>> class Entry:
...     id: str
...     timestamp: str
...     handle: str
>>> raw_spec = {
...     "id": {"type": "uuid"},
...     "timestamp": {"type": "date.iso.millis"},
...     "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
... }
>>> print(*record_entries(Entry, raw_spec, 3), sep='\n')
Entry(id='d5aeb7fa-374c-4228-8645-e8953165f163', timestamp='2024-07-03T04:10:10.016', handle='@DAHQDSsF')
Entry(id='acde6f46-4692-45a7-8f0c-d0a8736c4386', timestamp='2024-07-06T17:43:36.653', handle='@vBTf71sP')
Entry(id='4bb5542f-bf7d-4237-a972-257e24a659dd', timestamp='2024-08-01T03:06:49.724', handle='@gzfY_akS')
datacraft.builder.record_generator(data_class, raw_spec, iterations, **kwargs)

Creates a generator that yields instances of a given data class from the provided spec.

Parameters:
  • data_class (Type[TypeVar(T)]) – The data class to create instances of.

  • raw_spec (Dict[str, Dict]) – Specification to create generator for.

  • iterations (int) – Number of iterations before max.

Keyword Arguments:
  • processor – (RecordProcessor): For any Record Level transformations such templating or formatters.

  • output – (OutputHandlerInterface): For any field or record level output.

  • data_dir (str) – Path to the data directory with CSV files and such.

  • enforce_schema (bool) – If schema validation should be applied where possible.

Yields:

Instances of the data class.

Return type:

Generator[TypeVar(T), None, None]

Returns:

The generator for the provided spec.

datacraft.builder.values_for(field_spec, iterations, **kwargs)

Creates n entries/records from the provided spec

Parameters:
  • field_spec (Dict[str, Dict]) – to create values from

  • iterations (int) – number of iterations before max

Keyword Arguments:

enforce_schema (bool) – If schema validation should be applied where possible

Return type:

List[dict]

Returns:

the list of N values

Raises:

SpecException if field_spec is not valid

Examples

>>> import datacraft
>>> datacraft.values_for({"type": "uuid"}, 3)
['3ab92d2f-58d5-4328-a60e-72ee616199eb', 'cd5d5b64-ff25-4a2f-b69e-5a8c39841fc2', '2326f5c4-1b47-4913-8575-a71950f0fcce']
>>> datacraft.values_for({"type": "ip", "config": {"prefix": "address:"}}, 3)
['address:243.228.123.130', 'address:4.22.163.89', 'address:175.230.40.87']
>>> datacraft.values_for({"type": "values", "data": ["cat", "dog", "dragon"]}, 3)
['cat', 'dog', 'dragon']

Outputs Module

Module holds output related classes and functions

class datacraft.outputs.WriterInterface

Interface for classes that write the generated values out

abstract write(value)

Write the value to the configured output destination

Parameters:

value – to write

datacraft.outputs.file_name_engine(prefix, extension)

creates a templating engine that will produce file names based on the count

Parameters:
  • prefix (str) – prefix for file name

  • extension (str) – suffix for file name

Return type:

RecordProcessor

Returns:

template engine for producing file names

datacraft.outputs.get_writer(outdir=None, outfile=None, overwrite=False, **kwargs)

creates the appropriate output writer from the given args and params

If no output directory is specified/configured will write to stdout

Parameters:
  • outdir (Optional[str]) – Directory to write output to

  • outfile (Optional[str]) – If a specific file should be used for the output, default is to construct the name from kwargs

  • overwrite (bool) – Should existing files with the same name be overwritten

Keyword Arguments:
  • outfile_prefix – the prefix of the output files i.e. test-data-

  • extension – to append to the file name prefix i.e. .csv

  • suppress_output – if output to stdout should be suppressed, only valid if outdir is None

Return type:

WriterInterface

Returns:

The configured Writer

Examples

>>> import datacraft
>>> csv_writer = datacraft.outputs.get_writer('./output', outfileprefix='test-data-', extension='.csv')
datacraft.outputs.incrementing_file_writer(outdir, engine)

Creates a WriterInterface that increments the count in the file name once records_per_file have been written

Parameters:
  • outdir (str) – output directory

  • engine (RecordProcessor) – to generate file names with

Return type:

WriterInterface

Returns:

a Writer that increments the a count in the file name

datacraft.outputs.processor(template=None, format_name=None)

Configures the record level processor for either the template or for the format_name

Parameters:
  • template (Union[str, Path, None]) – path to template or template as string

  • format_name (Optional[str]) – one of the valid registered formatter names

Return type:

Optional[RecordProcessor]

Returns:

RecordProcessor if valid template of format_name provide, None otherwise

Raises:

SpecException when format_name is not registered or if both template and format specified

Examples

>>> import datacraft
>>> engine = datacraft.outputs.processor(template='/path/to/template.jinja')
>>> engine = datacraft.outputs.processor(template='{{ Inline: {{ variable }}')
>>> formatter = datacraft.outputs.processor(format_name='json')
>>> formatter = datacraft.outputs.processor(format_name='my_custom_registered_format')
datacraft.outputs.record_level(record_processor, writer, records_per_file=1)

Creates a OutputHandler for record level events

Parameters:
  • record_processor (RecordProcessor) – to process the records into strings

  • writer (WriterInterface) – to write the processed records

  • records_per_file (int) – number of records to accumulate before writing

Return type:

OutputHandlerInterface

Returns:

OutputHandlerInterface

datacraft.outputs.single_field(writer, output_key)

Creates a OutputHandler field level events

Parameters:
  • writer (WriterInterface) – to write the processed records

  • output_key (bool) – if the key should be output along with the value

Returns:

OutputHandlerInterface

datacraft.outputs.single_file_writer(outdir, outname, overwrite)

Creates a Writer for a single output file

Parameters:
  • outdir (str) – output directory

  • outname (str) – output file name

  • overwrite (bool) – if should overwrite exiting output files

Return type:

WriterInterface

Returns:

Writer for a single file

datacraft.outputs.stdout_writer()

Creates a WriterInterface that writes results to stdout

Return type:

WriterInterface

Returns:

writer that writes to stdout

datacraft.outputs.suppress_output_writer()

Returns a writer that suppresses the output to stdout

Return type:

WriterInterface

Template Engines

Handles loading and creating the templating engine

datacraft.template_engines.for_file(template_file)

Loads the templating engine for the template file specified

Parameters:

template_file (Union[str, Path]) – to fill in, string or Path

Return type:

RecordProcessor

Returns:

the templating engine

datacraft.template_engines.string(template)

Returns a template engine for processing templates as strings

Return type:

RecordProcessor

Spec Formatters

data spec formatting

Module with functions that handle formatting specs in an orderly and consistent structure i.e:

{
  "type": "<type name>",
  "data": "data stuff",
  "refs": "refs pointers",
  "config": {
    "key": "value..."
  }
}

References

JSON Custom formatting https://stackoverflow.com/questions/13249415/how-to-implement-custom-indentation-when-pretty-printing-with-the-json-module

YAML custom formatting from https://til.simonwillison.net/python/style-yaml-dump via: https://stackoverflow.com/a/8641732 and https://stackoverflow.com/a/16782282

datacraft.spec_formatters.format_json(raw_spec)

Formats the raw_spec as ordered dictionary in JSON

Parameters:

raw_spec (dict) – to format

Return type:

str

Returns:

the ordered and formatted JSON string

datacraft.spec_formatters.format_yaml(raw_spec)

Formats the raw_spec as ordered dictionary in YAML

Parameters:

raw_spec (dict) – to format

Return type:

str

Returns:

the ordered and formatted YAML string

Data Spec Inference

class datacraft.infer.RefsAggregator

Class for adding references to when building inferred specs

add(key, val)

Add spec to refs section with given key/name

Parameters:
  • key (str) – Name used to reference this spec

  • val (dict) – Field Spec for this key/name

class datacraft.infer.ValueListAnalyzer

Interface class for implementations that infer a Field Spec from a list of values

abstract compatibility_score(values)

Check if the analyzer is compatible with the provided values.

Parameters:

values (Generator[Any, None, None]) – Generator producing values to check.

Returns:

0, for not compatible with steps up to 1 for fully and totally compatible

Return type:

int

abstract generate_spec(name, values, refs, **kwargs)

Generate a specification for the provided list of values. Adds any necessary refs to refs aggregator as needed.

Parameters:
  • name (str) – name of field this spec is being generated for

  • values (List[Any]) – List of values to generate the spec for.

  • refs (RefsAggregator) – for adding refs if needed for generated spec.

Keyword Arguments:
  • limit – for lists or weighted values, down sample to this size if needed

  • limit_weighted – take top N limit weights

  • duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values

Returns:

A dictionary with the inferred spec for the values.

Return type:

Dict[str, Any]

datacraft.infer.csv_to_spec(file_path, **kwargs)

Read a CSV from the provided file path, convert it to JSON records, and then pass it to the from_examples function to get the spec.

Parameters:

file_path (str) – The path to the CSV file.

Keyword Arguments:
  • limit (int) – for lists or weighted values, down sample to this size if needed

  • limit_weighted (bool) – take top N limit weights

Returns:

The inferred data spec from the CSV data.

Return type:

Dict[str, Union[str, Dict]]

datacraft.infer.from_examples(examples, **kwargs)

Generates a Data Spec from the list of example JSON records

Parameters:

examples (list) – Data to infer Data Spec from

Keyword Arguments:
  • limit (int) – for lists or weighted values, down sample to this size if needed

  • limit_weighted (bool) – take top N limit weights

  • duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values

Returns:

Data Spec as dictionary

Return type:

dict

Examples

>>> import datacraft.infer as infer
>>> xmpls = [
...     {"foo": {"bar": 22.3, "baz": "single"}},
...     {"foo": {"bar": 44.5, "baz": "double"}}
... ]
>>>
>>> infer.from_examples(xmpls)
{'foo': {'type': 'nested', 'fields': {'bar': {'type': 'rand_range', 'data': [22.3, 44.5]}, 'baz': {'type': 'values', 'data': ['single', 'double']}}}}
datacraft.infer.infer_csv_select(file_path)

Infers a csv_select spec from the given csv file

Parameters:

file_path (str) – The path to the CSV file.

Returns:

The csv_select Data Spec for the given csv data.

Return type:

Dict[str, Union[str, Dict]]