Datacraft API
The Datacraft API is can be used to generate data in a similar way to the command line tooling. Data Specs are defined
as dictionaries and follow the JSON based format and schemas. Most of the time you can copy a JSON spec from a file
and assign it to a variable and it will generate the same data as the command line datacraft
tool.
Examples:
entries and generator
By default, datacraft will generate dictionaries from the data specs. You can access a list of generated dictionaries with the datacraft.entries function. If you hava a lot of data to generate, you will want to use a generator, you can call datacraft.generator to access the data this way.
import datacraft
spec = {
"id": {"type": "uuid"},
"timestamp": {"type": "date.iso.millis"},
"handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
}
print(*datacraft.entries(spec, 3), sep='\n')
# {'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'}
# {'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'}
# {'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}
# or if you prefer a generator
for record in datacraft.generator(spec, 3_000_000):
pass
record_entries and record_generator
If you are using Data classes, you can tell datacraft to return your data as a data class using the record_entries function.
import datacraft
from dataclasses import dataclass
@dataclass
class Entry:
id: str
timestamp: str
handle: str
spec = {
"id": {"type": "uuid"},
"timestamp": {"type": "date.iso.millis"},
"handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } }
}
print(*datacraft.record_entries(Entry, spec, 3), sep='\n')
# Entry(id='1a5d8158-f095-49f2-abaf-eef2e33b4075', timestamp='2050-07-11T18:58:30.376', handle='@g7Lu0Vd4')
# Entry(id='f9e23a54-f9e8-4aa4-b3f5-aca45d89dd2c', timestamp='2050-07-21T20:00:32.290', handle='@kBCD7')
# Entry(id='61239ab0-2d3d-420f-be01-15ec5d730fd1', timestamp='2050-07-04T13:53:07.322', handle='@GlWfzV6r')
# or if you prefer a generator
for record in datacraft.record_generator(Entry, spec, 3_000_000):
pass
values_for
If you only want the generated values for a specific field, say you want 100 uuids, then you can use the datacraft.values_for function
import datacraft
datacraft.values_for({"type": "uuid"}, 3)
# ['3ab92d2f-58d5-4328-a60e-72ee616199eb', 'cd5d5b64-ff25-4a2f-b69e-5a8c39841fc2', '2326f5c4-1b47-4913-8575-a71950f0fcce']
datacraft.values_for({"type": "ip", "config": {"prefix": "address:"}}, 3)
# ['address:243.228.123.130', 'address:4.22.163.89', 'address:175.230.40.87']
datacraft.values_for({"type": "date.iso"}, 3)
# ['2050-07-21T17:08:41', '2050-07-19T11:33:04', '2050-07-06T20:08:36']
registered_types and type_usage
There are some functions that can be helpful for getting the list of registered types as well as examples for using them with the API.
import datacraft
# List all registered types:
datacraft.registered_types()
# ['calculate', 'char_class', 'cc-ascii', 'cc-lower', '...', 'uuid', 'values', 'replace', 'regex_replace']
# Print API usage for a specific type or types
print(datacraft.type_usage('char_class', 'replace', '...'))
# Example Output
"""
-------------------------------------
replace | API Example:
import datacraft
spec = {
"field": {
"type": "values",
"data": ["foo", "bar", "baz"]
},
"replacement": {
"type": "replace",
"data": {"ba": "fi"},
"ref": "field"
}
}
print(*datacraft.entries(spec, 3), sep='\n')
{'field': 'foo', 'replacement': 'foo'}
{'field': 'bar', 'replacement': 'fir'}
{'field': 'baz', 'replacement': 'fiz'}
"""
Core Classes
- class datacraft.DataSpec(raw_spec)
Class representing a DataSpec object
- abstract generator(iterations, **kwargs)
Creates a generator that will produce records or render the template for each record
- Parameters:
iterations (
int
) – number of iterations to execute**kwargs
- Keyword Arguments:
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Yields:
Records or rendered template strings
- Return type:
Generator
Examples
>>> import datacraft >>> raw_spec {'name': ['bob', 'bobby', 'robert', 'bobo']} >>> spec = datacraft.parse_spec(raw_spec) >>> template = 'Name: {{ name }}' >>> processor = datacraft.outputs.processor(template=template) >>> generator = spec.generator( ... iterations=4, ... processor=processor) >>> record = next(generator) >>> print(record) Name: bob
- get(*args, **kwargs)
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- pop(k[, d]) v, remove specified key and return the corresponding value.
If the key is not found, return the default if given; otherwise, raise a KeyError.
- abstract to_pandas(iterations)
Converts iterations number of records into a pandas DataFrame
- Parameters:
iterations (
int
) – number of iterations to run / records to generate- Returns:
DataFrame with records as rows
- values() an object providing a view on D's values
- class datacraft.ValueSupplierInterface
Interface for Classes that supply values
- abstract next(iteration)
Produces the next value for the given iteration
- Parameters:
iteration – current iteration
- Returns:
the next value
- class datacraft.Loader
Parent object for loading value suppliers from specs
- abstract get(key)
Retrieve the value supplier for the given field or ref key
- Parameters:
key (
str
) – key to for field or ref name- Return type:
- Returns:
the Value Supplier for the given key
- Raises:
SpecException if key not found –
- abstract get_from_spec(field_spec)
Retrieve the value supplier for the given field spec
- Parameters:
field_spec (
Any
) – dictionary spec or literal values- Return type:
- Returns:
the Value Supplier for the given spec
- Raises:
SpecException if unable to resolve the spec with appropriate handler for the type –
- abstract get_ref(key)
returns the spec for the ref with the provided key
- Parameters:
key (
str
) – key to lookup ref by- Return type:
dict
- Returns:
Ref for key
- abstract property spec
get the preprocessed field specs for this loader
- class datacraft.Distribution
Interface Class for a numeric distribution such as a Uniform or Gaussian distribution
- abstract next_value()
get the next value for this distribution
- Return type:
float
- class datacraft.CasterInterface
Interface for Classes that cast objects to different types
- abstract cast(value)
casts the value according to the specified type
- Parameters:
value (
Any
) – to cast- Return type:
Any
- Returns:
the cast form of the value
- Raises:
SpecException when unable to cast value –
- class datacraft.RecordProcessor
A Class that takes in a generated record and returns it formatted as a string for output
- abstract process(record)
Processes the given record into the appropriate output string
- Parameters:
record (
Union
[list
,dict
]) – generated record for current iteration- Return type:
str
- Returns:
The formatted record
- class datacraft.OutputHandlerInterface
Interface four handling generated output values
- abstract finished_iterations()
This is called when all iterations have been completed
- abstract finished_record(iteration, group_name, exclude_internal=False)
This is called whenever all of the fields for a record have been generated for one iteration
- Parameters:
iteration (
int
) – iteration we are ongroup_name (
str
) – group this record is apart ofexclude_internal (
bool
) – if external fields should be excluded from output record
- abstract handle(key, value)
This is called each time a new value is generated for a given field
- Parameters:
key (
str
) – the field namevalue (
Any
) – the new value for the field
Registry Decorators
- class datacraft.registries.Registry
Catalogue registry for types, preprocessors, logging configuration, and others
- types
Types for field specs, registered functions for creating ValueSupplierInterface that will supply values for the given type
>>> @datacraft.registry.types('special_sauce') ... def _handle_special_type(field_spec: dict, loader: datacraft.Loader) -> ValueSupplierInterface: ... # return ValueSupplierInterface from spec config
- schemas
Schemas for field spec types, used to validate that the spec for a given type conforms to the schema for it
>>> @datacraft.registry.schemas('special_sauce') ... def _special_sauce_schema() -> dict: ... # return JSON schema validating specs with type: special_sauce
- usage
Usage for field spec types, used to provide command line help and examples
>>> @datacraft.registry.usage('special_sauce') ... def _special_sauce_usage() -> Union[str, dict]: ... # return string describing how to use special_sauce ... # or a dictionary with {"cli": "cli usage example", "api": "api usage example"}
- preprocessors
Functions to modify specs before data generations process. If there is a customization you want to do for every data spec, or an extenstion you added that requires modifications to the spec before they are run, this is where you would register that pre-processor.
>>> @datacraft.registry.preprocessors('custom-preprocessing') ... def _preprocess_spec_to_some_end(raw_spec: dict) -> dict: ... # return spec with any modification
- logging
Custom logging setup. Can override or modify the default logging behavior.
>>> @datacraft.registry.logging('denoise') ... def _customize_logging(loglevel: str): ... logging.getLogger('too.verbose.module').level = logging.ERROR
- formats
Registered formats for output. When using the –format <format name>. Unlike other registered functions, this one is called directly to perform the required formatting function. The return value from the formatter is the new value that will be written to the configured output (default is console).
>>> @datacraft.registry.formats('custom_format') ... def _format_custom(record: dict) -> str: ... # write to database or some other custom output, return something to write out or print to console
- distribution
Different numeric distributions, normal, uniform, etc. These are used for more nuanced counts values. The built in distributions are uniform and normal.
>>> @datacraft.registry.distribution('hyperbolic_inverse_haversine') ... def _hyperbolic_inverse_haversine(mean, stddev, **kwargs): ... # return a datacraft.Distribution, args can be custom for the defined distribution
- defaults
Default values. Different types have different default values for some configs. This provides a mechanism to override or to register other custom defaults. Read a default from the registry with:
datacraft.registries.get_default('var_key')
. Whiledatacraft.registries.all_defaults()
will give a mapping of all registered default keys and values.>>> @datacraft.registry.defaults('special_sauce_ingredient') ... def _default_special_sauce_ingredient(): ... # return the default value (i.e. onions)
- casters
Cast or alter values in simple ways. These are all the valid forms of altering generated values after they are created outside of the ValueSupplier types. Use
datacraft.registries.registered_casters()
to get a list of all the currently registered ones.>>> @datacraft.registry.casters('reverse') ... def _cast_reverse_strings() -> datacraft.CasterInterface: ... # return a datacraft.CasterInterface
- analyzers
Used by the Data Spec inference tool chain to analyze the list of values for a given field to try to determine an appropriate Field Spec that can be used to approximate the data values present
>>> @datacraft.registry.num_analyzers('custom') ... def _special_value_analyzer() -> datacraft.ValueListAnalyzer ... # return a datacraft.ValueListAnalyzer
Datacraft Errors
- class datacraft.SpecException
A SpecException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec or one of the described Field Specs. Common errors include undefined or misspelled references, missing or invalid configuration parameters, and invalid or missing data definitions.
- class datacraft.SupplierException
A SupplierException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec during run time.
- class datacraft.ResourceError
A ResourceLoadError indicates that an underlying resource such as a schema file was not able to be found or loaded.
Suppliers Module
Factory like module for core supplier related functions.
- datacraft.suppliers.alter(supplier, **kwargs)
Covers multiple suppliers that alter values if configured to do so through kwargs: cast, buffer, and decorate
- Parameters:
supplier – to alter if configured to do so
- Keyword Arguments:
cast (str) – caster to apply
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
buffer (bool) – if the values should be buffered
buffer_size (int) – size of buffer to use
- Return type:
- Returns:
supplier with alterations
- datacraft.suppliers.array_supplier(wrapped, **kwargs)
Wraps an existing supplier and always returns an array/list of elements, uses count config to determine number of items in the list
- Parameters:
wrapped (
ValueSupplierInterface
) – the underlying supplier- Keyword Arguments:
count – constant, list, or weighted map
data – alias for count
count_dist – distribution in named param function style format
- Return type:
- Returns:
The value supplier
Examples
>>> import datacraft >>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"]) >>> returns_mostly_two = datacraft.suppliers.array_supplier(pet_supplier, count_dist="normal(mean=2, stddev=1)") >>> pet_array = returns_mostly_two.next(0)
- datacraft.suppliers.buffered(wrapped, **kwargs)
Creates a Value Supplier that buffers the results of the wrapped supplier allowing the retrieval
- Parameters:
wrapped (
ValueSupplierInterface
) – the Value Supplier to buffer values for- Keyword Arguments:
buffer_size – number of produced values to buffer
- Return type:
- Returns:
a buffered value supplier
- datacraft.suppliers.calculate(suppliers_map, formula)
Creates a calculate supplier
- Parameters:
suppliers_map (
Dict
[str
,ValueSupplierInterface
]) – map of name to supplier of values for that nameformula (
str
) – to evaluate, should reference keys in suppliers_map
- Return type:
- Returns:
supplier with calculated values
- datacraft.suppliers.cast(supplier, cast_to)
Provides a cast supplier from explicit cast
- Parameters:
supplier (
ValueSupplierInterface
) – to cast results ofcast_to (
str
) – type to cast values to
- Return type:
- Returns:
the casting supplier
- datacraft.suppliers.character_class(data, **kwargs)
Creates a character class supplier for the given data
- Parameters:
data – set of characters to supply as values
- Keyword Arguments:
join_with (str) – string to join characters with, default is ‘’
exclude (str) – set of characters to exclude from returned values
escape (str) – set of characters to escape, i.e. “ -> “ for example
escape_str (str) – string to use for escaping, default is mean (float): mean number of characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list to use
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
- Returns:
supplier for characters
- datacraft.suppliers.combine(to_combine, join_with=None, as_list=None)
Creates a value supplier that will combine the outputs of the provided suppliers in order. The default is to join the values with an empty string. Provide the join_with config param to specify a different string to join the values with. Set as_list to true, if the values should be returned as a list and not joined
- Parameters:
to_combine – list of suppliers to combine in order of combination
as_list (
Optional
[bool
]) – if the results should be returned as a listjoin_with (
Optional
[str
]) – value to use to join the values
- Returns:
supplier for mac addresses
Examples
>>> import datacraft >>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"], sample=True) >>> job_supplier = datacraft.suppliers.values(["breeder", "trainer", "fighter", "wrestler"], sample=True) >>> interesting_jobs = datacraft.suppliers.combine([pet_supplier, job_supplier], join_with=' ') >>> next_career = interesting_jobs.next(0) >>> next_career 'pig wrestler'
Returns:
- datacraft.suppliers.constant(data)
Creates value supplier for the single value
- Parameters:
data (
Any
) – constant data to return on every iteration- Return type:
- Returns:
value supplier for the single value
Examples
>>> import datacraft >>> single_int_supplier = datacraft.suppliers.constant(42) >>> single_str_supplier = datacraft.suppliers.constant("42") >>> single_float_supplier = datacraft.suppliers.constant(42.42)
- datacraft.suppliers.count_supplier(**kwargs)
creates a count supplier from the config, if the count param is defined, otherwise uses default of 1
optionally can specify count or count_dist.
- valid data for counts:
integer i.e. 1, 7, 99
list of integers: [1, 7, 99], [1], [1, 2, 1, 2, 3]
weighted map, where keys are numeric strings: {“1”: 0.6, “2”: 0.4}
count_dist will be interpreted as a distribution i.e:
- Keyword Arguments:
count – constant, list, or weighted map
data – alias for count
count_dist (str) – distribution in named param function style format
- Return type:
- Returns:
a count supplier
Examples
>>> import datacraft >>> counts = datacraft.suppliers.count_supplier(count_dist="uniform(start=10, end=100)")
- datacraft.suppliers.csv(csv_path, **kwargs)
Creates a csv supplier
- Parameters:
csv_path – path to csv file to supply data from
- Keyword Arguments:
column (int) – 1 based column number, default is 1
sample (bool) – if the values for the column should be sampled, if supported
count – constant, list, or weighted map
count_dist – distribution in named param function style format
delimiter (str) – how items are separated, default is ‘,’
quotechar (str) – string used to quote values, default is ‘”’
headers (bool) – if the CSV file has a header row
sample_rows (bool) – if sampling should happen at a row level, not valid if buffering is set to true
- Returns:
supplier for csv field
- datacraft.suppliers.cut(supplier, start=0, end=None)
Trim output of given supplier from start to end, if length permits
- Parameters:
supplier (
ValueSupplierInterface
) – to get output fromstart (
int
) – where in output string to cut from (inclusive)end (
Optional
[int
]) – where to end cut (exclusive)
- Returns:
The shortened version of the output string
- datacraft.suppliers.date(**kwargs)
Creates supplier the provides date values according to specified format and ranges
Can use one of center_date or (start, end, offset, duration_days) etc.
- Parameters:
**kwargs
- Keyword Arguments:
format (str) – Format string for dates
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (int) – number of days after start, default is 30
- Return type:
- Returns:
supplier for dates
- datacraft.suppliers.decorated(supplier, **kwargs)
Creates a decorated supplier around the provided one
- Parameters:
supplier (
ValueSupplierInterface
) – the supplier to alter**kwargs
- Keyword Arguments:
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
- Return type:
- Returns:
the decorated supplier
Examples
>>> import datacraft >>> nums = datacraft.suppliers.values([1, 2, 3, 4, 5]) >>> prefix_supplier = datacraft.suppliers.decorated(nums, prefix='you are number ') >>> prefix_supplier.next(0) you are number 1 >>> suffix_supplier = datacraft.suppliers.decorated(nums, suffix=' more minutes') >>> suffix_supplier.next(0) 1 more minutes >>> quoted_supplier = datacraft.suppliers.decorated(nums, quote='"') >>> quoted_supplier.next(0) "1"
- datacraft.suppliers.distribution_supplier(distribution)
creates a ValueSupplier that uses the given distribution to generate values
- Parameters:
distribution (
Distribution
) – to use- Return type:
- Returns:
the value supplier
- datacraft.suppliers.epoch_date(as_millis=False, **kwargs)
Creates supplier the provides epoch dates
Can use one of center_date or (start, end, offset, duration_days) etc.
- Parameters:
as_millis (
bool
) – if the timestamp should be millis since epoch, default is seconds- Keyword Arguments:
format (str) – Format string for date args used, required if any provided
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (str) – number of days after start, default is 30
- Return type:
- Returns:
supplier for dates
- datacraft.suppliers.from_list_of_suppliers(supplier_list, modulate_iteration=True)
Returns a supplier that rotates through the provided suppliers incrementally
- Parameters:
supplier_list (
List
[ValueSupplierInterface
]) – to rotate throughmodulate_iteration (
bool
) – if the iteration number should be moded by the index of the supplier
- Return type:
- Returns:
a supplier for these suppliers
Examples
>>> import datacraft >>> nice_pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"]) >>> mean_pet_supplier = datacraft.suppliers.values(["alligator", "cobra", "mongoose", "killer bee"]) >>> pet_supplier = datacraft.suppliers.from_list_of_suppliers([nice_pet_supplier, mean_pet_supplier]) >>> pet_supplier.next(0) 'dog' >>> pet_supplier.next(1) 'alligator'
- datacraft.suppliers.geo_lat(**kwargs)
configures geo latitude type
- Keyword Arguments:
precision (int) – number of digits after decimal place
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
- Return type:
- Returns:
supplier for geo.lat type
- datacraft.suppliers.geo_long(**kwargs)
configures geo longitude type
- Keyword Arguments:
precision (int) – number of digits after decimal place
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
- Return type:
- Returns:
supplier for geo.long type
- datacraft.suppliers.geo_pair(**kwargs)
Creates geo pair supplier
- Keyword Arguments:
precision (int) – number of digits after decimal place
lat_first (bool) – if latitude should be populated before longitude
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
as_list (bool) – if the values should be returned as a list
join_with (str) – if the values should be joined with the provided string
- Returns:
supplier for geo.pair type
- datacraft.suppliers.ip_precise(cidr, sample=False)
Creates a value supplier that produces precise ip address from the given cidr
- Parameters:
cidr (
str
) – notation specifying ip rangesample (
bool
) – if the ip addresses should be sampled from the available set
- Return type:
- Returns:
supplier for precise ip addresses
Examples
>>> import datacraft >>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=False) >>> ips.next(0) '192.168.0.0' >>> ips.next(1) '192.168.0.1' >>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=True) >>> ips.next(0) '192.168.0.127' >>> ips.next(1) '192.168.0.196'
- datacraft.suppliers.ip_supplier(**kwargs)
Creates a value supplier for ipv v4 addresses
- Keyword Arguments:
base (str) – base of ip address, i.e. “192”, “10.” “100.100”, “192.168.”, “10.10.10”
cidr (str) – cidr to use only one /8 /16 or /24, i.e. “192.168.0.0/24”, “10.0.0.0/16”, “100.0.0.0/8”
- Return type:
- Returns:
supplier for ip addresses
- Raises:
SpecException if one of base or cidr is not provided –
Examples
>>> import datacraft >>> ips = datacraft.suppliers.ip_supplier(base="192.168.1") >>> ips.next(0) '192.168.1.144'
- datacraft.suppliers.list_count_sampler(data, **kwargs)
Samples N elements from data list based on config. If count is provided, each iteration exactly count elements will be returned. If only min is provided, between min and the total number of elements will be provided. If only max is provided, between one and max elements will be returned. Specifying both min and max will provide a sample containing a number of elements in this range.
- Parameters:
data (
list
) – list to select subset from- Keyword Arguments:
count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None
- Return type:
- Returns:
the supplier
Examples
>>> import datacraft >>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"] >>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, min=2, max=5) >>> pet_supplier.next(0) ['rabbit', 'cat', 'pig', 'cat'] >>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, count_dist="normal(mean=2,stddev=1,min=1)") >>> pet_supplier.next(0) ['pig', 'horse']
- datacraft.suppliers.list_stats_sampler(data, **kwargs)
sample from list (or string) with stats based params
- Parameters:
data (
Union
[str
,list
]) – list to select subset from- Keyword Arguments:
mean (float) – mean number of items/characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list/characters to use
count_dist (str) – count distribution to use
min (int) – minimum number of items/characters to return
max (int) – maximum number of items/characters to return
- Return type:
- Returns:
the supplier
Examples
>>> import datacraft >>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"] >>> pet_supplier = datacraft.suppliers.list_stats_sampler(pet_list, mean=2, stddev=1) >>> new_pets = pet_supplier.next(0)
>>> char_config = {"min": 2, "mean": 4, "max": 8} >>> char_supplier = datacraft.suppliers.list_stats_sampler("#!@#$%^&*()_-~", min=2, mean=4, max=8) >>> two_to_eight_chars = char_supplier.next(0)
- datacraft.suppliers.list_values(data, **kwargs)
creates a Value supplier for the list of provided data
- Parameters:
data (
list
) – for the supplier- Keyword Arguments:
as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format
- Return type:
- Returns:
the ValueSupplierInterface for the data list
- datacraft.suppliers.mac_address(delimiter=None)
Creates a value supplier that produces mac addresses
- Parameters:
delimiter (
Optional
[str
]) – how mac address pieces are separated, default is ‘:’- Return type:
- Returns:
supplier for mac addresses
Examples
>>> import datacraft >>> macs = datacraft.suppliers.mac_address() >>> macs.next(0) '1E:D4:0F:59:41:FA' >>> macs = datacraft.suppliers.mac_address('-') >>> macs.next(0) '4D-93-36-59-BD-09'
- datacraft.suppliers.random_range(start, end, precision=None, count=1)
Creates a random range supplier for the start and end parameters with the given precision (number of decimal places)
- Parameters:
start (
Union
[str
,int
,float
]) – of rangeend (
Union
[str
,int
,float
]) – of rangeprecision (
Union
[str
,int
,float
,None
]) – number of decimal points to keepcount (
Union
[int
,List
[int
],Dict
[str
,float
],Distribution
]) – number of elements to return, default is one
- Return type:
- Returns:
the value supplier for the range
Examples
>>> num_supplier = datacraft.suppliers.random_range(5, 25, precision=3) >>> # should be between 5 and 25 with 3 decimal places >>> num_supplier.next(0) 8.377
- datacraft.suppliers.range_supplier(start, end, step=1, **kwargs)
Creates a Value Supplier for given range of data
- Parameters:
start (
Union
[int
,float
]) – start of rangeend (
Union
[int
,float
]) – end of rangestep (
Union
[int
,float
]) – of range values
- Keyword Arguments:
precision (int) – Number of decimal places to use, in case of floating point range
- Returns:
supplier to supply ranges of values with
- datacraft.suppliers.resettable(iterator)
Wraps a ResettableIterator to supply values from
- Parameters:
iterator (
ResettableIterator
) – iterator with reset() method- Returns:
supplier to supply generated values with
- datacraft.suppliers.sample(data, **kwargs)
Creates a supplier that selects elements from the data list based on the supplier kwargs
- Parameters:
data (
list
) – list of data values to supply values from- Keyword Arguments:
mean (float) – mean number of values to include in list
stddev (float) – standard deviation from the mean
count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None
- Returns:
supplier to supply subsets of data list
Examples
>>> import datacraft >>> supplier = datacraft.suppliers.sample(['dog', 'cat', 'rat'], mean=2) >>> supplier.next(1) ['cat', 'rat']
- datacraft.suppliers.templated(supplier_map, template_str)
Creates a supplier that populates the template string from the supplier map
- Parameters:
supplier_map (
Dict
[str
,ValueSupplierInterface
]) – map of field name -> value supplier for ittemplate_str – templated string to populate
- Return type:
- Returns:
value supplier for template
Examples
>>> from datacraft import suppliers >>> char_to_num_supplier = { 'char': suppliers.values(['a', 'b', 'c']), 'num': suppliers.values([1, 2, 3]) } >>> letter_number_template = 'letter {{ char }}, number {{ num }}' >>> supplier = suppliers.templated(char_to_num_supplier, letter_number_template) >>> supplier.next(0) 'letter a, nummber 1'
- datacraft.suppliers.unicode_range(data, **kwargs)
Creates a unicode supplier for single or multiple unicode ranges
- Parameters:
data – list of unicode ranges to sample from
- Keyword Arguments:
mean (float) – mean number of values to produce
stddev (float) – standard deviation from the mean
count (int) – number of unicode characters to produce
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
as_list (bool) – if the results should be returned as a list
join_with (str) – value to join values with, default is ‘’
- Returns:
supplier to supply subsets of data list
- datacraft.suppliers.uuid(variant=None)
Creates a UUid Value Supplier
- Parameters:
variant (
Optional
[int
]) – of uuid to use, default is 4- Return type:
- Returns:
supplier to supply uuids with
- datacraft.suppliers.values(spec, **kwargs)
Based on data, return the appropriate values supplier. data can be a spec, constant, list, or dict. or just the raw data
- Parameters:
spec (
Any
) – to load values from, or raw data itself**kwargs – extra kwargs to add to config
- Keyword Arguments:
as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format
- Return type:
- Returns:
the values supplier for the spec
Examples
>>> import datacraft >>> raw_spec = {"type": "values", "data": [1,2,3,5,8,13]} >>> fib_supplier = datacraft.suppliers.values(raw_spec) >>> fib_supplier = datacraft.suppliers.values([1,2,3,5,8,13]) >>> fib_supplier.next(0) 1 >>> weights = {"1": 0.1, "2": 0.2, "3": 0.1, "4": 0.2, "5": 0.1, "6": 0.2, "7": 0.1} >>> mostly_even_supplier = datacraft.suppliers.values(weights) >>> mostly_even_supplier.next(0) '4'
- datacraft.suppliers.weighted_values(data, config=None)
Creates a weighted value supplier from the data, which is a mapping of value to the weight is should represent.
- Parameters:
data (
dict
) – for the supplierconfig (
Optional
[dict
]) – optional config (Default value = None)
- Return type:
- Returns:
the supplier
- Raises:
SpecException if data is empty –
Examples
>>> import datacraft >>> pets = { ... "dog": 0.5, "cat": 0.2, "bunny": 0.1, "hamster": 0.1, "pig": 0.05, "snake": 0.04, "_NULL_": 0.01 ... } >>> weighted_pet_supplier = datacraft.suppliers.weighted_values(pets) >>> most_likely_a_dog = weighted_pet_supplier.next(0)
Builder Module
Module for parsing and helper functions for specs
- datacraft.builder.entries(raw_spec, iterations, **kwargs)
Creates n entries/records from the provided spec
- Parameters:
raw_spec (
Dict
[str
,Dict
]) – to create entries foriterations (
int
) – number of iterations before max
- Keyword Arguments:
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Return type:
List
[dict
]- Returns:
the list of N entries/records
Examples
>>> import datacraft >>> field_spec = { ... "id": {"type": "uuid"}, ... "timestamp": {"type": "date.iso.millis"}, ... "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } } ... } >>> print(*datacraft.entries(spec, 3), sep='\n') {'id': '40bf8be1-23d2-4e93-9b8b-b37103c4b18c', 'timestamp': '2050-12-03T20:40:03.709', 'handle': '@WPNn'} {'id': '3bb5789e-10d1-4ae3-ae61-e0682dad8ecf', 'timestamp': '2050-11-20T02:57:48.131', 'handle': '@kl1KUdtT'} {'id': '474a439a-8582-46a2-84d6-58bfbfa10bca', 'timestamp': '2050-11-29T18:08:44.971', 'handle': '@XDvquPI'}
- datacraft.builder.generator(raw_spec, iterations, **kwargs)
Creates a generator for the raw spec for the specified iterations
- Parameters:
raw_spec (
Dict
[str
,Dict
]) – to create generator foriterations (
int
) – number of iterations before max
- Keyword Arguments:
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Yields:
Records or rendered template strings
- Return type:
Generator
- Returns:
the generator for the provided spec
- datacraft.builder.parse_spec(raw_spec)
Parses the raw spec into a DataSpec object. Takes in specs that may contain shorthand specifications. This is helpful if the spec is going to be reused in different scenarios. Otherwise, prefer the generator or entries functions.
- Parameters:
raw_spec (
dict
) – raw dictionary that conforms to JSON spec format- Return type:
- Returns:
the fully parsed and loaded spec
Examples
>>> import datacraft >>> raw_spec = { "field": {"type": "values", "data": [10, 100, 1000]}} >>> spec = datacraft.parse_spec(raw_spec) >>> record = list(spec.generator(1))
- datacraft.builder.record_entries(data_class, raw_spec, iterations, **kwargs)
Creates a list of instances of a given data class from the provided spec.
- Parameters:
data_class (
Type
[TypeVar
(T
)]) – The data class to create instances of.raw_spec (
Dict
[str
,Dict
]) – Specification to create entries for.iterations (
int
) – Number of iterations before max.
- Keyword Arguments:
processor – (RecordProcessor): For any Record Level transformations such templating or formatters.
output – (OutputHandlerInterface): For any field or record level output.
data_dir (str) – Path to the data directory with CSV files and such.
enforce_schema (bool) – If schema validation should be applied where possible.
- Return type:
List
[TypeVar
(T
)]- Returns:
List of instances of the data class.
Examples
>>> @dataclass >>> class Entry: ... id: str ... timestamp: str ... handle: str >>> raw_spec = { ... "id": {"type": "uuid"}, ... "timestamp": {"type": "date.iso.millis"}, ... "handle": {"type": "cc-word", "config": { "min": 4, "max": 8, "prefix": "@" } } ... } >>> print(*record_entries(Entry, raw_spec, 3), sep='\n') Entry(id='d5aeb7fa-374c-4228-8645-e8953165f163', timestamp='2024-07-03T04:10:10.016', handle='@DAHQDSsF') Entry(id='acde6f46-4692-45a7-8f0c-d0a8736c4386', timestamp='2024-07-06T17:43:36.653', handle='@vBTf71sP') Entry(id='4bb5542f-bf7d-4237-a972-257e24a659dd', timestamp='2024-08-01T03:06:49.724', handle='@gzfY_akS')
- datacraft.builder.record_generator(data_class, raw_spec, iterations, **kwargs)
Creates a generator that yields instances of a given data class from the provided spec.
- Parameters:
data_class (
Type
[TypeVar
(T
)]) – The data class to create instances of.raw_spec (
Dict
[str
,Dict
]) – Specification to create generator for.iterations (
int
) – Number of iterations before max.
- Keyword Arguments:
processor – (RecordProcessor): For any Record Level transformations such templating or formatters.
output – (OutputHandlerInterface): For any field or record level output.
data_dir (str) – Path to the data directory with CSV files and such.
enforce_schema (bool) – If schema validation should be applied where possible.
- Yields:
Instances of the data class.
- Return type:
Generator
[TypeVar
(T
),None
,None
]- Returns:
The generator for the provided spec.
- datacraft.builder.values_for(field_spec, iterations, **kwargs)
Creates n entries/records from the provided spec
- Parameters:
field_spec (
Dict
[str
,Dict
]) – to create values fromiterations (
int
) – number of iterations before max
- Keyword Arguments:
enforce_schema (bool) – If schema validation should be applied where possible
- Return type:
List
[dict
]- Returns:
the list of N values
- Raises:
SpecException if field_spec is not valid –
Examples
>>> import datacraft >>> datacraft.values_for({"type": "uuid"}, 3) ['3ab92d2f-58d5-4328-a60e-72ee616199eb', 'cd5d5b64-ff25-4a2f-b69e-5a8c39841fc2', '2326f5c4-1b47-4913-8575-a71950f0fcce'] >>> datacraft.values_for({"type": "ip", "config": {"prefix": "address:"}}, 3) ['address:243.228.123.130', 'address:4.22.163.89', 'address:175.230.40.87'] >>> datacraft.values_for({"type": "values", "data": ["cat", "dog", "dragon"]}, 3) ['cat', 'dog', 'dragon']
Outputs Module
Module holds output related classes and functions
- class datacraft.outputs.WriterInterface
Interface for classes that write the generated values out
- abstract write(value)
Write the value to the configured output destination
- Parameters:
value – to write
- datacraft.outputs.file_name_engine(prefix, extension)
creates a templating engine that will produce file names based on the count
- Parameters:
prefix (
str
) – prefix for file nameextension (
str
) – suffix for file name
- Return type:
- Returns:
template engine for producing file names
- datacraft.outputs.get_writer(outdir=None, outfile=None, overwrite=False, **kwargs)
creates the appropriate output writer from the given args and params
If no output directory is specified/configured will write to stdout
- Parameters:
outdir (
Optional
[str
]) – Directory to write output tooutfile (
Optional
[str
]) – If a specific file should be used for the output, default is to construct the name from kwargsoverwrite (
bool
) – Should existing files with the same name be overwritten
- Keyword Arguments:
outfile_prefix – the prefix of the output files i.e. test-data-
extension – to append to the file name prefix i.e. .csv
suppress_output – if output to stdout should be suppressed, only valid if outdir is None
- Return type:
- Returns:
The configured Writer
Examples
>>> import datacraft >>> csv_writer = datacraft.outputs.get_writer('./output', outfileprefix='test-data-', extension='.csv')
- datacraft.outputs.incrementing_file_writer(outdir, engine)
Creates a WriterInterface that increments the count in the file name once records_per_file have been written
- Parameters:
outdir (
str
) – output directoryengine (
RecordProcessor
) – to generate file names with
- Return type:
- Returns:
a Writer that increments the a count in the file name
- datacraft.outputs.processor(template=None, format_name=None)
Configures the record level processor for either the template or for the format_name
- Parameters:
template (
Union
[str
,Path
,None
]) – path to template or template as stringformat_name (
Optional
[str
]) – one of the valid registered formatter names
- Return type:
Optional
[RecordProcessor
]- Returns:
RecordProcessor if valid template of format_name provide, None otherwise
- Raises:
SpecException when format_name is not registered or if both template and format specified –
Examples
>>> import datacraft >>> engine = datacraft.outputs.processor(template='/path/to/template.jinja') >>> engine = datacraft.outputs.processor(template='{{ Inline: {{ variable }}') >>> formatter = datacraft.outputs.processor(format_name='json') >>> formatter = datacraft.outputs.processor(format_name='my_custom_registered_format')
- datacraft.outputs.record_level(record_processor, writer, records_per_file=1)
Creates a OutputHandler for record level events
- Parameters:
record_processor (
RecordProcessor
) – to process the records into stringswriter (
WriterInterface
) – to write the processed recordsrecords_per_file (
int
) – number of records to accumulate before writing
- Return type:
- Returns:
OutputHandlerInterface
- datacraft.outputs.single_field(writer, output_key)
Creates a OutputHandler field level events
- Parameters:
writer (WriterInterface) – to write the processed records
output_key (
bool
) – if the key should be output along with the value
- Returns:
OutputHandlerInterface
- datacraft.outputs.single_file_writer(outdir, outname, overwrite)
Creates a Writer for a single output file
- Parameters:
outdir (
str
) – output directoryoutname (
str
) – output file nameoverwrite (
bool
) – if should overwrite exiting output files
- Return type:
- Returns:
Writer for a single file
- datacraft.outputs.stdout_writer()
Creates a WriterInterface that writes results to stdout
- Return type:
- Returns:
writer that writes to stdout
- datacraft.outputs.suppress_output_writer()
Returns a writer that suppresses the output to stdout
- Return type:
Template Engines
Handles loading and creating the templating engine
- datacraft.template_engines.for_file(template_file)
Loads the templating engine for the template file specified
- Parameters:
template_file (
Union
[str
,Path
]) – to fill in, string or Path- Return type:
- Returns:
the templating engine
- datacraft.template_engines.string(template)
Returns a template engine for processing templates as strings
- Return type:
Spec Formatters
data spec formatting
Module with functions that handle formatting specs in an orderly and consistent structure i.e:
{
"type": "<type name>",
"data": "data stuff",
"refs": "refs pointers",
"config": {
"key": "value..."
}
}
References
JSON Custom formatting https://stackoverflow.com/questions/13249415/how-to-implement-custom-indentation-when-pretty-printing-with-the-json-module
YAML custom formatting from https://til.simonwillison.net/python/style-yaml-dump via: https://stackoverflow.com/a/8641732 and https://stackoverflow.com/a/16782282
- datacraft.spec_formatters.format_json(raw_spec)
Formats the raw_spec as ordered dictionary in JSON
- Parameters:
raw_spec (
dict
) – to format- Return type:
str
- Returns:
the ordered and formatted JSON string
- datacraft.spec_formatters.format_yaml(raw_spec)
Formats the raw_spec as ordered dictionary in YAML
- Parameters:
raw_spec (
dict
) – to format- Return type:
str
- Returns:
the ordered and formatted YAML string
Data Spec Inference
- class datacraft.infer.RefsAggregator
Class for adding references to when building inferred specs
- add(key, val)
Add spec to refs section with given key/name
- Parameters:
key (
str
) – Name used to reference this specval (
dict
) – Field Spec for this key/name
- class datacraft.infer.ValueListAnalyzer
Interface class for implementations that infer a Field Spec from a list of values
- abstract compatibility_score(values)
Check if the analyzer is compatible with the provided values.
- Parameters:
values (Generator[Any, None, None]) – Generator producing values to check.
- Returns:
0, for not compatible with steps up to 1 for fully and totally compatible
- Return type:
int
- abstract generate_spec(name, values, refs, **kwargs)
Generate a specification for the provided list of values. Adds any necessary refs to refs aggregator as needed.
- Parameters:
name (
str
) – name of field this spec is being generated forvalues (
List
[Any
]) – List of values to generate the spec for.refs (
RefsAggregator
) – for adding refs if needed for generated spec.
- Keyword Arguments:
limit – for lists or weighted values, down sample to this size if needed
limit_weighted – take top N limit weights
duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values
- Returns:
A dictionary with the inferred spec for the values.
- Return type:
Dict[str, Any]
- datacraft.infer.csv_to_spec(file_path, **kwargs)
Read a CSV from the provided file path, convert it to JSON records, and then pass it to the from_examples function to get the spec.
- Parameters:
file_path (str) – The path to the CSV file.
- Keyword Arguments:
limit (int) – for lists or weighted values, down sample to this size if needed
limit_weighted (bool) – take top N limit weights
- Returns:
The inferred data spec from the CSV data.
- Return type:
Dict[str, Union[str, Dict]]
- datacraft.infer.from_examples(examples, **kwargs)
Generates a Data Spec from the list of example JSON records
- Parameters:
examples (list) – Data to infer Data Spec from
- Keyword Arguments:
limit (int) – for lists or weighted values, down sample to this size if needed
limit_weighted (bool) – take top N limit weights
duplication_threshold (float) – ratio of unique to total items, if above this threshold, use weighted values
- Returns:
Data Spec as dictionary
- Return type:
dict
Examples
>>> import datacraft.infer as infer >>> xmpls = [ ... {"foo": {"bar": 22.3, "baz": "single"}}, ... {"foo": {"bar": 44.5, "baz": "double"}} ... ] >>> >>> infer.from_examples(xmpls) {'foo': {'type': 'nested', 'fields': {'bar': {'type': 'rand_range', 'data': [22.3, 44.5]}, 'baz': {'type': 'values', 'data': ['single', 'double']}}}}
- datacraft.infer.infer_csv_select(file_path)
Infers a csv_select spec from the given csv file
- Parameters:
file_path (str) – The path to the CSV file.
- Returns:
The csv_select Data Spec for the given csv data.
- Return type:
Dict[str, Union[str, Dict]]