Datacraft API
Contents
Core Classes
- class datacraft.DataSpec(raw_spec)
Class representing a DataSpec object
- generator(iterations, **kwargs)
Creates a generator that will produce records or render the template for each record
- Parameters
iterations (
int
) – number of iterations to execute**kwargs –
- Keyword Arguments
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Yields
Records or rendered template strings
Examples
>>> import datacraft >>> raw_spec {'name': ['bob', 'bobby', 'robert', 'bobo']} >>> spec = datacraft.parse_spec(raw_spec) >>> template = 'Name: {{ name }}' >>> processor = datacraft.outputs.processor(template=template) >>> generator = spec.generator( ... iterations=4, ... processor=processor) >>> record = next(generator) >>> print(record) Name: bob
- Return type
Generator
- get(*args, **kwargs)
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- pop(k[, d]) v, remove specified key and return the corresponding value.
If key is not found, d is returned if given, otherwise KeyError is raised
- abstract to_pandas(iterations)
Converts iterations number of records into a pandas DataFrame
- Parameters
iterations (
int
) – number of iterations to run / records to generate- Returns
DataFrame with records as rows
- values() an object providing a view on D's values
- class datacraft.ValueSupplierInterface
Interface for Classes that supply values
- abstract next(iteration)
Produces the next value for the given iteration
- Parameters
iteration – current iteration
- Returns
the next value
- class datacraft.Loader
Parent object for loading value suppliers from specs
- abstract get(key)
Retrieve the value supplier for the given field or ref key
- Parameters
key (
str
) – key to for field or ref name- Return type
- Returns
the Value Supplier for the given key
- Raises
SpecException if key not found –
- abstract get_from_spec(field_spec)
Retrieve the value supplier for the given field spec
- Parameters
field_spec (
Any
) – dictionary spec or literal values- Return type
- Returns
the Value Supplier for the given spec
- Raises
SpecException if unable to resolve the spec with appropriate handler for the type –
- abstract get_ref(key)
returns the spec for the ref with the provided key
- Parameters
key (
str
) – key to lookup ref by- Return type
dict
- Returns
Ref for key
- abstract property spec
get the preprocessed field specs for this loader
- class datacraft.Distribution
Interface Class for a numeric distribution such as a Uniform or Gaussian distribution
- abstract next_value()
get the next value for this distribution
- Return type
float
- class datacraft.CasterInterface
Interface for Classes that cast objects to different types
- abstract cast(value)
casts the value according to the specified type
- Parameters
value (
Any
) – to cast- Return type
Any
- Returns
the cast form of the value
- Raises
SpecException when unable to cast value –
- class datacraft.RecordProcessor
A Class that takes in a generated record and returns it formatted as a string for output
- abstract process(record)
Processes the given record into the appropriate output string
- Parameters
record (
Union
[list
,dict
]) – generated record for current iteration- Return type
str
- Returns
The formatted record
- class datacraft.OutputHandlerInterface
Interface four handling generated output values
- abstract finished_iterations()
This is called when all iterations have been completed
- abstract finished_record(iteration, group_name, exclude_internal=False)
This is called whenever all of the fields for a record have been generated for one iteration
- Parameters
iteration (
int
) – iteration we are ongroup_name (
str
) – group this record is apart ofexclude_internal (
bool
) – if external fields should be excluded from output record
- abstract handle(key, value)
This is called each time a new value is generated for a given field
- Parameters
key (
str
) – the field namevalue (
Any
) – the new value for the field
Registry Decorators
- class datacraft.registries.Registry
Catalogue registry for types, preprocessors, logging configuration, and others
- types
Types for field specs, registered functions for creating ValueSupplierInterface that will supply values for the given type
>>> @datacraft.registry.types('special_sauce') ... def _handle_special_type(field_spec: dict, loader: datacraft.Loader) -> ValueSupplierInterface: ... # return ValueSupplierInterface from spec config
- schemas
Schemas for field spec types, used to validate that the spec for a given type conforms to the schema for it
>>> @datacraft.registry.schemas('special_sauce') ... def _special_sauce_schema() -> dict: ... # return JSON schema validating specs with type: special_sauce
- usage
Usage for field spec types, used to provide command line help and examples
>>> @datacraft.registry.usage('special_sauce') ... def _special_sauce_usage() -> str: ... # return string describing how to use special_sauce
- preprocessors
Functions to modify specs before data generations process. If there is a customization you want to do for every data spec, or an extenstion you added that requires modifications to the spec before they are run, this is where you would register that pre-processor.
>>> @datacraft.registry.preprocessors('custom-preprocessing') ... def _preprocess_spec_to_some_end(raw_spec: dict) -> dict: ... # return spec with any modification
- logging
Custom logging setup. Can override or modify the default logging behavior.
>>> @datacraft.registry.logging('denoise') ... def _customize_logging(loglevel: str): ... logging.getLogger('too.verbose.module').level = logging.ERROR
- formats
Registered formats for output. When using the –format <format name>. Unlike other registered functions, this one is called directly to perform the required formatting function. The return value from the formatter is the new value that will be written to the configured output (default is console).
>>> @datacraft.registry.formats('custom_format') ... def _format_custom(record: dict) -> str: ... # write to database or some other custom output, return something to write out or print to console
- distribution
Different numeric distributions, normal, uniform, etc. These are used for more nuanced counts values. The built in distributions are uniform and normal.
>>> @datacraft.registry.distribution('hyperbolic_inverse_haversine') ... def _hyperbolic_inverse_haversine(mean, stddev, **kwargs): ... # return a datacraft.Distribution, args can be custom for the defined distribution
- defaults
Default values. Different types have different default values for some configs. This provides a mechanism to override or to register other custom defaults. Read a default from the registry with:
datacraft.registries.get_default('var_key')
. Whiledatacraft.registries.all_defaults()
will give a mapping of all registered default keys and values.>>> @datacraft.registry.defaults('special_sauce_ingredient') ... def _default_special_sauce_ingredient(): ... # return the default value (i.e. onions)
- casters
Cast or alter values in simple ways. These are all the valid forms of altering generated values after they are created outside of the ValueSupplier types. Use
datacraft.registries.registered_casters()
to get a list of all the currently registered ones.>>> @datacraft.registry.casters('reverse') ... def _cast_reverse_strings(): ... # return a datacraft.CasterInterface
Datacraft Errors
- class datacraft.SpecException
A SpecException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec or one of the described Field Specs. Common errors include undefined or misspelled references, missing or invalid configuration parameters, and invalid or missing data definitions.
- class datacraft.SupplierException
A SupplierException indicates that there is a fatal flaw with the configuration or data associated with a Data Spec during run time.
- class datacraft.ResourceError
A ResourceLoadError indicates that an underlying resource such as a schema file was not able to be found or loaded.
Suppliers Module
Factory like module for core supplier related functions.
- datacraft.suppliers.alter(supplier, **kwargs)
Covers multiple suppliers that alter values if configured to do so through kwargs: cast, buffer, and decorate
- Parameters
supplier – to alter if configured to do so
- Keyword Arguments
cast (str) – caster to apply
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
buffer (bool) – if the values should be buffered
buffer_size (int) – size of buffer to use
- Return type
- Returns
supplier with alterations
- datacraft.suppliers.array_supplier(wrapped, **kwargs)
Wraps an existing supplier and always returns an array/list of elements, uses count config to determine number of items in the list
- Parameters
wrapped (
ValueSupplierInterface
) – the underlying supplier- Keyword Arguments
count – constant, list, or weighted map
data – alias for count
count_dist – distribution in named param function style format
- Return type
- Returns
The value supplier
Examples
>>> import datacraft >>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"]) >>> returns_mostly_two = datacraft.suppliers.array_supplier(pet_supplier, count_dist="normal(mean=2, stddev=1)") >>> pet_array = returns_mostly_two.next(0)
- datacraft.suppliers.buffered(wrapped, **kwargs)
Creates a Value Supplier that buffers the results of the wrapped supplier allowing the retrieval
- Parameters
wrapped (
ValueSupplierInterface
) – the Value Supplier to buffer values for- Keyword Arguments
buffer_size – number of produced values to buffer
- Return type
- Returns
a buffered value supplier
- datacraft.suppliers.calculate(suppliers_map, formula)
Creates a calculate supplier
- Parameters
suppliers_map (
Dict
[str
,ValueSupplierInterface
]) – map of name to supplier of values for that nameformula (
str
) – to evaluate, should reference keys in suppliers_map
- Return type
- Returns
supplier with calculated values
- datacraft.suppliers.cast(supplier, cast_to)
Provides a cast supplier from explicit cast
- Parameters
supplier (
ValueSupplierInterface
) – to cast results ofcast_to (
str
) – type to cast values to
- Return type
- Returns
the casting supplier
- datacraft.suppliers.character_class(data, **kwargs)
Creates a character class supplier for the given data
- Parameters
data – set of characters to supply as values
- Keyword Arguments
join_with (str) – string to join characters with, default is ‘’
exclude (str) – set of characters to exclude from returned values
mean (float) – mean number of characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list to use
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
- Returns
supplier for characters
- datacraft.suppliers.combine(to_combine, join_with=None, as_list=None)
Creates a value supplier that will combine the outputs of the provided suppliers in order. The default is to join the values with an empty string. Provide the join_with config param to specify a different string to join the values with. Set as_list to true, if the values should be returned as a list and not joined
- Parameters
to_combine – list of suppliers to combine in order of combination
as_list (
Optional
[bool
]) – if the results should be returned as a listjoin_with (
Optional
[str
]) – value to use to join the values
- Returns
supplier for mac addresses
Examples
>>> import datacraft >>> pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"], sample=True) >>> job_supplier = datacraft.suppliers.values(["breeder", "trainer", "fighter", "wrestler"], sample=True) >>> interesting_jobs = datacraft.suppliers.combine([pet_supplier, job_supplier], join_with=' ') >>> next_career = interesting_jobs.next(0) >>> next_career 'pig wrestler'
Returns:
- datacraft.suppliers.constant(data)
Creates value supplier for the single value
- Parameters
data (
Any
) – constant data to return on every iteration- Return type
- Returns
value supplier for the single value
Examples
>>> import datacraft >>> single_int_supplier = datacraft.suppliers.constant(42) >>> single_str_supplier = datacraft.suppliers.constant("42") >>> single_float_supplier = datacraft.suppliers.constant(42.42)
- datacraft.suppliers.count_supplier(**kwargs)
creates a count supplier from the config, if the count param is defined, otherwise uses default of 1
optionally can specify count or count_dist.
- valid data for counts:
integer i.e. 1, 7, 99
list of integers: [1, 7, 99], [1], [1, 2, 1, 2, 3]
weighted map, where keys are numeric strings: {“1”: 0.6, “2”: 0.4}
count_dist will be interpreted as a distribution i.e:
- Keyword Arguments
count – constant, list, or weighted map
data – alias for count
count_dist (str) – distribution in named param function style format
- Return type
- Returns
a count supplier
Examples
>>> import datacraft >>> counts = datacraft.suppliers.count_supplier(count_dist="uniform(start=10, end=100)")
- datacraft.suppliers.csv(csv_path, **kwargs)
Creates a csv supplier
- Parameters
csv_path – path to csv file to supply data from
- Keyword Arguments
column (int) – 1 based column number, default is 1
sample (bool) – if the values for the column should be sampled, if supported
count – constant, list, or weighted map
count_dist – distribution in named param function style format
delimiter (str) – how items are separated, default is ‘,’
quotechar (str) – string used to quote values, default is ‘”’
headers (bool) – if the CSV file has a header row
sample_rows (bool) – if sampling should happen at a row level, not valid if buffering is set to true
- Returns
supplier for csv field
- datacraft.suppliers.date(**kwargs)
Creates supplier the provides date values according to specified format and ranges
Can use one of center_date or (start, end, offset, duration_days) etc.
- Parameters
**kwargs –
- Keyword Arguments
format (str) – Format string for dates
center_date (str) – Date matching format to center dates around
stddev_days (float) – Standard deviation in days from center date
start (str) – start date string
end (str) – end date string
offset (int) – number of days to shift the duration, positive is back negative is forward
duration_days (str) – number of days after start, default is 30
date_format_string (str) – format for parsing dates
- Return type
- Returns
supplier for dates
- datacraft.suppliers.decorated(supplier, **kwargs)
Creates a decorated supplier around the provided one
- Parameters
supplier (
ValueSupplierInterface
) – the supplier to alter**kwargs –
- Keyword Arguments
prefix (str) – prefix to prepend to value, default is ‘’
suffix (str) – suffix to append to value, default is ‘’
quote (str) – string to both append and prepend to value, default is ‘’
- Return type
- Returns
the decorated supplier
Examples
>>> import datacraft >>> nums = datacraft.suppliers.values([1, 2, 3, 4, 5]) >>> prefix_supplier = datacraft.suppliers.decorated(nums, prefix='you are number ') >>> prefix_supplier.next(0) you are number 1 >>> suffix_supplier = datacraft.suppliers.decorated(nums, suffix=' more minutes') >>> suffix_supplier.next(0) 1 more minutes >>> quoted_supplier = datacraft.suppliers.decorated(nums, quote='"') >>> quoted_supplier.next(0) "1"
- datacraft.suppliers.distribution_supplier(distribution)
creates a ValueSupplier that uses the given distribution to generate values
- Parameters
distribution (
Distribution
) – to use- Return type
- Returns
the value supplier
- datacraft.suppliers.from_list_of_suppliers(supplier_list, modulate_iteration=True)
Returns a supplier that rotates through the provided suppliers incrementally
- Parameters
supplier_list (
List
[ValueSupplierInterface
]) – to rotate throughmodulate_iteration (
bool
) – if the iteration number should be moded by the index of the supplier
- Return type
- Returns
a supplier for these suppliers
Examples
>>> import datacraft >>> nice_pet_supplier = datacraft.suppliers.values(["dog", "cat", "hamster", "pig", "rabbit", "horse"]) >>> mean_pet_supplier = datacraft.suppliers.values(["alligator", "cobra", "mongoose", "killer bee"]) >>> pet_supplier = datacraft.suppliers.from_list_of_suppliers([nice_pet_supplier, mean_pet_supplier]) >>> pet_supplier.next(0) 'dog' >>> pet_supplier.next(1) 'alligator'
- datacraft.suppliers.geo_lat(**kwargs)
configures geo latitude type
- Keyword Arguments
precision (int) – number of digits after decimal place
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
- Return type
- Returns
supplier for geo.lat type
- datacraft.suppliers.geo_long(**kwargs)
configures geo longitude type
- Keyword Arguments
precision (int) – number of digits after decimal place
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
- Return type
- Returns
supplier for geo.long type
- datacraft.suppliers.geo_pair(**kwargs)
Creates geo pair supplier
- Keyword Arguments
precision (int) – number of digits after decimal place
lat_first (bool) – if latitude should be populated before longitude
start_lat (int) – minimum value for latitude
end_lat (int) – maximum value for latitude
start_long (int) – minimum value for longitude
end_long (int) – maximum value for longitude
bbox (list) – list of size 4 with format: [min Longitude, min Latitude, max Longitude, max Latitude]
as_list (bool) – if the values should be returned as a list
join_with (str) – if the values should be joined with the provided string
- Returns
supplier for geo.pair type
- datacraft.suppliers.ip_precise(cidr, sample=False)
Creates a value supplier that produces precise ip address from the given cidr
- Parameters
cidr (
str
) – notation specifying ip rangesample (
bool
) – if the ip addresses should be sampled from the available set
- Return type
- Returns
supplier for precise ip addresses
Examples
>>> import datacraft >>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=False) >>> ips.next(0) '192.168.0.0' >>> ips.next(1) '192.168.0.1' >>> ips = datacraft.suppliers.ip_precise(cidr="192.168.0.0/22", sample=True) >>> ips.next(0) '192.168.0.127' >>> ips.next(1) '192.168.0.196'
- datacraft.suppliers.ip_supplier(**kwargs)
Creates a value supplier for ipv v4 addresses
- Keyword Arguments
base (str) – base of ip address, i.e. “192”, “10.” “100.100”, “192.168.”, “10.10.10”
cidr (str) – cidr to use only one /8 /16 or /24, i.e. “192.168.0.0/24”, “10.0.0.0/16”, “100.0.0.0/8”
- Return type
- Returns
supplier for ip addresses
- Raises
SpecException if one of base or cidr is not provided –
Examples
>>> import datacraft >>> ips = datacraft.suppliers.ip_supplier(base="192.168.1") >>> ips.next(0) '192.168.1.144'
- datacraft.suppliers.list_count_sampler(data, **kwargs)
Samples N elements from data list based on config. If count is provided, each iteration exactly count elements will be returned. If only min is provided, between min and the total number of elements will be provided. If only max is provided, between one and max elements will be returned. Specifying both min and max will provide a sample containing a number of elements in this range.
- Parameters
data (
list
) – list to select subset from- Keyword Arguments
count – number of elements in list to use
count_dist – count distribution to use
min – minimum number of values to return
max – maximum number of values to return
join_with – value to join values with, default is None
- Return type
- Returns
the supplier
Examples
>>> import datacraft >>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"] >>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, min=2, max=5) >>> pet_supplier.next(0) ['rabbit', 'cat', 'pig', 'cat'] >>> pet_supplier = datacraft.suppliers.list_count_sampler(pet_list, count_dist="normal(mean=2,stddev=1,min=1,max=3)") >>> pet_supplier.next(0) ['pig', 'horse']
- datacraft.suppliers.list_stats_sampler(data, **kwargs)
sample from list (or string) with stats based params
- Parameters
data (
Union
[str
,list
]) – list to select subset from- Keyword Arguments
mean (float) – mean number of items/characters to produce
stddev (float) – standard deviation from the mean
count (int) – number of elements in list/characters to use
count_dist (str) – count distribution to use
min (int) – minimum number of items/characters to return
max (int) – maximum number of items/characters to return
- Return type
- Returns
the supplier
Examples
>>> import datacraft >>> pet_list = ["dog", "cat", "hamster", "pig", "rabbit", "horse"] >>> pet_supplier = datacraft.suppliers.list_stats_sampler(pet_list, mean=2, stddev=1) >>> new_pets = pet_supplier.next(0)
>>> char_config = {"min": 2, "mean": 4, "max": 8} >>> char_supplier = datacraft.suppliers.list_stats_sampler("#!@#$%^&*()_-~", min=2, mean=4, max=8) >>> two_to_eight_chars = char_supplier.next(0)
- datacraft.suppliers.list_values(data, **kwargs)
creates a Value supplier for the list of provided data
- Parameters
data (
list
) – for the supplier- Keyword Arguments
as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format
- Return type
- Returns
the ValueSupplierInterface for the data list
- datacraft.suppliers.mac_address(delimiter=None)
Creates a value supplier that produces mac addresses
- Parameters
delimiter (
Optional
[str
]) – how mac address pieces are separated, default is ‘:’- Return type
- Returns
supplier for mac addresses
Examples
>>> import datacraft >>> macs = datacraft.suppliers.mac_address() >>> macs.next(0) '1E:D4:0F:59:41:FA' >>> macs = datacraft.suppliers.mac_address('-') >>> macs.next(0) '4D-93-36-59-BD-09'
- datacraft.suppliers.random_range(start, end, precision=None, count=1)
Creates a random range supplier for the start and end parameters with the given precision (number of decimal places)
- Parameters
start (
Union
[str
,int
,float
]) – of rangeend (
Union
[str
,int
,float
]) – of rangeprecision (
Union
[str
,int
,float
,None
]) – number of decimal points to keepcount (
Union
[int
,List
[int
],Dict
[str
,float
],Distribution
]) – number of elements to return, default is one
- Return type
- Returns
the value supplier for the range
Examples
>>> num_supplier = datacraft.suppliers.random_range(5, 25, precision=3) >>> # should be between 5 and 25 with 3 decimal places >>> num_supplier.next(0) 8.377
- datacraft.suppliers.range_supplier(start, end, step=1, **kwargs)
Creates a Value Supplier for given range of data
- Parameters
start (
Union
[int
,float
]) – start of rangeend (
Union
[int
,float
]) – end of rangestep (
Union
[int
,float
]) – of range values
- Keyword Arguments
precision (int) – Number of decimal places to use, in case of floating point range
- Returns
supplier to supply ranges of values with
- datacraft.suppliers.resettable(iterator)
Wraps a ResettableIterator to supply values from
- Parameters
iterator (
ResettableIterator
) – iterator with reset() method- Returns
supplier to supply generated values with
- datacraft.suppliers.select_list_subset(data, **kwargs)
Creates a supplier that selects elements from the data list based on the supplier kwargs
- Parameters
data (
list
) – list of data values to supply values from- Keyword Arguments
mean (float) – mean number of values to include in list
stddev (float) – standard deviation from the mean
- Returns
supplier to supply subsets of data list
- datacraft.suppliers.templated(supplier_map, template_str)
Creates a supplier that populates the template string from the supplier map
- Parameters
supplier_map (
Dict
[str
,ValueSupplierInterface
]) – map of field name -> value supplier for ittemplate_str – templated string to populate
- Return type
- Returns
value supplier for template
Examples
>>> from datacraft import suppliers >>> char_to_num_supplier = { 'char': suppliers.values(['a', 'b', 'c']), 'num': suppliers.values([1, 2, 3]) } >>> letter_number_template = 'letter {{ char }}, number {{ num }}' >>> supplier = suppliers.templated(char_to_num_supplier, letter_number_template) >>> supplier.next(0) 'letter a, nummber 1'
- datacraft.suppliers.unicode_range(data, **kwargs)
Creates a unicode supplier for single or multiple unicode ranges
- Parameters
data – list of unicode ranges to sample from
- Keyword Arguments
mean (float) – mean number of values to produce
stddev (float) – standard deviation from the mean
count (int) – number of unicode characters to produce
count_dist (str) – count distribution to use
min (int) – minimum number of characters to return
max (int) – maximum number of characters to return
as_list (bool) – if the results should be returned as a list
join_with (str) – value to join values with, default is ‘’
- Returns
supplier to supply subsets of data list
- datacraft.suppliers.uuid(variant=None)
Creates a UUid Value Supplier
- Parameters
variant (
Optional
[int
]) – of uuid to use, default is 4- Return type
- Returns
supplier to supply uuids with
- datacraft.suppliers.values(spec, **kwargs)
Based on data, return the appropriate values supplier. data can be a spec, constant, list, or dict. or just the raw data
- Parameters
spec (
Any
) – to load values from, or raw data itself**kwargs – extra kwargs to add to config
- Keyword Arguments
as_list (bool) – if data should be returned as a list
sample (bool) – if the data should be sampled instead of iterated through incrementally
count – constant, list, or weighted map
count_dist (str) – distribution in named param function style format
- Return type
- Returns
the values supplier for the spec
Examples
>>> import datacraft >>> raw_spec = {"type": "values", "data": [1,2,3,5,8,13]} >>> fib_supplier = datacraft.suppliers.values(raw_spec) >>> fib_supplier = datacraft.suppliers.values([1,2,3,5,8,13]) >>> fib_supplier.next(0) 1 >>> weights = {"1": 0.1, "2": 0.2, "3": 0.1, "4": 0.2, "5": 0.1, "6": 0.2, "7": 0.1} >>> mostly_even_supplier = datacraft.suppliers.values(weights) >>> mostly_even_supplier.next(0) '4'
- datacraft.suppliers.weighted_values(data, config=None)
Creates a weighted value supplier from the data, which is a mapping of value to the weight is should represent.
- Parameters
data (
dict
) – for the supplierconfig (
Optional
[dict
]) – optional config (Default value = None)
- Return type
- Returns
the supplier
- Raises
SpecException if data is empty –
Examples
>>> import datacraft >>> pets = {"dog": 0.5, "cat": 0.2, "bunny": 0.1, "hamster": 0.1, "pig": 0.05, "snake": 0.04, "_NULL_": 0.01} >>> weighted_pet_supplier = datacraft.suppliers.weighted_values(pets) >>> most_likely_a_dog = weighted_pet_supplier.next(0)
Builder Module
Module for parsing and helper functions for specs
Examples
>>> import datacraft
>>> raw_spec = {
... 'name': {'type': 'values', 'data': ['ann', 'bob', 'carl']},
... 'age': {'type': 'rand_int_range', 'data': [22, 47]}
... }
>>> spec = datacraft.parse_spec(raw_spec)
>>> type(spec)
DataSpec
- datacraft.builder.entries(raw_spec, iterations, **kwargs)
Creates n entries from the provided spec
- Parameters
raw_spec (
Dict
[str
,Dict
]) – to create generator foriterations (
int
) – number of iterations before max
- Keyword Arguments
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Return type
List
[dict
]- Returns
the list of N entries
- datacraft.builder.generator(raw_spec, iterations, **kwargs)
Creates a generator for the raw spec for the specified iterations
- Parameters
raw_spec (
Dict
[str
,Dict
]) – to create generator foriterations (
int
) – number of iterations before max
- Keyword Arguments
processor – (RecordProcessor): For any Record Level transformations such templating or formatters
output – (OutputHandlerInterface): For any field or record level output
data_dir (str) – path the data directory with csv files and such
enforce_schema (bool) – If schema validation should be applied where possible
- Yields
Records or rendered template strings
- Return type
Generator
- Returns
the generator for the provided spec
Outputs Module
Module holds output related classes and functions
- class datacraft.outputs.WriterInterface
Interface for classes that write the generated values out
- abstract write(value)
Write the value to the configured output destination
- Parameters
value – to write
- datacraft.outputs.file_name_engine(prefix, extension)
creates a templating engine that will produce file names based on the count
- Parameters
prefix (
str
) – prefix for file nameextension (
str
) – suffix for file name
- Return type
- Returns
template engine for producing file names
- datacraft.outputs.get_writer(outdir=None, outfile=None, overwrite=False, **kwargs)
creates the appropriate output writer from the given args and params
If no output directory is specified/configured will write to stdout
- Parameters
outdir (
Optional
[str
]) – Directory to write output tooutfile (
Optional
[str
]) – If a specific file should be used for the output, default is to construct the name from kwargsoverwrite (
bool
) – Should existing files with the same name be overwritten
- Keyword Arguments
outfile_prefix – the prefix of the output files i.e. test-data-
extension – to append to the file name prefix i.e. .csv
suppress_output – if output to stdout should be suppressed, only valid if outdir is None
- Return type
- Returns
The configured Writer
Examples
>>> import datacraft >>> csv_writer = datacraft.outputs.get_writer('./output', outfileprefix='test-data-', extension='.csv')
- datacraft.outputs.incrementing_file_writer(outdir, engine)
Creates a WriterInterface that increments the count in the file name once records_per_file have been written
- Parameters
outdir (
str
) – output directoryengine (
RecordProcessor
) – to generate file names with
- Return type
- Returns
a Writer that increments the a count in the file name
- datacraft.outputs.processor(template=None, format_name=None)
Configures the record level processor for either the template or for the format_name
- Parameters
template (
Union
[str
,Path
,None
]) – path to template or template as stringformat_name (
Optional
[str
]) – one of the valid registered formatter names
- Return type
Optional
[RecordProcessor
]- Returns
RecordProcessor if valid template of format_name provide, None otherwise
- Raises
SpecException when format_name is not registered or if both template and format specified –
Examples
>>> import datacraft >>> engine = datacraft.outputs.processor(template='/path/to/template.jinja') >>> engine = datacraft.outputs.processor(template='{{ Inline: {{ variable }}') >>> formatter = datacraft.outputs.processor(format_name='json') >>> formatter = datacraft.outputs.processor(format_name='my_custom_registered_format')
- datacraft.outputs.record_level(record_processor, writer, records_per_file=1)
Creates a OutputHandler for record level events
- Parameters
record_processor (
RecordProcessor
) – to process the records into stringswriter (
WriterInterface
) – to write the processed recordsrecords_per_file (
int
) – number of records to accumulate before writing
- Return type
- Returns
OutputHandlerInterface
- datacraft.outputs.single_field(writer, output_key)
Creates a OutputHandler field level events
- Parameters
writer (WriterInterface) – to write the processed records
output_key (
bool
) – if the key should be output along with the value
- Returns
OutputHandlerInterface
- datacraft.outputs.single_file_writer(outdir, outname, overwrite)
Creates a Writer for a single output file
- Parameters
outdir (
str
) – output directoryoutname (
str
) – output file nameoverwrite (
bool
) – if should overwrite exiting output files
- Return type
- Returns
Writer for a single file
- datacraft.outputs.stdout_writer()
Creates a WriterInterface that writes results to stdout
- Return type
- Returns
writer that writes to stdout
- datacraft.outputs.suppress_output_writer()
Returns a writer that suppresses the output to stdout
- Return type
Template Engines
Handles loading and creating the templating engine
- datacraft.template_engines.for_file(template_file)
Loads the templating engine for the template file specified
- Parameters
template_file (
Union
[str
,Path
]) – to fill in, string or Path- Return type
- Returns
the templating engine
- datacraft.template_engines.string(template)
Returns a template engine for processing templates as strings
- Return type
Spec Formatters
Module with functions that handle formatting specs in an orderly and consistent structure i.e:
{
"type": "<type name>",
"data": "data stuff",
"refs": "refs pointers",
"config": {
"key": "value..."
}
}
References
JSON Custom formatting https://stackoverflow.com/questions/13249415/how-to-implement-custom-indentation-when-pretty-printing-with-the-json-module
YAML custom formatting from https://til.simonwillison.net/python/style-yaml-dump via: https://stackoverflow.com/a/8641732 and https://stackoverflow.com/a/16782282
- datacraft.spec_formatters.format_json(raw_spec)
Formats the raw_spec as ordered dictionary in JSON
- Parameters
raw_spec (
dict
) – to format- Return type
str
- Returns
the ordered and formatted JSON string
- datacraft.spec_formatters.format_yaml(raw_spec)
Formats the raw_spec as ordered dictionary in YAML
- Parameters
raw_spec (
dict
) – to format- Return type
str
- Returns
the ordered and formatted YAML string