Core Types
These are the built-in field spec types. Organized by the type of Data they generate or by their function or utility.
Strings
For generating strings in various formats
char_class
A char_class
type is used to create strings that are made up of characters
from specific character classes. The strings can be of fixed or variable length.
There are several built in character classes. You can also provide your own set
of characters to sample from. Below is the list of supported character classes:
Built In Classes
class |
description |
---|---|
ascii |
All valid ascii characters including control |
lower |
ascii lowercase |
upper |
ascii uppercase |
digits |
Numbers 0 through 9 |
letters |
lowercase and uppercase |
word |
letters + digits + ‘_’ |
printable |
All printable ascii chars including whitespace |
visible |
All printable ascii chars excluding whitespace |
punctuation |
local specific punctuation |
special |
local specific punctuation |
hex |
Hexadecimal digits including upper and lower case a-f |
hex-lower |
Hexadecimal digits only including lower case a-f |
hex-upper |
Hexadecimal digits only including upper case A-F |
Prototype:
{
"<field name>": {
"type": "char_class":
"data": <char_class_name>,
or
"type": "cc-<char_class_name>",
or
"type": "char_class":
"data": <string with custom set of characters to sample from>
or
"type": "char_class":
"data": [<char_class_name1>, <char_class_name2>, ..., <custom characters>]
"config":{
"exclude": <string of characters to exclude from output>,
"escape": <string of characters to escape in output e.g. " -> \\", useful if non JSON output
"escape_str": <string to use for escaping, default is \>
"min": <min number of characters in string>,
"max": <max number of characters in string>,
or
"count": <exact number of characters in string>
or
"mean": <mean number of characters in string>
"stddev": <standard deviation from mean for number of characters in string>
"min": <optional min>
"max": <optional max>
}
}
}
Examples:
{
"password": {
"type": "char_class",
"data": ["word", "special", "hex-lower", "M4$p3c!@l$@uc3"],
"config": {
"mean": 14,
"stddev": 2,
"min": 10,
"max": 18,
"exclude": ["-", "\""]
}
}
}
$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"password": "c1hbR&V!sYi4+Em"}
{"password": "Z7Qd0AM>$f7'"}
{"password": "9Bh8Z%6?ed4g"}
{"password": "sqQ&I!Ucdp"}
{
"one_to_five_digits:cc-digits?min=1&max=5": {}
}
$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"one_to_five_digits": "43040"}
{"one_to_five_digits": "5"}
{"one_to_five_digits": "6914"}
{"one_to_five_digits": "752"}
unicode_range
Generates strings from unicode ranges
Prototype:
{
"<field>": {
"type": "unicode_range":
"data": [<start_code_point_in_hex>, <end_code_point_in_hex>],
or
"data": [
[<start_code_point_in_hex>, <end_code_point_in_hex>],
[<start_code_point_in_hex>, <end_code_point_in_hex>],
...
[<start_code_point_in_hex>, <end_code_point_in_hex>],
],
"config":{
# String Size Based Config Parameters
"min": <min number of characters in string>,
"max": <max number of characters in string>,
or
"count": <exact number of characters in string>
or
"mean": <mean number of characters in string>
"stddev": <standard deviation from mean for number of characters in string>
"min": <optional min>
"max": <optional max>
}
}
}
Examples:
{
"text": {
"type": "unicode_range",
"data": ["3040", "309f"],
"config": {
"mean": 5
}
}
}
uuid
A standard uuid
Prototype:
{
"<field name>": {
"type": "uuid",
"config": {
"variant": 1, 3, 4, or 5, default is 4, optional
}
}
}
Examples:
{
"id": {
"type": "uuid"
},
"id_shorthand:uuid": {},
"id_variant3": {
"type": "uuid",
"config": {
"variant": 3
}
}
}
Numeric
For generating numeric values in different ways.
range_suppliers
There are two main range types sequential and random. A sequential range is
specified using the range
type. A random one uses the rand_range
type.
range
Prototype:
{
"<field name>": {
"type": "range",
"data": [<start>, <end>, <step> (optional)],
or
"data": [
[<start>, <end>, <step> (optional)],
[<start>, <end>, <step> (optional)],
...
[<start>, <end>, <step> (optional)],
],
}
}
start: (Union[int, float]) - start of range
end: (Union[int, float]) - end of range
step: (Union[int, float]) - step for range, default is 1
Examples:
{
"zero_to_ten_step_half": {
"type": "range",
"data": [0, 10, 0.5]
}
}
{
"range_shorthand1:range": {
"data": [0, 10, 0.5]
}
}
{"range_shorthand2:range": [0, 10, 0.5]},
rand_range
Generates a random floating point number in the given range. Use the rand_int_range type as a shortcut for casting the numbers as integers.
Prototype:
{
"<field name>": {
"type": "rand_range",
"data": [<upper>],
or
"data": [<lower>, <upper>],
or
"data": [<lower>, <upper>, <precision> (optional)]
}
}
upper: (Union[int, float]) - upper limit of random range
lower: (Union[int, float]) - lower limit of random range
precision: (int) - Number of digits after decimal point
Examples:
{
"zero_to_ten_three_decimals": {
"type": "rand_range",
"data": [0, 10, 3]
}
}
{
"int_in_range": {
"type": "rand_int_range",
"data": [1, 100]
}
}
distribution
A distribution spec can be built from one of the registered distribution types. Below is the table of the built in ones. Custom distributions can be registered using Custom Code Loading. See Custom Count Distributions for an example.
distribution |
required arguments |
optional args |
examples |
---|---|---|---|
uniform |
start,end |
“uniform(start=10, end=30)” |
|
“uniform(start=1, end=3)” |
|||
guass |
mean,stddev |
min,max |
“gauss(mean=2, stddev=1)” |
guassian |
“guassian(mean=7, stddev=1, min=4)” |
||
normal |
“normal(mean=25, stddev=10, max=40)” |
Prototype:
{
"<field name>": {
"type": "distribution",
"data": "<dist func name>(<param1>=<val1>, ..., <paramN>=<valN>)
}
}
Examples:
{
"values": {
"type": "distribution",
"data": "uniform(start=10, end=30)"
}
}
{
"age": {
"type": "distribution",
"data": "normal(mean=28, stddev=10, min=18, max=40)",
"config": {"cast": "int"}
}
}
{
"pressure": {
"type": "distribution",
"data": "gauss(mean=33, stddev=3.4756535)",
"config": {
"count_dist": "normal(mean=2, stddev=1, min=1, max=4)",
"as_list": true
}
}
}
A distribution
type field with a uniform distribution will produce similar values to a rand_range
field. With
rand_range it is easier to specify a specific number of decimal places to keep. To do this for the distribution type,
you need to make use of the cast
config with a roundN
caster. See example below.
{
"values1": {
"type": "rand_range",
"data": [10, 30, 4]
},
"values2": {
"type": "distribution",
"data": "uniform(start=10, end=30)",
"config": {
"cast": "round4"
}
},
"values3:rand_range": [10, 30, 4],
"values4:distribution?cast=round4": "uniform(start=10, end=30)"
}
$ datacraft -s spec.json -i2 --log-level off --printkey
values1 -> 29.7907
values2 -> 18.9114
values3 -> 13.5495
values4 -> 15.5935
values1 -> 22.0634
values2 -> 17.8552
values3 -> 22.982
values4 -> 20.5616
iteration
An iteration or rownum spec is used to populate the record number that is being generated. By default the offset is set to 1. To get zero based indexes for iteration, set the offset config parameter to 0.
Prototype:
{
"<field name>": {
"type": "iteration",
OR
"type": "rownum",
OR
"type": "rownum",
"config": {
"offset": N
}
}
Examples:
{
"id": {
"type": "iteration"
}
}
{
"id": {
"type": "rownum",
"config": { "offset": 0 }
}
}
$ datacraft -s iteration.json -i 3 -t 'ID: {{ id | safe }}' -l off
ID: 1
ID: 2
ID: 3
Date & Time
For generating dates and timestamp in a variety of formats
date
A Date Field Spec is used to generate date strings. The default format is day-month-year i.e. Christmas 2050 would be: 25-12-2050. There is also a date.iso type that generates ISO8601 formatted date strings without microseconds and a date.iso.us for one that generates them with microseconds. There are also a date.epoch and date.epcoh.ms and date.epoch.millis. These are for generating unix epoch timestamps. We use the format specification from the datetime module.
type |
example output |
---|---|
date |
11-18-2050 |
date.iso |
2050-12-01T01:44:35Z |
date.iso.ms |
2050-12-01T05:11:20.543Z |
date.iso.millis |
2050-12-01T05:11:20.543Z |
date.iso.us |
2050-12-01T06:19:02.752373Z |
date.iso.micros |
2050-12-01T06:17:05.487878Z |
date.epoch |
1669825519 |
date.epoch.ms |
1668624934547 |
date.epoch.millis |
1669166880466 |
Uniformly Sampled Dates
The default strategy is to create random dates within a 30 day range, where the start date is today. You can use the
start
parameter to set a specific start date for the dates. You can also explicitly specify an end
date. The
start
and end
parameters should conform to the specified date format, or the default
if none is provided. The offset
parameter can be used to shift the dates by a specified number of days. A
positive offset
will shift the start date back. A negative offset
will shift the date forward. The
duration_days
parameter can be used to specify the number of days that should be covered in the date range,
instead of the default 30 days. This parameter is usually specified as an integer constant.
start end (default start + 30 days)
|--------------------------------|
|+offset| start+duration_days
|--------------------------------|
|-offset|
|--------------------------------|
Dates Distributed around a Center Point
An alternative strategy is to specify a center_date
parameter with an optional stddev_days
. This will create
a normal or gaussian distribution of dates around the center point.
|
|
| | |
| | | | |
| | | | | | |
| | | | | | | | | | | | |
|-------------------------------------|
| | stddev | stddev | |
center
Restricting Hours
If you want your generated dates to be restricted to certain hours of the day, you provide the hours
config param.
The value of this parameter can be any type of Field Spec that produces valid integers in the range of 0 to 23. See
examples below.
Prototype:
{
"<field name>": {
"type": "date",
OR,
"type": "date.iso",
OR,
"type": "date.iso.ms",
OR,
"type": "date.iso.millis",
OR,
"type": "date.iso.us",
OR,
"type": "date.iso.micros",
"data" "replacement for config.format, valid for type: date only",
"config": {
"format": "Valid datetime format string",
"duration_days": "The number of days from the start date to create date strings for",
"start": "date string matching format or default format to use for start date",
"end": "date string matching format or default format to use for end date",
"offset": "number of days to shift base date by, positive means shift backwards, negative means forward",
"center_date": "date string matching format or default format to use for center date",
"stddev_days": "The standard deviation in days from the center date that dates should be distributed",
"hours": "spec describing how the hours should be populated, i.e. only between 9am and 5pm"
}
}
}
Examples:
Dates that start on 15 Dec 2050 and span a 90 day period
{
"dates": {
"type": "date",
"config": {
"duration_days": "90",
"start": "15-Dec-2050 12:00",
"format": "%d-%b-%Y %H:%M"
}
}
}
Dates centered on 01 Jun 2050 with a standard deviation of +-2 days
{
"dates": {
"type": "date",
"config": {
"center_date": "20500601 12:00",
"format": "%Y%m%d %H:%M",
"stddev_days": "2"
}
}
}
ISO Date Centered at 1 Jun 2050, with weighted hours of the day
{
"start_time": {
"type": "date.iso",
"config": {
"center_date": "2050-06-01T12:00:00Z",
"hours": { "type": "values", "data": { "7": 0.1, "8": 0.2, "9": 0.4, "10": 0.2, "11": 0.1 } }
}
}
}
Epoch Date with milliseconds 14 days in the future with a 7 day window for timestamps
{
"start_time": {
"type": "date.epoch.ms",
"config": {
"offset": -14,
"duration_days": 7
}
}
}
Date format in data element using shorthand notation
{
"start_time:date": "%d-%b-%Y %H:%M"
}
Equivalent to
{
"start_time": {
"type": "date",
"data": "%d-%b-%Y %H:%M"
}
}
.now Variations
All date-type variations support a .now
extension, allowing you to generate the current date and time in different
formats based on your specific needs. These formats can include human-readable strings, epoch timestamps in various
precisions, or ISO standard formats. The flexibility of the .now
variations ensures that your data can align with
different system requirements.
For example, using the .now
extension with a specific format string will generate the current date and time as
follows:
{
"event_date": {
"type": "date.now",
"data": "%d-%b-%Y %H:%M:%S"
}
}
This produces output like: 15-Sep-2044 10:35:20
, which is useful for generating consistent, formatted timestamps.
Available .now
Variations:
Each of the following .now
types generates the current date and time in a specific format:
|
Outputs the current date in a human-readable string, supports custom formats. |
|
Generates the current Unix timestamp (seconds since 1 January 1970). |
|
Returns the Unix timestamp with millisecond precision. |
|
Alias for |
|
Produces the current date and time in ISO 8601 format |
|
Provides the ISO 8601 format with microsecond precision. |
|
Alias for |
|
Outputs the ISO 8601 format with millisecond precision. |
|
Alias for |
These variations work well when using the --server
command line option to serve up the data over REST.
Geographic
For generating basic decimal degrees of latitude and longitude
geo types
There are three main geo types: geo.lat
, geo.long
, and geo.pair
. The defaults will create decimal string
values in the valid ranges: -90 to 90 for latitude and -180 to 180 for longitude. You can bound the ranges in several
ways. The first is with the start_lat, end_lat, start_long, end_long config params. These will set the individual
bounds for each of the segments. You can use one or more of them. The other mechanism is by defining a bbox array
which consists of the lower left geo point and the upper right one.
type |
param |
description |
---|---|---|
all |
precision |
number of decimal places for lat or long, default is 4 |
bbox |
array of [min Longitude, min Latitude, max Longitude, max Latitude] |
|
geo.lat |
start_lat |
lower bound for latitude |
end_lat |
upper bound for latitude |
|
geo.long |
start_long |
lower bound for longitude |
end_long |
upper bound for longitude |
|
geo.pair |
join_with |
delimiter to join long and lat with, default is comma |
as_list |
One of yes, true, or on if the pair should be returned as a list instead of as a joined string |
|
lat_first |
if latitude should be first in the generated pair, default is longitude first |
|
start_lat |
lower bound for latitude |
|
end_lat |
upper bound for latitude |
|
start_long |
lower bound for longitude |
|
end_long |
upper bound for longitude |
Prototype:
{
"<field name>": {
"type": "geo.lat",
or
"type": "geo.long",
or
"type": "geo.pair",
"config": {
"key": Any
}
}
}
Examples:
{
"egypt": {
"type": "geo.pair",
"config": {
"bbox": [
31.33134,
22.03795,
34.19295,
25.00562
],
"precision": 3
}
}
}
Network
Network related types
ip/ipv4
Ip addresses can be generated
using CIDR notation
or by specifying a base. For example, if you wanted to generate ips in the
10.0.0.0 to 10.0.0.255 range, you could either specify a cidr
param of
10.0.0.0/24 or a base
param of 10.0.0.
Prototype:
{
"<field name>": {
"type": "ipv4",
"config": {
"cidr": "<cidr value /8 /16 /24 only>",
OR
"base": "<beginning of ip i.e. 10.0>"
}
}
}
Examples:
{
"network": {
"type": "ipv4",
"config": {
"cidr": "2.22.222.0/16"
}
},
"network_shorthand:ip?cidr=2.22.222.0/16": {},
"network_with_base:ip?base=192.168.0": {}
}
ip.precise
The default ip type only supports cidr masks of /8 /16 and /24. If you want more precise ip ranges you need to use the
ip.precise
type. This type requires a cidr as the single config param. The default mode for ip.precise
is to
increment the ip addresses. Set config param sample to one of true, on, or yes to enable random ip addresses selected
from the generated ranges.
Prototype:
{
"<field name>": {
"type": "ip.precise",
"config": {
"cidr": "<valid cidr value>",
}
}
}
Examples:
{
"network": {
"type": "ip.precise",
"config": {
"cidr": "192.168.0.0/14",
"sample": "true"
}
}
}
net.mac
For creating MAC addresses
Prototype:
{
"<field name>": {
"type": "net.mac",
"config": {
"dashes": "If dashes should be used as the separator one of on, yes, 'true', or True"
}
}
}
Examples:
{
"network": {
"type": "net.mac"
}
}
{
"network": {
"type": "net.mac",
"config": {
"dashes": "true"
}
}
}
Utility/Common
Common types or types that are used in a utility capacity.
values
There are three types of values specs: Constants, List, and Weighted. Values specs have a shorthand notation where the value of the data element replaces the full spec. See examples below.
Prototype:
{
"<field_name>": {
"type": "values",
"data": Union[str, bool, int, float, list, dict],
"config": {
"key": "value"
}
}
}
Examples:
{"field_constant": {"type": "values", "data": 42}}
{"field_list": {"type": "values", "data": [1, 2, 3, 5, 8, 13]}}
{"field_weighted": {"type": "values", "data": {"200": 0.6, "404": 0.1, "303": 0.3}}}
{"field_weighted_with_null": {"type": "values", "data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NULL_": 0.1}}}
{"shorthand_field_constant": 42}
{"shorthand_field_list": [1, 2, 3, 5, 8, 13]}
{"shorthand_field_weighted": {"200": 0.6, "404": 0.1, "303": 0.3}}
{
"short_hand_field_weighted_with_null": {
"type": "values",
"data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NONE_": 0.1}
}
}
$ datacraft -s spec.json -i 3 -r 1 --format json -x --log-level off
{"short_hand_field_weighted_with_null": "200"}
{"short_hand_field_weighted_with_null": null}
{"short_hand_field_weighted_with_null": "200"}
Special Output Values
There are certain valid JSON output values that are trickier to produce with a values spec. There are also times when
your values are interpreted as strings but you need them to be output as one of these special values. The way
we do this is by using a special token of the form _TYPE_
. Below is the current mappings of special token to
output value:
{
"_NONE_": null,
"_NULL_": null,
"_NIL_": null,
"_TRUE_": true,
"_FALSE_": false
}
This is particularly useful when using a weighted values form of the values spec:
{
"converted": {
"type": "values",
"data": {
"_TRUE_": 0.05,
"_FALSE_": 0.95
}
}
}
$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"converted": false}
{"converted": false}
{"converted": false}
The special token values can be mixed and matched as well:
{
"mixed": {
"type": "values",
"data": {
"_NONE_": 0.11,
"_NULL_": 0.11,
"_NIL_": 0.11,
"_TRUE_": 0.33,
"_FALSE_": 0.33
}
}
}
$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"mixed": false}
{"mixed": true}
{"mixed": null}
refs
Pointer to a field spec defined in references section
Prototype:
{
"<field name>": {
"type": "ref":
"ref": "<ref_name>",
or
"data": <ref_name>,
"config": {
"key": Any
}
}
}
Examples:
{ "pointer": { "type": "ref", "data": "ref_name" }, "refs": { "ref_name": 42 } }
{ "pointer": { "type": "ref", "ref": "ref_name" }, "refs": { "ref_name": 42 } }
{ "pointer:ref": { "ref": "ref_name" }, "refs": { "ref_name": 42 } }
{ "pointer:ref": { "data": "ref_name" }, "refs": { "ref_name": 42 } }
{ "pointer:ref": "ref_name", "refs": { "ref_name": 42 } }
ref_list
Pointer to Field Specs to be injected into list in order of name. This allows externally defined fields to be injected into specific places in a list of values.
Prototype:
{
"<field name>": {
"type": "ref_list":
"refs": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
or
"data": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
"config": {
"key": Any
}
}
}
Example:
In this example we want a location field as a list of [latitude, longitude, altitude]
{
"location": {
"type": "ref_list",
"refs": ["lat", "long", "altitude"]
},
"refs": {
"lat": {
"type": "geo.lat"
},
"long": {
"type": "geo.long"
},
"altitude": {
"type": "rand_int_range",
"data": [5000, 10000]
}
}
}
$ datacraft -s spec.json -i 1 --format json-pretty -x -l off
[
{
"location": [
-36.7587,
-40.5453,
6233
]
}
]
weighted_refs
A weighted_ref spec is used to select the values from a set of refs in a weighted fashion.
Prototype:
{
"<field name>": {
"type": "weighted_ref",
"data": {"valid_ref_1": 0.N, "valid_ref_2": 0.N, ...}
"config": {
"key": Any
}
}
}
Examples:
{
"http_code": {
"type": "weighted_ref",
"data": {"GOOD_CODES": 0.7, "BAD_CODES": 0.3}
},
"refs": {
"GOOD_CODES": {
"200": 0.5,
"202": 0.3,
"203": 0.1,
"300": 0.1
},
"BAD_CODES": {
"400": 0.5,
"403": 0.3,
"404": 0.1,
"500": 0.1
}
}
}
config_ref
Reference for holding configurations common to multiple fields.
Prototype:
{
"refs": {
"<config ref name>": {
"type": "config_ref",
"config": {
"key1": Any,
...
"key2": Any
}
}
}
}
Examples:
{
"status": {
"type": "csv",
"config": {
"column": 1,
"config_ref": "tabs_config"
}
},
"description": {
"type": "csv",
"config": {
"column": 2,
"config_ref": "tabs_config"
}
},
"status_type:csv?config_ref=tabs_config&column=3": {},
"refs": {
"tabs_config": {
"type": "config_ref",
"config": {
"datafile": "tabs.csv",
"delimiter": "\t",
"headers": true
}
}
}
}
nested
Nested types are used to create fields that contain subfields. Nested types can
also contain nested fields to allow multiple levels of nesting. Use the nested
type to generate a field that contains subfields. The subfields are defined in
the fields
element of the nested spec. The fields
element will be treated
like a top level DataSpec and has access to the refs
and other elements of the
root.
Prototype:
{
"<field name>": {
"type": "nested",
"config": {
"count": "Values Spec for Counts, default is 1"
},
"fields": {
"<sub field one>": { spec definition here },
"<sub field two>": { spec definition here },
...
},
"field_groups": <field groups format>
}
}
Examples:
{
"id": {
"type": "uuid"
},
"user": {
"type": "nested",
"fields": {
"user_id": {
"type": "uuid"
},
"geo": {
"type": "nested",
"fields": {
"place_id:cc-digits?mean=5": {},
"coordinates:geo.pair?as_list=true": {}
}
}
}
}
}
The same spec in a slightly more compact format
{
"id:uuid": {},
"user:nested": {
"fields": {
"user_id:uuid": {},
"geo:nested": {
"fields": {
"place_id:cc-digits?mean=5": {},
"coordinates:geo.pair?as_list=true": {}
}
}
}
}
}
Generates the following structure
datacraft -s tweet-geo.json --log-level off -x -i 1 --format json-pretty
{
"id": "68092478-2234-41aa-bcc6-e679950770d7",
"user": {
"user_id": "93b3c62e-76ad-4272-b3c1-b434be2c8c30",
"geo": {
"place_id": "5104987632",
"coordinates": [
-93.0759,
68.2469
]
}
}
}
External Data
The csv types are used to input large numbers of values into a spec.
csv types
If you have an existing large set of data in a tabular format that you want to
use, it would be burdensome to copy and paste the data into a spec. To make use
of data already in a tabular format you can use a csv
Field Spec. These specs
allow you to identify a column from a tabular data file to use to provide the
values for a field. Another advantage of using a csv spec is that it is easy to
have fields that are correlated be generated together. All rows will be selected
incrementally, unless any of the fields are configured to use sample
mode. You
can use sample
mode on individual columns, or you can use it across all
columns by creating a config_ref
spec. See csv_select
for an
efficient way to select multiple columns from a csv file.
csv
Prototype:
{
"<field name>": {
"type": "csv",
"config": {
"datafile": "filename in datedir",
"headers": "yes, on, true for affirmative",
"column": "1 based column number or field name if headers are present",
"delimiter": "how values are separated, default is comma",
"quotechar": "how values are quoted, default is double quote",
"sample": "If the values should be selected at random, default is false",
"count": "Number of values in column to use for value"
}
}
}
Examples:
{
"cities": {
"type": "csv",
"config": {
"datafile": "cities.csv",
"delimiter": "~",
"sample": true
}
}
}
{
"status": {
"type": "csv",
"config": {
"column": 1,
"config_ref": "tabs_config"
}
},
"description": {
"type": "csv",
"config": {
"column": 2,
"config_ref": "tabs_config"
}
},
"status_type:csv?config_ref=tabs_config&column=3": {},
"refs": {
"tabs_config": {
"type": "config_ref",
"config": {
"datafile": "tabs.csv",
"delimiter": "\\t",
"headers": true,
"sample_rows": true
}
}
}
}
csv_select
Prototype:
{
"<field name>": {
"type": "csv_select",
"data": {
"<field_one>": <1 based column index for field 1>,
"<field_two>:<cast>": <1 based column index for field 2>,
"<field_tre>": {
"col": <1 based column index for field 3>,
"cast": "<valid cast value i.e. int, float, etc"
},
...,
"<field n>":
}
},
"config": {
"datafile": "filename in datedir, or templated name i.e. {{ to_be_filled }}",
"headers": "yes, on, true for affirmative",
"delimiter": "how values are separated, default is comma",
"quotechar": "how values are quoted, default is double quote"
}
}
}
Examples:
{
"placeholder": {
"type": "csv_select",
"data": {
"geonameid": 1,
"name": 2,
"latitude:float": 5,
"longitude": { "col": 6, "cast": "float" },
"country_code": 9,
"population:int": 15
},
"config": {
"datafile": "allCountries.txt",
"headers": false,
"delimiter": "\t"
}
}
}
In the example above, the latitude and longitude columns are both cast to floating point numbers and the population is cast to an integer. See Casting Values for details on available casting types.
weighted_csv
This is useful when you have a large number of weighted values that would not fit nicely into a JSON file. You can specify a value and a weight for that value. The default is that the first column in the csv is the value and the second column is the weight. Example CSV:
city,weight
New York,0.65
Los Angeles,0.23
London,0.87
Paris,0.49
Tokyo,0.32
Sydney,0.91
Beijing,0.04
Rio de Janeiro,0.78
Mumbai,0.56
Cape Town,0.38
Prototype:
{
"<field name>": {
"type": "weighted_csv",
"config": {
"datafile": "filename in datedir",
"headers": "yes, on, true for affirmative",
"column": "1 based column number or field name if headers are present",
"weight_column": "1 based column number or field name if headers are present where weights are defined"
"delimiter": "how values are separated, default is comma",
"quotechar": "how values are quoted, default is double quote",
"sample": "If the values should be selected at random, default is false",
"count": "Number of values in column to use for value"
}
}
}
Examples:
{
"cities": {
"type": "weighted_csv",
"config": {
"datafile": "weighted_cities.csv"
}
}
}
Operator Types
These make use of one or more other fields or references to compute their values.
sample
A sample spec is used to select multiple values from a list to use as the value for a field.
Prototype:
{
"<field name>": {
"type": "sample",
OR
"type": "sample",
"config": {
"mean": N,
"stddev": N,
"min": N,
"max": N,
or
"count": N,
"join_with": "<optional delimiter to join with>"
},
"data": ["data", "to", "select", "from"],
OR
"ref": "<ref or field with data as list>"
}
}
Examples:
{
"ingredients": {
"type": "sample",
"data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
"config": {
"mean": 3,
"stddev": 1,
"min": 2,
"max": 4,
"join_with": ", "
}
}
}
{
"ingredients": {
"type": "sample",
"data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
"config": {
"mean": 3,
"stddev": 1,
"min": 2,
"max": 4,
"join_with": "\", \"",
"quote": "\""
}
}
}
$ datacraft -s sample.json -i 3 -t 'Ingredients: {{ ingredients | safe }}' -l off
Ingredients: "garlic", "onions"
Ingredients: "mushrooms", "potatoes", "garlic", "bell peppers"
Ingredients: "potatoes", "mushrooms"
combine
A combine Field Spec is used to concatenate or append two or more fields or reference to one another.
There are two combine types: combine
and combine-list
.
combine
Prototype:
{
"<field name>": {
"type": "combine",
"fields": ["valid field name1", "valid field name2"],
OR
"refs": ["valid ref1", "valid ref2"],
"config": {
"join_with": "<optional string to use to join fields or refs, default is none>"
}
}
}
Examples:
{
"combine": {
"type": "combine",
"refs": ["first", "last"],
"config": {
"join_with": " "
}
},
"refs": {
"first": {
"type": "values",
"data": ["zebra", "hedgehog", "llama", "flamingo"]
},
"last": {
"type": "values",
"data": ["jones", "smith", "williams"]
}
}
}
combine-list
Prototype:
{
"<field name>": {
"type": "combine-list",
"refs": [
["valid ref1", "valid ref2"],
["valid ref1", "valid ref2", "valid_ref3", ...], ...
["another_ref", "one_more_ref"]
],
"config": {
"join_with": "<optional string to use to join fields or refs, default is none>"
}
}
}
Examples:
{
"full_name": {
"type": "combine-list",
"refs": [
["first", "last"],
["first", "middle", "last"],
["first", "middle_initial", "last"]
],
"config": {
"join_with": " "
}
},
"refs": {
"first": {
"type": "values",
"data": ["zebra", "hedgehog", "llama", "flamingo"]
},
"last": {
"type": "values",
"data": ["jones", "smith", "williams"]
},
"middle": {
"type": "values",
"data": ["cloud", "sage", "river"]
},
"middle_initial": {
"type": "values",
"data": {"a": 0.3, "m": 0.3, "j": 0.1, "l": 0.1, "e": 0.1, "w": 0.1}
}
}
}
calculate
There are times when one field needs the value of another field in order to calculate its own value. For example, if you wanted to produce values that represented a users’ height in inches and in centimeters, you would want them to correlate. You could use the calculate type to specify a formula to do this calculation. There are two ways to specify the fields to calculate a value from. The first is to use the fields and/or the refs keys with an array of fields or refs to use in the formula. The second is to use a map where the field or ref name to be used is mapped to a string that will be used as an alias for it in the formula. See second example below for the mapped alias version.
Prototype:
{
"<field name>": {
"type": "calculate",
"fields": List[str],
or
"refs": List[str],
"formula": <formula>
"config": {
"key": Any
}
}
}
formula (str): The formula to use in calculations
Examples:
{
"height_in": [60, 70, 80, 90],
"height_cm": {
"type": "calculate",
"fields": ["height_in"],
"formula": "{{ height_in }} * 2.54"
}
}
{
"long_name_one": {
"type": "values",
"data": [4, 5, 6]
},
"long_name_two": {
"type": "values",
"data": [3, 6, 9]
},
"c": {
"type": "calculate",
"fields": {
"long_name_one": "a",
"long_name_two": "b"
},
"formula": "sqrt({{a}}*{{a}} + {{b}}*{{b}})"
}
}
We use the asteval package to do formula evaluation. This provides a fairly safe way to do evaluation. The package provides a bunch of built-in-functions as well. We also use the Jinja2 templating engine format for specifying variable names to substitute. In theory, you could use any valid jinja2 syntax i.e.:
{
"formula": "sqrt({{ value_that_might_be_a_string | int }})"
}
templated
A templated Field Spec is used to create strings by injecting the values from other fields into them. The other
fields must be defined. The values can come from references or other defined fields. Use the jinja2 {{ field }}
syntax to signify where the field should be injected.
Prototype:
{
"<field name>": {
"type": "templated",
"data": "string with {{ jinja2 }} syntax fields",
"fields": ["valid field name1", "valid field name2"],
OR
"refs": ["valid ref1", "valid ref2"]
}
}
Examples:
{
"user_agent": {
"type": "templated",
"data": "Mozilla/5.0 ({{ system }}) {{ platform }}",
"refs": ["system", "platform"],
},
"refs": {
"system": {
"type": "values",
"data": [
"Windows NT 6.1; Win64; x64; rv:47.0",
"Macintosh; Intel Mac OS X x.y; rv:42.0"
]
},
"platform": {
"type": "values",
"data": ["Gecko/20100101 Firefox/47.0", "Gecko/20100101 Firefox/42.0"]
}
}
}
replace
Replace one or more parts of the output of a field or reference. Values to replace should be specified as strings. Values to replace with should also be strings.
Prototype:
{
"<field name>": {
"type": "replace",
"ref": "<field or ref to source value from>",
"data": {
"<value to replace 1>": "<value to replace with 1>",
...
"<value to replace N>": "<value to replace with N>",
}
}
}
Examples:
{
"id": {
"type": "uuid"
},
"remove_dashes": {
"type": "replace",
"ref": "id",
"data": { "-": "" }
}
}
$ datacraft --spec uuid-spec.json -i 3 -r 1 -x -l off --format json
{"id": "e809af25-bd85-4118-a5e9-cfdc953e172b", "remove_dashes": "1622e5cf2f334b81a90a6c031e0f78bf"}
{"id": "2a98b892-bb73-49de-8186-fa7cb4510001", "remove_dashes": "9c1d22d6f6e544bb8c0d582c441a1c78"}
{"id": "7986c789-1e5c-46f1-b5f1-a095f6a75209", "remove_dashes": "b50e914ea7994b6bb3194ce8c3402c8e"}
regex_replace
Replace one or more parts of the output of a field or reference using regular expressions to match the value strings.
Note that masked
is an alias for this type.
Prototype:
{
"<field name>": {
"type": "regex_replace|masked",
"ref": "<field or ref to source value from>",
"data": {
"<regex 1>": "<value to replace with 1>",
...
"<regex N>": "<value to replace with N>",
}
OR
"data": "<replace all values with this>
}
}
Examples:
This first example with take a 10 digit string of numbers and format it as a phone number. The double forward slash allows the strings to be compiled into regular expressions. Notice the \N format for specifying the group capture replacement.
{
"phone": {
"type": "regex_replace",
"ref": "ten_digits",
"data": {
"^(\\d{3})(\\d{3})(\\d{4})": "(\\1) \\2-\\3"
}
},
"refs": {
"ten_digits": {
"type": "cc-digits",
"config": {
"count": 10,
"buffer": true
}
}
}
}
$ datacraft --spec phone-spec.json -i 4 -r 1 -x -l off --format json
{"phone": "(773) 542-6190"}
{"phone": "(632) 956-3481"}
{"phone": "(575) 307-4587"}
{"phone": "(279) 788-3403"}
Masked Example
The masked
type is an alias for regex_replace
. One mode for these type is to replace
all the values with a specified value for example:
{
"masked_ssn": {
"type": "masked",
"ref": "ssn",
"data": "NNN-NN-NNNN"
},
"age:rand_int_range": [18, 99],
"refs": {
"ssn": [
"123-45-6789",
"111-22-3333",
"555-55-5555"
]
}
}
$ datacraft.exe -s ssn.json -i 3 --format csvh -x -l off
masked_ssn,age
NNN-NN-NNNN,40
NNN-NN-NNNN,42
NNN-NN-NNNN,73