Core Types

These are the built-in field spec types. Organized by the type of Data they generate or by their function or utility.

Strings

For generating strings in various formats

char_class

A char_class type is used to create strings that are made up of characters from specific character classes. The strings can be of fixed or variable length. There are several built in character classes. You can also provide your own set of characters to sample from. Below is the list of supported character classes:

Built In Classes

class	description
ascii	All valid ascii characters including control
lower	ascii lowercase
upper	ascii uppercase
digits	Numbers 0 through 9
letters	lowercase and uppercase
word	letters + digits + ‘_’
printable	All printable ascii chars including whitespace
visible	All printable ascii chars excluding whitespace
punctuation	local specific punctuation
special	local specific punctuation
hex	Hexadecimal digits including upper and lower case a-f
hex-lower	Hexadecimal digits only including lower case a-f
hex-upper	Hexadecimal digits only including upper case A-F

Prototype:

{
  "<field name>": {
    "type": "char_class":
    "data": <char_class_name>,
    or
    "type": "cc-<char_class_name>",
    or
    "type": "char_class":
    "data": <string with custom set of characters to sample from>
    or
    "type": "char_class":
    "data": [<char_class_name1>, <char_class_name2>, ..., <custom characters>]
    "config":{
      "exclude": <string of characters to exclude from output>,
      "escape": <string of characters to escape in output e.g. " -> \\", useful if non JSON output
      "escape_str": <string to use for escaping, default is \>
      "min": <min number of characters in string>,
      "max": <max number of characters in string>,
      or
      "count": <exact number of characters in string>
      or
      "mean": <mean number of characters in string>
      "stddev": <standard deviation from mean for number of characters in string>
      "min": <optional min>
      "max": <optional max>
    }
  }
}

Examples:

{
  "password": {
    "type": "char_class",
    "data": ["word", "special", "hex-lower", "M4$p3c!@l$@uc3"],
    "config": {
      "mean": 14,
      "stddev": 2,
      "min": 10,
      "max": 18,
      "exclude": ["-", "\""]
    }
  }
}

$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"password": "c1hbR&V!sYi4+Em"}
{"password": "Z7Qd0AM>$f7'"}
{"password": "9Bh8Z%6?ed4g"}
{"password": "sqQ&I!Ucdp"}

{
  "one_to_five_digits:cc-digits?min=1&max=5": {}
}

$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"one_to_five_digits": "43040"}
{"one_to_five_digits": "5"}
{"one_to_five_digits": "6914"}
{"one_to_five_digits": "752"}

unicode_range

Generates strings from unicode ranges

Prototype:

{
  "<field>": {
    "type": "unicode_range":
    "data": [<start_code_point_in_hex>, <end_code_point_in_hex>],
    or
    "data": [
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
        ...
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
    ],
    "config":{
      # String Size Based Config Parameters
      "min": <min number of characters in string>,
      "max": <max number of characters in string>,
      or
      "count": <exact number of characters in string>
      or
      "mean": <mean number of characters in string>
      "stddev": <standard deviation from mean for number of characters in string>
      "min": <optional min>
      "max": <optional max>
    }
  }
}

Examples:

{
  "text": {
    "type": "unicode_range",
    "data": ["3040", "309f"],
    "config": {
      "mean": 5
    }
  }
}

uuid

A standard uuid

Prototype:

{
  "<field name>": {
    "type": "uuid",
    "config": {
      "variant": 1, 3, 4, or 5, default is 4, optional
    }
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "id_shorthand:uuid": {},
  "id_variant3": {
    "type": "uuid",
    "config": {
      "variant": 3
    }
  }
}

Numeric

For generating numeric values in different ways.

range_suppliers

There are two main range types sequential and random. A sequential range is specified using the range type. A random one uses the rand_range type.

range

Prototype:

{
  "<field name>": {
    "type": "range",
    "data": [<start>, <end>, <step> (optional)],
    or
    "data": [
      [<start>, <end>, <step> (optional)],
      [<start>, <end>, <step> (optional)],
      ...
      [<start>, <end>, <step> (optional)],
    ],
  }
}

start: (Union[int, float]) - start of range
end: (Union[int, float]) - end of range
step: (Union[int, float]) - step for range, default is 1

Examples:

{
  "zero_to_ten_step_half": {
    "type": "range",
    "data": [0, 10, 0.5]
  }
}

{
  "range_shorthand1:range": {
    "data": [0, 10, 0.5]
  }
}

{"range_shorthand2:range": [0, 10, 0.5]},

rand_range

Generates a random floating point number in the given range. Use the rand_int_range type as a shortcut for casting the numbers as integers.

Prototype:

{
  "<field name>": {
    "type": "rand_range",
    "data": [<upper>],
    or
    "data": [<lower>, <upper>],
    or
    "data": [<lower>, <upper>, <precision> (optional)]
  }
}

upper: (Union[int, float]) - upper limit of random range
lower: (Union[int, float]) - lower limit of random range
precision: (int) - Number of digits after decimal point

Examples:

{
  "zero_to_ten_three_decimals": {
    "type": "rand_range",
    "data": [0, 10, 3]
  }
}

{
  "int_in_range": {
    "type": "rand_int_range",
    "data": [1, 100]
  }
}

{
  "int_in_range": {
    "type": "rand_int_range",
    "data": [1, 100]
  }
}

integer

The integer type is similar to rand_int_range and uses the same configuration. The only difference is that the data element is not required. If no data element is specified, the range of numbers created will be between +- one billion.

{
  "int_no_args": {
    "type": "integer"
  }
}

{
  "int_with_args": {
    "type": "integer",
    "data": [
        [1, 5], [7, 11], [20, 122]
    ]
  }
}

number

The number type is similar to rand_range and uses the same configuration. The only difference is that the data element is not required. If no data element is specified, the range of numbers created will be between +- one billion.

{
  "num_no_args": {
    "type": "number"
  }
}

{
  "num_with_args": {
    "type": "number",
    "data": [
        [1.1, 5.5], [7.1, 11.33], [20.5, 122.66]
    ]
  }
}

number.N

The number.N type is a specialized version of the number type that automatically truncates decimal places to exactly N digits. This type is available for N = 1 through 7. It uses the same configuration as the number type but automatically applies a roundN cast to ensure consistent decimal precision.

Prototype:

{
  "<field name>": {
    "type": "number.<N>",
    "data": [<lower>, <upper>] (optional),
    or
    "data": [
      [<lower>, <upper>],
      [<lower>, <upper>],
      ...
      [<lower>, <upper>],
    ],
  }
}

N: (int) - Number of decimal places (1-7)
lower: (Union[int, float]) - lower limit of random range
upper: (Union[int, float]) - upper limit of random range

Examples:

{
  "two_decimal_places": {
    "type": "number.2",
    "data": [0, 10]
  }
}

{
  "three_decimal_places": {
    "type": "number.3",
    "data": [1.1, 5.5]
  }
}

{
  "one_decimal_place": {
    "type": "number.1",
    "data": [
        [0, 5], [10, 15]
    ]
  }
}

{
  "four_decimal_places": {
    "type": "number.4"
  }
}

distribution

A distribution spec can be built from one of the registered distribution types. Below is the table of the built in ones. Custom distributions can be registered using Custom Code Loading. See Custom Count Distributions for an example.

distribution	required arguments	optional args	examples
uniform	start,end		“uniform(start=10, end=30)”
			“uniform(start=1, end=3)”
guass	mean,stddev	min,max	“gauss(mean=2, stddev=1)”
guassian			“guassian(mean=7, stddev=1, min=4)”
normal			“normal(mean=25, stddev=10, max=40)”

Prototype:

{
  "<field name>": {
    "type": "distribution",
    "data": "<dist func name>(<param1>=<val1>, ..., <paramN>=<valN>)
  }
}

Examples:

{
  "values": {
    "type": "distribution",
    "data": "uniform(start=10, end=30)"
  }
}

{
  "age": {
    "type": "distribution",
    "data": "normal(mean=28, stddev=10, min=18, max=40)",
    "config": {"cast": "int"}
  }
}

{
  "pressure": {
    "type": "distribution",
    "data": "gauss(mean=33, stddev=3.4756535)",
    "config": {
      "count_dist": "normal(mean=2, stddev=1, min=1, max=4)",
      "as_list": true
    }
  }
}

A distribution type field with a uniform distribution will produce similar values to a rand_range field. With rand_range it is easier to specify a specific number of decimal places to keep. To do this for the distribution type, you need to make use of the cast config with a roundN caster. See example below.

{
  "values1": {
    "type": "rand_range",
    "data": [10, 30, 4]
  },
  "values2": {
    "type": "distribution",
    "data": "uniform(start=10, end=30)",
    "config": {
      "cast": "round4"
    }
  },
  "values3:rand_range": [10, 30, 4],
  "values4:distribution?cast=round4": "uniform(start=10, end=30)"
}

$ datacraft -s spec.json -i2 --log-level off --printkey
values1 -> 29.7907
values2 -> 18.9114
values3 -> 13.5495
values4 -> 15.5935
values1 -> 22.0634
values2 -> 17.8552
values3 -> 22.982
values4 -> 20.5616

iteration

An iteration or rownum spec is used to populate the record number that is being generated. By default the offset is set to 1. To get zero based indexes for iteration, set the offset config parameter to 0.

Prototype:

{
  "<field name>": {
    "type": "iteration",
    OR
    "type": "rownum",
    OR
    "type": "rownum",
    "config": {
      "offset": N
    }
}

Examples:

{
  "id": {
    "type": "iteration"
  }
}

{
  "id": {
    "type": "rownum",
    "config": { "offset": 0 }
  }
}

$ datacraft -s iteration.json  -i 3 -t 'ID: {{ id | safe }}' -l off
ID: 1
ID: 2
ID: 3

Date & Time

For generating dates and timestamp in a variety of formats

date

A Date Field Spec is used to generate date strings. The default format is day-month-year i.e. Christmas 2050 would be: 25-12-2050. There is also a date.iso type that generates ISO8601 formatted date strings without microseconds and a date.iso.us for one that generates them with microseconds. There are also a date.epoch and date.epcoh.ms and date.epoch.millis. These are for generating unix epoch timestamps. We use the format specification from the datetime module.

type	example output
date	11-18-2050
date.iso	2050-12-01T01:44:35Z
date.iso.ms	2050-12-01T05:11:20.543Z
date.iso.millis	2050-12-01T05:11:20.543Z
date.iso.us	2050-12-01T06:19:02.752373Z
date.iso.micros	2050-12-01T06:17:05.487878Z
date.epoch	1669825519
date.epoch.ms	1668624934547
date.epoch.millis	1669166880466

Uniformly Sampled Dates

The default strategy is to create random dates within a 30 day range, where the start date is today. You can use the start parameter to set a specific start date for the dates. You can also explicitly specify an end date. The start and end parameters should conform to the specified date format, or the default if none is provided. The offset parameter can be used to shift the dates by a specified number of days. A positive offset will shift the start date back. A negative offset will shift the date forward. The duration_days parameter can be used to specify the number of days that should be covered in the date range, instead of the default 30 days. This parameter is usually specified as an integer constant.

     start                              end (default start + 30 days)
        |--------------------------------|
|+offset|                           start+duration_days
|--------------------------------|
        |-offset|
                |--------------------------------|

Dates Distributed around a Center Point

An alternative strategy is to specify a center_date parameter with an optional stddev_days. This will create a normal or gaussian distribution of dates around the center point.

                   |
                   |
                |  |  |
             |  |  |  |  |
          |  |  |  |  |  |  |
 |  |  |  |  |  |  |  |  |  |  |  |  |
|-------------------------------------|
|         | stddev | stddev |         |
                center

Restricting Hours

If you want your generated dates to be restricted to certain hours of the day, you provide the hours config param. The value of this parameter can be any type of Field Spec that produces valid integers in the range of 0 to 23. See examples below.

Prototype:

{
  "<field name>": {
    "type": "date",
    OR,
    "type": "date.iso",
    OR,
    "type": "date.iso.ms",
    OR,
    "type": "date.iso.millis",
    OR,
    "type": "date.iso.us",
    OR,
    "type": "date.iso.micros",
    "data" "replacement for config.format, valid for type: date only",
    "config": {
      "format": "Valid datetime format string",
      "duration_days": "The number of days from the start date to create date strings for",
      "start": "date string matching format or default format to use for start date",
      "end": "date string matching format or default  format to use for end date",
      "offset": "number of days to shift base date by, positive means shift backwards, negative means forward",
      "center_date": "date string matching format or default format to use for center date",
      "stddev_days": "The standard deviation in days from the center date that dates should be distributed",
      "hours": "spec describing how the hours should be populated, i.e. only between 9am and 5pm"
    }
  }
}

Examples:

Dates that start on 15 Dec 2050 and span a 90 day period

{
  "dates": {
    "type": "date",
    "config": {
      "duration_days": "90",
      "start": "15-Dec-2050 12:00",
      "format": "%d-%b-%Y %H:%M"
    }
  }
}

Dates centered on 01 Jun 2050 with a standard deviation of +-2 days

{
  "dates": {
    "type": "date",
    "config": {
      "center_date": "20500601 12:00",
      "format": "%Y%m%d %H:%M",
      "stddev_days": "2"
    }
  }
}

ISO Date Centered at 1 Jun 2050, with weighted hours of the day

{
  "start_time": {
    "type": "date.iso",
    "config": {
      "center_date": "2050-06-01T12:00:00Z",
      "hours": { "type": "values", "data": { "7": 0.1, "8": 0.2, "9": 0.4, "10": 0.2, "11": 0.1 } }
    }
  }
}

Epoch Date with milliseconds 14 days in the future with a 7 day window for timestamps

{
  "start_time": {
    "type": "date.epoch.ms",
    "config": {
      "offset": -14,
      "duration_days": 7
    }
  }
}

Date format in data element using shorthand notation

{
  "start_time:date": "%d-%b-%Y %H:%M"
}

Equivalent to

{
  "start_time": {
    "type": "date",
    "data": "%d-%b-%Y %H:%M"
  }
}

.now Variations

All date-type variations support a .now extension, allowing you to generate the current date and time in different formats based on your specific needs. These formats can include human-readable strings, epoch timestamps in various precisions, or ISO standard formats. The flexibility of the .now variations ensures that your data can align with different system requirements.

For example, using the .now extension with a specific format string will generate the current date and time as follows:

{
  "event_date": {
    "type": "date.now",
    "data": "%d-%b-%Y %H:%M:%S"
  }
}

This produces output like: 15-Sep-2044 10:35:20, which is useful for generating consistent, formatted timestamps.

Available `.now` Variations:

Each of the following .now types generates the current date and time in a specific format:

`date.now`	Outputs the current date in a human-readable string, supports custom formats.
`date.epoch.now`	Generates the current Unix timestamp (seconds since 1 January 1970).
`date.epoch.millis.now`	Returns the Unix timestamp with millisecond precision.
`date.epoch.ms.now`	Alias for `date.epoch.millis.now`.
`date.iso.now`	Produces the current date and time in ISO 8601 format
`date.iso.micros.now`	Provides the ISO 8601 format with microsecond precision.
`date.iso.us.now`	Alias for `date.iso.micros.now`.
`date.iso.millis.now`	Outputs the ISO 8601 format with millisecond precision.
`date.iso.ms.now`	Alias for `date.iso.millis.now`.

These variations work well when using the --server command line option to serve up the data over REST.

Geographic

For generating basic decimal degrees of latitude and longitude

geo types

There are three main geo types: geo.lat, geo.long, and geo.pair. The defaults will create decimal string values in the valid ranges: -90 to 90 for latitude and -180 to 180 for longitude. You can bound the ranges in several ways. The first is with the start_lat, end_lat, start_long, end_long config params. These will set the individual bounds for each of the segments. You can use one or more of them. The other mechanism is by defining a bbox array which consists of the lower left geo point and the upper right one.

type	param	description
all	precision	number of decimal places for lat or long, default is 4
	bbox	array of [min Longitude, min Latitude, max Longitude, max Latitude]
geo.lat	start_lat	lower bound for latitude
	end_lat	upper bound for latitude
geo.long	start_long	lower bound for longitude
	end_long	upper bound for longitude
geo.pair	join_with	delimiter to join long and lat with, default is comma
	as_list	One of yes, true, or on if the pair should be returned as a list instead of as a joined string
	lat_first	if latitude should be first in the generated pair, default is longitude first
	start_lat	lower bound for latitude
	end_lat	upper bound for latitude
	start_long	lower bound for longitude
	end_long	upper bound for longitude

Prototype:

{
  "<field name>": {
    "type": "geo.lat",
    or
    "type": "geo.long",
    or
    "type": "geo.pair",
    "config": {
      "key": Any
    }
  }
}

Examples:

{
  "egypt": {
    "type": "geo.pair",
    "config": {
      "bbox": [
        31.33134,
        22.03795,
        34.19295,
        25.00562
      ],
      "precision": 3
    }
  }
}

Network

Network related types

ip/ipv4

Ip addresses can be generated using CIDR notation or by specifying a base. For example, if you wanted to generate ips in the 10.0.0.0 to 10.0.0.255 range, you could either specify a cidr param of 10.0.0.0/24 or a base param of 10.0.0.

Prototype:

{
  "<field name>": {
    "type": "ipv4",
    "config": {
      "cidr": "<cidr value /8 /16 /24 only>",
      OR
      "base": "<beginning of ip i.e. 10.0>"
    }
  }
}

Examples:

{
  "network": {
    "type": "ipv4",
    "config": {
      "cidr": "2.22.222.0/16"
    }
  },
  "network_shorthand:ip?cidr=2.22.222.0/16": {},
  "network_with_base:ip?base=192.168.0": {}
}

ip.precise

The default ip type only supports cidr masks of /8 /16 and /24. If you want more precise ip ranges you need to use the ip.precise type. This type requires a cidr as the single config param. The default mode for ip.precise is to increment the ip addresses. Set config param sample to one of true, on, or yes to enable random ip addresses selected from the generated ranges.

Prototype:

{
  "<field name>": {
    "type": "ip.precise",
    "config": {
      "cidr": "<valid cidr value>",
    }
  }
}

Examples:

{
  "network": {
    "type": "ip.precise",
    "config": {
      "cidr": "192.168.0.0/14",
      "sample": "true"
    }
  }
}

net.mac

For creating MAC addresses

Prototype:

{
  "<field name>": {
    "type": "net.mac",
    "config": {
      "dashes": "If dashes should be used as the separator one of on, yes, 'true', or True"
    }
  }
}

Examples:

{
  "network": {
    "type": "net.mac"
  }
}

{
  "network": {
    "type": "net.mac",
    "config": {
      "dashes": "true"
    }
  }
}

Utility/Common

Common types or types that are used in a utility capacity.

values

There are three types of values specs: Constants, List, and Weighted. Values specs have a shorthand notation where the value of the data element replaces the full spec. See examples below.

Prototype:

{
  "<field_name>": {
    "type": "values",
    "data": Union[str, bool, int, float, list, dict],
    "config": {
      "key": "value"
    }
  }
}

Examples:

{"field_constant": {"type": "values", "data": 42}}

{"field_list": {"type": "values", "data": [1, 2, 3, 5, 8, 13]}}

{"field_weighted": {"type": "values", "data": {"200": 0.6, "404": 0.1, "303": 0.3}}}

{"field_weighted_with_null": {"type": "values", "data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NULL_": 0.1}}}

{"shorthand_field_constant": 42}

{"shorthand_field_list": [1, 2, 3, 5, 8, 13]}

{"shorthand_field_weighted": {"200": 0.6, "404": 0.1, "303": 0.3}}

{
    "short_hand_field_weighted_with_null": {
        "type": "values",
        "data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NONE_": 0.1}
    }
}

$ datacraft -s spec.json -i 3 -r 1 --format json -x --log-level off
{"short_hand_field_weighted_with_null": "200"}
{"short_hand_field_weighted_with_null": null}
{"short_hand_field_weighted_with_null": "200"}

Special Output Values

There are certain valid JSON output values that are trickier to produce with a values spec. There are also times when your values are interpreted as strings but you need them to be output as one of these special values. The way we do this is by using a special token of the form _TYPE_. Below is the current mappings of special token to output value:

{
    "_NONE_": null,
    "_NULL_": null,
    "_NIL_": null,
    "_TRUE_": true,
    "_FALSE_": false
}

This is particularly useful when using a weighted values form of the values spec:

{
    "converted": {
        "type": "values",
        "data": {
            "_TRUE_": 0.05,
            "_FALSE_": 0.95
        }
    }
}

$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"converted": false}
{"converted": false}
{"converted": false}

The special token values can be mixed and matched as well:

{
    "mixed": {
        "type": "values",
        "data": {
            "_NONE_": 0.11,
            "_NULL_": 0.11,
            "_NIL_": 0.11,
            "_TRUE_": 0.33,
            "_FALSE_": 0.33
        }
    }
}

$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"mixed": false}
{"mixed": true}
{"mixed": null}

refs

Pointer to a field spec defined in references section

Prototype:

{
  "<field name>": {
    "type": "ref":
    "ref": "<ref_name>",
    or
    "data": <ref_name>,
    "config": {
      "key": Any
    }
  }
}

Examples:

{ "pointer": { "type": "ref", "data": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer": { "type": "ref", "ref": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": { "ref": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": { "data": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": "ref_name", "refs": { "ref_name": 42 } }

ref_list

Pointer to Field Specs to be injected into list in order of name. This allows externally defined fields to be injected into specific places in a list of values.

Prototype:

{
  "<field name>": {
    "type": "ref_list":
    "refs": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
    or
    "data": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
    "config": {
      "key": Any
    }
  }
}

Example:

In this example we want a location field as a list of [latitude, longitude, altitude]

{
  "location": {
    "type": "ref_list",
    "refs": ["lat", "long", "altitude"]
  },
  "refs": {
    "lat": {
      "type": "geo.lat"
    },
    "long": {
      "type": "geo.long"
    },
    "altitude": {
      "type": "rand_int_range",
      "data": [5000, 10000]
    }
  }
}

$ datacraft -s spec.json -i 1 --format json-pretty -x -l off
[
    {
        "location": [
            -36.7587,
            -40.5453,
            6233
        ]
    }
]

weighted_refs

A weighted_ref spec is used to select the values from a set of refs in a weighted fashion.

Prototype:

{
  "<field name>": {
    "type": "weighted_ref",
    "data": {"valid_ref_1": 0.N, "valid_ref_2": 0.N, ...}
    "config": {
      "key": Any
    }
  }
}

Examples:

{
  "http_code": {
    "type": "weighted_ref",
    "data": {"GOOD_CODES": 0.7, "BAD_CODES": 0.3}
  },
  "refs": {
    "GOOD_CODES": {
      "200": 0.5,
      "202": 0.3,
      "203": 0.1,
      "300": 0.1
    },
    "BAD_CODES": {
      "400": 0.5,
      "403": 0.3,
      "404": 0.1,
      "500": 0.1
    }
  }
}

config_ref

Reference for holding configurations common to multiple fields.

Prototype:

{
  "refs": {
    "<config ref name>": {
      "type": "config_ref",
      "config": {
        "key1": Any,
        ...
        "key2": Any
      }
    }
  }
}

Examples:

{
  "status": {
    "type": "csv",
    "config": {
      "column": 1,
      "config_ref": "tabs_config"
    }
  },
  "description": {
    "type": "csv",
    "config": {
      "column": 2,
      "config_ref": "tabs_config"
    }
  },
  "status_type:csv?config_ref=tabs_config&column=3": {},
  "refs": {
    "tabs_config": {
      "type": "config_ref",
      "config": {
        "datafile": "tabs.csv",
        "delimiter": "\t",
        "headers": true
      }
    }
  }
}

nested

Nested types are used to create fields that contain subfields. Nested types can also contain nested fields to allow multiple levels of nesting. Use the nested type to generate a field that contains subfields. The subfields are defined in the fields element of the nested spec. The fields element will be treated like a top level DataSpec and has access to the refs and other elements of the root.

Prototype:

{
  "<field name>": {
    "type": "nested",
    "config": {
      "count": "Values Spec for Counts, default is 1"
    },
    "fields": {
      "<sub field one>": { spec definition here },
      "<sub field two>": { spec definition here },
      ...
    },
    "field_groups": <field groups format>
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "user": {
    "type": "nested",
    "fields": {
      "user_id": {
        "type": "uuid"
      },
      "geo": {
        "type": "nested",
        "fields": {
          "place_id:cc-digits?mean=5": {},
          "coordinates:geo.pair?as_list=true": {}
        }
      }
    }
  }
}

The same spec in a slightly more compact format

{
  "id:uuid": {},
  "user:nested": {
    "fields": {
      "user_id:uuid": {},
      "geo:nested": {
        "fields": {
          "place_id:cc-digits?mean=5": {},
          "coordinates:geo.pair?as_list=true": {}
        }
      }
    }
  }
}

Generates the following structure

datacraft -s tweet-geo.json --log-level off -x -i 1 --format json-pretty

{
    "id": "68092478-2234-41aa-bcc6-e679950770d7",
    "user": {
        "user_id": "93b3c62e-76ad-4272-b3c1-b434be2c8c30",
        "geo": {
            "place_id": "5104987632",
            "coordinates": [
                -93.0759,
                68.2469
            ]
        }
    }
}

External Data

The csv types are used to input large numbers of values into a spec.

csv types

If you have an existing large set of data in a tabular format that you want to use, it would be burdensome to copy and paste the data into a spec. To make use of data already in a tabular format you can use a csv Field Spec. These specs allow you to identify a column from a tabular data file to use to provide the values for a field. Another advantage of using a csv spec is that it is easy to have fields that are correlated be generated together. All rows will be selected incrementally, unless any of the fields are configured to use sample mode. You can use sample mode on individual columns, or you can use it across all columns by creating a config_ref spec. See csv_select for an efficient way to select multiple columns from a csv file.

csv

Prototype:

{
  "<field name>": {
    "type": "csv",
    "config": {
      "datafile": "filename in datedir",
      "headers": "yes, on, true for affirmative",
      "column": "1 based column number or field name if headers are present",
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote",
      "sample": "If the values should be selected at random, default is false",
      "count": "Number of values in column to use for value"
    }
  }
}

Examples:

{
  "cities": {
    "type": "csv",
    "config": {
      "datafile": "cities.csv",
      "delimiter": "~",
      "sample": true
    }
  }
}

{
  "status": {
    "type": "csv",
    "config": {
      "column": 1,
      "config_ref": "tabs_config"
    }
  },
  "description": {
    "type": "csv",
    "config": {
      "column": 2,
      "config_ref": "tabs_config"
    }
  },
  "status_type:csv?config_ref=tabs_config&column=3": {},
  "refs": {
    "tabs_config": {
      "type": "config_ref",
      "config": {
        "datafile": "tabs.csv",
        "delimiter": "\\t",
        "headers": true,
        "sample_rows": true
      }
    }
  }
}

csv_select

Prototype:

{
  "<field name>": {
    "type": "csv_select",
    "data": {
      "<field_one>": <1 based column index for field 1>,
      "<field_two>:<cast>": <1 based column index for field 2>,
      "<field_tre>": {
        "col": <1 based column index for field 3>,
        "cast": "<valid cast value i.e. int, float, etc"
      },
      ...,
      "<field n>":
      }
    },
    "config": {
      "datafile": "filename in datedir, or templated name i.e. {{ to_be_filled }}",
      "headers": "yes, on, true for affirmative",
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote"
    }
  }
}

Examples:

{
  "placeholder": {
    "type": "csv_select",
    "data": {
      "geonameid": 1,
      "name": 2,
      "latitude:float": 5,
      "longitude": { "col": 6, "cast": "float" },
      "country_code": 9,
      "population:int": 15
    },
    "config": {
      "datafile": "allCountries.txt",
      "headers": false,
      "delimiter": "\t"
    }
  }
}

In the example above, the latitude and longitude columns are both cast to floating point numbers and the population is cast to an integer. See Casting Values for details on available casting types.

weighted_csv

This is useful when you have a large number of weighted values that would not fit nicely into a JSON file. You can specify a value and a weight for that value. The default is that the first column in the csv is the value and the second column is the weight. Example CSV:

city,weight
New York,0.65
Los Angeles,0.23
London,0.87
Paris,0.49
Tokyo,0.32
Sydney,0.91
Beijing,0.04
Rio de Janeiro,0.78
Mumbai,0.56
Cape Town,0.38

Prototype:

{
  "<field name>": {
    "type": "weighted_csv",
    "config": {
      "datafile": "filename in datedir",
      "headers": "yes, on, true for affirmative",
      "column": "1 based column number or field name if headers are present",
      "weight_column": "1 based column number or field name if headers are present where weights are defined"
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote",
      "sample": "If the values should be selected at random, default is false",
      "count": "Number of values in column to use for value"
    }
  }
}

Examples:

{
  "cities": {
    "type": "weighted_csv",
    "config": {
      "datafile": "weighted_cities.csv"
    }
  }
}

Operator Types

These make use of one or more other fields or references to compute their values.

sample

A sample spec is used to select multiple values from a list to use as the value for a field.

Prototype:

{
  "<field name>": {
    "type": "sample",
    OR
    "type": "sample",
    "config": {
      "mean": N,
      "stddev": N,
      "min": N,
      "max": N,
      or
      "count": N,
      "join_with": "<optional delimiter to join with>"
    },
    "data": ["data", "to", "select", "from"],
    OR
    "ref": "<ref or field with data  as list>"
  }
}

Examples:

{
  "ingredients": {
    "type": "sample",
    "data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
    "config": {
      "mean": 3,
      "stddev": 1,
      "min": 2,
      "max": 4,
      "join_with": ", "
    }
  }
}

{
  "ingredients": {
    "type": "sample",
    "data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
    "config": {
      "mean": 3,
      "stddev": 1,
      "min": 2,
      "max": 4,
      "join_with": "\", \"",
      "quote": "\""
    }
  }
}

$ datacraft -s sample.json  -i 3 -t 'Ingredients: {{ ingredients | safe }}' -l off
Ingredients: "garlic", "onions"
Ingredients: "mushrooms", "potatoes", "garlic", "bell peppers"
Ingredients: "potatoes", "mushrooms"

combine

A combine Field Spec is used to concatenate or append two or more fields or reference to one another. There are two combine types: combine and combine-list.

combine

Prototype:

{
  "<field name>": {
    "type": "combine",
    "fields": ["valid field name1", "valid field name2"],
    OR
    "refs": ["valid ref1", "valid ref2"],
    "config": {
      "join_with": "<optional string to use to join fields or refs, default is none>"
    }
  }
}

Examples:

{
  "combine": {
    "type": "combine",
    "refs": ["first", "last"],
    "config": {
      "join_with": " "
    }
  },
  "refs": {
    "first": {
      "type": "values",
      "data": ["zebra", "hedgehog", "llama", "flamingo"]
    },
    "last": {
      "type": "values",
      "data": ["jones", "smith", "williams"]
    }
  }
}

combine-list

Prototype:

{
  "<field name>": {
    "type": "combine-list",
    "refs": [
      ["valid ref1", "valid ref2"],
      ["valid ref1", "valid ref2", "valid_ref3", ...], ...
      ["another_ref", "one_more_ref"]
    ],
    "config": {
      "join_with": "<optional string to use to join fields or refs, default is none>"
    }
  }
}

Examples:

{
  "full_name": {
    "type": "combine-list",
    "refs": [
      ["first", "last"],
      ["first", "middle", "last"],
      ["first", "middle_initial", "last"]
    ],
    "config": {
      "join_with": " "
    }
  },
  "refs": {
    "first": {
      "type": "values",
      "data": ["zebra", "hedgehog", "llama", "flamingo"]
    },
    "last": {
      "type": "values",
      "data": ["jones", "smith", "williams"]
    },
    "middle": {
      "type": "values",
      "data": ["cloud", "sage", "river"]
    },
    "middle_initial": {
      "type": "values",
      "data": {"a": 0.3, "m": 0.3, "j": 0.1, "l": 0.1, "e": 0.1, "w": 0.1}
    }
  }
}

calculate

There are times when one field needs the value of another field in order to calculate its own value. For example, if you wanted to produce values that represented a users’ height in inches and in centimeters, you would want them to correlate. You could use the calculate type to specify a formula to do this calculation. There are two ways to specify the fields to calculate a value from. The first is to use the fields and/or the refs keys with an array of fields or refs to use in the formula. The second is to use a map where the field or ref name to be used is mapped to a string that will be used as an alias for it in the formula. See second example below for the mapped alias version.

Prototype:

{
  "<field name>": {
    "type": "calculate",
    "fields": List[str],
    or
    "refs": List[str],
    "formula": <formula>
    "config": {
      "key": Any
    }
  }
}

formula (str): The formula to use in calculations

Examples:

{
  "height_in": [60, 70, 80, 90],
  "height_cm": {
    "type": "calculate",
    "fields": ["height_in"],
    "formula": "{{ height_in }} * 2.54"
  }
}

{
  "long_name_one": {
    "type": "values",
    "data": [4, 5, 6]
  },
  "long_name_two": {
    "type": "values",
    "data": [3, 6, 9]
  },
  "c": {
    "type": "calculate",
    "fields": {
      "long_name_one": "a",
      "long_name_two": "b"
    },
    "formula": "sqrt({{a}}*{{a}} + {{b}}*{{b}})"
  }
}

We use the asteval package to do formula evaluation. This provides a fairly safe way to do evaluation. The package provides a bunch of built-in-functions as well. We also use the Jinja2 templating engine format for specifying variable names to substitute. In theory, you could use any valid jinja2 syntax i.e.:

{
  "formula": "sqrt({{ value_that_might_be_a_string | int }})"
}

templated

A templated Field Spec is used to create strings by injecting the values from other fields into them. The other fields must be defined. The values can come from references or other defined fields. Use the jinja2 {{ field }} syntax to signify where the field should be injected.

Prototype:

{
  "<field name>": {
    "type": "templated",
    "data": "string with {{ jinja2 }} syntax fields",
    "fields": ["valid field name1", "valid field name2"],
    OR
    "refs": ["valid ref1", "valid ref2"]
  }
}

Examples:

{
  "user_agent": {
    "type": "templated",
    "data": "Mozilla/5.0 ({{ system }}) {{ platform }}",
    "refs": ["system", "platform"],
  },
  "refs": {
    "system": {
      "type": "values",
      "data": [
        "Windows NT 6.1; Win64; x64; rv:47.0",
        "Macintosh; Intel Mac OS X x.y; rv:42.0"
      ]
    },
    "platform": {
      "type": "values",
      "data": ["Gecko/20100101 Firefox/47.0", "Gecko/20100101 Firefox/42.0"]
    }
  }
}

replace

Replace one or more parts of the output of a field or reference. Values to replace should be specified as strings. Values to replace with should also be strings.

Prototype:

{
  "<field name>": {
    "type": "replace",
    "ref": "<field or ref to source value from>",
    "data": {
      "<value to replace 1>": "<value to replace with 1>",
      ...
      "<value to replace N>": "<value to replace with N>",
    }
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "remove_dashes": {
    "type": "replace",
    "ref": "id",
    "data": { "-": "" }
  }
}

$ datacraft --spec uuid-spec.json -i 3 -r 1 -x -l off --format json
{"id": "e809af25-bd85-4118-a5e9-cfdc953e172b", "remove_dashes": "1622e5cf2f334b81a90a6c031e0f78bf"}
{"id": "2a98b892-bb73-49de-8186-fa7cb4510001", "remove_dashes": "9c1d22d6f6e544bb8c0d582c441a1c78"}
{"id": "7986c789-1e5c-46f1-b5f1-a095f6a75209", "remove_dashes": "b50e914ea7994b6bb3194ce8c3402c8e"}

regex_replace

Replace one or more parts of the output of a field or reference using regular expressions to match the value strings. Note that masked is an alias for this type.

Prototype:

{
  "<field name>": {
    "type": "regex_replace|masked",
    "ref": "<field or ref to source value from>",
    "data": {
      "<regex 1>": "<value to replace with 1>",
      ...
      "<regex N>": "<value to replace with N>",
    }
    OR
    "data": "<replace all values with this>
  }
}

Examples:

This first example with take a 10 digit string of numbers and format it as a phone number. The double forward slash allows the strings to be compiled into regular expressions. Notice the \N format for specifying the group capture replacement.

{
  "phone": {
    "type": "regex_replace",
    "ref": "ten_digits",
    "data": {
      "^(\\d{3})(\\d{3})(\\d{4})": "(\\1) \\2-\\3"
    }
  },
  "refs": {
    "ten_digits": {
      "type": "cc-digits",
      "config": {
        "count": 10,
        "buffer": true
      }
    }
  }
}

$ datacraft --spec phone-spec.json -i 4 -r 1 -x -l off --format json
{"phone": "(773) 542-6190"}
{"phone": "(632) 956-3481"}
{"phone": "(575) 307-4587"}
{"phone": "(279) 788-3403"}

Masked Example

The masked type is an alias for regex_replace. One mode for these type is to replace all the values with a specified value for example:

{
  "masked_ssn": {
    "type": "masked",
    "ref": "ssn",
    "data": "NNN-NN-NNNN"
  },
  "age:rand_int_range": [18, 99],
  "refs": {
    "ssn": [
      "123-45-6789",
      "111-22-3333",
      "555-55-5555"
    ]
  }
}

$  datacraft.exe -s ssn.json -i 3 --format csvh -x -l off
masked_ssn,age
NNN-NN-NNNN,40
NNN-NN-NNNN,42
NNN-NN-NNNN,73

Core Types

Strings

char_class

Built In Classes

unicode_range

uuid

Numeric

range_suppliers

range

rand_range

integer

number

number.N

distribution

iteration

Date & Time

date

Uniformly Sampled Dates

Dates Distributed around a Center Point

Restricting Hours

.now Variations

Available .now Variations:

Geographic

geo types

Network

ip/ipv4

ip.precise

net.mac

Utility/Common

values

Special Output Values

refs

ref_list

weighted_refs

config_ref

nested

External Data

csv types

csv

csv_select

weighted_csv

Operator Types

sample

combine

combine

combine-list

calculate

templated

replace

regex_replace

Masked Example

Available `.now` Variations: