Core Types

These are the built-in field spec types. Organized by the type of Data they generate or by their function or utility.

Strings

For generating strings in various formats

char_class

A char_class type is used to create strings that are made up of characters from specific character classes. The strings can be of fixed or variable length. There are several built in character classes. You can also provide your own set of characters to sample from. Below is the list of supported character classes:

Built In Classes

class

description

ascii

All valid ascii characters including control

lower

ascii lowercase

upper

ascii uppercase

digits

Numbers 0 through 9

letters

lowercase and uppercase

word

letters + digits + ‘_’

printable

All printable ascii chars including whitespace

visible

All printable ascii chars excluding whitespace

punctuation

local specific punctuation

special

local specific punctuation

hex

Hexadecimal digits including upper and lower case a-f

hex-lower

Hexadecimal digits only including lower case a-f

hex-upper

Hexadecimal digits only including upper case A-F

Prototype:

{
  "<field name>": {
    "type": "char_class":
    "data": <char_class_name>,
    or
    "type": "cc-<char_class_name>",
    or
    "type": "char_class":
    "data": <string with custom set of characters to sample from>
    or
    "type": "char_class":
    "data": [<char_class_name1>, <char_class_name2>, ..., <custom characters>]
    "config":{
      "exclude": <string of characters to exclude from output>,
      "escape": <string of characters to escape in output e.g. " -> \\", useful if non JSON output
      "escape_str": <string to use for escaping, default is \>
      "min": <min number of characters in string>,
      "max": <max number of characters in string>,
      or
      "count": <exact number of characters in string>
      or
      "mean": <mean number of characters in string>
      "stddev": <standard deviation from mean for number of characters in string>
      "min": <optional min>
      "max": <optional max>
    }
  }
}

Examples:

{
  "password": {
    "type": "char_class",
    "data": ["word", "special", "hex-lower", "M4$p3c!@l$@uc3"],
    "config": {
      "mean": 14,
      "stddev": 2,
      "min": 10,
      "max": 18,
      "exclude": ["-", "\""]
    }
  }
}
$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"password": "c1hbR&V!sYi4+Em"}
{"password": "Z7Qd0AM>$f7'"}
{"password": "9Bh8Z%6?ed4g"}
{"password": "sqQ&I!Ucdp"}
{
  "one_to_five_digits:cc-digits?min=1&max=5": {}
}
$ datacraft -s spec.json -i 4 -r 1 -x -l off --format json
{"one_to_five_digits": "43040"}
{"one_to_five_digits": "5"}
{"one_to_five_digits": "6914"}
{"one_to_five_digits": "752"}

unicode_range

Generates strings from unicode ranges

Prototype:

{
  "<field>": {
    "type": "unicode_range":
    "data": [<start_code_point_in_hex>, <end_code_point_in_hex>],
    or
    "data": [
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
        ...
        [<start_code_point_in_hex>, <end_code_point_in_hex>],
    ],
    "config":{
      # String Size Based Config Parameters
      "min": <min number of characters in string>,
      "max": <max number of characters in string>,
      or
      "count": <exact number of characters in string>
      or
      "mean": <mean number of characters in string>
      "stddev": <standard deviation from mean for number of characters in string>
      "min": <optional min>
      "max": <optional max>
    }
  }
}

Examples:

{
  "text": {
    "type": "unicode_range",
    "data": ["3040", "309f"],
    "config": {
      "mean": 5
    }
  }
}

uuid

A standard uuid

Prototype:

{
  "<field name>": {
    "type": "uuid",
    "config": {
      "variant": 1, 3, 4, or 5, default is 4, optional
    }
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "id_shorthand:uuid": {},
  "id_variant3": {
    "type": "uuid",
    "config": {
      "variant": 3
    }
  }
}

Numeric

For generating numeric values in different ways.

range_suppliers

There are two main range types sequential and random. A sequential range is specified using the range type. A random one uses the rand_range type.

range

Prototype:

{
  "<field name>": {
    "type": "range",
    "data": [<start>, <end>, <step> (optional)],
    or
    "data": [
      [<start>, <end>, <step> (optional)],
      [<start>, <end>, <step> (optional)],
      ...
      [<start>, <end>, <step> (optional)],
    ],
  }
}

start: (Union[int, float]) - start of range
end: (Union[int, float]) - end of range
step: (Union[int, float]) - step for range, default is 1

Examples:

{
  "zero_to_ten_step_half": {
    "type": "range",
    "data": [0, 10, 0.5]
  }
}
{
  "range_shorthand1:range": {
    "data": [0, 10, 0.5]
  }
}
{"range_shorthand2:range": [0, 10, 0.5]},

rand_range

Generates a random floating point number in the given range. Use the rand_int_range type as a shortcut for casting the numbers as integers.

Prototype:

{
  "<field name>": {
    "type": "rand_range",
    "data": [<upper>],
    or
    "data": [<lower>, <upper>],
    or
    "data": [<lower>, <upper>, <precision> (optional)]
  }
}

upper: (Union[int, float]) - upper limit of random range
lower: (Union[int, float]) - lower limit of random range
precision: (int) - Number of digits after decimal point

Examples:

{
  "zero_to_ten_three_decimals": {
    "type": "rand_range",
    "data": [0, 10, 3]
  }
}
{
  "int_in_range": {
    "type": "rand_int_range",
    "data": [1, 100]
  }
}

distribution

A distribution spec can be built from one of the registered distribution types. Below is the table of the built in ones. Custom distributions can be registered using Custom Code Loading. See Custom Count Distributions for an example.

distribution

required arguments

optional args

examples

uniform

start,end

“uniform(start=10, end=30)”

“uniform(start=1, end=3)”

guass

mean,stddev

min,max

“gauss(mean=2, stddev=1)”

guassian

“guassian(mean=7, stddev=1, min=4)”

normal

“normal(mean=25, stddev=10, max=40)”

Prototype:

{
  "<field name>": {
    "type": "distribution",
    "data": "<dist func name>(<param1>=<val1>, ..., <paramN>=<valN>)
  }
}

Examples:

{
  "values": {
    "type": "distribution",
    "data": "uniform(start=10, end=30)"
  }
}
{
  "age": {
    "type": "distribution",
    "data": "normal(mean=28, stddev=10, min=18, max=40)",
    "config": {"cast": "int"}
  }
}
{
  "pressure": {
    "type": "distribution",
    "data": "gauss(mean=33, stddev=3.4756535)",
    "config": {
      "count_dist": "normal(mean=2, stddev=1, min=1, max=4)",
      "as_list": true
    }
  }
}

A distribution type field with a uniform distribution will produce similar values to a rand_range field. With rand_range it is easier to specify a specific number of decimal places to keep. To do this for the distribution type, you need to make use of the cast config with a roundN caster. See example below.

{
  "values1": {
    "type": "rand_range",
    "data": [10, 30, 4]
  },
  "values2": {
    "type": "distribution",
    "data": "uniform(start=10, end=30)",
    "config": {
      "cast": "round4"
    }
  },
  "values3:rand_range": [10, 30, 4],
  "values4:distribution?cast=round4": "uniform(start=10, end=30)"
}
$ datacraft -s spec.json -i2 --log-level off --printkey
values1 -> 29.7907
values2 -> 18.9114
values3 -> 13.5495
values4 -> 15.5935
values1 -> 22.0634
values2 -> 17.8552
values3 -> 22.982
values4 -> 20.5616

Date & Time

For generating dates and timestamp in a variety of formats

date

A Date Field Spec is used to generate date strings. The default format is day-month-year i.e. Christmas 2050 would be: 25-12-2050. There is also a date.iso type that generates ISO8601 formatted date strings without microseconds and a date.iso.us for one that generates them with microseconds. There are also a date.epoch and date.epcoh.ms and date.epoch.millis. These are for generating unix epoch timestamps. We use the format specification from the datetime module.

type

example output

date

11-18-2050

date.iso

2050-12-01T01:44:35

date.iso.ms

2050-12-01T05:11:20.543

date.iso.millis

2050-12-01T05:11:20.543

date.iso.us

2050-12-01T06:19:02.752373

date.iso.micros

2050-12-01T06:17:05.487878

date.epoch

1669825519

date.epoch.ms

1668624934547

date.epoch.millis

1669166880466

Uniformly Sampled Dates

The default strategy is to create random dates within a 30 day range, where the start date is today. You can use the start parameter to set a specific start date for the dates. You can also explicitly specify an end date. The start and end parameters should conform to the specified date format, or the default if none is provided. The offset parameter can be used to shift the dates by a specified number of days. A positive offset will shift the start date back. A negative offset will shift the date forward. The duration_days parameter can be used to specify the number of days that should be covered in the date range, instead of the default 30 days. This parameter is usually specified as an integer constant.

     start                              end (default start + 30 days)
        |--------------------------------|
|+offset|                           start+duration_days
|--------------------------------|
        |-offset|
                |--------------------------------|

Dates Distributed around a Center Point

An alternative strategy is to specify a center_date parameter with an optional stddev_days. This will create a normal or gaussian distribution of dates around the center point.

                   |
                   |
                |  |  |
             |  |  |  |  |
          |  |  |  |  |  |  |
 |  |  |  |  |  |  |  |  |  |  |  |  |
|-------------------------------------|
|         | stddev | stddev |         |
                center

Restricting Hours

If you want your generated dates to be restricted to certain hours of the day, you provide the hours config param. The value of this parameter can be any type of Field Spec that produces valid integers in the range of 0 to 23. See examples below.

Prototype:

{
  "<field name>": {
    "type": "date",
    OR,
    "type": "date.iso",
    OR,
    "type": "date.iso.ms",
    OR,
    "type": "date.iso.millis",
    OR,
    "type": "date.iso.us",
    OR,
    "type": "date.iso.micros",
    "data" "replacement for config.format, valid for type: date only",
    "config": {
      "format": "Valid datetime format string",
      "duration_days": "The number of days from the start date to create date strings for",
      "start": "date string matching format or default format to use for start date",
      "end": "date string matching format or default  format to use for end date",
      "offset": "number of days to shift base date by, positive means shift backwards, negative means forward",
      "center_date": "date string matching format or default format to use for center date",
      "stddev_days": "The standard deviation in days from the center date that dates should be distributed",
      "hours": "spec describing how the hours should be populated, i.e. only between 9am and 5pm"
    }
  }
}

Examples:

Dates that start on 15 Dec 2050 and span a 90 day period

{
  "dates": {
    "type": "date",
    "config": {
      "duration_days": "90",
      "start": "15-Dec-2050 12:00",
      "format": "%d-%b-%Y %H:%M"
    }
  }
}

Dates centered on 01 Jun 2050 with a standard deviation of +-2 days

{
  "dates": {
    "type": "date",
    "config": {
      "center_date": "20500601 12:00",
      "format": "%Y%m%d %H:%M",
      "stddev_days": "2"
    }
  }
}

ISO Date Centered at 1 Jun 2050, with weighted hours of the day

{
  "start_time": {
    "type": "date.iso",
    "config": {
      "center_date": "2050-06-01T12:00:00",
      "hours": { "type": "values", "data": { "7": 0.1, "8": 0.2, "9": 0.4, "10": 0.2, "11": 0.1 } }
    }
  }
}

Epoch Date with milliseconds 14 days in the future with a 7 day window for timestamps

{
  "start_time": {
    "type": "date.epoch.ms",
    "config": {
      "offset": -14,
      "duration_days": 7
    }
  }
}

Date format in data element using shorthand notation

{
  "start_time:date": "%d-%b-%Y %H:%M"
}

Equivalent to

{
  "start_time": {
    "type": "date",
    "data": "%d-%b-%Y %H:%M"
  }
}

Geographic

For generating basic decimal degrees of latitude and longitude

geo types

There are three main geo types: geo.lat, geo.long, and geo.pair. The defaults will create decimal string values in the valid ranges: -90 to 90 for latitude and -180 to 180 for longitude. You can bound the ranges in several ways. The first is with the start_lat, end_lat, start_long, end_long config params. These will set the individual bounds for each of the segments. You can use one or more of them. The other mechanism is by defining a bbox array which consists of the lower left geo point and the upper right one.

type

param

description

all

precision

number of decimal places for lat or long, default is 4

bbox

array of [min Longitude, min Latitude, max Longitude, max Latitude]

geo.lat

start_lat

lower bound for latitude

end_lat

upper bound for latitude

geo.long

start_long

lower bound for longitude

end_long

upper bound for longitude

geo.pair

join_with

delimiter to join long and lat with, default is comma

as_list

One of yes, true, or on if the pair should be returned as a list instead of as a joined string

lat_first

if latitude should be first in the generated pair, default is longitude first

start_lat

lower bound for latitude

end_lat

upper bound for latitude

start_long

lower bound for longitude

end_long

upper bound for longitude

Prototype:

{
  "<field name>": {
    "type": "geo.lat",
    or
    "type": "geo.long",
    or
    "type": "geo.pair",
    "config": {
      "key": Any
    }
  }
}

Examples:

{
  "egypt": {
    "type": "geo.pair",
    "config": {
      "bbox": [
        31.33134,
        22.03795,
        34.19295,
        25.00562
      ],
      "precision": 3
    }
  }
}

Network

Network related types

ip/ipv4

Ip addresses can be generated using CIDR notation or by specifying a base. For example, if you wanted to generate ips in the 10.0.0.0 to 10.0.0.255 range, you could either specify a cidr param of 10.0.0.0/24 or a base param of 10.0.0.

Prototype:

{
  "<field name>": {
    "type": "ipv4",
    "config": {
      "cidr": "<cidr value /8 /16 /24 only>",
      OR
      "base": "<beginning of ip i.e. 10.0>"
    }
  }
}

Examples:

{
  "network": {
    "type": "ipv4",
    "config": {
      "cidr": "2.22.222.0/16"
    }
  },
  "network_shorthand:ip?cidr=2.22.222.0/16": {},
  "network_with_base:ip?base=192.168.0": {}
}

ip.precise

The default ip type only supports cidr masks of /8 /16 and /24. If you want more precise ip ranges you need to use the ip.precise type. This type requires a cidr as the single config param. The default mode for ip.precise is to increment the ip addresses. Set config param sample to one of true, on, or yes to enable random ip addresses selected from the generated ranges.

Prototype:

{
  "<field name>": {
    "type": "ip.precise",
    "config": {
      "cidr": "<valid cidr value>",
    }
  }
}

Examples:

{
  "network": {
    "type": "ip.precise",
    "config": {
      "cidr": "192.168.0.0/14",
      "sample": "true"
    }
  }
}

net.mac

For creating MAC addresses

Prototype:

{
  "<field name>": {
    "type": "net.mac",
    "config": {
      "dashes": "If dashes should be used as the separator one of on, yes, 'true', or True"
    }
  }
}

Examples:

{
  "network": {
    "type": "net.mac"
  }
}
{
  "network": {
    "type": "net.mac",
    "config": {
      "dashes": "true"
    }
  }
}

Utility/Common

Common types or types that are used in a utility capacity.

values

There are three types of values specs: Constants, List, and Weighted. Values specs have a shorthand notation where the value of the data element replaces the full spec. See examples below.

Prototype:

{
  "<field_name>": {
    "type": "values",
    "data": Union[str, bool, int, float, list, dict],
    "config": {
      "key": "value"
    }
  }
}

Examples:

{"field_constant": {"type": "values", "data": 42}}
{"field_list": {"type": "values", "data": [1, 2, 3, 5, 8, 13]}}
{"field_weighted": {"type": "values", "data": {"200": 0.6, "404": 0.1, "303": 0.3}}}
{"field_weighted_with_null": {"type": "values", "data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NULL_": 0.1}}}
{"shorthand_field_constant": 42}
{"shorthand_field_list": [1, 2, 3, 5, 8, 13]}
{"shorthand_field_weighted": {"200": 0.6, "404": 0.1, "303": 0.3}}
{
    "short_hand_field_weighted_with_null": {
        "type": "values",
        "data": {"200": 0.5, "404": 0.1, "303": 0.3, "_NONE_": 0.1}
    }
}
$ datacraft -s spec.json -i 3 -r 1 --format json -x --log-level off
{"short_hand_field_weighted_with_null": "200"}
{"short_hand_field_weighted_with_null": null}
{"short_hand_field_weighted_with_null": "200"}

Special Output Values

There are certain valid JSON output values that are trickier to produce with a values spec. There are also times when your values are interpreted as strings but you need them to be output as one of these special values. The way we do this is by using a special token of the form _TYPE_. Below is the current mappings of special token to output value:

{
    "_NONE_": null,
    "_NULL_": null,
    "_NIL_": null,
    "_TRUE_": true,
    "_FALSE_": false
}

This is particularly useful when using a weighted values form of the values spec:

{
    "converted": {
        "type": "values",
        "data": {
            "_TRUE_": 0.05,
            "_FALSE_": 0.95
        }
    }
}
$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"converted": false}
{"converted": false}
{"converted": false}

The special token values can be mixed and matched as well:

{
    "mixed": {
        "type": "values",
        "data": {
            "_NONE_": 0.11,
            "_NULL_": 0.11,
            "_NIL_": 0.11,
            "_TRUE_": 0.33,
            "_FALSE_": 0.33
        }
    }
}
$ datacraft -s /tmp/spec.json -i 3 -r 1 --format json -x --log-level off
{"mixed": false}
{"mixed": true}
{"mixed": null}

refs

Pointer to a field spec defined in references section

Prototype:

{
  "<field name>": {
    "type": "ref":
    "ref": "<ref_name>",
    or
    "data": <ref_name>,
    "config": {
      "key": Any
    }
  }
}

Examples:

{ "pointer": { "type": "ref", "data": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer": { "type": "ref", "ref": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": { "ref": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": { "data": "ref_name" }, "refs": { "ref_name": 42 } }

{ "pointer:ref": "ref_name", "refs": { "ref_name": 42 } }

ref_list

Pointer to Field Specs to be injected into list in order of name. This allows externally defined fields to be injected into specific places in a list of values.

Prototype:

{
  "<field name>": {
    "type": "ref_list":
    "refs": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
    or
    "data": ["<ref_name>", "<ref_name>", ...,"<ref_name>"]
    "config": {
      "key": Any
    }
  }
}

Example:

In this example we want a location field as a list of [latitude, longitude, altitude]

{
  "location": {
    "type": "ref_list",
    "refs": ["lat", "long", "altitude"]
  },
  "refs": {
    "lat": {
      "type": "geo.lat"
    },
    "long": {
      "type": "geo.long"
    },
    "altitude": {
      "type": "rand_int_range",
      "data": [5000, 10000]
    }
  }
}
$ datacraft -s spec.json -i 1 --format json-pretty -x -l off
[
    {
        "location": [
            -36.7587,
            -40.5453,
            6233
        ]
    }
]

weighted_refs

A weighted_ref spec is used to select the values from a set of refs in a weighted fashion.

Prototype:

{
  "<field name>": {
    "type": "weighted_ref",
    "data": {"valid_ref_1": 0.N, "valid_ref_2": 0.N, ...}
    "config": {
      "key": Any
    }
  }
}

Examples:

{
  "http_code": {
    "type": "weighted_ref",
    "data": {"GOOD_CODES": 0.7, "BAD_CODES": 0.3}
  },
  "refs": {
    "GOOD_CODES": {
      "200": 0.5,
      "202": 0.3,
      "203": 0.1,
      "300": 0.1
    },
    "BAD_CODES": {
      "400": 0.5,
      "403": 0.3,
      "404": 0.1,
      "500": 0.1
    }
  }
}

config_ref

Reference for holding configurations common to multiple fields.

Prototype:

{
  "refs": {
    "<config ref name>": {
      "type": "config_ref",
      "config": {
        "key1": Any,
        ...
        "key2": Any
      }
    }
  }
}

Examples:

{
  "status": {
    "type": "csv",
    "config": {
      "column": 1,
      "config_ref": "tabs_config"
    }
  },
  "description": {
    "type": "csv",
    "config": {
      "column": 2,
      "config_ref": "tabs_config"
    }
  },
  "status_type:csv?config_ref=tabs_config&column=3": {},
  "refs": {
    "tabs_config": {
      "type": "config_ref",
      "config": {
        "datafile": "tabs.csv",
        "delimiter": "\t",
        "headers": true
      }
    }
  }
}

nested

Nested types are used to create fields that contain subfields. Nested types can also contain nested fields to allow multiple levels of nesting. Use the nested type to generate a field that contains subfields. The subfields are defined in the fields element of the nested spec. The fields element will be treated like a top level DataSpec and has access to the refs and other elements of the root.

Prototype:

{
  "<field name>": {
    "type": "nested",
    "config": {
      "count": "Values Spec for Counts, default is 1"
    },
    "fields": {
      "<sub field one>": { spec definition here },
      "<sub field two>": { spec definition here },
      ...
    },
    "field_groups": <field groups format>
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "user": {
    "type": "nested",
    "fields": {
      "user_id": {
        "type": "uuid"
      },
      "geo": {
        "type": "nested",
        "fields": {
          "place_id:cc-digits?mean=5": {},
          "coordinates:geo.pair?as_list=true": {}
        }
      }
    }
  }
}

The same spec in a slightly more compact format

{
  "id:uuid": {},
  "user:nested": {
    "fields": {
      "user_id:uuid": {},
      "geo:nested": {
        "fields": {
          "place_id:cc-digits?mean=5": {},
          "coordinates:geo.pair?as_list=true": {}
        }
      }
    }
  }
}

Generates the following structure

datacraft -s tweet-geo.json --log-level off -x -i 1 --format json-pretty
{
    "id": "68092478-2234-41aa-bcc6-e679950770d7",
    "user": {
        "user_id": "93b3c62e-76ad-4272-b3c1-b434be2c8c30",
        "geo": {
            "place_id": "5104987632",
            "coordinates": [
                -93.0759,
                68.2469
            ]
        }
    }
}

External Data

The csv types are used to input large numbers of values into a spec.

csv types

If you have an existing large set of data in a tabular format that you want to use, it would be burdensome to copy and paste the data into a spec. To make use of data already in a tabular format you can use a csv Field Spec. These specs allow you to identify a column from a tabular data file to use to provide the values for a field. Another advantage of using a csv spec is that it is easy to have fields that are correlated be generated together. All rows will be selected incrementally, unless any of the fields are configured to use sample mode. You can use sample mode on individual columns, or you can use it across all columns by creating a config_ref spec. See csv_select for an efficient way to select multiple columns from a csv file.

csv

Prototype:

{
  "<field name>": {
    "type": "csv",
    "config": {
      "datafile": "filename in datedir",
      "headers": "yes, on, true for affirmative",
      "column": "1 based column number or field name if headers are present",
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote",
      "sample": "If the values should be selected at random, default is false",
      "count": "Number of values in column to use for value"
    }
  }
}

Examples:

{
  "cities": {
    "type": "csv",
    "config": {
      "datafile": "cities.csv",
      "delimiter": "~",
      "sample": true
    }
  }
}
{
  "status": {
    "type": "csv",
    "config": {
      "column": 1,
      "config_ref": "tabs_config"
    }
  },
  "description": {
    "type": "csv",
    "config": {
      "column": 2,
      "config_ref": "tabs_config"
    }
  },
  "status_type:csv?config_ref=tabs_config&column=3": {},
  "refs": {
    "tabs_config": {
      "type": "config_ref",
      "config": {
        "datafile": "tabs.csv",
        "delimiter": "\\t",
        "headers": true,
        "sample_rows": true
      }
    }
  }
}

csv_select

Prototype:

{
  "<field name>": {
    "type": "csv_select",
    "data": {
      "<field_one>": <1 based column index for field 1>,
      "<field_two>:<cast>": <1 based column index for field 2>,
      "<field_tre>": {
        "col": <1 based column index for field 3>,
        "cast": "<valid cast value i.e. int, float, etc"
      },
      ...,
      "<field n>":
      }
    },
    "config": {
      "datafile": "filename in datedir, or templated name i.e. {{ to_be_filled }}",
      "headers": "yes, on, true for affirmative",
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote"
    }
  }
}

Examples:

{
  "placeholder": {
    "type": "csv_select",
    "data": {
      "geonameid": 1,
      "name": 2,
      "latitude:float": 5,
      "longitude": { "col": 6, "cast": "float" },
      "country_code": 9,
      "population:int": 15
    },
    "config": {
      "datafile": "allCountries.txt",
      "headers": false,
      "delimiter": "\t"
    }
  }
}

In the example above, the latitude and longitude columns are both cast to floating point numbers and the population is cast to an integer. See Casting Values for details on available casting types.

weighted_csv

This is useful when you have a large number of weighted values that would not fit nicely into a JSON file. You can specify a value and a weight for that value. The default is that the first column in the csv is the value and the second column is the weight. Example CSV:

city,weight
New York,0.65
Los Angeles,0.23
London,0.87
Paris,0.49
Tokyo,0.32
Sydney,0.91
Beijing,0.04
Rio de Janeiro,0.78
Mumbai,0.56
Cape Town,0.38

Prototype:

{
  "<field name>": {
    "type": "weighted_csv",
    "config": {
      "datafile": "filename in datedir",
      "headers": "yes, on, true for affirmative",
      "column": "1 based column number or field name if headers are present",
      "weight_column": "1 based column number or field name if headers are present where weights are defined"
      "delimiter": "how values are separated, default is comma",
      "quotechar": "how values are quoted, default is double quote",
      "sample": "If the values should be selected at random, default is false",
      "count": "Number of values in column to use for value"
    }
  }
}

Examples:

{
  "cities": {
    "type": "weighted_csv",
    "config": {
      "datafile": "weighted_cities.csv"
    }
  }
}

Operator Types

These make use of one or more other fields or references to compute their values.

sample

A sample spec is used to select multiple values from a list to use as the value for a field.

Prototype:

{
  "<field name>": {
    "type": "sample",
    OR
    "type": "sample",
    "config": {
      "mean": N,
      "stddev": N,
      "min": N,
      "max": N,
      or
      "count": N,
      "join_with": "<optional delimiter to join with>"
    },
    "data": ["data", "to", "select", "from"],
    OR
    "ref": "<ref or field with data  as list>"
  }
}

Examples:

{
  "ingredients": {
    "type": "sample",
    "data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
    "config": {
      "mean": 3,
      "stddev": 1,
      "min": 2,
      "max": 4,
      "join_with": ", "
    }
  }
}
{
  "ingredients": {
    "type": "sample",
    "data": ["onions", "mushrooms", "garlic", "bell peppers", "spinach", "potatoes", "carrots"],
    "config": {
      "mean": 3,
      "stddev": 1,
      "min": 2,
      "max": 4,
      "join_with": "\", \"",
      "quote": "\""
    }
  }
}
$ datacraft -s sample.json  -i 3 -t 'Ingredients: {{ ingredients | safe }}' -l off
Ingredients: "garlic", "onions"
Ingredients: "mushrooms", "potatoes", "garlic", "bell peppers"
Ingredients: "potatoes", "mushrooms"

combine

A combine Field Spec is used to concatenate or append two or more fields or reference to one another. There are two combine types: combine and combine-list.

combine

Prototype:

{
  "<field name>": {
    "type": "combine",
    "fields": ["valid field name1", "valid field name2"],
    OR
    "refs": ["valid ref1", "valid ref2"],
    "config": {
      "join_with": "<optional string to use to join fields or refs, default is none>"
    }
  }
}

Examples:

{
  "combine": {
    "type": "combine",
    "refs": ["first", "last"],
    "config": {
      "join_with": " "
    }
  },
  "refs": {
    "first": {
      "type": "values",
      "data": ["zebra", "hedgehog", "llama", "flamingo"]
    },
    "last": {
      "type": "values",
      "data": ["jones", "smith", "williams"]
    }
  }
}

combine-list

Prototype:

{
  "<field name>": {
    "type": "combine-list",
    "refs": [
      ["valid ref1", "valid ref2"],
      ["valid ref1", "valid ref2", "valid_ref3", ...], ...
      ["another_ref", "one_more_ref"]
    ],
    "config": {
      "join_with": "<optional string to use to join fields or refs, default is none>"
    }
  }
}

Examples:

{
  "full_name": {
    "type": "combine-list",
    "refs": [
      ["first", "last"],
      ["first", "middle", "last"],
      ["first", "middle_initial", "last"]
    ],
    "config": {
      "join_with": " "
    }
  },
  "refs": {
    "first": {
      "type": "values",
      "data": ["zebra", "hedgehog", "llama", "flamingo"]
    },
    "last": {
      "type": "values",
      "data": ["jones", "smith", "williams"]
    },
    "middle": {
      "type": "values",
      "data": ["cloud", "sage", "river"]
    },
    "middle_initial": {
      "type": "values",
      "data": {"a": 0.3, "m": 0.3, "j": 0.1, "l": 0.1, "e": 0.1, "w": 0.1}
    }
  }
}

calculate

There are times when one field needs the value of another field in order to calculate its own value. For example, if you wanted to produce values that represented a users’ height in inches and in centimeters, you would want them to correlate. You could use the calculate type to specify a formula to do this calculation. There are two ways to specify the fields to calculate a value from. The first is to use the fields and/or the refs keys with an array of fields or refs to use in the formula. The second is to use a map where the field or ref name to be used is mapped to a string that will be used as an alias for it in the formula. See second example below for the mapped alias version.

Prototype:

{
  "<field name>": {
    "type": "calculate",
    "fields": List[str],
    or
    "refs": List[str],
    "formula": <formula>
    "config": {
      "key": Any
    }
  }
}

formula (str): The formula to use in calculations

Examples:

{
  "height_in": [60, 70, 80, 90],
  "height_cm": {
    "type": "calculate",
    "fields": ["height_in"],
    "formula": "{{ height_in }} * 2.54"
  }
}
{
  "long_name_one": {
    "type": "values",
    "data": [4, 5, 6]
  },
  "long_name_two": {
    "type": "values",
    "data": [3, 6, 9]
  },
  "c": {
    "type": "calculate",
    "fields": {
      "long_name_one": "a",
      "long_name_two": "b"
    },
    "formula": "sqrt({{a}}*{{a}} + {{b}}*{{b}})"
  }
}

We use the asteval package to do formula evaluation. This provides a fairly safe way to do evaluation. The package provides a bunch of built-in-functions as well. We also use the Jinja2 templating engine format for specifying variable names to substitute. In theory, you could use any valid jinja2 syntax i.e.:

{
  "formula": "sqrt({{ value_that_might_be_a_string | int }})"
}

templated

A templated Field Spec is used to create strings by injecting the values from other fields into them. The other fields must be defined. The values can come from references or other defined fields. Use the jinja2 {{ field }} syntax to signify where the field should be injected.

Prototype:

{
  "<field name>": {
    "type": "templated",
    "data": "string with {{ jinja2 }} syntax fields",
    "fields": ["valid field name1", "valid field name2"],
    OR
    "refs": ["valid ref1", "valid ref2"]
  }
}

Examples:

{
  "user_agent": {
    "type": "templated",
    "data": "Mozilla/5.0 ({{ system }}) {{ platform }}",
    "refs": ["system", "platform"],
  },
  "refs": {
    "system": {
      "type": "values",
      "data": [
        "Windows NT 6.1; Win64; x64; rv:47.0",
        "Macintosh; Intel Mac OS X x.y; rv:42.0"
      ]
    },
    "platform": {
      "type": "values",
      "data": ["Gecko/20100101 Firefox/47.0", "Gecko/20100101 Firefox/42.0"]
    }
  }
}

replace

Replace one or more parts of the output of a field or reference. Values to replace should be specified as strings. Values to replace with should also be strings.

Prototype:

{
  "<field name>": {
    "type": "replace",
    "ref": "<field or ref to source value from>",
    "data": {
      "<value to replace 1>": "<value to replace with 1>",
      ...
      "<value to replace N>": "<value to replace with N>",
    }
  }
}

Examples:

{
  "id": {
    "type": "uuid"
  },
  "remove_dashes": {
    "type": "replace",
    "ref": "id",
    "data": { "-": "" }
  }
}
$ datacraft --spec uuid-spec.json -i 3 -r 1 -x -l off --format json
{"id": "e809af25-bd85-4118-a5e9-cfdc953e172b", "remove_dashes": "1622e5cf2f334b81a90a6c031e0f78bf"}
{"id": "2a98b892-bb73-49de-8186-fa7cb4510001", "remove_dashes": "9c1d22d6f6e544bb8c0d582c441a1c78"}
{"id": "7986c789-1e5c-46f1-b5f1-a095f6a75209", "remove_dashes": "b50e914ea7994b6bb3194ce8c3402c8e"}

regex_replace

Replace one or more parts of the output of a field or reference using regular expressions to match the value strings. Note that masked is an alias for this type.

Prototype:

{
  "<field name>": {
    "type": "regex_replace|masked",
    "ref": "<field or ref to source value from>",
    "data": {
      "<regex 1>": "<value to replace with 1>",
      ...
      "<regex N>": "<value to replace with N>",
    }
    OR
    "data": "<replace all values with this>
  }
}

Examples:

This first example with take a 10 digit string of numbers and format it as a phone number. The double forward slash allows the strings to be compiled into regular expressions. Notice the \N format for specifying the group capture replacement.

{
  "phone": {
    "type": "regex_replace",
    "ref": "ten_digits",
    "data": {
      "^(\\d{3})(\\d{3})(\\d{4})": "(\\1) \\2-\\3"
    }
  },
  "refs": {
    "ten_digits": {
      "type": "cc-digits",
      "config": {
        "count": 10,
        "buffer": true
      }
    }
  }
}
$ datacraft --spec phone-spec.json -i 4 -r 1 -x -l off --format json
{"phone": "(773) 542-6190"}
{"phone": "(632) 956-3481"}
{"phone": "(575) 307-4587"}
{"phone": "(279) 788-3403"}

Masked Example

The masked type is an alias for regex_replace. One mode for these type is to replace all the values with a specified value for example:

{
  "masked_ssn": {
    "type": "masked",
    "ref": "ssn",
    "data": "NNN-NN-NNNN"
  },
  "age:rand_int_range": [18, 99],
  "refs": {
    "ssn": [
      "123-45-6789",
      "111-22-3333",
      "555-55-5555"
    ]
  }
}
$  datacraft.exe -s ssn.json -i 3 --format csvh -x -l off
masked_ssn,age
NNN-NN-NNNN,40
NNN-NN-NNNN,42
NNN-NN-NNNN,73