Pixie reads in big structured text files, transforms them with JavaScript functions, and writes them back to disk.
The usage examples in this section are based on the following large JSONL file.
Inspect the examples by clicking on them!
Users may extend pixie with (third-party) plugins for many more data formats.
See the [.pxi module section][pxi-module] on how to do that and the plugins section for a list.
Pixie deserializes data into JSON, applies functions, and serializes JSON to another format.
It offers the telling aliases --from and --to alternative to --deserializer and --serializer.
time,year,month,day,hours,minutes,seconds
1546300800,2019,1,1,0,0,0
1546300801,2019,1,1,0,0,1
1546300802,2019,1,1,0,0,2
1546300803,2019,1,1,0,0,3
Use Ramda, Lodash or any other JavaScript library:
Pixie follows the [unix philosophy][unix-philosophy] !:
It does one thing (processing structured data), and does it well.
It is written to work together with other programs and it handles text streams because that is a universal interface.
Pixieâs space-separated values deserializer makes it very easy to work with the output of other commands.
Array destructuring is especially helpful in this area.
Pixieâs philosophy is to provide a small, extensible frame
for processing large files and streams with JavaScript functions.
Different data formats are supported through plugins.
JSON, CSV, SSV, and TSV are supported by default, but users can customize their pixie
installation by picking and choosing from more available (including third-party) plugins.
Pixie works its magic by chunking, deserializing, applying functions, and serializing data.
Expressed in code, it works like this:
functionpxi(data) { // Data is passed to pxi from stdin.
const chunks = chunk(data)// The data is chunked.
const jsons = deserialize(chunks)// The chunks are deserialized into JSON objects.
const jsons2 = apply(f,jsons)// f is applied to each object and new JSON objects are returned.
const string = serialize(jsons2)// The new objects are serialized to a string.
process.stdout.write(string) // The string is written to stdout.
}
For example, chunking, deserializing, and serializing JSON is provided by the [pxi-json][pxi-json] plugin.
The last column states which plugins come preinstalled in pxi.
Refer to the .pxi Module section to see how to enable more plugins and how to develop plugins.
New experimental pixie plugins are developed i.a. in the [pxi-sandbox][pxi-sandbox] repository.
pxi is very fast and beats several similar tools in [performance benchmarks][pxi-benchmarks].
Times are given in CPU time (seconds), wall-clock times may deviate by Âą 1s.
The benchmarks were run on a 13â MacBook Pro (2019) with a 2,8 GHz Quad-Core i7 and 16GB memory.
Feel free to run the [benchmarks][pxi-benchmarks] on your own machine
and if you do, please [open an issue][issues] to report your results!
[Benchmark][pxi-benchmarks]
Description
pxi
gawk
jq
mlr
fx
JSON 1
Select an attribute on small JSON objects
11s
15s
46s
â
284s
JSON 2
Select an attribute on large JSON objects
20s
20s
97s
â
301s
JSON 3
Pick a single attribute on small JSON objects
15s
21s
68s
91s
368s
JSON 4
Pick a single attribute on large JSON objects
26s
27s
130s
257sâ
420s
JSON to CSV 1
Convert a small JSON to CSV format
15s
â
77s
60s
â
JSON to CSV 2
Convert a large JSON to CSV format
38s
â
264s
237sâ
â
CSV 1
Select a column from a small csv file
11s
8s
37s
23s
â
CSV 2
Select a column from a large csv file
19s
9s
66s
72s
â
CSV to JSON 1
Convert a small CSV to JSON format
15s
â
â
120s
â
CSV to JSON 2
Convert a large CSV to JSON format
42s
â
â
352s
â
â mlr appears to load the whole file instead of processing it in chunks if reading JSON.
This is why it fails on large input files.
So in these benchmarks, the first 20,000,000 lines are processed first, followed by the remaining 11,536,000 lines.
The times of both runs are summed up.
pxi and gawk notably beat
jq, mlr, and fx in every benchmark.
However, due to its different data processing approach, pxi is more versatile than gawk
and is e.g. able to transform data formats into another.
For a more detailed interpretation, open this box.
pxi and gawk differ greatly in their approaches to transforming data:
While gawk manipulates strings, pxi parses data according to a format, builds an internal JSON representation,
manipulates this JSON, and serializes it to a different format.
Surprisingly, they perform equally well in the benchmarks,
with pxi being a little faster in JSON and gawk in CSV.
However, the more attributes JSON objects have and the more columns CSV files have,
the faster gawk gets compared to pxi, because it does not need to build an internal data representation.
On the other hand, while pxi is able to perform complex format transformations,
gawk is unable to do it because of its different approach.
jq and mlr share pxiâs data transformation approach, but focus on different formats:
While jq specializes in transforming JSON, mlrâs focus is CSV.
Although pxi does not prefer one format over the other,
it beats both tools in processing speed on their preferred formats.
fx and pxi are very similar in that both are written in JavaScript and use JavaScript as their processing language.
However, although fx specializes in just the JSON format, pxi is at least 15x faster in all benchmarks.
All tools differ in their memory needs.
Since pxi and fx are written in an interpreted language, they need approx. 70 MB due to their runtime.
Since gawk and jq are compiled binaries, they only need approx. 1MB.
mlr needs the most memory (up to 11GB), since it appears to load the whole file before processing it in some cases.
Pixie has deserializers (--from) and serializers (--to) for various data formats, including JSON and CSV.
JSON is the default deserializer and serializer, so no need to type --from json and --to json.
time,year,month,day,hours,minutes,seconds
1546300800,2019,1,1,0,0,0
1546300801,2019,1,1,0,0,1
1546300802,2019,1,1,0,0,2
1546300803,2019,1,1,0,0,3
1546300804,2019,1,1,0,0,4
Convert JSON to CSV, but keep only time and month:
Serializers can be freely combined with functions.
1546300800,1
1546300801,1
1546300802,1
1546300803,1
1546300804,1
Rename time to timestamp and convert CSV to TSV:
Terminal window
$pxi'({time, ...rest}) => ({timestamp: time, ...rest})'--fromcsv--totsv<2019.csv
Read in CSV format.
Use destructuring to select all attributes other than time.
Rename time to timestamp and keep all other attributes unchanged.
Write in TSV format.
timestamp year month day hours minutes seconds
1546300800201911000
1546300801201911001
1546300802201911002
1546300803201911003
1546300804201911004
Convert CSV to JSON:
Terminal window
$pxi--deserializercsv--serializerjson<2019.csv
--from and --to are aliases for --deserializer and --serializer that are used to convert between formats.
Deserializing from CSV does not automatically cast strings to other types.
This is intentional, since some use cases may need casting, and others donât.
If you need a key to be an integer, you need to explicitly transform it.
$pxi'({month, day}) => month == 5 && day == 4'--applierfilter<2019.jsonl
Appliers determine how functions are applied.
The default applier is map, which applies the function to each element.
Here, we use the filter applier that keeps only elements for which the function yields true.
The --keep attribute takes a stringified JSON array and narrows each element to only the keys in it.
Using --spaces with any value other than 0 formats the serialized JSON using the provided number as spaces.
{
"time": 1546300800
}
{
"time": 1546300801
}
{
"time": 1546300802
}
{
"time": 1546300803
}
{
"time": 1546300804
}
Deserialize JSON that is not given line by line:
Terminal window
$pxi--byjsonObj<pretty.jsonl
The --chunker or --by attribute defines how data is turned into chunks that are deserialized.
The default chunker is line which treats each line as a chunk.
In cases where JSON is not given line by line, e.g. if it is pretty-printed, the jsonObj chunker helps.
{"time":1546300800}
{"time":1546300801}
{"time":1546300802}
{"time":1546300803}
{"time":1546300804}
Suppose you have to access a web API:
Terminal window
$curl-s"https://swapi.co/api/people/"
The returned JSON is one big mess and needs to be tamed.
Here, the --with alias for --applier is used.
The function selects the results array.
If it were applied with map, it would return the whole array as an element.
But since we use the flatMap applier, each array item is returned as an element, instead.
The --keep attribute specifies, which keys to keep from the returned objects:
We use pixie to compute each characterâs BMI.
The default chunker line and the default applier map are suitable to apply a BMI-computing function to each line.
Before serializing to the default format JSON, we only keep the name and bmi fields.
The map applier supports mutating function inputs, which might be a problem for other appliers, so be careful.
Array destructuring is especially useful when working with space-separated values.
{"size":"704B","file":"."}
{"size":"704B","file":".."}
{"size":"1.2K","file":"bin"}
{"size":"4.4K","file":"dev"}
{"size":"11B","file":"etc"}
{"size":"25B","file":"home"}
{"size":"64B","file":"opt"}
{"size":"192B","file":"private"}
{"size":"2.0K","file":"sbin"}
{"size":"11B","file":"tmp"}
{"size":"352B","file":"usr"}
{"size":"11B","file":"var"}
Allow JSON objects and lists in CSV:
Terminal window
$echo'{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}'|
pxi--tocsv--no-fixed-length--allow-list-values
Pixie can be told to allow JSON encoded lists and objects in CSV files.
Note, how pixie takes care of quoting and escaping those values for you.
a,b
1,"[1,2,3]"
2,"{""c"":2}"
Decode JSON values in CSV:
Terminal window
$echo'{"a":1,"b":[1,2,3]}\n{"a":2,"b":{"c":2}}'|
pxi--tocsv--no-fixed-length--allow-list-values|
pxi--fromcsv'evolve({b: JSON.parse})'
JSON values are treated as strings and are not automatically parsed.
This is intentional, as pixie tries to keep as much out of your way as possible.
They can be transformed back into JSON by applying JSON.parse in a function.
Users may extend and modify pxi by providing a .pxi module.
If you wish to do that, create a ~/.pxi/index.js file and insert the following base structure:
module.exports= {
plugins: [],
context: {},
defaults: {}
}
The following sections will walk you through all capabilities of .pxi modules.
If you want to skip over the details and instead see sample code, visit [pxi-pxi][pxi-pxi]!
The name is used by pixie to select your extension,
the desc is displayed in the options section of pxi --help, and
the func is called by pixie to transform data.
The sample extensions are bundled to the sample plugin, as follows:
Plugins can come from two sources:
They are either written by the user, as shown in the previous section, or they are installed in ~/.pxi/ as follows:
Terminal window
$npminstallpxi-sample
If a plugin was installed, it has to be imported into ~/.pxi/index.js:
const sample = require('pxi-sample')
Regardless of whether a plugin was defined by a user or installed from npm,
all plugins are added to the .pxi module the same way:
module.exports= {
plugins: [sample],
context: {},
defaults: {}
}
pxi --help should now list the sample plugin extensions in the options section.
:speak_no_evil: Adding plugins may break the pxi command line tool!
If this happens, just remove the plugin from the list and pxi should work normal again.
Use this feature responsibly.
Libraries like [Ramda][ramda] and [Lodash][lodash] are of immense help when writing functions to transform JSON objects
and many heated discussions have been had, which of these libraries is superior.
Since different people have different preferences, pixie lets the user decide which library to use.
First, install your preferred libraries in ~/.pxi/:
Terminal window
$npminstallramda
$npminstalllodash
Next, add the libraries to ~/.pxi/index.js:
const R = require('ramda')
const L = require('lodash')
module.exports= {
plugins: [],
context: Object.assign({}, R, {_: L}),
defaults: {}
}
You may now use all Ramda functions without prefix, and all Lodash functions with prefix _:
Terminal window
$pxi"prop('time')"<2019.jsonl
$pxi"json => _.get(json, 'time')"<2019.jsonl
:hear_no_evil: Using Ramda and Lodash in your functions may have a negative impact on performance!
Use this feature responsibly.
Just as you may extend pixie with third-party libraries like Ramda and Lodash,
you may add your own functions.
This is as simple as adding them to the context in ~/.pxi/index.js:
const getTime = json => json.time
module.exports= {
plugins: [],
context: {getTime},
defaults: {}
}
After adding it to the context, you may use your function:
Small, fast, and magical command-line data processor similar to awk, jq, and mlr.
Command-line JSON processor
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Command-line tool and terminal JSON viewer
The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code
Focus
Transforming data with user provided functions and converting between formats
Transforming JSON with user provided functions
Transforming CSV with user provided functions and converting between formats
[Predefined verbs and custom put/filter DSL][mlr-verbs]
JavaScript and all [JavaScript libraries][npm]
[awk language][gawk-lang]
Extensibility
(Third-party) Plugins, any [JavaScript library][npm], custom functions
(Third-party) [Modules][jq-modules] written in [jq][jq-lang]
Running arbitrary [shell commands][mlr-shell]
Any [JavaScript library][npm], custom functions
[gawk dynamic extensions][gawk-extensions]
Similarities
pxi and jq both heavily rely on JSON
pxi and mlr both convert back and forth between CSV and JSON
pxi and fx both apply JavaScript functions to JSON streams
pxi and gawk both transform data
Differences
pxi and jq use different processing languages
While pxi uses a programming language for data processing, mlr uses a custom put/filter DSL, also, mlr reads in the whole file while pxi processes it in chunks
pxi supports data formats other than JSON, and fx provides a terminal JSON viewer
While pxi functions transform a JSON into another JSON, gawk does not have a strict format other than transforming strings into other strings
We are open to, and grateful for, any contributions made by the community.
By contributing to pixie, you agree to abide by the [code of conduct][code].
Please read the [contributing guide][contribute].