Cookbook

Parsing log-file output

This, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as

2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378

I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example:

grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status

Rectangularizing data

Suppose you have a method (in whatever language) which is printing things of the form

outer=1
outer=2
outer=3

and then calls another method which prints things of the form

middle=10
middle=11
middle=12
middle=20
middle=21
middle=30
middle=31

and then, perhaps, that second method calls a third method which prints things of the form

inner1=100,inner2=101
inner1=120,inner2=121
inner1=200,inner2=201
inner1=210,inner2=211
inner1=300,inner2=301
inner1=312
inner1=313,inner2=314

with the result that your program’s output is

outer=1
middle=10
inner1=100,inner2=101
middle=11
middle=12
inner1=120,inner2=121
outer=2
middle=20
inner1=200,inner2=201
middle=21
inner1=210,inner2=211
outer=3
middle=30
inner1=300,inner2=301
middle=31
inner1=312
inner1=313,inner2=314

The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, you can use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs.

$ mlr --from data/rect.txt put -q '
  ispresent($outer) {
    unset @r
  }
  for (k, v in $*) {
    @r[k] = v
  }
  ispresent($inner1) {
    emit @r
  }'
outer=1,middle=10,inner1=100,inner2=101
outer=1,middle=12,inner1=120,inner2=121
outer=2,middle=20,inner1=200,inner2=201
outer=2,middle=21,inner1=210,inner2=211
outer=3,middle=30,inner1=300,inner2=301
outer=3,middle=31,inner1=312,inner2=301
outer=3,middle=31,inner1=313,inner2=314

Bulk rename of field names

$ cat data/spaces.csv
a b c,def,g h i
123,4567,890
2468,1357,3579
9987,3312,4543

$ mlr --csv --rs lf rename -r -g ' ,_'  data/spaces.csv
a_b_c,def,g_h_i
123,4567,890
2468,1357,3579
9987,3312,4543

$ mlr --csv --irs lf --opprint rename -r -g ' ,_'  data/spaces.csv
a_b_c def  g_h_i
123   4567 890
2468  1357 3579
9987  3312 4543

You can also do this with a for-loop but it puts the modified fields after the unmodified fields:

$ cat data/bulk-rename-for-loop.mlr
for (oldk,v in $*) {
    @newk = gsub(oldk, " ", "_");
    if (@newk != oldk) {
        unset $[oldk];
        $[@newk] = v
    }
}

$ mlr --icsv --irs lf --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv
def  a_b_c g_h_i
4567 123   890
1357 2468  3579
3312 9987  4543

Headerless CSV on input or output

Sometimes we get CSV files which lack a header. For example:

$ cat data/headerless.csv
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

You can use Miller to add a header: the --implicit-csv-header applies positionally indexed labels:

$ mlr --csv --rs lf --implicit-csv-header cat data/headerless.csv
1,2,3
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

$ mlr --icsv --irs lf --implicit-csv-header --opprint cat data/headerless.csv
1     2  3
John  23 present
Fred  34 present
Alice 56 missing
Carol 45 present

Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:

$ mlr --csv --rs lf --implicit-csv-header label name,age,status data/headerless.csv
name,age,status
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

$ mlr --icsv --rs lf --implicit-csv-header --opprint label name,age,status data/headerless.csv
name  age status
John  23  present
Fred  34  present
Alice 56  missing
Carol 45  present

Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller’s output to the system command sed 1d, or you can use Miller’s --headerless-csv-output option:

$ head -5 data/colored-shapes.dkvp | mlr --ocsv cat
color,shape,flag,i,u,v,w,x
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863

$ head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat
yellow,triangle,1,11,0.6321695890307647,0.9887207810889004,0.4364983936735774,5.7981881667050565
red,square,1,15,0.21966833570651523,0.001257332190235938,0.7927778364718627,2.944117399716207
red,circle,1,16,0.20901671281497636,0.29005231936593445,0.13810280912907674,5.065034003400998
red,square,0,48,0.9562743938458542,0.7467203085342884,0.7755423050923582,7.117831369597269
purple,triangle,0,51,0.4355354501763202,0.8591292672156728,0.8122903963006748,5.753094629505863

Regularizing ragged CSV

Miller handles compliant CSV: in particular, it’s an error if the number of data fields in a given data line don’t match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.

$ cat data/ragged.csv
a,b,c
1,2,3
4,5
6
7,8,9

$ mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  @nf = NF;
  while(@nf < @maxnf) {
    @nf += 1;
    $[@nf] = ""
  }
'
a,b,c
1,2,3
4,5,
6,,
7,8,9

or, more simply,

$ mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  while(NF < @maxnf) {
    $[NF+1] = "";
  }
'
a,b,c
1,2,3
4,5,
6,,
7,8,9

Finding missing dates

Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:

$ head -n 10 data/miss-date.csv
date,qoh
2012-03-05,10055
2012-03-06,10486
2012-03-07,10430
2012-03-08,10674
2012-03-09,10880
2012-03-10,10718
2012-03-11,10795
2012-03-12,11043
2012-03-13,11177

$ wc -l data/miss-date.csv
    1372 data/miss-date.csv

To find these, you can convert the dates to seconds since the epoch using strptime, then compute adjacent differences (the cat -n simply inserts record-counters):

$ mlr --from data/miss-date.csv --icsv \
  cat -n \
  then put '$datestamp = strptime($date, "%Y-%m-%d")' \
  then step -a delta -f datestamp \
| head

Then, filter for adjacent difference not being 86400 (the number of seconds in a day):

$ mlr --from data/miss-date.csv --icsv \
  cat -n \
  then put '$datestamp = strptime($date, "%Y-%m-%d")' \
  then step -a delta -f datestamp \
  then filter '$datestamp_delta != 86400 && $n != 1'

Given this, it’s now easy to see where the gaps are:

$ mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv
n=770,1=2014-04-12,2=129435
n=771,1=2014-04-13,2=129868
n=772,1=2014-04-14,2=129797
n=773,1=2014-04-15,2=129919
n=774,1=2014-04-16,2=130181
n=775,1=2014-04-19,2=130140
n=776,1=2014-04-20,2=130271
n=777,1=2014-04-21,2=130368
n=778,1=2014-04-22,2=130368
n=779,1=2014-04-23,2=130849
n=780,1=2014-04-24,2=131026

$ mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv
n=1115,1=2015-03-25,2=181006
n=1116,1=2015-03-26,2=180995
n=1117,1=2015-03-27,2=181043
n=1118,1=2015-03-28,2=181112
n=1119,1=2015-03-29,2=181306
n=1120,1=2015-03-31,2=181625
n=1121,1=2015-04-01,2=181494
n=1122,1=2015-04-02,2=181718
n=1123,1=2015-04-03,2=181835
n=1124,1=2015-04-04,2=182104
n=1125,1=2015-04-05,2=182528

Two-pass algorithms

Miller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as sort, tac, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.

Two-pass algorithms: computation of percentages

For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage.

$ mlr --from data/small --opprint put -q '
  # These are executed once per record, which is the first pass.
  # The key is to use NR to index an out-of-stream variable to
  # retain all the x-field values.
  @x_min = min($x, @x_min);
  @x_max = max($x, @x_max);
  @x[NR] = $x;

  # The second pass is in a for-loop in an end-block.
  end {
    for (nr, x in @x) {
      @x_pct[nr] = 100 * (@x[nr] - @x_min) / (@x_max - @x_min);
    }
    emit (@x, @x_pct), "NR"
  }
'
NR x        x_pct
1  0.346790 25.661943
2  0.758680 100.000000
3  0.204603 0.000000
4  0.381399 31.908236
5  0.573289 66.540542

Two-pass algorithms: line-number ratios

Similarly, finding the total record count requires first reading through all the data:

$ mlr --opprint --from data/small put -q '
  @records[NR] = $*;
  end {
    for((I,k),v in @records) {
      @records[I]["I"] = I;
      @records[I]["N"] = NR;
      @records[I]["PCT"] = 100*I/NR
    }
    emit @records,"I"
  }
' then reorder -f I,N,PCT
I N PCT a   b   i x                   y
1 5 20  pan pan 1 0.3467901443380824  0.7268028627434533
2 5 40  eks pan 2 0.7586799647899636  0.5221511083334797
3 5 60  wye wye 3 0.20460330576630303 0.33831852551664776
4 5 80  eks wye 4 0.38139939387114097 0.13418874328430463
5 5 100 wye pan 5 0.5732889198020006  0.8636244699032729

Two-pass algorithms: records having max value

The idea is to retain records having the largest value of n in the following data:

$ mlr --itsv --irs lf --opprint cat data/maxrows.tsv
a      b      n score
purple red    5 0.743231
blue   purple 2 0.093710
red    purple 2 0.802103
purple red    5 0.389055
red    purple 2 0.880457
orange red    2 0.540349
purple purple 1 0.634451
orange purple 5 0.257223
orange purple 5 0.693499
red    red    4 0.981355
blue   purple 5 0.157052
purple purple 1 0.441784
red    purple 1 0.124912
orange blue   1 0.921944
blue   purple 4 0.490909
purple red    5 0.454779
green  purple 4 0.198278
orange blue   5 0.705700
red    red    3 0.940705
purple red    5 0.072936
orange blue   3 0.389463
orange purple 2 0.664985
blue   purple 1 0.371813
red    purple 4 0.984571
green  purple 5 0.203577
green  purple 3 0.900873
purple purple 0 0.965677
blue   purple 2 0.208785
purple purple 1 0.455077
red    purple 4 0.477187
blue   red    4 0.007487

Of course, the largest value of n isn’t known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:

$ cat data/maxrows.mlr
# Retain all records
@records[NR] = $*;
# Track max value of n
@maxn = max(@maxn, $n);

# After all records have been read, loop through retained records
# and print those with the max n value
end {
  for ((nr,k),v in @records) {
    if (k == "n") {
      if (@records[nr]["n"] == @maxn) {
        emit @records[nr]
      }
    }
  }
}

$ mlr --itsv --irs lf --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
a      b      n score
purple red    5 0.743231
purple red    5 0.389055
orange purple 5 0.257223
orange purple 5 0.693499
blue   purple 5 0.157052
purple red    5 0.454779
orange blue   5 0.705700
purple red    5 0.072936
green  purple 5 0.203577

Filtering paragraphs of text

The idea is to use a record separator which is a pair of newlines. Then, if you want each paragraph to be a record with a single value, use a field-separator which isn’t present in the input data (e.g. a control-A which is octal 001). Or, if you want each paragraph to have its lines as separate values, use newline as field separator.

$ cat paragraphs.txt
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick
brown fox jumped over the lazy dogs. The quick brown fox jumped over the
lazy dogs.

Now is the time for all good people to come to the aid of their country.  Now
is the time for all good people to come to the aid of their country.  Now is
the time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.

Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
on the plain. The rain in Spain falls mainly on the plain. The rain in Spain
falls mainly on the plain. The rain in Spain falls mainly on the plain. The
rain in Spain falls mainly on the plain. The rain in Spain falls mainly on
the plain. The rain in Spain falls mainly on the plain.

$ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\001' filter '$1 =~ "the"'
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick
brown fox jumped over the lazy dogs. The quick brown fox jumped over the
lazy dogs.

Now is the time for all good people to come to the aid of their country.  Now
is the time for all good people to come to the aid of their country.  Now is
the time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.  Now is the
time for all good people to come to the aid of their country.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
on the plain. The rain in Spain falls mainly on the plain. The rain in Spain
falls mainly on the plain. The rain in Spain falls mainly on the plain. The
rain in Spain falls mainly on the plain. The rain in Spain falls mainly on
the plain. The rain in Spain falls mainly on the plain.

$ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\n' cut  -f 1,3
The quick brown fox jumped over the lazy dogs. The quick brown fox jumped
brown fox jumped over the lazy dogs. The quick brown fox jumped over the

Now is the time for all good people to come to the aid of their country.  Now
the time for all good people to come to the aid of their country.  Now is the

Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow.
Sphynx of black quartz, judge my vow.

The rain in Spain falls mainly on the plain. The rain in Spain falls mainly
falls mainly on the plain. The rain in Spain falls mainly on the plain. The

Doing arithmetic on fields with currency symbols

$ cat sample.csv
EventOccurred,EventType,Description,Status,PaymentType,NameonAccount,TransactionNumber,Amount
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,John,1,$230.36
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Fred,2,$32.25
10/1/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Bob,3,$39.02
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Alice,4,$57.54
10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Jungle,5,$230.36
10/1/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Joe,6,$281.96
10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,7,$188.19
10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,8,$188.19
10/2/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Anthony,9,$250.00

$ mlr --icsv --opprint cat sample.csv
EventOccurred EventType    Description                               Status   PaymentType NameonAccount TransactionNumber Amount
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    John          1                 $230.36
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Fred          2                 $32.25
10/1/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Bob           3                 $39.02
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Alice         4                 $57.54
10/1/2015     Charged Back Reason: Authorization Revoked By Customer Disputed Checking    Jungle        5                 $230.36
10/1/2015     Charged Back Reason: Payment Stopped                   Disputed Checking    Joe           6                 $281.96
10/2/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Joseph        7                 $188.19
10/2/2015     Charged Back Reason: Customer Advises Not Authorized   Disputed Checking    Joseph        8                 $188.19
10/2/2015     Charged Back Reason: Payment Stopped                   Disputed Checking    Anthony       9                 $250.00

$ mlr --csv put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv
Amount_sum
1497.870000

$ mlr --csv --ofmt '%.2lf' put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv
Amount_sum
1497.87

Program timing

This admittedly artificial example demonstrates using Miller time and stats functions to introspectly acquire some information about Miller’s own runtime. The delta function computes the difference between successive timestamps.

$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' > lines.txt

$ head -n 5 lines.txt
i=1
i=2
i=3
i=4
i=5

mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7
i     t                 t_delta
1     1430603027.018016 1.430603027e+09
2     1430603027.018043 2.694129944e-05
3     1430603027.018048 5.006790161e-06
4     1430603027.018052 4.053115845e-06
5     1430603027.018055 2.861022949e-06
6     1430603027.018058 3.099441528e-06

mlr --ofmt '%.9le' --oxtab \
  put '$t=systime()' then \
  step -a delta -f t then \
  filter '$i>1' then \
  stats1 -a min,mean,max -f t_delta \
  lines.txt
t_delta_min  2.861022949e-06
t_delta_mean 4.077508505e-06
t_delta_max  5.388259888e-05

Using out-of-stream variables

One of Miller’s strengths is its compact notation: for example, given input of the form

$ head -n 5 ../data/medium
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

you can simply do

$ mlr --oxtab stats1 -a sum -f x ../data/medium
x_sum 4986.019682

or

$ mlr --opprint stats1 -a sum -f x -g b ../data/medium
b   x_sum
pan 965.763670
wye 1023.548470
zee 979.742016
eks 1016.772857
hat 1000.192668

rather than the more tedious

$ mlr --oxtab put -q '
  @x_sum += $x;
  end {
    emit @x_sum
  }
' data/medium
x_sum 4986.019682

or

$ mlr --opprint put -q '
  @x_sum[$b] += $x;
  end {
    emit @x_sum, "b"
  }
' data/medium
b   x_sum
pan 965.763670
wye 1023.548470
zee 979.742016
eks 1016.772857
hat 1000.192668

The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.

Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic — if you wish to do so — which isn’t present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in C as a Miller verb of its own.)

The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.

Mean without/with oosvars

$ mlr --opprint stats1 -a mean -f x data/medium
x_mean
0.498602

$ mlr --opprint put -q '
  @x_sum += $x;
  @x_count += 1;
  end {
    @x_mean = @x_sum / @x_count;
    emit @x_mean
  }
' data/medium
x_mean
0.498602

Keyed mean without/with oosvars

$ mlr --opprint stats1 -a mean -f x -g a,b data/medium
a   b   x_mean
pan pan 0.513314
eks pan 0.485076
wye wye 0.491501
eks wye 0.483895
wye pan 0.499612
zee pan 0.519830
eks zee 0.495463
zee wye 0.514267
hat wye 0.493813
pan wye 0.502362
zee eks 0.488393
hat zee 0.509999
hat eks 0.485879
wye hat 0.497730
pan eks 0.503672
eks eks 0.522799
hat hat 0.479931
hat pan 0.464336
zee zee 0.512756
pan hat 0.492141
pan zee 0.496604
zee hat 0.467726
wye zee 0.505907
eks hat 0.500679
wye eks 0.530604

$ mlr --opprint put -q '
  @x_sum[$a][$b] += $x;
  @x_count[$a][$b] += 1;
  end{
    for ((a, b), v in @x_sum) {
      @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
    }
    emit @x_mean, "a", "b"
  }
' data/medium
a   b   x_mean
pan pan 0.513314
pan wye 0.502362
pan eks 0.503672
pan hat 0.492141
pan zee 0.496604
eks pan 0.485076
eks wye 0.483895
eks zee 0.495463
eks eks 0.522799
eks hat 0.500679
wye wye 0.491501
wye pan 0.499612
wye hat 0.497730
wye zee 0.505907
wye eks 0.530604
zee pan 0.519830
zee wye 0.514267
zee eks 0.488393
zee zee 0.512756
zee hat 0.467726
hat wye 0.493813
hat zee 0.509999
hat eks 0.485879
hat hat 0.479931
hat pan 0.464336

Variance and standard deviation without/with oosvars

$ mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
x_count  10000
x_sum    4986.019682
x_mean   0.498602
x_var    0.084270
x_stddev 0.290293

$ cat variance.mlr
@n += 1;
@sumx += $x;
@sumx2 += $x**2;
end {
  @mean = @sumx / @n;
  @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
  @stddev = sqrt(@var);
  emitf @n, @sumx, @sumx2, @mean, @var, @stddev
}

$ mlr --oxtab put -q -f variance.mlr data/medium
n      10000
sumx   4986.019682
sumx2  3328.652400
mean   0.498602
var    0.084270
stddev 0.290293

You can also do this keyed, of course, imitating the keyed-mean example above.

Min/max without/with oosvars

$ mlr --oxtab stats1 -a min,max -f x data/medium
x_min 0.000045
x_max 0.999953

$ mlr --oxtab put -q '@x_min = min(@x_min, $x); @x_max = max(@x_max, $x); end{emitf @x_min, @x_max}' data/medium
x_min 0.000045
x_max 0.999953

Keyed min/max without/with oosvars

$ mlr --opprint stats1 -a min,max -f x -g a data/medium
a   x_min    x_max
pan 0.000204 0.999403
eks 0.000692 0.998811
wye 0.000187 0.999823
zee 0.000549 0.999490
hat 0.000045 0.999953

$ mlr --opprint --from data/medium put -q '
  @min[$a] = min(@min[$a], $x);
  @max[$a] = max(@max[$a], $x);
  end{
    emit (@min, @max), "a";
  }
'
a   min      max
pan 0.000204 0.999403
eks 0.000692 0.998811
wye 0.000187 0.999823
zee 0.000549 0.999490
hat 0.000045 0.999953

Delta without/with oosvars

$ mlr --opprint step -a delta -f x data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
wye pan 5 0.5732889198020006  0.8636244699032729  0.191890

$ mlr --opprint put '$x_delta = ispresent(@last) ? $x - @last : 0; @last = $x' data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
wye pan 5 0.5732889198020006  0.8636244699032729  0.191890

Keyed delta without/with oosvars

$ mlr --opprint step -a delta -f x -g a data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0
wye wye 3 0.20460330576630303 0.33831852551664776 0
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
wye pan 5 0.5732889198020006  0.8636244699032729  0.368686

$ mlr --opprint put '$x_delta = ispresent(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small
a   b   i x                   y                   x_delta
pan pan 1 0.3467901443380824  0.7268028627434533  0
eks pan 2 0.7586799647899636  0.5221511083334797  0
wye wye 3 0.20460330576630303 0.33831852551664776 0
eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
wye pan 5 0.5732889198020006  0.8636244699032729  0.368686

Exponentially weighted moving averages without/with oosvars

$ mlr --opprint step -a ewma -d 0.1 -f x data/small
a   b   i x                   y                   x_ewma_0.1
pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
wye pan 5 0.5732889198020006  0.8636244699032729  0.391064

$ mlr --opprint put '
  begin{ @a=0.1 };
  $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
  @e=$e
' data/small
a   b   i x                   y                   e
pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
wye pan 5 0.5732889198020006  0.8636244699032729  0.391064