Joining consecutive lines with paste
Short intro to paste
paste
is a tool defined in POSIX
as follows:
The paste utility shall concatenate the corresponding lines of the given input files, and write the resulting lines to standard output.
The default operation of paste shall concatenate the corresponding lines of the input files. The <newline> of every line except the line from the last input file shall be replaced with a <tab>.
And man
from GNU core utilities explains:
SYNOPSIS
paste [OPTION]... [FILE]...
DESCRIPTION
Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output.
Let's see an example:
$ cat a.txt
1
2
3
$ cat b.txt
101
102
103
$ paste a.txt b.txt
1 101
2 102
3 103
Ok, from the above we see that this can be quite useful. But wait, there is one behavior that can be missed at a first glance...
Passing stdin
paste
can take -
(a dash) as a source parameter to
read from stdin. So far nothing special, but what is worth noting is that if we
pass stdin parameter multiple times, we will keep reading from the same file
descriptor in all the instances, each time incrementing same shared read
offset over same input.
This is a special case behavior for stdin as paste
parameter (vs passing
file paths to same files). And a quite natural one, given that there is one
stdin to a program.
While I don't see this being explicitely documented in the manual for coreutils implementation, this property is mentioned in POSIX:
If '-' is specified for one or more of the files, the standard input shall be used; the standard input shall be read one line at a time, circularly, for each instance of '-'.
In principle a program could implement stdin buffering to allow for multiple independent reads of stdin contents, but that's not what we are dealing with here, including the coreutils implementation.
Joining consecutive lines from a single source
Now that we see that state (read offset) is shared for all the occurrences of
stdin passed to paste
, we can observe that this allows, for example, for
merging consecutive lines from a single file, like so:
$ cat a.txt
1
2
3
$ cat a.txt | paste - -
1 2
3
In the above example text is read via stdin, but because stdin is declared as input twice, process of assembling each output line reads two consecutive lines from the single input.
This is as opposed to each of the two instances of stdin parameter generating own instance of a read offset, resulting in reading each line twice.
To further illustrate the difference, we can get back to having such effect of independent read offsets per (same) file parameter, if we point twice to the same file using file paths instead:
$ paste a.txt a.txt
1 1
2 2
3 3
Above there are two independent read offsets at play, one per file (per opened file descriptor) - which just happens to be the same file.
Less synthetic example
I've used this property recently
when operating on the output of git log -p
which was showing, among others,
dates of commits and commit patches. What I wanted was to get (date, specific
line diff) pairs in one line. paste
and this usage of stdin allowed me to get
from a form like...
date 1
diff line 1
date 2
diff line 2
... to this:
date 1 diff line 1
date 2 diff line 2
by simply adding | paste - -
to the pipeline.