#paste #posix #coreutils #shell

Joining consecutive lines with paste

Short intro to paste

paste is a tool defined in POSIX as follows:

The paste utility shall concatenate the corresponding lines of the given input files, and write the resulting lines to standard output.

The default operation of paste shall concatenate the corresponding lines of the input files. The <newline> of every line except the line from the last input file shall be replaced with a <tab>.

And man from GNU core utilities explains:

SYNOPSIS

  paste [OPTION]... [FILE]...

DESCRIPTION

  Write lines consisting of the sequentially corresponding lines from each FILE, separated by TABs, to standard output.

Let's see an example:

$ cat a.txt
1
2
3

$ cat b.txt
101
102
103

$ paste a.txt b.txt
1       101
2       102
3       103

Ok, from the above we see that this can be quite useful. But wait, there is one behavior that can be missed at a first glance...

Passing stdin

paste can take - (a dash) as a source parameter to read from stdin. So far nothing special, but what is worth noting is that if we pass stdin parameter multiple times, we will keep reading from the same file descriptor in all the instances, each time incrementing same shared read offset over same input.

This is a special case behavior for stdin as paste parameter (vs passing file paths to same files). And a quite natural one, given that there is one stdin to a program.

While I don't see this being explicitely documented in the manual for coreutils implementation, this property is mentioned in POSIX:

If '-' is specified for one or more of the files, the standard input shall be used; the standard input shall be read one line at a time, circularly, for each instance of '-'.

In principle a program could implement stdin buffering to allow for multiple independent reads of stdin contents, but that's not what we are dealing with here, including the coreutils implementation.

Joining consecutive lines from a single source

Now that we see that state (read offset) is shared for all the occurrences of stdin passed to paste, we can observe that this allows, for example, for merging consecutive lines from a single file, like so:

$ cat a.txt
1
2
3

$ cat a.txt | paste - -
1       2
3

In the above example text is read via stdin, but because stdin is declared as input twice, process of assembling each output line reads two consecutive lines from the single input.

This is as opposed to each of the two instances of stdin parameter generating own instance of a read offset, resulting in reading each line twice.

To further illustrate the difference, we can get back to having such effect of independent read offsets per (same) file parameter, if we point twice to the same file using file paths instead:

$ paste a.txt a.txt
1       1
2       2
3       3

Above there are two independent read offsets at play, one per file (per opened file descriptor) - which just happens to be the same file.

Less synthetic example

I've used this property recently when operating on the output of git log -p which was showing, among others, dates of commits and commit patches. What I wanted was to get (date, specific line diff) pairs in one line. paste and this usage of stdin allowed me to get from a form like...

date 1
diff line 1
date 2
diff line 2

... to this:

date 1 diff line 1
date 2 diff line 2

by simply adding | paste - - to the pipeline.