I had to use GNU Parallel quite extensively lately. In this post I’m sharing some of its features that were most helpful to me. Here is how I used it.
The Simplest Way to Use Parallel
The simplest way to use Parallel is to pipe a list of command into it:
parallel <<EOF
date
date +"%Y-%m-%d"
sleep 5 && echo slept for 5s
echo 1+2 | bc
EOF
Wed Jan 15 19:46:06 CET 2025
2025-01-15
3
slept for 5s
By default, it runs as many commands in parallel as the number of cores on the machine. It’s possible to set the number of jobs to a desired number with -j
flag:
for i in {1..100}; do touch file$i.txt; done
find . -name "file*.txt" | parallel -j 3 rm
The previous example also shows how to run the same command multiple times with different arguments. Parallel executes rm
for every file name in the list that is passed to its standard input.
It’s also possible to pass the list of arguments as an argument after three colons :::
:
parallel echo ::: A B C
B
A
C
Replacement Strings
In case you want to put the arguments somewhere in the middle of the command you can use {}
to indicate Parallel where to put them:
seq 1 10 | parallel touch file-{}.txt
ls file*
file-1.txt file-10.txt file-2.txt file-3.txt file-4.txt file-5.txt file-6.txt file-7.txt file-8.txt file-9.txt
In case there are multiple arguments in each input line you can refer them individually by a number using {n}
. But for that you need to specify how they are delimited using --colsep
(-C
):
paste -d ' ' <(seq 1 10) <(seq 101 110) \
| parallel --colsep ' ' touch file_{1}_{2}.txt
ls file*
file_10_110.txt file_1_101.txt file_2_102.txt file_3_103.txt file_4_104.txt file_5_105.txt file_6_106.txt file_7_107.txt file_8_108.txt file_9_109.txt
Another replacement string is the job slot number {%}
. It gets replaced with whatever the number the job has that is about to execute the command. In my case it was useful to distribute commands over a number of pods. Here is an example:
seq 1 100 | parallel -j 5 echo {} ">>" job_{%}_results.txt
ls job_*
head -n 5 job_3_results.txt
job_1_results.txt job_2_results.txt job_3_results.txt job_4_results.txt job_5_results.txt
3
8
13
18
23
As you can see we asked parallel to run 5 jobs, and it has distributed the numbers over 5 different files naming each file according to the pattern.
Spread the Input
Parallel can also spread the input over multiple commands using --pipe
flag instead of using it as commands arguments. Here is a simple example:
yes 'some string to repeat' | head -n 100000 | parallel -j 2 --pipe wc -l
47662
47662
4676
Parallel reads the input and splits it to chunks of approx. 1 MB
by default. Then each chunk is passed to the standard input of the target command, wc -l
in this case. At the same time it makes sure there are no more than 2 wc
commands running at any point in time. I said the chunks are approximately 1 MB
by default because Parallel splits the input at new lines as opposed to splitting it at exact 1 MB
boundaries. The chunk size can be controlled with --block
. For example --block 1G
sets the size of the chunk to be around 1 gigabyte.
Output Buffering
By default, the output of the commands running in parallel is not mixed up. This is useful if for example your commands output JSON and you process it later with jq
. Mixing the outputs would most certainly make the combined output to be an invalid JSON.
But sometimes the commands just output some log and in this case you’d like to know what they are printing as soon as possible. For that there is --lb
(--line-buffer
) flag. It allows Parallel to mix up the outputs on per-line basis. Compare the output of the following command for yourself with and without --lb
:
seq 1 5 | sed 's/.*/echo \0; sleep 5s; echo \0/' | parallel -j 3 --lb
Bonus Feature
If the input is big, and it’s taking some time to execute all the commands it’s nice to visually see how far Parallel is from the completion. And there is a way: --bar
. Take a look:
My Use Cases
Lately I had to run some scripts on a bunch of Kafka topics. Initially I’ve been using simple loops for that, but it was painfully slow. Gradually I’ve converted all scripts to use Parallel instead. In one case I had to create a lot of topics, so I prepared a list of commands and piped it to Parallel. In another case I had to produce huge amounts of dummy messages and Parallel helped me here as well. I was using a combination of yes
and head
commands to generate the exact number of messages I needed, and then I was sending it to Parallel. It was breaking down the stream using --pipe
and sending the data to the topics using multiple Kafka producers.
I have to manage some resources in a number of Kubernetes clusters and checking the resource’s statuses was slow. Initially I had written a script that simply iterated over each cluster and queried the resources for the statuses. Later I’ve converted that to use Parallel as well. I’ve piped the list of contexts to Parallel and specified the command to execute which contained {}
placeholder. The output of the command was JSON, so I piped the output of Parallel into jq
and after some transformation to column -t
so that it displayed the data in a nice table.
Conclusion
I’ve barely scratched the surface in this post. Parallel provides much more features and I encourage you to spend some time learning it. Take a look at man parallel
, man parallel_tutorial
and man parallel_examples
. It’s a real time saver in so many cases.
I hope the post was helpful to you, and thanks for reading!