As I was trying to parallelize some bash script today, quite unexpectedly, I had more fun1 than I was hoping for. It was a bumpy ride, so please follow the story from the safety of a comfy chair, basking in the glow of your favorite terminal emulator.
The task
So I was sitting there, facing off a small snippet of bash code that screamed to be optimized. Truthfully, what was actually screaming at me was the fact that the last stage in a Go pipeline was taking some 27 minutes.
Here’s the infamous snippet:
1 2 3 4 5 |
|
Actually, not that bad. Ok, the $host
value could be passed as an argument
to the function r_refresh_host
and not being used implicitly, but all in all
not too shabby.
So what does this snippet actually do? It connects to 13 nodes and updates the configuration using Puppet. If that floats your boat, imagine using Chef instead as it is not important for the story.
Anyway, this can clearly be optimized by connecting to the nodes in parallel. So let’s do just that:
1 2 3 4 5 6 7 |
|
Trivial change and voilĂ ! The deed is done. Well not really. The snippet is
embedded in the script that declares set -e
. I would not think anything of it
had not a colleague mentioned a few days prior that the failure of a subshell
does not kill the main script. That’s not the behaviour I desired as I’d like
for all the nodes to be properly updated.
The same colleague told me that he uses something called parallel to solve similar issues.
Enter GNU Parallel
GNU Parallel is “a shell tool for executing jobs in parallel”. It even has quite nice logo:
Thus began my suffering.
I’ve played with parallel on my mac after installing it via homebrew. It
worked nicely. The only thing that was annoying me a bit was the fact that I
needed to use --no-notice
flag to silence some undesired output. The
snippet:
1
|
|
To be truthful, I fumbled a bit to get there as I’m not a bash expert™. But that’s less important for the story.
After running the Go pipeline, in a test stage that rolls out
the change to only one node, I found out that parallel version I just
installed on that ubuntu machine does not support --no-notice
flag. Oh,
well… The flag went away.
Retriggering the pipeline, the final stage that is rolling out the configuration to all nodes failed. This was confusing as the previous stage that tests rollout to one node succeeded. What is going on?
The twist
I dug deeper and it turns out not all the nodes had the same version of GNU parallel. Some of them (version 20120422) worked just fine, but others (version 20121122) failed miserably. More concretely, event the simplest command:
1
|
|
produced output 42
or \n
depending on the version in use.
While checking the version, there was a somewhat random message:
1
|
|
Not knowing the Tollef fellow, I added the --gnu
flag.
The (happy?) end
And it worked.
Here’s the final one-liner:
1
|
|
Somehow you’d think that it’s not necessary to use gnu flag in something called GNU parallel. But life is full of surprises.
To be fair, in the end parallel does the job. And it is better than solving the same problem with additional bash logic. I definitely plan on using it again. Check it out, maybe you should too. Just don’t forget to use a flag or two.
As a side note, this particular Go stage went from 27 min to 4.5 min. All is well that ends well.
EDIT:
One reader was kind enough to explain why the issue happens.
Maybe something else to mention: I’m a big fan of GNU and respect everything it stands for. So take this post as intended and laugh a bit. :)