How to Speed Up Bash Scripts with Multithreading and GNU Parallel

Chris Blackden

Read more posts by this author.

If you’re tired of your Bash scripts taking forever to run, this tutorial is for you. Often, you can Bash scripts in parallel, which can dramatically speed up the result. How? Using the GNU Parallel utility, also just called Parallel, with some handy GNU Parallel examples!

Parallel executes Bash scripts in parallel via a concept called multi-threading. This utility allows you to run different jobs per CPU instead of only one, cutting down on time to run a script.

In this tutorial, you’re going to learn multi-threading Bash scripts with a ton of great GNU Parallel examples!

Prerequisites

This tutorial will be full of hands-on demonstrations. If you intend to follow along, be sure you have the following:

  • A Linux computer. Any distribution will work. The tutorial uses Ubuntu 20.04 running on Windows Subsystem for Linux (WSL).
  • Logged-in with a user with sudo privileges.

Installing GNU Parallel

To begin speeding up Bash scripts with multithreading, you must first install Parallel. So let’s get started by downloading and getting it installed.

1. Open a Bash terminal.

2. Run wget to download the Parallel package. The command below downloads the latest version (parallel-latest) into the current working directory.

wget https://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2

If you’d rather use an older version of GNU Parallel, you can find all packages at the official download site.

3. Now, execute the tar command below to un-archive the package you just downloaded.

Below, the command uses the x flag to extract the archive, j to specify that it targets an archive with a .bz2 extension, and f to accept a file as the input to the tar command. sudo tar -xjf parallel-latest.tar.bz2

sudo tar -xjf parallel-latest.tar.bz2

You should now have a directory named parallel- with the month, day and year of the latest release.

4. Navigate into the package archive folder with cd. In this tutorial, the package archive folder is called parallel-20210422, as shown below.

Navigating to the Parallel archive folder
Navigating to the Parallel archive folder

5. Next, build and install the GNU Parallel binary by running the following commands:

./configure
 make
 make install

Now, verify that Parallel installed correctly by checking the version that’s installed.

parallel --version
Checking the GNU Parallel version
Checking the GNU Parallel version

When running Parallel for the first time, you might also see a couple of scary lines that display text like perl: warning:. Those warning messages indicate that Parallel can’t detect your current locale and language settings. But don’t worry about those warnings for now. You’ll learn how to fix those warnings later.

Configuring GNU Parallel

Now that Parallel is installed, you can use it right away! But first, it’s important to configure a few minor settings before getting started.

While still in your Bash terminal, agree to the GNU Parallel academic research permission telling Parallel that you will cite it in any academic research by specifying the citation parameter followed by will cite.

If you do not want to support GNU or it’s maintainers, agreeing to cite is not required to use GNU Parallel.

parallel --citation
will cite

Change the locale by setting the following environment variables by running the lines of code below. Setting locale and language environment variables like this isn’t a requirement. But GNU Parallel checks for them every time it runs.

If the environment variables do not exist, Parallel will complain about them every time as you saw in the previous section.

This tutorial assumes that you are an English speaker. Other languages are supported as well.

export LC_ALL=C man
export LANGUAGE=en_US
export LANG=en_US.UTF-8
Setting locale and language for GNU Parallel
Setting locale and language for GNU Parallel

Running Ad-Hoc Shell Commands

Let’s now get started using GNU Parallel! To start with, you’ll learn the basic syntax. Once you’re comfortable with the syntax, you’ll then get into some handy GNU Parallel examples later on.

To begin, let’s cover a super-simple example of just echoing the numbers 1-5.

1. In your Bash terminal, run the following commands. Exciting, right? Bash uses the echo command to send the numbers 1-5 to the terminal. If you’d put each of these commands in a script, Bash would execute each one sequentially, waiting for the previous one to finish.

In this example, you are executing five commands that don’t take hardly any time. But, imagine if those commands were Bash scripts actually did something useful but took forever to run?

 echo 1
 echo 2
 echo 3
 echo 4
 echo 5

Now, run each of those commands in at the same time with Parallel like below. In this example, Parallel runs the echo command and designated by the :::, passes that command the arguments, 1, 2, 3, 4, 5. The three colons tell Parallel that you’re providing input via the command line rather than the pipeline (more later).

In the below example, you passed a single command to Parallel with no options. Here, like all Parallel examples, Parallel started a new process for each command using a different CPU core.

# From the command line
 parallel echo ::: 1 2 3 4 5

All Parallel commands follow the syntax parallel [Options] <Command to multi-thread>.

3. To demonstrate Parallel receiving input from the Bash pipeline, create a file called count_file.txt like below. Each number represents the argument you will pass to the echo command.

 1
 2
 3
 4
 5

4. Now, run the cat command to read that file and pass the output to Parallel, as shown below. In this example, the {} represents each argument (1-5) that will be passed to Parallel.

# From the pipeline cat count_file.txt | parallel echo {}
GNU Parallel Example #1
GNU Parallel Example #1

Comparing Bash and GNU Parallel

Right now, using Parallel might just seem like a complicated way to run Bash commands. But the real benefit to you is time savings. Remember, Bash will run on only one CPU core while GNU Parallel will run on several at once.

1. To demonstrate the difference in sequential Bash commands vs. Parallel, create a Bash script called test.sh with the following code. Create this script in the same directory you created the count_file.txt in earlier.

The Bash script below reads the count_file.txt file, sleeps for 1, 2, 3, 4, and 5 seconds, echos the sleep length to the terminal, and terminates.

#!/bin/bash
 nums=$(cat count_file.txt) # Read count_file.txt
 for num in $nums           # For each line in the file, start a loop
 do
     sleep $num             # Read the line and wait that many seconds
     echo $num              # Print the line
 done

2. Now, run the script using the time command to measure how long the script takes to complete. It will take 15 seconds.

time ./test.sh

3. Now, use the time command again to perform the same task but this time use Parallel to do so.

The command below performs the same task but this time, instead of waiting for the first loop to complete before starting the next, it will run one on each CPU core, and start as many as it can at the same time.

time cat count_file.txt | parallel "sleep {}; echo {}"
The prompt on the right side of the terminal confirms the time each command took to complete
The prompt on the right side of the terminal confirms the time each command took to complete

Know the Dry Run!

It’s now time to get into some more real-world GNU Parallel examples. But, before you do, you should first know about the --dryrun flag. This flag comes in handy when you want to see what will happen without Parallel actually doing it.

The --dryrun flag can be the final sanity check before running a command that doesn’t behave the way you thought. Unfortunately, if you enter a command that would harm your system, the only thing GNU Parallel will help you do is harm it faster!

parallel --dryrun "rm rf {}"

GNU Parallel Example #1: Downloading Files from the Web

For this task, you will download a list of files from various URLs on the web. For example, these URLs could represent web pages you want to save, images, or even a list of files from an FTP server.

For this example, you’re going to download a list of archive packages (and the SIG files) from GNU parallel’s FTP server.

1. Create a file called download_items.txt, grab some download links from the official download site and add them to the file separated by a new line.

 https://ftp.gnu.org/gnu/parallel/parallel-20120122.tar.bz2
 https://ftp.gnu.org/gnu/parallel/parallel-20120122.tar.bz2.sig
 https://ftp.gnu.org/gnu/parallel/parallel-20120222.tar.bz2
 https://ftp.gnu.org/gnu/parallel/parallel-20120222.tar.bz2.sig

You could save some time by using Python’s Beautiful Soup library to scrape all of the links from the download page.

2. Read all of the URLs from the download_items.txt file and pass them to Parallel, which will invoke wget and pass each URL.

cat download_items.txt | parallel wget {}

Don’t forget that {} in a parallel command is a placeholder for the input string!

3. Perhaps you need to control the number of threads that GNU Parallel uses at once. If so, add the --jobs or -j parameter to the command. The --jobs parameter limits the number of threads that can run concurrently to the number you specify.

For example, to limit Parallel to downloading five URLs at a time, the command would look like this:

#!/bin/bash
 cat download_items.txt | parallel --jobs 5 wget {}

The --jobs parameter in the above command can be adjusted to download any number of files, so long as the computer, you are running on has that many CPUs to process them.

4. To demonstrate the effect of the --jobs parameter, now adjust job count and run the time command to measure how long each run takes.

 time cat download_items.txt | parallel --jobs 5 wget {}
 time cat download_items.txt | parallel --jobs 10 wget {}

GNU Parallel Example #2: Unzipping Archive Packages

Now that you have all of these archive files downloaded from the previous example, you must now un-archive them.

While in the same directory as the archive packages, run the following Parallel command. Notice the use of the wildcard (*). Since this directory contains both archive packages and the SIG files, you must tell Parallel to only process .tar.bz2 files.

sudo parallel tar -xjf ::: *.tar.bz2

Bonus! If you are using GNU parallel interactively (not in a script), add the --bar flag to have Parallel show you a progress bar while the task is running.

Showing the output from the --bar flag
Showing the output from the --bar flag

GNU Parallel Example #3: Removing Files

If you’ve followed examples one and two, you should now have many folders in your working directory taking up space. So let’s remove all of those files in parallel!

To remove all folders that start with parallel- using Parallel, list all of the folders with ls -d and pipe each of those folder paths to Parallel, invoking rm -rf on each folder, as shown below.

Remember the --dryrun flag!

ls -d parallel-*/ | parallel "rm -rf {}"

Conclusion

Now you can automate tasks with Bash and save yourself a lot of time. What you choose to do with that time is up to you. Whether saving time means leaving work a little early or reading another ATA blog post, it’s time back in your day.

Now think about all of the long-running scripts in your environment. Which ones can you speed up with Parallel?

Subscribe to Stay in Touch

Never miss out on your favorite ATA posts and our latest announcements!

Looks like you're offline!