How to Install Apache Spark for Windows

Published:15 November 2023 - 8 min. read

Michael Nguyen Tu Image

Michael Nguyen Tu

Read more tutorials by Michael Nguyen Tu!

In the ever-expanding realm of massive data, the need for tools that can efficiently handle and process vast datasets is more crucial than ever. If you wish to unlock the transformative power of data analysis, set up Apache Spark on your Windows machine and embark on a data exploration journey.

Throughout this tutorial, you’ll discover the ropes of installing Apache Spark on your Windows system, positioning yourself to conquer the fascinating world of data analysis.

Transform your Windows environment into a data processing powerhouse!

Prerequisites

Before diving into the installation process, ensure you have all the necessary prerequisites in place.

  • A Windows 7 system or later – This tutorial uses Windows 10 Pro 22H2.
  • Java installed – This tutorial uses Java 1.8.0_391.

Installing Python for Apache Spark on Windows

As you ensure your Windows environment is primed for the upcoming Apache Spark installation, one essential step stands before you — installing Python.

Python serves as the glue that enhances the interoperability and extensibility of Spark. From handling data manipulations to running Spark scripts, Python acts as a powerful catalyst.

To install Python for Apache Spark on Windows, follow these steps:

1. Open your favorite web browser, visit the official Python download page, and download the latest Python Installer. At this time of writing, the latest version is Python 3.12.0.

Downloading Python for Apache Spark on Windows
Downloading Python for Apache Spark on Windows

2. Once downloaded, double-click on the installer to begin the installation process.

3. On the Setup window, tick the Add python.exe to PATH option, and click Install Now.

Enabling this option adds Python to your PATH environment variable, which eases the management of Python packages via package managers like pip or anaconda. As a result, you can install and manage Python libraries without additional configuration.

Continuing with the Python installation
Continuing with the Python installation

4. Now, select the Disable path length limit option, and click Close to close the Setup wizard.

This option allows Python to use the “Long Paths” feature to access paths longer than the traditional MAX_PATH limit (260 characters) on Windows. This feature is beneficial as some of the packages used by Apache Spark exceed the MAX_PATH limit.

Disabling the path length limit and closing the installer
Disabling the path length limit and closing the installer

5. Lastly, open PowerShell and run the below commands to verify your Python installation.

pip--version
python --version
Checking the installed versions of the pip package manager and Python
Checking the installed versions of the pip package manager and Python

Downloading and Installing Apache Spark

With Python installed, you can now lay the foundation for dynamic capabilities to flourish in your Windows environment. Installing Apache Spark opens a world of possibilities where data-driven insights are at your fingertips.

To install Apache Spark on Windows, proceed with the following:

1. Visit the official Apache Spark download page in another web browser tab.

2. Next, download the Apache Spark installation package as follows:

  • Choose a Spark release – Select the latest release from the dropdown field (i.e., 3.5.0).
  • Choose a package type – Pick a suitable one depending on whether you want to use Hadoop. Your version of Spark and Hadoop may slightly differ. For this example, choose Pre-built for Apache Hadoop 3.3 and later option, which is best if you use Spark with Hadoop. Otherwise, select the option without Hadoop.
  • Download Spark – Click on the hyperlink, which opens a new tab (step three) to select your Apache Spark download location.
Choosing options for downloading Apache Spark
Choosing options for downloading Apache Spark

3. Click the first hyperlink or choose one from the alternate download locations to download the Apache Spark package.

Downloading the Apache Spark package
Downloading the Apache Spark package

4. Once downloaded, execute the below command and generate a SHA512 hash of the downloaded file.

Ensure you replace spark-3.5.0-bin-hadoop3.tgz with the exact file name of the one you downloaded (or the full path).

certutil -hashfile .\spark-3.5.0-bin-hadoop3.tgz SHA512

Note down the generated hash, as you will make a comparison later.

Generating a SHA512 hash of the Apache Spark package
Generating a SHA512 hash of the Apache Spark package

5. Switch back to the download page and click the checksum hyperlink. Doing so opens the hash file containing the expected SHA512 checksum for your chosen download (step six).

Opening the checksum link
Opening the checksum link

6. Compare the generated hash you noted in step four with the one in the checksum file.

Matching both hashes indicates your download was not corrupted or tampered with, and you can proceed with extracting the downloaded archive.

Comparing the generated hash with the checksum
Comparing the generated hash with the checksum

7. Lastly, unzip/extract the contents of the Apache Spark package using tools like 7-Zip or WinRAR to a directory of your choice.

For example, you can extract everything to the C:\spark folder, which effectively installs Apache Spark in your root directory.

After extraction, the Apache Spark installation directory structure should resemble the following:

Viewing the folder structure
Viewing the folder structure

Installing the winutils.exe Utility for Required Functionalities

Now that Apache Spark has found its home on your Windows machine, the next step involves a nuanced touch — installing the winutils.exe utility. Apache Spark requires the winutils.exe utility to run smoothly on Windows.

This utility is essential for enabling the required file and directory operations functionality. The winutils.exe utility is crucial because these operations are typically available on Linux-based systems but need additional support on Windows.

To install the winutils.exe utility, complete the steps below:

1. Visit the GitHub repository that provides the winutils.exe utility for Windows and look for the latest Hadoop version. Hadoop is an open-source framework designed for the distributed storage and processing of large data sets.

At this time of writing, the latest revision is hadoop-3.3.5, as shown below.

2. Next, run the following command to download the contents of the bin folder (hadoop-3.3.5/bin) to a folder called C:\hadoop\bin.

💡 If you see a newer version, change the folder name hadoop-3.3.5 and the revision number (r335) accordingly (i.e., hadoop-3-3.6 and r336).

This process effectively installs the winutils.exe utility on your Windows system.

svn export https://github.com/cdarlint/winutils/trunk/hadoop-3.3.5/bin@r335 "C:\hadoop\bin"
Downloading the latest Hadoop version
Downloading the latest Hadoop version

3. Once downloaded, navigate to the C:\hadoop\bin folder and confirm you have the following structure.

You’ll see files, including the winutils.exe utility, that are necessary to run Hadoop operations on Windows.

Verifying the winutils.exe utility
Verifying the winutils.exe utility

Setting an Environment Variable for Hadoop and Spark Integration

Having laid the groundwork by installing the winutils.exe utility, your journey now takes a turn into integration. You’ll set up the HADOOP_HOME environment variable to ensure Spark can locate and use the necessary Hadoop components.

To set the HADOOP_HOME environment variable, carry out the following:

Execute the below Set-Item command to create a new HADOOP_HOME environment variable and set its value to Hadoop’s install location (C:\hadoop).

When successful, this command does not provide output to the console, but you’ll verify the result in the following step.

Set-Item -Path "Env:\HADOOP_HOME" -Value 'C:\hadoop'

Next, run the following command to verify the HADOOP_HOME environment variable.

$env:HADOOP_HOME

The command returns the value of the HADOOP_HOME environment variable, which in this case is Hadoop’s install location (C:\hadoop).

Verifying the HADOOP_HOME variable
Verifying the HADOOP_HOME variable

Launching Spark Shells for Interactive Spark Programming

With a fine-tuned environment for seamless Hadoop and Spark integration, prepare yourself for interactive Spark programming. You’ll launch Spark Shells, an interactive command-line interface (CLI) provided by Apache Spark.

Spark Shells allow you to interactively work with Spark, where you can execute Spark jobs and explore your data via Spark’s APIs.

Below are the two main Spark Shells, depending on your preferred programming language:

Spark ShellDetails
Scala Spark ShellScala is the default Spark Shell and is primarily used for writing Spark applications in the Scala programming language.
PySpark ShellPySpark is the Python API for Apache Spark, and the PySpark Shell is used for writing Spark applications in Python. This Spark Shell is similar to the Scala Spark Shell but for Python developers.

To launch Spark Shells, execute the following:

1. Run each Set-Item command below to set environment variables for Spark Shell and PySpark Shell (i.e., SPARK_SH and PY_SPARK). These commands have no output on the console but set environment variables to store the full paths of both shells, respectively.

Ensure you replace the full paths (C:\spark\spark-3.5.0-bin-hadoop3\bin\) with where you unzipped Spark, followed by the shells (spark-shell and pyspark).

This one-time process saves time and is convenient as you can quickly launch either of the shells without typing or specifying the full path each time.

Set-Item -Path "Env:\SPARK_SH" -Value 'C:\spark\spark-3.5.0-bin-hadoop3\bin\spark-shell'
Set-Item -Path "Env:\PY_SPARK" -Value 'C:\spark\spark-3.5.0-bin-hadoop3\bin\pyspark'  

2. Next, execute the below Start-Process command to launch the Scala Spark Shell ($env:SPARK_SH).

You only need to call the environment variable that holds the Scala Spark Shell’s full path instead of manually typing everything.

Start-Process $env:SPARK_SH

As you launch the Scala Spark Shell, a new console opens where you’ll see the Scala Read-Eval-Print Loop (REPL) prompt (scala>). This prompt lets you interactively enter and execute Scala code with access to Spark APIs.

Launching the Scala Spark Shell
Launching the Scala Spark Shell

3. When prompted, click Allow access Java through Windows firewall, as shown below. Otherwise, you may encounter issues launching the Spark Shell.

This prompt appears since Apache Spark uses a Java-based web server, Jetty, to display its Web UIs.

Allowing Java through Windows firewall
Allowing Java through Windows firewall

4. Now, switch back to the previous PowerShell console and execute the following command to launch the PySpark Shell ($env:PY_SPARK).

Like with the Scala Spark Shell, you only need to call the environment variable that holds the full path of the PySpark Shell.

Start-Process $env:PY_SPARK

The PySpark Shell opens in another console, where you’ll see the Python REPL prompt (>>>). This prompt lets you interactively enter and execute Python code with access to Spark’s PySpark API.

Launching the PySpark Shell
Launching the PySpark Shell

5. Press Ctrl+D, regardless of which shell you launched, to exit the shell, which also closes the console.

Testing Your Spark Setup with Real-world Examples

After venturing into the dynamic realm of launching Spark Shells for interactive programming, the time has come to put your Spark setup to the test.

In this example, you’ll perform a basic analysis of a CSV file containing information about movie ratings via PySpark’s DataFrame API.

To test your Spark setup, perform these steps:

1. Download the sample CSV file containing movie ratings to a directory of your choice (i.e., C:\logs\movie_ratings.csv). Remember the full path, as you will need it for the following step.

2. Next, launch the PySpark shell and run the following command, which has no output but loads the CSV data (movie_ratings.csv) into a DataFrame.

Replace the path (C:\\logs\\movie_ratings.csv) with the one where you saved the sample CSV file in step one.

df = spark.read.csv("C:\\logs\\movie_ratings.csv", header=True, inferSchema=True)

3. Afterward, run the below commands to perform basic data exploration tasks.

# Show the first few rows of the DataFrame
df.show()

# Display the schema of the DataFrame
df.printSchema()

# Compute basic statistics of the "rating" column
df.describe("rating").show()
Performing basic data exploration tasks
Performing basic data exploration tasks

4. Lastly, run each command below to perform a basic data analysis task that calculates the average rating for each movie.

# Import necessary functions
from pyspark.sql.functions import avg

# Calculate the average rating for each movie
average_ratings = df.groupBy("movie_id").agg(avg("rating").alias("avg_rating"))
average_ratings.show()
Performing a basic data analysis task
Performing a basic data analysis task

Conclusion

You’ve successfully navigated the twists and turns of setting up Apache Spark on your Windows machine. Moreover, you seamlessly integrated Spark with Hadoop, launched Spark Shells, and tested your Spark setup with real-world examples. Throughout this journey, here you stand, having conquered each step with resilience and determination.

Armed with a Windows environment primed for data analysis, consider this stage as just the beginning. Continue experimenting with diverse datasets to refine your skills. Now, why not consider exploring advanced Spark features, such as creating Resilient Distributed Datasets (RDDs), distributed processing, and parallel computing?

Spark has a rich set of APIs for various programming languages like Scala, Python, Java, R, and SQL. Use these APIs to develop robust data applications to process massive datasets in real-time!

Hate ads? Want to support the writer? Get many of our tutorials packaged as an ATA Guidebook.

Explore ATA Guidebooks

Looks like you're offline!