AWS Glue Tutorial for Beginners: Effortlessly Transform Data

Published:12 July 2023 - 7 min. read

Michael Nguyen Tu Image

Michael Nguyen Tu

Read more tutorials by Michael Nguyen Tu!

Converting your raw data into organized and actionable information may sound complex. Well, not when you have a fast and efficient solution. Worry not! This beginner-friendly AWS Glue tutorial has got your back.

In this tutorial, you will learn the crucial steps of configuring and executing data transformations with AWS Glue.

Explore and streamline data preparation for cloud-based analytics!

Prerequisites

Before working with AWS Glue, ensure you have an active Amazon Web Services (AWS) account with billing enabled. A free tier account will suffice for this tutorial.

Creating an IAM Role for AWS Glue

Before executing a transformation job, you must create an Identity and Access Management (IAM) role that grants permission to the AWS Glue service. This role defines what type of resources AWS Glue is allowed to access in your AWS account.

To create the IAM role, follow the steps below:

1. Open your preferred web browser, and log in to the AWS Management Console.

2. Search for and select IAM in the result list to access the IAM console.

Accessing the IAM console
Accessing the IAM console

3. In the IAM console, navigate to Roles (left pane) and click Create role (top-right), redirecting your browser to a new page dedicated to configuring the role.

aws glue tutorial - Initiating creating a new role
Initiating creating a new role

4. Now, configure the following settings for the role:

  • Trusted entity type – Select AWS service so an AWS service will trust the role. Doing so allows that service to assume the role and act on your behalf.
  • Use case – Choose Glue under the Use cases for other AWS services section since you will create the IAM role specifically for AWS Glue, and click Next.
Selecting the trust entity type and use case
Selecting the trust entity type and use case

5. Search and select the following policies, and click Next.

  • AWSGlueServiceRole – Grants the AWS Glue service the necessary permissions to perform its operations.
  • S3FullAccess – Grants full access to the S3 resources, allowing AWS Glue to read from and write to S3 buckets.
    AWS Glue needs extensive permissions to read from and write to S3 buckets to perform its data extraction, transformation, and loading (ETL) tasks effectively.

💡 Avoid granting unnecessary excessive permissions, as they can pose security risks.

Adding permissions for AWS Glue
Adding permissions for AWS Glue

6. Provide a descriptive name for the role (i.e., glue_role) and a description.

Providing a descriptive name and description for the role
Providing a descriptive name and description for the role

7. Finally, scroll down, review your settings, and click Create role (bottom-right) to finalize creating the role.

Reviewing the role settings and creating the role
Reviewing the role settings and creating the role

Creating an S3 Bucket and Uploading a Sample File

Now that you have an IAM role for AWS Glue, you need a place to store your data, specifically, an S3 bucket. An S3 bucket provides a centralized location for storing the data that AWS Glue will process.

In this example, AWS Glue will use AWS S3 as a data store for various operations, such as data extraction, transformation, and loading (ETL) tasks.

To create an S3 bucket and upload a sample file, follow these steps:

1. Download a sample data file (example Every Politician data set) to your local machine. This file contains an unstructured collection of records to serve as the input for the AWS Glue transformation job.

2. Search for and select the S3 service to access the S3 console.

Accessing the S3 console
Accessing the S3 console

3. Click Create a bucket to initiate creating a new S3 bucket.

Initiating creating a new S3 bucket
Initiating creating a new S3 bucket

4. Now, provide a unique name for your bucket (i.e., sampledata54675) and select the region where the bucket should be located.

A unique name lets you avoid conflicts with existing bucket names is crucial, while the region selection determines the physical location of your bucket’s data.

Providing a name and region for the bucket
Providing a name and region for the bucket

5. Scroll down, keep other options as is, and click Create bucket to create the bucket.

Creating the newly-configured S3 bucket
Creating the newly-configured S3 bucket

6. Once created, click the hyperlink for the newly created S3 bucket to navigate to the bucket.

Accessing the newly-created bucket
Accessing the newly-created bucket

7. Click Upload and locate the sample file you wish to upload.

Initiating uploading a file
Initiating uploading a file

8. Lastly, keep other settings as is, and click Upload to upload the sample file to the newly created bucket.

Uploading a sample file to an S3 bucket
Uploading a sample file to an S3 bucket

If successful, you will see your newly-uploaded file in your bucket, as shown below.

Verifying the newly-uploaded file exists in the bucket
Verifying the newly-uploaded file exists in the bucket

Creating a Glue Crawler to Scan and Catalog Data

You have just uploaded sample data to your S3 bucket, but since it is currently unstructured, you need a way to read the data and build a metadata catalog. How? By creating a glue crawler that automatically scans and catalogs the data.

To create a glue crawler, follow the steps below:

1. Navigate to the AWS Glue console via the AWS Management Console, as shown below.

Accessing the AWS Glue console
Accessing the AWS Glue console

2. Next, navigate to Crawler (left pane) and click Add crawler (upper-right) to initiate creating a new glue crawler.

Initiating creating a new crawler
Initiating creating a new crawler

3. Provide a descriptive name (i.e., glue_crawler) and a description for the crawler, keep other settings as is, and click Next.

Setting the crawler name and description
Setting the crawler name and description

4. Now, click Add a data source under Data sources to initiate adding a new data source to the crawler.

Initiating adding a data source
Initiating adding a data source

5. On the popup window, configure the data source as follows:

  • Data source – Select S3 since your data is in your S3 bucket.
  • S3 path – Click Browse S3, and choose the bucket that contains your uploaded sample data (sampledata54675).
  • Keep other settings as is, and click Add an S3 data source to add the sample data to the crawler.
Adding an S3 data source
Adding an S3 data source

6. Once configured, verify the data source, as shown below, and click Next to continue.

Verifying the configured data source
Verifying the configured data source

7. On the next screen, select the IAM role you created earlier (glue_role), keep other settings as is, and click Next.

Configuring the security settings
Configuring the security settings

8. Under output and scheduling, click Add database to initiate adding a new database to store the processed data and metadata generated by your glue crawler. This action opens a new browser tab, where you will configure your database details (step eight).

This database provides a structured representation of the data for querying and analysis.

Initiating adding a new target database
Initiating adding a new target database

9. On the new browser tab, provide a descriptive database name (i.e., glue_database), and click Create database to create the database.

Naming and creating the new database
Naming and creating the new database

10. Switch to the previous browser tab, select the newly-created database (glue_database) from the drop-down, keep other settings as is, and click Next.

Setting a target database (glue_database)
Setting a target database (glue_database)

11. Ultimately, review your settings on the final screen to ensure they are accurate, and click Create crawler (bottom-right) to create the new crawler.

Creating the new crawler
Creating the new crawler

If everything goes well, you will see a screen confirming the successful creation of the crawler. Do not close this screen yet; you will run this crawler in the following section.

Overviewing the crawler properties
Overviewing the crawler properties

Running the Glue Crawler to Build a Metadata Catalog

With a new crawler at your disposal, running the crawler is essential to start the scanning and cataloging process. Your glue crawler will build a metadata catalog that provides a structured representation of your data for querying and analysis purposes.

To run your newly-created glue crawler:

1. On the crawler details page, click Run crawler under the Crawler runs tab to initiate the execution of the crawler.

Initiating the execution of the crawler
Initiating the execution of the crawler

Once the crawler starts running, you will see its status and progress on the crawler details page.

Depending on the size and complexity of your data, the crawler may take some time to complete its execution. You can periodically refresh the page to see the updated status of the crawler.

Overviewing the crawler’s execution
Overviewing the crawler’s execution

Once the crawler has completed its execution, the status changes to Completed, as shown below. At this point, you can proceed with querying your data.

Verifying the crawler status
Verifying the crawler status

2. Next, navigate to Database (left pane), and click your database to access its properties and tables.

Accessing the database
Accessing the database

3. Finally, click on your bucket’s name (sampledata54675), now a table, to view its stored data.

Accessing the bucket that has transformed into a table
Accessing the bucket that has transformed into a table

If successful, you will see information similar below. This information confirms that the data was successfully transformed into the database table, providing valuable details.

Viewing transformed data from the bucket to a table
Viewing transformed data from the bucket to a table

Querying Cataloged Data via AWS Athena

Now that your data is available in AWS Glue Data Catalog, you can use various tools to query and analyze your data. One such tool is AWS Athena, an interactive query service that enables you to analyze data in the cloud using standard SQL.

To query the data using AWS Athena, follow the steps below:

1. Search for and access the Athena console.

Accessing the Athena console
Accessing the Athena console

2. Select the database where your data is cataloged under the Data section as follows:

  • Data source – Select AwsDataCatalog to indicate that you want to query the data cataloged in AWS Glue.
  • Database – Select the appropriate database from the drop-down field (i.e., glue_database).

💡 If you do not see your desired database in the drop-down, ensure the crawler has completed its execution and cataloged the data.

Selecting the appropriate database for querying data
Selecting the appropriate database for querying data

3. Finally, populate and run the following query in the query editor on the right.

This query returns the first 10 rows from the sampledata54675 table in the glue_database database. Feel free to modify the query to suit your specific requirements.

SELECT *
FROM "glue_database"."sampledata54675"
LIMIT 10;
Querying data from a database
Querying data from a database

If the query is successful, you will see the results in the Result pane, as shown below. The results contain information about the records stored in the table based on your SQL query.

Take note of the column names, data types, and values returned in the result set. This information helps you understand the structure and content of the queried data.

Viewing the query results
Viewing the query results

Conclusion

In this tutorial, you have learned the basics of using AWS Glue to create a Glue Crawler, catalog your data, and query data using AWS Athena. Data preparation and analysis are essential for any data-driven application. And tools like AWS Glue provide a quick way to extract, transform, and load (ETL) data from various sources into a database table.

With AWS Glue, you can now quickly manage and organize data, allowing you to focus more on analyzing and deriving insights from your data. But what you have seen is just the tip of the iceberg. Explore the wide range of capabilities and functionalities AWS Glue can offer!

Why not leverage AWS Glue connections to seamlessly integrate with other AWS services, such as Amazon RDS or Amazon Redshift? This integration enables you to build complex ETL pipelines and achieve even greater data analysis capabilities.

Hate ads? Want to support the writer? Get many of our tutorials packaged as an ATA Guidebook.

Explore ATA Guidebooks

Looks like you're offline!