Getting Started with Azure Data Factory

Victor Silva

Read more posts by this author.

Nowadays, data is the new oil, but how does this affect a business transitioning to the cloud? or what happens when you migrate to the cloud with the raw and disorganized data stored in the different storage systems? Fortunately, Microsoft Azure Data Factory has answered these questions.

Microsoft Azure Data Factory offers an extract-transform-load (ETL) and data integration services. These services allow you to build data-driven workflows called data pipelines.

Sounds good? Let’s explore more about how awesome Azure Data Factory is!

Prerequisites

If you’d like to follow along with the step-by-step tutorial, be sure you have an Azure subscription. If you don’t have one, you can register for a free Azure account.

What is Azure Data Factory?

In a nutshell, Azure Data Factory provides a cloud-based data integration service that orchestrates the migration and transformation of data between different data stores and compute resources.

One of the interesting actions that you make with Azure Data Factory is to copy data to and from different Software-as-a-Service (SaaS) applications, on-premises data stores and, cloud data stores. And, during the copying process, you can even convert file formats.

After getting the data, you continue working on the transformation side, called Data Flows provided by the Integration Runtime (IR), taking the ETL process seriously.

Creating an Azure Data Factory

An introduction wouldn’t be enough to understand how Azure Data Factory works, so you’ll create an actual Azure Data Factory!

1. First, open your favorite web browser and navigate to the Azure Portal.

2. Next, click on the portal menu icon at the top-left of the page and select All services from the menu panel. Doing so lets you view all services available to you.

Accessing All Services
Accessing All Services

3. Click on Analytics —> Data factories to view the overview page where you’ll see all of your previous data factories’ resources.

Accessing Data Factories
Accessing Data Factories

4. Now, click Create to open the assistant to create a new data factory resource.

Starting to create an Azure Data Factory resource
Starting to create an Azure Data Factory resource

5. Select the subscription and resource group to place the new resources. But if you prefer to create a new resource group, click the Create new link, enter your preferred name and click OK.

Creating a New Resource Group for Deploying Azure Data Factory
Creating a New Resource Group for Deploying Azure Data Factory

6. Now, configure the instance details with the following:

  • A region to create the resource, typically choosing the location closest to you.
  • A name you prefer for the resource.
  • Check that the version is V2, to get the latest features and improvements (recommended by default).
  • Click on the Next: Git configuration button to continue the creation wizard.
Defining the Instance Details
Defining the Instance Details

7. Check the box Configure Git later because now you are only creating the instance. Click on Next: Networking to access the Networking configuration tab.

Postponing the Git configuration
Postponing the Git configuration

8. Click on Review + create to finish the config steps. By default, the configuration is set to Public endpoint, which is okay.

The Advanced config tab is necessary if you need to specify a custom key to encrypt your data and not use the Microsoft-managed key. While the Tags config tab is required, for example, when you use tags to identify resources within a project. For this demo, these tabs are not necessary.

Setting Up Network Configuration
Setting Up Network Configuration

9. Once all the final validation passes, click on the Create button to instruct Azure to begin creating the instance of the Azure Data Factory.

Passing Validations and Creating Azure Data Factory
Passing Validations and Creating Azure Data Factory

Creating an Azure Data Lake Storage Account

Now you have an Azure Data Factory, you can now start copying data. But first, create the Azure Data Lake Storage Account to copy data into using the Azure Data Factory.

1. On the Azure Portal, hover your mouse on the Storage accounts button, and click Create on the pop-up window that appears, as shown below. Doing so redirects the page to the “Storage Account” page.

Creating a Storage Account from Azure Portal
Creating a Storage Account from Azure Portal

2. Under the Basics tab, choose the same resource group and region you selected in the “Creating an Azure Data Factory” section (step six).

Setting up Resource Group and Region for Storage Account
Setting up Resource Group and Region for Storage Account

3. Give your storage account a unique name, but for this example, the storage account is named demodfstorageaccount. Choose Standard performance option and change redundancy to Locally-redundant storage (LRS) to keep costs down for these demos. Click Next: Advanced to get to the Advanced tab.

Defining the Name of the Storage Account and Redundancy
Defining the Name of the Storage Account and Redundancy

4. Over the Advanced tab, under Data Lake Storage Gen2, check the enable hierarchical namespace option. Selecting this option turns your storage account into a data lake instead of a regular blob storage account.

Click on Review + create to review your storage account settings.

Enabling Hierarchical Namespace to the Storage Account
Enabling Hierarchical Namespace to the Storage Account

5. After the validation, click on the Create button to finish the creation of your Azure Data Lake Storage Account:

Validating and Finalizing Storage Account Creation
Validating and Finalizing Storage Account Creation

Performing a Data Ingest with the Copy Data Tool

You now have an active storage account, so all that’s left now is to test if the storage account is working. You’re going to copy a sample CSV file into a blob container called weather in your Azure Data Lake Storage Account.

The selected datasets for this demo are from the National Centers for Environmental Information / National Oceanic and Atmospheric Administration (NCDD-NOAA). They also have a sample database for the daily summaries of the Global Historical Climate Network that you can download for free as CSV files.

1. In your browser, navigate to the Azure Data Factory home page, and then click Ingest to open the Copy Data Tool where you’ll create a copy job.

Accessing Copy Data Tool
Accessing Copy Data Tool

2. On the Properties page, choose the Built-in copy task option to create a single pipeline. Keep the default Run once now option selected for the task schedule, and click Next.

Selecting the Run once now option runs the copy job immediately after you set it up.

Selecting the Built-in Copy Task
Selecting the Built-in Copy Task

3. Next, on the Source data store page, choose HTTP from the Source type dropdown to use the selected dataset for this demo. Click New connection to specify the source URL.

Adding a New Connection on the Source Data Store
Adding a New Connection on the Source Data Store

Configure the new connection with the following:

  • Give a name to the connection (the definition of the data source)
  • Copy and paste the dataset URL below for this demo to the Base URL field: https://www1.ncdc.noaa.gov/pub/data/cdo/samples/GHCND_sample_csv.csv
  • Change the Authentication type to Anonymous
  • Keep the other default options, and click on
Configuring the New Connection
Configuring the New Connection

5. Click on Next since you don’t need to change or specify anything for the source data store in this demo.

Configuring source data store
Configuring source data store

6. Now, check the First row as header option to specify whether to consider the first row as a header. Set the Column delimiter as Comma (,) because the format of the origin’s dataset is a CSV file.

Click Preview data to view the sample dataset shown in the next step.

File format settings page
File format settings page

7. Scroll through the dataset to see its content, then close the window and click Next.

Preview data from the dataset
Preview data from the dataset

8. On the Destination data store page, choose your previously created storage account (Azure Data Lake Storage Gen2) from the Target type dropdown.

Click the New connection button to open the assistant to create a new connection in the next step.

Configuring Target Destination Data Store
Configuring Target Destination Data Store

9. Configure the new connection with the following:

  • Enter your preferred name for the new connection in the Name field.
  • Click on the Storage account name dropdown and choose the storage account you previously created (demodfstorageaccount) — “Creating an Azure Data Lake Storage Account” section (step three).
  • Click on Create to create the new connection.
Configuring New Connection for the Azure Data Lake Storage
Configuring New Connection for the Azure Data Lake Storage

10. Now, enter your preferred Folder path and File name. But for this demo, the dataset with the file name GHCN-daily will be copied to the weather folder.

Keep the other default options, and click Next.

Setting up Destination Folder path and File name
Setting up Destination Folder path and File name

11. In the File format settings page, confirm the File format is set to Text format and the Column delimiter is set to Comma (,), then click Next.

Setting the delimiter to Comma (,) is crucial since you’re creating a copy job for a CSV file.

File format settings page
File format settings page

12. Configure the Settings page with the following:

  • Write a pipeline name in the Task name field.
  • Change the fault tolerance to Skip incompatible rows to avoid errors and check the Enable logging option to save logs.
  • Specify weather/errors (these folders will be created) as the Folder path to save the logs.
  • Click Next to review the overall settings of the copy job you’re creating shown in the next step.
Configuring Pipeline Settings for the Deployment Process
Configuring Pipeline Settings for the Deployment Process

13. On the Summary page, review the overall settings for the copy job, and click Next to start the deployment process.

Below, you can see that you are ingesting data from an HTTP source to an Azure Data Lake Storage Gen2 target. Since you chose the “run once now” option in step two, the pipeline executes immediately.

Viewing Summary of the Copy Data Tool Process
Viewing Summary of the Copy Data Tool Process

14. Once the deployment is complete, click on the Monitor button to view the execution on the “Monitor” page.

Click on the Monitor button to open the Monitor page
Click on the Monitor button to open the Monitor page

After starting the Deployment and clicking Monitor, you will be directed to the Pipeline runs page as shown below.

Deployment executed succeeded
Deployment executed succeeded

15. Finally, open your Azure Storage Explorer, and you’ll see your newly moved data like the one below.

Viewing Newly Moved Data in Azure Storage Explorer
Viewing Newly Moved Data in Azure Storage Explorer

Conclusion

In this tutorial, you learned how to navigate the Copy Data Tool to download a CSV file from your demo datasets and load it into your Azure data lake storage account. Microsoft Azure Data Factory is a great tool to store your data in the cloud efficiently and more rapidly.

What’s next? Why not explore different data sources and datasets? Or transform your data using the file format settings, and save them to a different target data store?

Subscribe to Stay in Touch

Never miss out on your favorite ATA posts and our latest announcements!

Looks like you're offline!