Azure Data Factory | InsideMSTech

Azure provides many options for data ingestion and Azure data factory is one of them. It is an option for scenarios where you need to transfer data regularly or in technical terms it is cloud data integration service known as data pipeline. Azure data factory works on two key pillars, i.e. data movement and data transformation. This cloud based data integration service allows to create data-driven workflows and orchestration/automation of the data movement and transformation processes.

Courtesy: Microsoft

Let me explain you this through one scenario:

One of the super market is going through the transformation and looking for ways to increase the revenue and customer satisfaction. Many stores are doing well in terms of revenue and customer satisfaction while few stores are struggling to achieve the same. Business has decided to close few existing stores and open new stores in new locations as well. Super market stores capture the customer satisfaction by a survey machines installed on each PoS system. When customer makes the payment, cashier request them to provide a feedback. This feedback system runs on cloud-native app and store the data directly to the cloud in synchronous mode. This organization wants to analyze userbase based on the demographics. All the billing related data is being stored in ERP system that resides in on-premises datacenter.

Strategy team has provided an approach to generate and visualize useful data for new markets. To fulfill this need, you need to consolidate all the data in one place and because of the continuity, it is not one-time job as they need to compare data based on the days, weeks, month, year, time and season wise. With Azure Data Factory you can move your data continuously using data pipeline, and once data has been moved there then you can first transform this data based on the need and later use this data with any systems or use analytics tools like Power BI to visualize the data. Here is the process, which differs between version 1 and 2.

Azure Data Factory v1:

Azure Data Factory v2:

Now let’s understand the process in detail:

Connect & Collect: Whenever you need to play with data, first you need to collect it. In layman language, you can copy the data from multiple sources in different ways such as using some copy utility, FTP/SFTP, scripts etc. This data can be in multiple forms such as structured, un-structured and semi-structured, and can be extracted from multiple sources such as on-premises, SaaS solutions, database, file shares etc. Once you have multiple data sources, frequency and availability of data will also differ. Azure data factory can connect to multiple data sources and collect the data into the centralized data store such as Azure Blob Storage or Azure data lake store etc.

Transform & Enrich: Once the raw data has been collected, you can transform the data using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics and Machine Learning.

Publish: Once you have transformed the data now you can use this valuable data anywhere in the cloud or can send back this data to on-premises as well. This data can be used by any analytics tool such as Power BI to visualize and generate the reports or can be loaded into the Azure Data Warehouse, Azure SQL Database, Azure cosmoDB or anywhere else for further use.

Monitor: Azure Data Factory v2 provides the monitoring capabilities to monitor established data integration pipelines for various purposes. You can leverage built-in support for pipeline monition via PowerShell, log analytics, Azure monitor, API and health panels on the Azure portal.

At present, Azure Data Factory is available in selected regions only. ADF v1 is available in East US, East US2, West US, West Central US and North Europe region while ADF v2 is available in East US, East US2, West US, West Central US, North Europe and West Europe regions. However, a data factory can use compute resources and data stores from other regions as well. Therefore, you can use this service by leveraging ADF from selected regions.

Azure Data Factory pricing can be calculated based on the four parameters:

Number of activities run.
Volume of data moved.
SQL Server Integration Services (SSIS) compute hours.
Whether a pipeline is active or not.

You can calculate your pricing here.

At present, Azure Data Factory version 2 is in preview.

Cloud has become a prominent option for all kind of organizations. When any medium to large organization moves to the cloud, data transfer becomes a biggest challenge. To address this concern, Microsoft provides different types of data transfer options to the customers. Before you get into the details of data transfer options, answer the following questions:

How much data, we need to migrate?
What is going to be the frequency of data transfer?
Data source and destination locations, and respective data regulations?
Find the bottlenecks that may arise at the time of migration?
Do the cost, time and effort comparison between possible data migration types?

If you look from the Microsoft point of view, they have divided the data transfer into four major categories:

Physical data transfer
Data transfer using command line tools and APIs
Data transfer using graphical user interface
Data pipeline

Let me explain you briefly about each one of the data transfer methodology:

Physical data transfer: Widely used when you have large data sets to migrate. It could be leveraged for either one-time data migration activity or for less frequent data migration activity. For physical data transfer you can choose one data methodology based on the size.

Azure Import/Export: The Azure Import/Export can be used to transfer large amount of data using internal SATA HDDs or SSDs. By using this service, you can securely transfer data from on-premises to the cloud blob or file storage and vice-versa. When procuring drives for this service, don’t get confuse between SATA and SAS drives. Order SATA III drives as it is faster than older version of SATA drives and support speed of 6 Gbps.
Azure Data Box: Azure Data Box is an option to transfer very large amount of data, it is very similar to Azure Import/Export but avoids the hurdles of procuring, writing and sending multiple data disks. In this service, Microsoft provides you secure and reliable appliance to transfer data between on-premises and cloud blob and file storage. It is much easier than Azure Import/Export service as Microsoft takes the responsibility of end-to-end logistics.

Data transfer using command line tools and APIs: used when you have enough bandwidth available to migrate limited amount of data between on-premises and cloud blob and file storage. There are multiple tools available to perform this activity.

AzCopy: It is command line to tool to transfer data to and from Azure blob, file and table storage in fast, secure and reliable manner. You can install this tool on Windows or Linux machine to transfer data. It supports parallelism and the ability to resume copy operation when interrupted.
Azure CLI: It is a command line tool to manage Azure services and to upload data to Azure storage. Azure CLI doesn’t need any installation and configuration as it is available through Azure portal itself.
PowerShell: PowerShell is an alternative option for windows administrators to transfer data.

Data transfer using graphical user interface: is a most simpler way of transferring data between cloud on-premises. You have two options available to transfer data using graphical tools.

Azure Portal: Simplest way of exploring and uploading files to the Azure blob storage and data lake store but it has a limitation of exploring and uploading only file at a time.
Azure Storage Explorer: Azure storage explorer is a great option for GUI lovers, it provides a capability to manage, upload and download files through interactive interface for blobs, files, queues, tables and Azure Cosmos DBs objects. It also allows to manage data between blob storage, and between storage accounts.

Data Pipeline: used when you need to transfer data regularly.

Azure Data Factory: It is an option to transfer and transform data using data-driven workflows (a.k.a data pipeline) on a regular basis by leveraging orchestration or automation processes. It is a managed service to transfer data between Azure services, on-premises, or a combination of the two. The workflow can be created and scheduled based on your requirements. It can process and transform the data by leveraging compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

Apart from the above core data transfer options, following are the list of tools that can be leveraged to transfer data within specific Azure services.

Adlcopy
Distcp
Sqoop
PloyBase
Hadoop command line

InsideMSTech

Learn and Share

Tag Archives: Azure Data Factory

#Azure : Azure Data Factory

#Azure : Data transfer