Even the most experienced data engineer can feel overwhelmed by the range of data platform technologies in Microsoft Azure. In diverse scenarios and industries, data engineers must solve complex data problems to provide business value through data. By understanding the data types and capabilities of the data platform technologies, a data engineer can pick the right tool for the job.
Imagine you’re a data engineer working for an organization that’s starting to explore cloud capabilities. Executives have asked the network infrastructure team to explain the benefits and drawbacks of running IT operations in Azure. The network team approaches you for information about Azure data services. Could you answer their high-level questions? This module will help you achieve that objective.
data storage in Azure Storage
Azure Storage accounts are the base storage type within Azure. Azure Storage offers a very scalable object store for data objects and file system services in the cloud. It can also provide a messaging store for reliable messaging, or it can act as a NoSQL store.
Azure Storage offers four configuration options:
- Azure Blob: A scalable object store for text and binary data
- Azure Files: Managed file shares for cloud or on-premises deployments
- Azure Queue: A messaging store for reliable messaging between application components
- Azure Table: A NoSQL store for no-schema storage of structured data
Data ingestion in Blob
To ingest data into your system, use Azure Data Factory, Storage Explorer, the AzCopy tool, PowerShell, or Visual Studio. If you use the File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio. AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB. You cant query data in blob you have to move data
data storage in Azure Data Lake Storage
Azure Data Lake Storage is a Hadoop-compatible data repository that can store any size or type of data. This storage service is available as Generation 1 (Gen1) or Generation 2 (Gen2). Data Lake Storage Gen1 users don’t have to upgrade to Gen2, but they forgo some benefits.
Data Lake Storage Gen2 users take advantage of Azure Blob storage, a hierarchical file system, and performance tuning that helps them process big-data analytics solutions. In Gen2, developers can access data through either the Blob API or the Data Lake file API. Gen2 can also act as a storage layer for a wide range of compute platforms, including Azure Databricks, Hadoop, and Azure HDInsight, but data doesn’t need to be loaded into the platforms.
Here are the key features of Data Lake Storage:
- Unlimited scalability
- Hadoop compatibility
- Security support for both access control lists (ACLs)
- POSIX compliance
- An optimized Azure Blob File System (ABFS) driver that’s designed for big-data analytics
- Zone-redundant storage
- Geo-redundant storage
In Data Lake Storage Gen1, data engineers query data by using the U-SQL language. In Gen 2, use the Azure Blob Storage API or the Azure Data Lake System (ADLS) API.
Understand Azure Cosmos DB
Azure Cosmos DB is a globally distributed, multimodel database. You can deploy it by using several API models:
- SQL API
- MongoDB API
- Cassandra API
- Gremlin API
- Table API
Because of the multimodel architecture of Azure Cosmos DB, you benefit from each model’s inherent capabilities. For example, you can use MongoDB for semistructured data, Cassandra for wide columns, or Gremlin for graph databases. When you move your data from SQL, MongoDB, or Cassandra to Azure Cosmos DB, applications that are built using the SQL, MongoDB, or Cassandra APIs will continue to operate.
When to use Azure Cosmos DB
Deploy Azure Cosmos DB when you need a NoSQL database of the supported API model, at planet scale, and with low latency performance. Currently, Azure Cosmos DB supports five-nines uptime (99.999 percent). It can support response times below 10 ms when it’s provisioned correctly.
Consider this example where Azure Cosmos DB helps resolve a business problem. Contoso is an e-commerce retailer based in Manchester, UK. The company sells children’s toys. After reviewing Power BI reports, Contoso’s managers notice a significant decrease in sales in Australia. Managers review customer service cases in Dynamics 365 and see many Australian customer complaints that their site’s shopping cart is timing out.
Contoso’s network operations manager confirms the problem. It’s that the company’s only data center is located in London. The physical distance to Australia is causing delays. Contoso applies a solution that uses the Microsoft Australia East datacenter to provide a local version of the data to users in Australia. Contoso migrates their on-premises SQL Database to Azure Cosmos DB by using the SQL API. This solution improves performance for Australian users. The data can be stored in the UK and replicated to Australia to improve throughput times.
Understand Azure SQL Database
Azure SQL Database is a managed relational database service. It supports structures such as relational data and unstructured formats such as spatial and XML data. SQL Database provides online transaction processing (OLTP) that can scale on demand. You’ll also find the comprehensive security and availability that you appreciate in Azure database services.
When to use SQL Database
Use SQL Database when you need to scale up and scale down OLTP systems on demand. SQL Database is a good solution when your organization wants to take advantage of Azure security and availability features. Organizations that choose SQL Database also avoid the risks of capital expenditures and of increasing operational spending on complex on-premises systems.
SQL Database can be more flexible than an on-premises SQL Server solution because you can provision and configure it in minutes. Even more, SQL Database is backed up by the Azure service-level agreement (SLA).
SQL Database delivers predictable performance for multiple resource types, service tiers, and compute sizes. Requiring almost no administration, it provides dynamic scalability with no downtime, built-in intelligent optimization, global scalability and availability, and advanced security options. These capabilities let you focus on rapid app development and on speeding up your time to market. You no longer have to devote precious time and resources to managing virtual machines and infrastructure.
Use T-SQL to query the contents of a SQL Database. This method benefits from a wide range of standard SQL features to filter, order, and project the data into the form you need.
Understand Azure SQL Data Warehouse(Quite same as bigquery offered by Google)
Azure SQL Data Warehouse is a cloud-based enterprise data warehouse. It can process massive amounts of data and answer complex business questions.
When to use SQL Data Warehouse
Data loads can increase the processing time for on-premises data warehousing solutions. Organizations that face this issue might look to a cloud-based alternative to reduce processing time and release business intelligence reports faster. But many organizations first consider scaling up on-premises servers. As this approach reaches its physical limits, they look for a solution on a petabyte scale that doesn’t involve complex installations and configurations.
Ingesting and processing data
SQL Data Warehouse uses the extract, load, and transform (ELT) approach for bulk data. SQL professionals are already familiar with bulk-copy tools such as bcp and the SQLBulkCopy API. Data engineers who work with SQL Data Warehouse will soon learn how quickly PolyBase can load data.
PolyBase is a technology that removes complexity for data engineers. They take advantage of techniques for big-data ingestion and processing by offloading complex calculations to the cloud. Developers use PolyBase to apply stored procedures, labels, views, and SQL to their applications. You can also use Azure Data Factory to ingest and process data.
Understand Azure Stream Analytics
Applications, sensors, monitoring devices, and gateways broadcast continuous event data known as data streams. Streaming data is high volume and has a lighter payload than nonstreaming systems.
Data engineers use Azure Stream Analytics to process streaming data and respond to data anomalies in real time. You can use Stream Analytics for Internet of Things (IoT) monitoring, web logs, remote patient monitoring, and point of sale (POS) systems.
When to use Stream Analytics
If your organization must respond to data events in real time or analyze large batches of data in a continuous time-bound stream, Stream Analytics is a good solution. Your organization must decide whether to work with streaming data or batch data.
In real time, data is ingested from applications or IoT devices and gateways into an event hub or IoT hub. The event hub or IoT hub then streams the data into Stream Analytics for real-time analysis.
Batch systems process groups of data that are stored in an Azure Blob store. They do this in a single job that runs at a predefined interval. Don’t use batch systems for business intelligence systems that can’t tolerate the predefined interval. For example, an autonomous vehicle can’t wait for a batch system to adjust its driving. Similarly, a fraud-detection system must decline a questionable financial transaction in real time.
Databricks is a serverless platform that’s optimized for Azure. It provides one-click setup, streamlined workflows, and an interactive workspace for Spark-based applications.
Databricks adds capabilities to Apache Spark, including fully managed Spark clusters and an interactive workspace. You can use REST APIs to program clusters.
In Databricks notebooks you’ll use familiar programming tools such as R, Python, Scala, and SQL. Role-based security in Azure Active Directory and Databricks provides enterprise-grade security.
Data Factory is a cloud-integration service. It orchestrates the movement of data between various data stores.
As a data engineer, you can create data-driven workflows in the cloud to orchestrate and automate data movement and data transformation. Use Data Factory to create and schedule data-driven workflows (called pipelines) that can ingest data from data stores.
Data Factory processes and transforms data by using compute services such as Azure HDInsight, Hadoop, Spark, and Azure Machine Learning. Publish output data to data stores such as Azure SQL Data Warehouse so that business intelligence applications can consume the data. Ultimately, you use Data Factory to organize raw data into meaningful data stores and data lakes so your organization can make better business decisions.
Analysts, data scientists, developers, and others use Data Catalog to discover, understand, and consume data sources. Data Catalog features a crowdsourcing model of metadata and annotations. In this central location, an organization’s users contribute their knowledge to build a community of data sources that are owned by the organization.
Data Catalog is a fully managed cloud service. Users discover and explore data sources, and they help the organization document information about their data sources.
Whether you work with small data or big data, you’ll find that the Azure platform provides a rich set of technologies. Use these to analyze data that’s relational, nonrelational, streaming, text-based, or image-based. Choose the technologies that meet your business needs. Then scale your solutions to meet demands securely.