Apache Spark for Azure Synapse now provides descriptive Livy error codes. When an Azure Synapse Spark job fails, this new and updated error handling feature parses and checks the logs on the backend to identify the root cause and displays it on the monitoring pane along with the steps to take to resolve the issue.
As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.
Azure Data Factory and Synapse Analytics Pipelines have a wealth of linked service connection types that allow them to connect and interact with many services and data stores. The Workspace UI provides the most important properties that are needed for the connection. However, at times we need more control that the UI doesn’t offer.
When implementing CICD processes in the context of Azure Synapse Analytics, you will require different workflows, depending on whether you are automating the integration and delivery of Workspace artifacts (pipelines, notebooks, etc…) or SQL pool objects (tables, stored procedures…).
Microsoft Most Valuable Professionals, or MVPs, are technology experts who passionately share their knowledge with the community. They are always on the “bleeding edge” and have an unstoppable urge to get their hands on new, exciting technologies. They have very deep knowledge of Microsoft products and services, while also being able to bring together diverse platforms, products and solutions, to solve real world problems.
A common data engineering task is explore, transform, and load data into data warehouse using Azure Synapse Apache Spark. The Azure Synapse Dedicated SQL Pool Connector for Apache Spark is the way to read and write a large volume of data efficiently between Apache Spark to Dedicated SQL Pool in Synapse Analytics. The connector supports Scala and Python language on Synapse Notebooks to perform these operations.
If you have ever used Azure Synapse Analytics dedicated SQL pool you would know there are multiple table types to choose from, for your workload. You might ask yourself, “when can I use Replicated table type and how I can efficiently use them”?
Today I would like to share a scenario that I was working on one of my serverless SQL Pool support cases. The customer asked for an advice on how to monitor serverless SQL requests by using log analytics.
We are introducing a spaceborne data processing Notebook, which has been published to Azure Synapse Analytics Gallery. The Notebook uses STAC API (SpatioTemporal Asset Catalog) to search and download geospatial data from Microsoft Planetary Computer to an Azure Storage account and perform basic geospatial transformation.
I will do a series of posts regarding Synapse connectivity. As there are a lot of topics to cover like inbound, outbound, public and private endpoints, managed VNET, managed private endpoints etc., it will be easier to break these into smaller dedicated posts.
Apache Spark applications are used by businesses to perform big data processing (ELT/ETL load), Machine Learning, and complex analytics need. Primarily Spark applications can be allocated into three different buckets.
For more information, view the blog here: October 2022 | Microsoft Azure Synapse Analytics Blog | Microsoft Azure Synapse
Contact us today for a more in-depth conversation around Azure Synapse.