top of page
Ray Minds

Read Nested Array JSON from Azure Storage in Databricks

Updated: Jun 26

By Abhishek Tyagi


Databricks is a powerful cloud-based data processing and analytics platform. One common task is reading nested JSON data from Azure Storage and processing it in Databricks. This document provides a step-by-step guide on how to achieve this.


JSON example

Prerequisites:

Azure Subscription:  Ensure you have an active Azure subscription.

Azure Storage Account:  A storage account where your JSON files are stored.

Databricks Workspace:  Access to a Databricks workspace in Azure. Step 1: Setting Up Azure Storage Account and JSON file upload

Create a Storage Account:

  • Navigate to the Azure portal.

  • Create a new storage account or use an existing one.

  • Note the storage account name and the access key (found under the storage account’s Access keys section).

Upload JSON File:

  • In the storage account, create a container (e.g., Json-container).

  • Upload your nested JSON file to this container.


Step 2: Configuring Databricks

Launch Databricks Workspace:

  • Go to the Azure portal.

  • Navigate to your Databricks workspace and launch it.

Create a New Cluster:

  • In the Databricks workspace, create a new cluster with the required configurations and libraries (e.g., Pyspark).

Step 3: Accessing Azure Storage from Databricks

Mount Azure Storage:

  • Use the following code in a Databricks notebook to mount the Azure Storage container to Databricks.

Mount  query

Verify Mounting:

  • Verify the mounting by listing the files in the mounted directory.

verify mount

Step 4: Reading Nested JSON Data

Read JSON File:

  • Use the following code to read the nested JSON file into a DataFrame.

Read JSON

Inspect DataFrame Schema:

  • Check the schema to understand the structure of the nested J

JSON schema

Exploring Nested Data:

  • The JSON contains nested arrays or structures, you may need to use explode and other DataFrame operations, such as withColumn to add a new column with exploded elements, to flatten and process the data.


Step 5: Select the desired columns


Select column

Step 6: Display the flattened DataFrame


Display result

Step 7: Clean Up

Unmount the Azure Storage if it is no longer needed.


Clean up

Conclusion

Reading and processing nested JSON data from Azure Storage in Databricks involves setting up your Azure environment, mounting the storage, reading the data into a Spark DataFrame and then using Spark's powerful processing capabilities to analyse and transform the data. By following the steps outlined in this document, you should be able to seamlessly integrate and analyse nested JSON data in your Databricks workflows.


72 views1 comment

1 Comment

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Guest
Aug 13

Where is Df_exploded step is missing?

Like
bottom of page