Setting up the pipelines
Creating the AzureHD insight cluster
use these configs:
Cluster Type: Spark
Version: Spark 3.3.0 (HDI 5.1)
Then under Security + Networking make sure to create a "virtual network" and connect it
When selecting the nodes, E2 V3 is the cheapest.
Then finish creating the azure HDinsight cluster. This may take up to 30 minutes
Configuring the blob storage
That should have created a blob storage. We have to configure this now.
Under containers, create containers named "data", "jars", "querys", "scripts", and "zipped"
In the data folder, you should put all of the patient data in the form of NDJSON

Under the jars folder, you should put this file. Once you upload this file, generate a SAS URL that expires in a long time.
then under the scripts file, you need to put these files
After this, you should generate a SAS url to each of the shell scripts. You should make the expire time long into the future.

Then you need to edit the "jar_install.sh" file and add the JAR_URL from earlier
#!/bin/bash
# Log starting
echo "Starting the JAR installation script..."
# Define the destination directory for the JAR files
DEST_DIR="/usr/libs/sparklibs"
# Create a local directory for the JAR files
echo "Creating directory $DEST_DIR for JAR files..."
sudo mkdir -p $DEST_DIR
# Specify the JAR file URL
JAR_URL="URL_TO_JAR_IN_BLOB_STORAGE_GOES_HERE" #EDIT THIS LINE
# Define the JAR file destination path
JAR_DEST="$DEST_DIR/library-runtime-6.3.0.jar"
# Download the JAR file from the given URL to the specified directory
echo "Downloading the JAR file to $DEST_DIR..."
sudo wget -O $JAR_DEST "$JAR_URL"
# Check if the download was successful
if [ $? -eq 0 ]; then
echo "Download successful."
else
echo "Download failed."
exit 1
fi
# Print the directory contents to verify the JAR file is there
echo "Listing contents of $DEST_DIR..."
sudo ls -l $DEST_DIR
# Final log statement
echo "JAR installation script completed."
Setting up the script actions
Now we need to go back to the azure hdinsight cluster and click on script actions on the left navbar

using the SASS URLS to the shell commands in blob storage, use these settings and let all of the shell scripts run.
Setting the spark configuration
Now we need to edit the cluster to use these shell scripts. To do this we go back to our azure hdinsight compute cluster and click on the URL.

Once we click on that URL, a dashboard will load with a login. This login was created in the azure HDinsight setup.

Once here go to Services/Spark3/Configs
Under "Advanced spark3-env" edit the content to this
#!/usr/bin/env bash
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-{{spark_home}}/conf}
export SPARK_LOG_DIR={{spark_log_dir}}
export SPARK_PID_DIR={{spark_pid_dir}}
export SPARK_MAJOR_VERSION=3
SPARK_IDENT_STRING=$USER
SPARK_NICENESS=0
export HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-{{hadoop_conf_dir}}}
export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:/usr/hdp/current/spark3-client/jars/*:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/spark_llap/*:/usr/hdp/current/spark3-client/conf:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hive_warehouse_connector/*
export JAVA_HOME=/usr/lib/jvm/java-11-oracle
if [ -d "/etc/tez/conf/" ]; then
export TEZ_CONF_DIR=/etc/tez/conf
else
export TEZ_CONF_DIR=
fi
# Tell pyspark (the shell) to use Anaconda Python.
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
export PYSPARK3_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
# Give values for log4j variables for Spark History Server
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Detwlogger.component=sparkhistoryserver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,RFA,Anonymizer -Dlog4jspark.log.dir=/var/log/spark -Dlog4jspark.log.file=sparkhistoryserver.log -Dlog4j2.configurationFile=/usr/hdp/current/spark3-client/conf/log4j2.properties"
The specific lines that were changes here are
This changes the java of the cluster to java 11
export JAVA_HOME=/usr/lib/jvm/java-11-oracle
This changes the python version from 2 to 3
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
export PYSPARK3_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
Then we need to go to "spark3-defaults"
here we want to set these variables
spark.driver.extraClassPath = /usr/libs/sparklibs/*
spark.executor.extraClassPath = /usr/libs/sparklibs/*
spark.yarn.appMasterEnv.PYSPARK3_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3
spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3
spark.yarn.jars = /usr/libs/sparklibs
Last updated