Setting up the pipelines

Creating the AzureHD insight cluster

use these configs:

Cluster Type: Spark

Version: Spark 3.3.0 (HDI 5.1)

Then under Security + Networking make sure to create a "virtual network" and connect it

When selecting the nodes, E2 V3 is the cheapest.

Then finish creating the azure HDinsight cluster. This may take up to 30 minutes

Configuring the blob storage

That should have created a blob storage. We have to configure this now.

Under containers, create containers named "data", "jars", "querys", "scripts", and "zipped"

In the data folder, you should put all of the patient data in the form of NDJSON

Under the jars folder, you should put this file. Once you upload this file, generate a SAS URL that expires in a long time.

then under the scripts file, you need to put these files

After this, you should generate a SAS url to each of the shell scripts. You should make the expire time long into the future.

Then you need to edit the "jar_install.sh" file and add the JAR_URL from earlier

#!/bin/bash

# Log starting
echo "Starting the JAR installation script..."

# Define the destination directory for the JAR files
DEST_DIR="/usr/libs/sparklibs"

# Create a local directory for the JAR files
echo "Creating directory $DEST_DIR for JAR files..."
sudo mkdir -p $DEST_DIR

# Specify the JAR file URL
JAR_URL="URL_TO_JAR_IN_BLOB_STORAGE_GOES_HERE" #EDIT THIS LINE

# Define the JAR file destination path
JAR_DEST="$DEST_DIR/library-runtime-6.3.0.jar"

# Download the JAR file from the given URL to the specified directory
echo "Downloading the JAR file to $DEST_DIR..."
sudo wget -O $JAR_DEST "$JAR_URL"
# Check if the download was successful
if [ $? -eq 0 ]; then
    echo "Download successful."
else
    echo "Download failed."
    exit 1
fi

# Print the directory contents to verify the JAR file is there
echo "Listing contents of $DEST_DIR..."
sudo ls -l $DEST_DIR

# Final log statement
echo "JAR installation script completed."

Setting up the script actions

Now we need to go back to the azure hdinsight cluster and click on script actions on the left navbar

using the SASS URLS to the shell commands in blob storage, use these settings and let all of the shell scripts run.

Setting the spark configuration

Now we need to edit the cluster to use these shell scripts. To do this we go back to our azure hdinsight compute cluster and click on the URL.

Once we click on that URL, a dashboard will load with a login. This login was created in the azure HDinsight setup.

Once here go to Services/Spark3/Configs

Under "Advanced spark3-env" edit the content to this

#!/usr/bin/env bash

export SPARK_CONF_DIR=${SPARK_CONF_DIR:-{{spark_home}}/conf}
export SPARK_LOG_DIR={{spark_log_dir}}
export SPARK_PID_DIR={{spark_pid_dir}}
export SPARK_MAJOR_VERSION=3
SPARK_IDENT_STRING=$USER
SPARK_NICENESS=0
export HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-{{hadoop_conf_dir}}}
export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:/usr/hdp/current/spark3-client/jars/*:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/spark_llap/*:/usr/hdp/current/spark3-client/conf:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hive_warehouse_connector/*
export JAVA_HOME=/usr/lib/jvm/java-11-oracle
if [ -d "/etc/tez/conf/" ]; then
  export TEZ_CONF_DIR=/etc/tez/conf
else
  export TEZ_CONF_DIR=
fi

# Tell pyspark (the shell) to use Anaconda Python.
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
export PYSPARK3_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}


# Give values for log4j variables for Spark History Server
export SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Detwlogger.component=sparkhistoryserver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,RFA,Anonymizer -Dlog4jspark.log.dir=/var/log/spark -Dlog4jspark.log.file=sparkhistoryserver.log -Dlog4j2.configurationFile=/usr/hdp/current/spark3-client/conf/log4j2.properties"

The specific lines that were changes here are

This changes the java of the cluster to java 11

export JAVA_HOME=/usr/lib/jvm/java-11-oracle

This changes the python version from 2 to 3

export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}
export PYSPARK3_PYTHON=${PYSPARK_PYTHON:-/usr/bin/miniforge/envs/jorienv/bin/python}

Then we need to go to "spark3-defaults"

here we want to set these variables

spark.driver.extraClassPath = /usr/libs/sparklibs/*

spark.executor.extraClassPath = /usr/libs/sparklibs/*

spark.yarn.appMasterEnv.PYSPARK3_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3

spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3

spark.yarn.jars = /usr/libs/sparklibs

Last updated