Creating the AzureHD insight cluster
use these configs:
Cluster Type: Spark
Version: Spark 3.3.0 (HDI 5.1)
Then under Security + Networking make sure to create a "virtual network" and connect it
When selecting the nodes, E2 V3 is the cheapest.
Then finish creating the azure HDinsight cluster. This may take up to 30 minutes
Configuring the blob storage
That should have created a blob storage. We have to configure this now.
Under containers, create containers named "data", "jars", "querys", "scripts", and "zipped"
In the data folder, you should put all of the patient data in the form of NDJSON
Under the jars folder, you should put this file. Once you upload this file, generate a SAS URL that expires in a long time.
then under the scripts file, you need to put these files
After this, you should generate a SAS url to each of the shell scripts. You should make the expire time long into the future.
Then you need to edit the "jar_install.sh" file and add the JAR_URL from earlier
Setting up the script actions
Now we need to go back to the azure hdinsight cluster and click on script actions on the left navbar
using the SASS URLS to the shell commands in blob storage, use these settings and let all of the shell scripts run.
Setting the spark configuration
Now we need to edit the cluster to use these shell scripts. To do this we go back to our azure hdinsight compute cluster and click on the URL.
Once we click on that URL, a dashboard will load with a login. This login was created in the azure HDinsight setup.
Once here go to Services/Spark3/Configs
Under "Advanced spark3-env" edit the content to this
The specific lines that were changes here are
This changes the java of the cluster to java 11
This changes the python version from 2 to 3
Then we need to go to "spark3-defaults"
here we want to set these variables
spark.driver.extraClassPath = /usr/libs/sparklibs/*
spark.executor.extraClassPath = /usr/libs/sparklibs/*
spark.yarn.appMasterEnv.PYSPARK3_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3
spark.yarn.appMasterEnv.PYSPARK_PYTHON = /usr/bin/miniforge/envs/jorienv/bin/python3
spark.yarn.jars = /usr/libs/sparklibs