Running Code in Jori
Jori is a powerful platform that enables researchers to write and execute Apache Spark code using PySpark for analyzing population-level FHIR (Fast Healthcare Interoperability Resources) data in healthcare. With Jori, you can easily submit your PySpark applications, which will be run on a distributed cluster, and receive the results efficiently.
Apache Spark and PySpark
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It provides high-level APIs in various programming languages, including Python, and supports a wide range of data processing tasks, such as batch processing, real-time streaming, machine learning, and graph processing.
PySpark is the Python API for Apache Spark. It allows you to write Spark applications using Python and take advantage of Spark's distributed computing capabilities. PySpark provides a Pandas-like DataFrame API, which is familiar to many data scientists and makes it easy to manipulate and analyze structured data.
Pathling Library
Jori utilizes the Pathling library to facilitate working with FHIR data in PySpark applications. Pathling provides FHIR encoders that transform FHIR Bundles or NDJSON into Spark datasets, allowing you to leverage Spark's powerful data processing capabilities.
With Pathling, every FHIR resource is already initialized and can be accessed when writing your PySpark code. Each PySpark DataFrame is named after its corresponding FHIR resource. For example, the "Patient" FHIR resource data is stored in a DataFrame variable named "Patient".
Examples of interacting with the FHIR dataframes
For example, the following code segment should print the number of unique patients there are.
Patient.count()This example will return a table showing the id, gender, and birthdate of each patient.
Patient.select('id', 'gender', 'birthDate').show()Submitting PySpark Applications to Jori
To run your PySpark application on Jori, follow these steps:
Write your PySpark code using the PySpark API and leverage the Pathling library for working with FHIR data.
Utilize the IDE on the Jori platform to write and execute your pyspark code.
Jori takes care of the underlying infrastructure and resource management, allowing you to focus on writing your PySpark code and analyzing the FHIR data. It abstracts away the complexities of cluster setup, configuration, and maintenance, making it easier for researchers to leverage the power of distributed computing.
By leveraging Apache Spark, PySpark, and the Pathling library, researchers can efficiently process and analyze large volumes of FHIR data using Jori's platform, enabling them to gain valuable insights from population-level healthcare data.
Last updated