Session Management

Session Management with Livy

  • Livy: Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It is crucial for managing Spark sessions and executing code remotely, allowing users to submit Spark jobs, monitor their progress, and retrieve results without directly interacting with the cluster.

  • Session Creation and Management: The code uses Livy to create and manage Spark sessions. It sends HTTP requests to the Livy server to start new sessions, monitor their status, and execute Spark code. This is done through the create_session, add_session_to_pool, and ensure_sessions methods. The session management logic ensures that a predefined number of Spark sessions are always available for use, optimizing resource utilization and reducing wait times for execution.

Multithreading for Concurrent Session Management

  • Threading: The code leverages Python's threading module to handle session creation and management asynchronously. This allows the application to maintain responsiveness and efficiency, especially when dealing with I/O-bound operations such as HTTP requests to the Livy server.

  • Session Pool Management: A thread-safe session pool is implemented using a list and a lock (self.session_pool and self.session_pool_lock, respectively). This pool keeps track of available Spark sessions. The session_pool_lock ensures that concurrent modifications to the session pool by multiple threads do not lead to race conditions or inconsistent states.

  • Asynchronous Session Creation: When the number of available sessions falls below a certain threshold, new sessions are created asynchronously in separate threads (create_session_async and add_session_to_pool). This design pattern prevents the application from blocking while waiting for new Spark sessions to be initialized, which can be a time-consuming process.

Session Initialization

The script initializes DataFrames using Pathling within a Spark session, directly reading healthcare data from Azure Blob Storage for rapid analytics. Each healthcare resource type, such as AllergyIntolerance or Patient, is processed through Pathling to decode FHIR-formatted data into Spark DataFrames. This is achieved by constructing paths to .ndjson files in Blob Storage, which Pathling reads and encodes into structured DataFrames.

These DataFrames are then stored in a dictionary, keyed by resource type, and dynamically assigned to global variables named after each resource. This strategy makes the DataFrames instantly accessible across the Spark session, eliminating the need for reloading or reprocessing data for subsequent queries. By pre-loading the data into structured formats, the script significantly accelerates analytics on healthcare data, enabling efficient and rapid data analysis without the overhead of initial data processing steps.

Implementation

  • Dynamic Resource Allocation: By managing a pool of Spark sessions and using multithreading to handle session creation and management, the code implements a dynamic resource allocation mechanism. This ensures that resources are efficiently utilized, scaling up or down based on demand.

  • Efficient Processing of Data Jobs: The session management logic directly contributes to the business objective of processing data jobs efficiently. By ensuring that a minimum number of Spark sessions are always ready, it reduces the latency between job submissions and execution. This is particularly important in a business context where timely data processing can lead to faster decision-making and improved operational efficiency.

  • Resource Optimization: Through careful session management, the application minimizes the cost associated with running Spark clusters on Azure HDInsight. By maintaining an optimal number of active sessions and terminating idle ones, it ensures that resources are not wasted, thus optimizing cloud expenditure.

Last updated