Real-Time Data

The diagram indicates a system that listens to a blockchain for new events, which in this case would be new health records added to the chain, and then processes and stores these records for querying and analysis. Here's how the system works step-by-step for a decentralized electronic health record (EHR) system:
Blockchain events: New health records are added to the blockchain. These records are likely to be encrypted to protect patient privacy and may trigger events that notify listeners when new data is added.
Listener: A specialized listener service is constantly monitoring the blockchain for these specific events. When a new health record is written to the chain, the listener detects this event.
Node.js App: Once the listener picks up an event, it forwards the data to a Node.js application. This app could be responsible for several tasks, including validating the event, decrypting the health record if necessary, transforming the data into a suitable format, and then sending it to a message broker for reliable delivery.
Apache Kafka Streaming: The Node.js app publishes the health record to Apache Kafka, which acts as a messaging system. Kafka provides a fault-tolerant, high-throughput, publish-subscribe messaging system that allows building a distributed data stream. Within this system, Kafka would handle the health record events as messages in a topic.
Data Warehouse Update: The messages from Kafka are then consumed by the Apache Spark cluster, updating the data available for analytics. This keeps the data warehouse updated to within seconds of the data being created.
Data Warehouse: While not explicitly labeled in the diagram, the ultimate destination of the processed health records would be a data warehouse within the Azure cluster. This is where the data becomes available for querying and analysis. The data warehouse would be designed to facilitate fast queries and would store the data in a way that makes it easy to retrieve and analyze.
How far can Apache Kafka scale?
Apache Kafka can add data into an Apache Spark cluster at a high speed. Kafka is known for its high throughput and low latency, making it capable of handling a large number of messages per second. In a benchmarking test, Kafka was able to achieve 2 million writes per second on three cheap machines3. This high throughput makes it well-suited for feeding data into a Spark cluster for real-time processing and analysis. The combination of Kafka and Spark allows for the processing of different data sources, such as real-time weather data streams, and the building of machine learning models
Last updated