Resources

Apache Spark

Apache Spark is an open-source, distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, Spark has an efficient, general-purpose data processing engine designed for both batch and streaming data processing. It supports a wide array of programming languages like Scala, Python, Java, and R, making it accessible for a broad range of developers and data scientists.

Apache Kafka

Apache Kafka is an open-source platform designed for building real-time streaming data pipelines and applications. Developed by LinkedIn and later donated to the Apache Software Foundation, Kafka enables high-throughput, fault-tolerant, publish-subscribe messaging systems. It's widely used for log aggregation, stream processing, event sourcing, and real-time analytics.

Apache Livy

Apache Livy is an open-source service that facilitates the interaction between Apache Spark and application clients. It offers a REST interface for managing Spark jobs, contexts, and sessions, making it easier to submit and monitor Spark jobs programmatically. Livy is designed to improve the efficiency of job submission, providing a scalable and secure connection method between Spark and applications that wish to interact with it. By abstracting the complexity of job submission, Livy enables developers to focus on writing Spark code rather than managing infrastructure details. It supports submitting code snippets or entire applications as Spark jobs, handling job submission, and monitoring job execution. Livy is particularly useful in environments where Spark needs to be accessed by multiple and possibly concurrent users, offering a way to manage these interactions efficiently. It's an integral tool for simplifying Spark job management in a multi-tenant environment, enhancing accessibility to Spark's powerful data processing capabilities.

Pathling

Pathling is a specialized tool designed to enable the analysis of healthcare data. It provides a server and API for executing distributed queries across large datasets, particularly those stored in the FHIR (Fast Healthcare Interoperability Resources) format. Developed by CSIRO (Commonwealth Scientific and Industrial Research Organisation) in Australia, Pathling aims to facilitate complex data analysis and interoperability within the healthcare sector, making it easier to derive insights from electronic health records and other medical data.

Kubernetes

Kubernetes is an open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts. It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation. Kubernetes provides a framework for running distributed systems resiliently, allowing for scaling and failover for your applications, providing deployment patterns, and more. It supports a range of containerization technologies, including Docker.

Docker

Docker is an open-source platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications. By taking advantage of Docker’s methodologies for shipping, testing, and deploying code quickly, you can significantly reduce the delay between writing code and running it in production.

Azure HDInsight

Azure HDInsight is a cloud service provided by Microsoft Azure that enables big data analytics for organizations. It simplifies the management and processing of big data, providing a fully managed, full-spectrum, open-source analytics service for enterprises. HDInsight supports a wide variety of big data technologies including Apache Hadoop, Spark, Kafka, HBase, and more, enabling batch, interactive, and real-time analytics on large-scale datasets stored in Azure or on-premises.

Last updated