Big Data Ecosystem

Author: Jacob Park

NoSQL/NewSQL Databases

CockroachDB

An open-source, distributed NewSQL OLTP database based on Google's Spanner to provide CP guarantees and ACID transactions.

Use Cases

  • PostgreSQL Replacement (Wire-Protocol Compatible).
  • Distributed SQL.
  • ACID Transactions.

See Also

Cassandra

An open-source, distributed key-value/columnar store based on Amazon's Dynamo to provide AP guarantees.

Use Cases

  • Globally Distributed Replication.
  • Automatic TTL.
  • Event Data.
  • Machine Learning Data.
  • Time-Series Data.

See Also

Druid

An open-source, federated OLAP/analytics database.

Use Cases

  • Historical and Real-Time Analytics.

See Also

Elasticsearch

An open-source, sharded full-text search engine to provide CP guarantees.

Use Cases

  • Full-Text Search.
  • Geospatial Intelligence.
  • Log Ingestion, Analysis, and Visualization.

See Also

Kafka

An open-source, distributed publisher/subscriber message queue.

Use Cases

  • Activity Tracking.
  • Real-Time Metrics.
  • Log Aggregation.
  • Stream Processing.
  • Event Sourcing.
  • Commit Log.

See Also

JanusGraph

An open-source, distributed graph database which is a fork of TitanDB.

Use Cases

  • Fraud Detection.
  • Infrastructure Monitoring.
  • Recommendation Engines.
  • Social Network Graphs.

See Also

MongoDB

An open-source, sharded BSON-document store to provide CP guarantees.

Use Cases

  • Flexible Schemas.
  • Complex Hierarchical Data.

See Also

Redis

An open-source, in-memory data structure store.

Use Cases

  • LRU Cache.
  • Complex Data Structures.

See Also

RocksDB

An open-source, embeddable persistent key-value store based on Google's LevelDB.

Use Cases

  • Localized State.
  • Low-Latency Embeddable Cache.

See Also

TiDB

An open-source, distributed NewSQL OLTP/OLAP database based on Google's Percolator to provide CP guarantees and ACID transactions.

Use Cases

  • MySQL Replacement (Wire-Protocol Compatible).
  • Distributed SQL.
  • ACID Transactions.

See Also

ZooKeeper

An open-source, distributed hierarchical key-value store to provide CP guarantees.

Use Cases

  • Distributed Configurations.
  • Distributed Coordination.
  • Naming Service.
  • Leadership Election.

See Also

Processing

An open-source, distributed streaming data-flow engine.

Use Cases

  • Streaming ETL.
  • Streaming SQL.
  • Event-Driven Applications.
  • Stateful Applications.

See Also

Spark

An open-source, distributed general-purpose cluster-computing framework.

Use Cases

  • Batch ETL.
  • Batch SQL.
  • Data Mining.

See Also

Scheduling

Airflow

An open-source platform to programmatically author, schedule and monitor workflows.

Use Cases

  • Scheduling ETL Jobs.
  • Scheduling Machine Learning Jobs.
  • Coordinating Data Pipelines.

See Also

Serialization

Arrow

An open-source, language-independent columnar memory format for flat and hierarchical data.

Use Cases

  • In-Memory Analytics.

See Also

Avro

An open-source, remote procedure call and data serialization framework.

Use Cases

  • Streaming Analytics.
  • Schema Evolution.

See Also

Parquet

An open-source, columnar storage format.

Use Cases

  • Batched Analytics.

See Also

Storage

Alluxio

An open-source, virtual memory distributed file system.

Use Cases

  • Storage Abstraction.
  • Remote Data Access Acceleration.

See Also

Hadoop Distributed File System

An open-source, distributed file-system over commodity machines.

Use Cases

  • Bare-Metal Data Center.

See Also

S3

A proprietary, distributed file-system with four nines of availability.

Use Cases

  • AWS.

See Also