I have a somewhat tricky use case, and I'm looking for ideas.

I have 5-6 Kafka producers, reading various APIs, and writing their raw
data into Kafka.

I need to:

- Do ETL on the data, and standardize it.

- Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
ElasticSearch / Postgres)

- Query this data to generate reports / analytics (There will be a web UI
which will be the front-end to the data, and will show the reports)

Java is being used as the backend language for everything (backend of the
web UI, as well as the ETL layer)

I'm considering:

- Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
raw data from Kafka, standardize & store it)

- Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
and to allow queries

- In the backend of the web UI, I could either use Spark to run queries
across the data (mostly filters), or directly run queries against Cassandra
/ HBase

I'd appreciate some thoughts / suggestions on which of these alternatives I
should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
persistent data store to use, and how to query that data store in the
backend of the web UI, for displaying the reports).


Thanks.

Reply via email to