Re: Architecture recommendations for a tricky use case

Mich Talebzadeh Thu, 29 Sep 2016 09:51:03 -0700

Hi Michael,

How about druid <http://druid.io/> here.


Hive ORC tables are another option that have  Streaming data ingest
<https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest>to
Flume and storm

However, Spark cannot read ORC transactional tables because of delta files,
unless the compaction is done (a nightmare)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 17:01, Michael Segel <msegel_had...@hotmail.com>
wrote:

> Ok… so what’s the tricky part?
> Spark Streaming isn’t real time so if you don’t mind a slight delay in
> processing… it would work.
>
> The drawback is that you now have a long running Spark Job (assuming under
> YARN) and that could become a problem in terms of security and resources.
> (How well does Yarn handle long running jobs these days in a secured
> Cluster? Steve L. may have some insight… )
>
> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do
> you want to write your own compaction code? Or use Hive 1.x+?)
>
> HBase? Depending on your admin… stability could be a problem.
> Cassandra? That would be a separate cluster and that in itself could be a
> problem…
>
> YMMV so you need to address the pros/cons of each tool specific to your
> environment and skill level.
>
> HTH
>
> -Mike
>
> > On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> >
> > I have a somewhat tricky use case, and I'm looking for ideas.
> >
> > I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
> >
> > I need to:
> >
> > - Do ETL on the data, and standardize it.
> >
> > - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
> >
> > - Query this data to generate reports / analytics (There will be a web
> UI which will be the front-end to the data, and will show the reports)
> >
> > Java is being used as the backend language for everything (backend of
> the web UI, as well as the ETL layer)
> >
> > I'm considering:
> >
> > - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
> (receive raw data from Kafka, standardize & store it)
> >
> > - Using Cassandra, HBase, or raw HDFS, for storing the standardized
> data, and to allow queries
> >
> > - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
> >
> > I'd appreciate some thoughts / suggestions on which of these
> alternatives I should go with (e.g, using raw Kafka consumers vs Spark for
> ETL, which persistent data store to use, and how to query that data store
> in the backend of the web UI, for displaying the reports).
> >
> >
> > Thanks.
>
>

Re: Architecture recommendations for a tricky use case

Reply via email to