Re: Architecture recommendations for a tricky use case

Cody Koeninger Thu, 29 Sep 2016 08:40:45 -0700

If you're doing any kind of pre-aggregation during ETL, spark direct
stream will let you more easily get the delivery semantics you need,
especially if you're using a transactional data store.


If you're literally just copying individual uniquely keyed items from
kafka to a key-value store, use kafka consumers, sure.

On Thu, Sep 29, 2016 at 10:35 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org>
>> > wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com>
>> >> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> >> > Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>> >> > <deepakmc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> >> >>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> >> >>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> >> >>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mich.talebza...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> >> >>>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> >> >>>> read
>> >> >>>> from
>> >> >>>> hdfs files with Spark or simply store data in a database of your
>> >> >>>> choice via
>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>> >> >>>>
>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>> >> >>>> into
>> >> >>>> spark streaming and that will be  online or near real time
>> >> >>>> (defined
>> >> >>>> by your
>> >> >>>> window).
>> >> >>>>
>> >> >>>> Then you have a a serving layer to present data from both speed
>> >> >>>> (the
>> >> >>>> one from SS) and batch layer.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> Dr Mich Talebzadeh
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> LinkedIn
>> >> >>>>
>> >> >>>>
>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> http://talebzadehmich.wordpress.com
>> >> >>>>
>> >> >>>>
>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>> for
>> >> >>>> any
>> >> >>>> loss, damage or destruction of data or any other property which
>> >> >>>> may
>> >> >>>> arise
>> >> >>>> from relying on this email's technical content is explicitly
>> >> >>>> disclaimed. The
>> >> >>>> author will in no case be liable for any monetary damages arising
>> >> >>>> from such
>> >> >>>> loss, damage or destruction.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>>
>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>> >> >>>>> query
>> >> >>>>> the data online, and show the results in real-time.
>> >> >>>>>
>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>> >> >>>>> be
>> >> >>>>> used, it must have a custom backend + front-end.
>> >> >>>>>
>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>> >> >>>>> work:
>> >> >>>>>
>> >> >>>>> - Spark Streaming to read data from Kafka
>> >> >>>>> - Storing the data on HDFS using Flume
>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>> >> >>>>> <mich.talebza...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>> >> >>>>>> stored on HDFS using flume.
>> >> >>>>>>
>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>> >> >>>>>> be a
>> >> >>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>> the
>> >> >>>>>> reports)
>> >> >>>>>>
>> >> >>>>>> This is basically batch layer and you need something like
>> >> >>>>>> Tableau
>> >> >>>>>> or
>> >> >>>>>> Zeppelin to query data
>> >> >>>>>>
>> >> >>>>>> You will also need spark streaming to query data online for
>> >> >>>>>> speed
>> >> >>>>>> layer. That data could be stored in some transient fabric like
>> >> >>>>>> ignite or
>> >> >>>>>> even druid.
>> >> >>>>>>
>> >> >>>>>> HTH
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Dr Mich Talebzadeh
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> LinkedIn
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> http://talebzadehmich.wordpress.com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>> >> >>>>>> for
>> >> >>>>>> any loss, damage or destruction of data or any other property
>> >> >>>>>> which
>> >> >>>>>> may
>> >> >>>>>> arise from relying on this email's technical content is
>> >> >>>>>> explicitly
>> >> >>>>>> disclaimed. The author will in no case be liable for any
>> >> >>>>>> monetary
>> >> >>>>>> damages
>> >> >>>>>> arising from such loss, damage or destruction.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>> >> >>>>>>> yes.
>> >> >>>>>>>
>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>> >> >>>>>>> <deepakmc...@gmail.com> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> What is the message inflow ?
>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>> >> >>>>>>>>
>> >> >>>>>>>> Thanks
>> >> >>>>>>>> Deepak
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com>
>> >> >>>>>>>> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing
>> >> >>>>>>>>> their
>> >> >>>>>>>>> raw data into Kafka.
>> >> >>>>>>>>>
>> >> >>>>>>>>> I need to:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>> >> >>>>>>>>> Raw
>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Query this data to generate reports / analytics (There will
>> >> >>>>>>>>> be
>> >> >>>>>>>>> a
>> >> >>>>>>>>> web UI which will be the front-end to the data, and will show
>> >> >>>>>>>>> the reports)
>> >> >>>>>>>>>
>> >> >>>>>>>>> Java is being used as the backend language for everything
>> >> >>>>>>>>> (backend
>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'm considering:
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>> >> >>>>>>>>> layer
>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>> >> >>>>>>>>>
>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>> >> >>>>>>>>> standardized
>> >> >>>>>>>>> data, and to allow queries
>> >> >>>>>>>>>
>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>> >> >>>>>>>>> run
>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>> >> >>>>>>>>> queries against
>> >> >>>>>>>>> Cassandra / HBase
>> >> >>>>>>>>>
>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers
>> >> >>>>>>>>> vs
>> >> >>>>>>>>> Spark for
>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>> >> >>>>>>>>> that
>> >> >>>>>>>>> data store in
>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Thanks.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> Deepak
>> >> >> www.bigdatabig.com
>> >> >> www.keosha.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>
>

Re: Architecture recommendations for a tricky use case

Reply via email to