The business use case is to read a user's data from a variety of different
services through their API, and then allowing the user to query that data,
on a per service basis, as well as an aggregate across all services.

The way I'm considering doing it, is to do some basic ETL (drop all the
unnecessary fields, rename some fields into something more manageable, etc)
and then store the data in Cassandra / Postgres.

Then, when the user wants to view a particular report, query the respective
table in Cassandra / Postgres. (select .. from data where user = ? and date
between <start> and <end> and some_field = ?)

How will Spark Streaming help w/ aggregation? Couldn't the data be queried
from Cassandra / Postgres via the Kafka consumer and aggregated that way?

On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <c...@koeninger.org> wrote:

> No, direct stream in and of itself won't ensure an end-to-end
> guarantee, because it doesn't know anything about your output actions.
>
> You still need to do some work.  The point is having easy access to
> offsets for batches on a per-partition basis makes it easier to do
> that work, especially in conjunction with aggregation.
>
> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
> > If you use spark direct streams , it ensure end to end guarantee for
> > messages.
> >
> >
> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com>
> wrote:
> >>
> >> My concern with Postgres / Cassandra is only scalability. I will look
> >> further into Postgres horizontal scaling, thanks.
> >>
> >> Writes could be idempotent if done as upserts, otherwise updates will be
> >> idempotent but not inserts.
> >>
> >> Data should not be lost. The system should be as fault tolerant as
> >> possible.
> >>
> >> What's the advantage of using Spark for reading Kafka instead of direct
> >> Kafka consumers?
> >>
> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
> >> wrote:
> >>>
> >>> I wouldn't give up the flexibility and maturity of a relational
> >>> database, unless you have a very specific use case.  I'm not trashing
> >>> cassandra, I've used cassandra, but if all I know is that you're doing
> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
> >>> aggregations without a lot of forethought.  If you're worried about
> >>> scaling, there are several options for horizontally scaling Postgres
> >>> in particular.  One of the current best from what I've worked with is
> >>> Citus.
> >>>
> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com
> >
> >>> wrote:
> >>> > Hi Cody
> >>> > Spark direct stream is just fine for this use case.
> >>> > But why postgres and not cassandra?
> >>> > Is there anything specific here that i may not be aware?
> >>> >
> >>> > Thanks
> >>> > Deepak
> >>> >
> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org>
> >>> > wrote:
> >>> >>
> >>> >> How are you going to handle etl failures?  Do you care about lost /
> >>> >> duplicated data?  Are your writes idempotent?
> >>> >>
> >>> >> Absent any other information about the problem, I'd stay away from
> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> >>> >> feeding postgres.
> >>> >>
> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com>
> >>> >> wrote:
> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
> >>> >> > Nothing
> >>> >> > is
> >>> >> > being done to the data except some light ETL and then storing it
> in
> >>> >> > Cassandra
> >>> >> >
> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
> >>> >> > <deepakmc...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
> >>> >> >>
> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <
> ali.rac...@gmail.com>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> I don't think I need a different speed storage and batch
> storage.
> >>> >> >>> Just
> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
> >>> >> >>> somewhere
> >>> >> >>> where
> >>> >> >>> the web UI can query it, seems like it will be enough.
> >>> >> >>>
> >>> >> >>> I'm thinking about:
> >>> >> >>>
> >>> >> >>> - Reading data from Kafka via Spark Streaming
> >>> >> >>> - Standardizing, then storing it in Cassandra
> >>> >> >>> - Querying Cassandra from the web ui
> >>> >> >>>
> >>> >> >>> That seems like it will work. My question now is whether to use
> >>> >> >>> Spark
> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> >> >>> <mich.talebza...@gmail.com> wrote:
> >>> >> >>>>
> >>> >> >>>> - Spark Streaming to read data from Kafka
> >>> >> >>>> - Storing the data on HDFS using Flume
> >>> >> >>>>
> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
> store
> >>> >> >>>> on
> >>> >> >>>> HDFS. It is a waste of resources.
> >>> >> >>>>
> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>> >> >>>>
> >>> >> >>>> KafkaAgent.sources = kafka-sources
> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>> >> >>>>
> >>> >> >>>> That will be for your batch layer. To analyse you can directly
> >>> >> >>>> read
> >>> >> >>>> from
> >>> >> >>>> hdfs files with Spark or simply store data in a database of
> your
> >>> >> >>>> choice via
> >>> >> >>>> cron or something. Do not mix your batch layer with speed
> layer.
> >>> >> >>>>
> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka
> >>> >> >>>> into
> >>> >> >>>> spark streaming and that will be  online or near real time
> >>> >> >>>> (defined
> >>> >> >>>> by your
> >>> >> >>>> window).
> >>> >> >>>>
> >>> >> >>>> Then you have a a serving layer to present data from both speed
> >>> >> >>>> (the
> >>> >> >>>> one from SS) and batch layer.
> >>> >> >>>>
> >>> >> >>>> HTH
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Dr Mich Talebzadeh
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> LinkedIn
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
> >>> >> >>>> for
> >>> >> >>>> any
> >>> >> >>>> loss, damage or destruction of data or any other property which
> >>> >> >>>> may
> >>> >> >>>> arise
> >>> >> >>>> from relying on this email's technical content is explicitly
> >>> >> >>>> disclaimed. The
> >>> >> >>>> author will in no case be liable for any monetary damages
> arising
> >>> >> >>>> from such
> >>> >> >>>> loss, damage or destruction.
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <
> ali.rac...@gmail.com>
> >>> >> >>>> wrote:
> >>> >> >>>>>
> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
> >>> >> >>>>> query
> >>> >> >>>>> the data online, and show the results in real-time.
> >>> >> >>>>>
> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
> can't
> >>> >> >>>>> be
> >>> >> >>>>> used, it must have a custom backend + front-end.
> >>> >> >>>>>
> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
> >>> >> >>>>> work:
> >>> >> >>>>>
> >>> >> >>>>> - Spark Streaming to read data from Kafka
> >>> >> >>>>> - Storing the data on HDFS using Flume
> >>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>>
> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
> >>> >> >>>>> <mich.talebza...@gmail.com> wrote:
> >>> >> >>>>>>
> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
> >>> >> >>>>>> be
> >>> >> >>>>>> stored on HDFS using flume.
> >>> >> >>>>>>
> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
> will
> >>> >> >>>>>> be a
> >>> >> >>>>>> web UI which will be the front-end to the data, and will show
> >>> >> >>>>>> the
> >>> >> >>>>>> reports)
> >>> >> >>>>>>
> >>> >> >>>>>> This is basically batch layer and you need something like
> >>> >> >>>>>> Tableau
> >>> >> >>>>>> or
> >>> >> >>>>>> Zeppelin to query data
> >>> >> >>>>>>
> >>> >> >>>>>> You will also need spark streaming to query data online for
> >>> >> >>>>>> speed
> >>> >> >>>>>> layer. That data could be stored in some transient fabric
> like
> >>> >> >>>>>> ignite or
> >>> >> >>>>>> even druid.
> >>> >> >>>>>>
> >>> >> >>>>>> HTH
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Dr Mich Talebzadeh
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> LinkedIn
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> http://talebzadehmich.wordpress.com
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
> responsibility
> >>> >> >>>>>> for
> >>> >> >>>>>> any loss, damage or destruction of data or any other property
> >>> >> >>>>>> which
> >>> >> >>>>>> may
> >>> >> >>>>>> arise from relying on this email's technical content is
> >>> >> >>>>>> explicitly
> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
> >>> >> >>>>>> monetary
> >>> >> >>>>>> damages
> >>> >> >>>>>> arising from such loss, damage or destruction.
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
> >>> >> >>>>>> <ali.rac...@gmail.com>
> >>> >> >>>>>> wrote:
> >>> >> >>>>>>>
> >>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
> >>> >> >>>>>>> yes.
> >>> >> >>>>>>>
> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
> >>> >> >>>>>>> <deepakmc...@gmail.com> wrote:
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> What is the message inflow ?
> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
> use .
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> Thanks
> >>> >> >>>>>>>> Deepak
> >>> >> >>>>>>>>
> >>> >> >>>>>>>>
> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com>
> >>> >> >>>>>>>> wrote:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
> >>> >> >>>>>>>>> ideas.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
> >>> >> >>>>>>>>> writing
> >>> >> >>>>>>>>> their
> >>> >> >>>>>>>>> raw data into Kafka.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I need to:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
> Cassandra /
> >>> >> >>>>>>>>> Raw
> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
> >>> >> >>>>>>>>> will be
> >>> >> >>>>>>>>> a
> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
> >>> >> >>>>>>>>> show
> >>> >> >>>>>>>>> the reports)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Java is being used as the backend language for everything
> >>> >> >>>>>>>>> (backend
> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'm considering:
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
> ETL
> >>> >> >>>>>>>>> layer
> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
> >>> >> >>>>>>>>> standardized
> >>> >> >>>>>>>>> data, and to allow queries
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
> to
> >>> >> >>>>>>>>> run
> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
> >>> >> >>>>>>>>> queries against
> >>> >> >>>>>>>>> Cassandra / HBase
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
> these
> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
> >>> >> >>>>>>>>> consumers vs
> >>> >> >>>>>>>>> Spark for
> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
> >>> >> >>>>>>>>> that
> >>> >> >>>>>>>>> data store in
> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>>
> >>> >> >>>>>>>>> Thanks.
> >>> >> >>>>>>>
> >>> >> >>>>>>>
> >>> >> >>>>>>
> >>> >> >>>>>
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Thanks
> >>> >> >> Deepak
> >>> >> >> www.bigdatabig.com
> >>> >> >> www.keosha.net
> >>> >> >
> >>> >> >
> >>> >>
> >>> >> ------------------------------------------------------------
> ---------
> >>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Thanks
> >>> > Deepak
> >>> > www.bigdatabig.com
> >>> > www.keosha.net
> >>
> >>
> >
> >
> >
> > --
> > Thanks
> > Deepak
> > www.bigdatabig.com
> > www.keosha.net
>

Reply via email to