Re: Architecture recommendations for a tricky use case

Deepak Sharma Thu, 29 Sep 2016 07:59:31 -0700

Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:


> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>>> on HDFS using flume.
>>>>
>>>> -  Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> This is basically batch layer and you need something like Tableau or
>>>> Zeppelin to query data
>>>>
>>>> You will also need spark streaming to query data online for speed
>>>> layer. That data could be stored in some transient fabric like ignite or
>>>> even druid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>>
>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>
>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What is the message inflow ?
>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>
>>>>>> Thanks
>>>>>> Deepak
>>>>>>
>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>
>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>> raw data into Kafka.
>>>>>>>
>>>>>>> I need to:
>>>>>>>
>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>
>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>
>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>>> reports)
>>>>>>>
>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>
>>>>>>> I'm considering:
>>>>>>>
>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>
>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>> data, and to allow queries
>>>>>>>
>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>> queries across the data (mostly filters), or directly run queries 
>>>>>>> against
>>>>>>> Cassandra / HBase
>>>>>>>
>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>>>>>>> for
>>>>>>> ETL, which persistent data store to use, and how to query that data 
>>>>>>> store
>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Architecture recommendations for a tricky use case

Reply via email to