Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the web UI can query it, seems like it will be enough. > > I'm thinking about: > > - Reading data from Kafka via Spark Streaming > - Standardizing, then storing it in Cassandra > - Querying Cassandra from the web ui > > That seems like it will work. My question now is whether to use Spark > Streaming to read Kafka, or use Kafka consumers directly. > > > On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> - Spark Streaming to read data from Kafka >> - Storing the data on HDFS using Flume >> >> You don't need Spark streaming to read data from Kafka and store on HDFS. >> It is a waste of resources. >> >> Couple Flume to use Kafka as source and HDFS as sink directly >> >> KafkaAgent.sources = kafka-sources >> KafkaAgent.sinks.hdfs-sinks.type = hdfs >> >> That will be for your batch layer. To analyse you can directly read from >> hdfs files with Spark or simply store data in a database of your choice via >> cron or something. Do not mix your batch layer with speed layer. >> >> Your speed layer will ingest the same data directly from Kafka into spark >> streaming and that will be online or near real time (defined by your >> window). >> >> Then you have a a serving layer to present data from both speed (the one >> from SS) and batch layer. >> >> HTH >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote: >> >>> The web UI is actually the speed layer, it needs to be able to query the >>> data online, and show the results in real-time. >>> >>> It also needs a custom front-end, so a system like Tableau can't be >>> used, it must have a custom backend + front-end. >>> >>> Thanks for the recommendation of Flume. Do you think this will work: >>> >>> - Spark Streaming to read data from Kafka >>> - Storing the data on HDFS using Flume >>> - Using Spark to query the data in the backend of the web UI? >>> >>> >>> >>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> You need a batch layer and a speed layer. Data from Kafka can be stored >>>> on HDFS using flume. >>>> >>>> - Query this data to generate reports / analytics (There will be a web >>>> UI which will be the front-end to the data, and will show the reports) >>>> >>>> This is basically batch layer and you need something like Tableau or >>>> Zeppelin to query data >>>> >>>> You will also need spark streaming to query data online for speed >>>> layer. That data could be stored in some transient fabric like ignite or >>>> even druid. >>>> >>>> HTH >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote: >>>> >>>>> It needs to be able to scale to a very large amount of data, yes. >>>>> >>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com> >>>>> wrote: >>>>> >>>>>> What is the message inflow ? >>>>>> If it's really high , definitely spark will be of great use . >>>>>> >>>>>> Thanks >>>>>> Deepak >>>>>> >>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote: >>>>>> >>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>>> >>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>>> raw data into Kafka. >>>>>>> >>>>>>> I need to: >>>>>>> >>>>>>> - Do ETL on the data, and standardize it. >>>>>>> >>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw >>>>>>> HDFS / ElasticSearch / Postgres) >>>>>>> >>>>>>> - Query this data to generate reports / analytics (There will be a >>>>>>> web UI which will be the front-end to the data, and will show the >>>>>>> reports) >>>>>>> >>>>>>> Java is being used as the backend language for everything (backend >>>>>>> of the web UI, as well as the ETL layer) >>>>>>> >>>>>>> I'm considering: >>>>>>> >>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>>> (receive raw data from Kafka, standardize & store it) >>>>>>> >>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>>> data, and to allow queries >>>>>>> >>>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>>> queries across the data (mostly filters), or directly run queries >>>>>>> against >>>>>>> Cassandra / HBase >>>>>>> >>>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark >>>>>>> for >>>>>>> ETL, which persistent data store to use, and how to query that data >>>>>>> store >>>>>>> in the backend of the web UI, for displaying the reports). >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- Thanks Deepak www.bigdatabig.com www.keosha.net