Re: spark streaming from kafka real time + batch processing in java
Mohit, >I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. Are you wanting to process it and then put it into HDFS or just put the raw data into HDFS? If the later then why not just use Camus ( https://github.com/linkedin/camus), it will easily put the data into the directory structure you are after. On Fri, Feb 6, 2015 at 12:19 AM, Mohit Durgapal wrote: > I want to write a spark streaming consumer for kafka in java. I want to > process the data in real-time as well as store the data in hdfs in > year/month/day/hour/ format. I am not sure how to achieve this. Should I > write separate kafka consumers, one for writing data to HDFS and one for > spark streaming? > > Also I would like to ask what do people generally do with the result of > spark streams after aggregating over it? Is it okay to update a NoSQL DB > with aggregated counts per batch interval or is it generally stored in hdfs? > > Is it possible to store the mini batch data from spark streaming to HDFS > in a way that the data is aggregated hourly and put into HDFS in its > "hour" folder. I would not want a lot of small files equal to the mini > batches of spark per hour, that would be inefficient for running hadoop > jobs later. > > Is anyone working on the same problem? > > Any help and comments would be great. > > > Regards > Mohit >
Re: spark streaming from kafka real time + batch processing in java
Good questions, some of which I'd like to know the answer to. >> Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use the data as inputs to another job, it might benefit from being distributed across the cluster on HDFS (for data locality). 2. Usually when speaking about aggregates there is be substantially less data, in which case storing that data in another datastore is okay. If you're talking about a few thousand rows, and having them in something like Mongo or Postgres makes your life easier (reporting software, for example) - even if you use them as inputs to another job - its okay to just store the results in another data store. If the data will grow unbounded over time this might not be a good solution (in which case refer to #1). On Fri Feb 06 2015 at 6:16:39 AM Mohit Durgapal wrote: > I want to write a spark streaming consumer for kafka in java. I want to > process the data in real-time as well as store the data in hdfs in > year/month/day/hour/ format. I am not sure how to achieve this. Should I > write separate kafka consumers, one for writing data to HDFS and one for > spark streaming? > > Also I would like to ask what do people generally do with the result of > spark streams after aggregating over it? Is it okay to update a NoSQL DB > with aggregated counts per batch interval or is it generally stored in hdfs? > > Is it possible to store the mini batch data from spark streaming to HDFS > in a way that the data is aggregated hourly and put into HDFS in its > "hour" folder. I would not want a lot of small files equal to the mini > batches of spark per hour, that would be inefficient for running hadoop > jobs later. > > Is anyone working on the same problem? > > Any help and comments would be great. > > > Regards > > Mohit >
spark streaming from kafka real time + batch processing in java
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark streaming? Also I would like to ask what do people generally do with the result of spark streams after aggregating over it? Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? Is it possible to store the mini batch data from spark streaming to HDFS in a way that the data is aggregated hourly and put into HDFS in its "hour" folder. I would not want a lot of small files equal to the mini batches of spark per hour, that would be inefficient for running hadoop jobs later. Is anyone working on the same problem? Any help and comments would be great. Regards Mohit
spark streaming from kafka real time + batch processing in java
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark streaming? Also I would like to ask what do people generally do with the result of spark streams after aggregating over it? Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? Is it possible to store the mini batch data from spark streaming to HDFS in a way that the data is aggregated hourly and put into HDFS in its "hour" folder. I would not want a lot of small files equal to the mini batches of spark per hour, that would be inefficient for running hadoop jobs later. Is anyone working on the same problem? Any help and comments would be great. Regards Mohit