Good questions, some of which I'd like to know the answer to. >> Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs?
This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use the data as inputs to another job, it might benefit from being distributed across the cluster on HDFS (for data locality). 2. Usually when speaking about aggregates there is be substantially less data, in which case storing that data in another datastore is okay. If you're talking about a few thousand rows, and having them in something like Mongo or Postgres makes your life easier (reporting software, for example) - even if you use them as inputs to another job - its okay to just store the results in another data store. If the data will grow unbounded over time this might not be a good solution (in which case refer to #1). On Fri Feb 06 2015 at 6:16:39 AM Mohit Durgapal <durgapalmo...@gmail.com> wrote: > I want to write a spark streaming consumer for kafka in java. I want to > process the data in real-time as well as store the data in hdfs in > year/month/day/hour/ format. I am not sure how to achieve this. Should I > write separate kafka consumers, one for writing data to HDFS and one for > spark streaming? > > Also I would like to ask what do people generally do with the result of > spark streams after aggregating over it? Is it okay to update a NoSQL DB > with aggregated counts per batch interval or is it generally stored in hdfs? > > Is it possible to store the mini batch data from spark streaming to HDFS > in a way that the data is aggregated hourly and put into HDFS in its > "hour" folder. I would not want a lot of small files equal to the mini > batches of spark per hour, that would be inefficient for running hadoop > jobs later. > > Is anyone working on the same problem? > > Any help and comments would be great. > > > Regards > > Mohit >