Good questions, some of which I'd like to know the answer to.

>>  Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?

This depends on how you are going to use the aggregate data.

1. Is there a lot of data? If so, and you are going to use the data as
inputs to another job, it might benefit from being distributed across the
cluster on HDFS (for data locality).
2. Usually when speaking about aggregates there is be substantially less
data, in which case storing that data in another datastore is okay. If
you're talking about a few thousand rows, and having them in something like
Mongo or Postgres makes your life easier (reporting software, for example)
- even if you use them as inputs to another job - its okay to just store
the results in another data store. If the data will grow unbounded over
time this might not be a good solution (in which case refer to #1).



On Fri Feb 06 2015 at 6:16:39 AM Mohit Durgapal <durgapalmo...@gmail.com>
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
>
> Mohit
>

Reply via email to