subject:"spark streaming from kafka real time \+ batch processing in java"

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Andrew Psaltis

Mohit,

>I want to process the data in real-time as well as store the data in hdfs
in year/month/day/hour/ format.
Are you wanting to process it and then put it into HDFS or just put the raw
data into HDFS? If the later then why not just use Camus (
https://github.com/linkedin/camus), it will easily put the data into the
directory structure you are after.

On Fri, Feb 6, 2015 at 12:19 AM, Mohit Durgapal 
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
> Mohit
>

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke

Good questions, some of which I'd like to know the answer to.

>>  Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?

This depends on how you are going to use the aggregate data.

1. Is there a lot of data? If so, and you are going to use the data as
inputs to another job, it might benefit from being distributed across the
cluster on HDFS (for data locality).
2. Usually when speaking about aggregates there is be substantially less
data, in which case storing that data in another datastore is okay. If
you're talking about a few thousand rows, and having them in something like
Mongo or Postgres makes your life easier (reporting software, for example)
- even if you use them as inputs to another job - its okay to just store
the results in another data store. If the data will grow unbounded over
time this might not be a good solution (in which case refer to #1).



On Fri Feb 06 2015 at 6:16:39 AM Mohit Durgapal 
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
>
> Mohit
>

spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Mohit Durgapal

I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark streaming?

Also I would like to ask what do people generally do with the result of
spark streams after aggregating over it? Is it okay to update a NoSQL DB
with aggregated counts per batch interval or is it generally stored in hdfs?

Is it possible to store the mini batch data from spark streaming to HDFS in
a way that the data is aggregated  hourly and put into HDFS in its "hour"
folder. I would not want a lot of small files equal to the mini batches of
spark per hour, that would be inefficient for running hadoop jobs later.

Is anyone working on the same problem?

Any help and comments would be great.


Regards

Mohit

spark streaming from kafka real time + batch processing in java

2015-02-05 Thread Mohit Durgapal

I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark streaming?

Also I would like to ask what do people generally do with the result of
spark streams after aggregating over it? Is it okay to update a NoSQL DB
with aggregated counts per batch interval or is it generally stored in hdfs?

Is it possible to store the mini batch data from spark streaming to HDFS in
a way that the data is aggregated  hourly and put into HDFS in its "hour"
folder. I would not want a lot of small files equal to the mini batches of
spark per hour, that would be inefficient for running hadoop jobs later.

Is anyone working on the same problem?

Any help and comments would be great.


Regards
Mohit

Re: spark streaming from kafka real time + batch processing in java

Re: spark streaming from kafka real time + batch processing in java

spark streaming from kafka real time + batch processing in java

spark streaming from kafka real time + batch processing in java

4 matches

Site Navigation

Mail list logo

Footer information