In general, (and I am prototyping), I have a better idea :)
- Consume kafka in Spark from topic-A
- transform data in Spark (normalize, enrich etc etc)
- Feed it back to Kafka (in a different topic-B)
- Have flume->HDFS (for M/R, Impala, Spark batch) or Spark-streaming
or any other compute framework subscribe to B




On Mon, Aug 11, 2014 at 5:57 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
> Hi,
>
> On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers
> <gwenhael.pasqui...@ericsson.com> wrote:
>>
>> We intend to apply other operations on the data later in the same spark
>> context, but our first step is to archive it.
>>
>>
>>
>> Our goal is somth like this
>>
>> Step 1 : consume kafka
>> Step 2 : archive to hdfs AND send to step 3
>> Step 3 : transform data
>>
>> Step 4 : save transformed data to HDFS as input for M/R
>
>
> I see. Well I think Spark Streaming may be well suited for that purpose.
>
>>
>> To us it looks like a great flaw if, in streaming mode, spark-streaming
>> cannot slow down it’s consumption depending on the available resources.
>
>
> On Mon, Aug 11, 2014 at 10:10 PM, Gwenhael Pasquiers
> <gwenhael.pasqui...@ericsson.com> wrote:
>>
>> I think the kind of self-regulating system you describe would be too
>> difficult to implement and probably unreliable (even more with the fact that
>> we have multiple slaves).
>
>
> Isn't "slow down its consumption depending on the available resources" a
> "self-regulating system"? I don't see how you can adapt to available
> resources without measuring your execution time and then change how much you
> consume. Did you have any particular form of adaption in mind?
>
> Tobias

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to