Hi!
I've been using spark for the last months and it is awesome. I'm pretty new on 
this topic so don't be too harsh on me.
Recently I've been doing some simple tests with Spark Streaming for log 
processing and I'm considering different ETL input solutions such as Flume or 
PDI+Kafka.

My use case will be:
1.- Collect logs from different applications located in different physical 
servers.
2.- Transform and pre-process those logs.
3.- Process all the logs data with spark streaming.

I've got a question regarding data processing where the data is located. 
Ideally I'd like spark-streaming (standalone, yarn or mesos) to handle the 
decision of processing data wherever it is located.

I know I can setup whatever flume workflow (agents --> collectors) I want and 
then upload the aggregated data to the HDFS. Where I guess the system will 
handle the best worker to operate on every split of data. Am I right?
Will spark-streaming + flume integration (without sinking into HDFS) provide 
this kind of behavior?

Any tips to point me in the right direction?

Reply via email to