Re: Spark and Flume integration - do I understand this correctly?
Great, thanks guys, that helped a lot and I've got a sample working. As a follow up, when do worker/masters become necessity? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879p10908.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark and Flume integration - do I understand this correctly?
Hi, Deploying spark with Flume is pretty simple. What you'd need to do is: 1. Start your spark Flume DStream Receiver on some machine using one of the FlumeUtils.createStream methods - where you need to specify the hostname and port of the worker node on which you want the spark executor to run - say a.b.c.d: 4585. This is where Spark will receive the data from Flume. 2. Once you application has started, start the flume agent(s) which are going to be sending the data, with Avro sinks with hostname set to: a.b.c.d and port set to 4585. And you are done! Tathagata Das wrote: Hari, can you help? TD On Tue, Jul 29, 2014 at 12:13 PM, dapooley wrote: Hi, I am trying to integrate Spark onto a Flume log sink and avro source. The sink is on one machine (the application), and the source is on another. Log events are being sent from the application server to the avro source server (a log directory sink on the arvo source prints to verify) The aim is to get Spark to also receive the same events that the avro source is getting. The steps, I believe, are: 1. install/start Spark master (on avro source machine). 2. write spark application, deploy (on avro source machine). 3. add spark application as a worker to the master. 4. have spark application configured to same port as avro source Test setup is using 2 ubuntu VMs on a Windows host. Flume configuration: # application ## ## Tail application log file # /var/lib/apache-flume-1.5.0-bin/bin/flume-ng agent -n cps -c conf -f conf/flume-conf.properties # http://flume.apache.org/FlumeUserGuide.html#exec-source source_agent.sources = tomcat source_agent.sources.tomcat.type = exec source_agent.sources.tomcat.command = tail -F /var/lib/tomcat/logs/application.log source_agent.sources.tomcat.batchSize = 1 source_agent.sources.tomcat.channels = memoryChannel # http://flume.apache.org/FlumeUserGuide.html#memory-channel source_agent.channels = memoryChannel source_agent.channels.memoryChannel.type = memory source_agent.channels.memoryChannel.capacity = 100 ## Send to Flume Collector on Analytics Node # http://flume.apache.org/FlumeUserGuide.html#avro-sink source_agent.sinks = avro_sink source_agent.sinks.avro_sink.type = avro source_agent.sinks.avro_sink.channel = memoryChannel source_agent.sinks.avro_sink.hostname = 10.0.2.2 source_agent.sinks.avro_sink.port = 41414 avro source ## ## Receive Flume events for Spark streaming # http://flume.apache.org/FlumeUserGuide.html#memory-channel agent1.channels = memoryChannel agent1.channels.memoryChannel.type = memory agent1.channels.memoryChannel.capacity = 100 ## Flume Collector on Analytics Node # http://flume.apache.org/FlumeUserGuide.html#avro-source agent1.sources = avroSource agent1.sources.avroSource.type = avro agent1.sources.avroSource.channels = memoryChannel agent1.sources.avroSource.bind = 0.0.0.0 agent1.sources.avroSource.port = 41414 #Sinks agent1.sinks = localout #http://flume.apache.org/FlumeUserGuide.html#file-roll-sink agent1.sinks.localout.type = file_roll agent1.sinks.localout.sink.directory = /home/vagrant/flume/logs agent1.sinks.localout.sink.rollInterval = 0 agent1.sinks.localout.channel = memoryChannel thank you in advance for any assistance, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879.html Sent from the Apache Spark User List mailing list archive at Nabble.com. Tathagata Das <mailto:tathagata.das1...@gmail.com> July 29, 2014 at 1:52 PM Hari, can you help? TD -- Thanks, Hari
Re: Spark and Flume integration - do I understand this correctly?
Hari, can you help? TD On Tue, Jul 29, 2014 at 12:13 PM, dapooley wrote: > Hi, > > I am trying to integrate Spark onto a Flume log sink and avro source. The > sink is on one machine (the application), and the source is on another. Log > events are being sent from the application server to the avro source server > (a log directory sink on the arvo source prints to verify) > > The aim is to get Spark to also receive the same events that the avro source > is getting. The steps, I believe, are: > > 1. install/start Spark master (on avro source machine). > 2. write spark application, deploy (on avro source machine). > 3. add spark application as a worker to the master. > 4. have spark application configured to same port as avro source > > Test setup is using 2 ubuntu VMs on a Windows host. > > Flume configuration: > > # application ## > ## Tail application log file > # /var/lib/apache-flume-1.5.0-bin/bin/flume-ng agent -n cps -c conf -f > conf/flume-conf.properties > # http://flume.apache.org/FlumeUserGuide.html#exec-source > source_agent.sources = tomcat > source_agent.sources.tomcat.type = exec > source_agent.sources.tomcat.command = tail -F > /var/lib/tomcat/logs/application.log > source_agent.sources.tomcat.batchSize = 1 > source_agent.sources.tomcat.channels = memoryChannel > > # http://flume.apache.org/FlumeUserGuide.html#memory-channel > source_agent.channels = memoryChannel > source_agent.channels.memoryChannel.type = memory > source_agent.channels.memoryChannel.capacity = 100 > > ## Send to Flume Collector on Analytics Node > # http://flume.apache.org/FlumeUserGuide.html#avro-sink > source_agent.sinks = avro_sink > source_agent.sinks.avro_sink.type = avro > source_agent.sinks.avro_sink.channel = memoryChannel > source_agent.sinks.avro_sink.hostname = 10.0.2.2 > source_agent.sinks.avro_sink.port = 41414 > > > avro source ## > ## Receive Flume events for Spark streaming > > # http://flume.apache.org/FlumeUserGuide.html#memory-channel > agent1.channels = memoryChannel > agent1.channels.memoryChannel.type = memory > agent1.channels.memoryChannel.capacity = 100 > > ## Flume Collector on Analytics Node > # http://flume.apache.org/FlumeUserGuide.html#avro-source > agent1.sources = avroSource > agent1.sources.avroSource.type = avro > agent1.sources.avroSource.channels = memoryChannel > agent1.sources.avroSource.bind = 0.0.0.0 > agent1.sources.avroSource.port = 41414 > > #Sinks > agent1.sinks = localout > > #http://flume.apache.org/FlumeUserGuide.html#file-roll-sink > agent1.sinks.localout.type = file_roll > agent1.sinks.localout.sink.directory = /home/vagrant/flume/logs > agent1.sinks.localout.sink.rollInterval = 0 > agent1.sinks.localout.channel = memoryChannel > > thank you in advance for any assistance, > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark and Flume integration - do I understand this correctly?
Hi, I am trying to integrate Spark onto a Flume log sink and avro source. The sink is on one machine (the application), and the source is on another. Log events are being sent from the application server to the avro source server (a log directory sink on the arvo source prints to verify) The aim is to get Spark to also receive the same events that the avro source is getting. The steps, I believe, are: 1. install/start Spark master (on avro source machine). 2. write spark application, deploy (on avro source machine). 3. add spark application as a worker to the master. 4. have spark application configured to same port as avro source Test setup is using 2 ubuntu VMs on a Windows host. Flume configuration: # application ## ## Tail application log file # /var/lib/apache-flume-1.5.0-bin/bin/flume-ng agent -n cps -c conf -f conf/flume-conf.properties # http://flume.apache.org/FlumeUserGuide.html#exec-source source_agent.sources = tomcat source_agent.sources.tomcat.type = exec source_agent.sources.tomcat.command = tail -F /var/lib/tomcat/logs/application.log source_agent.sources.tomcat.batchSize = 1 source_agent.sources.tomcat.channels = memoryChannel # http://flume.apache.org/FlumeUserGuide.html#memory-channel source_agent.channels = memoryChannel source_agent.channels.memoryChannel.type = memory source_agent.channels.memoryChannel.capacity = 100 ## Send to Flume Collector on Analytics Node # http://flume.apache.org/FlumeUserGuide.html#avro-sink source_agent.sinks = avro_sink source_agent.sinks.avro_sink.type = avro source_agent.sinks.avro_sink.channel = memoryChannel source_agent.sinks.avro_sink.hostname = 10.0.2.2 source_agent.sinks.avro_sink.port = 41414 avro source ## ## Receive Flume events for Spark streaming # http://flume.apache.org/FlumeUserGuide.html#memory-channel agent1.channels = memoryChannel agent1.channels.memoryChannel.type = memory agent1.channels.memoryChannel.capacity = 100 ## Flume Collector on Analytics Node # http://flume.apache.org/FlumeUserGuide.html#avro-source agent1.sources = avroSource agent1.sources.avroSource.type = avro agent1.sources.avroSource.channels = memoryChannel agent1.sources.avroSource.bind = 0.0.0.0 agent1.sources.avroSource.port = 41414 #Sinks agent1.sinks = localout #http://flume.apache.org/FlumeUserGuide.html#file-roll-sink agent1.sinks.localout.type = file_roll agent1.sinks.localout.sink.directory = /home/vagrant/flume/logs agent1.sinks.localout.sink.rollInterval = 0 agent1.sinks.localout.channel = memoryChannel thank you in advance for any assistance, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879.html Sent from the Apache Spark User List mailing list archive at Nabble.com.