Chen, Have you taken a look at this presentation on Planning and Deploying Flume from ApacheCon?
http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf It may have the answers you need. Best, Jeff On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <[email protected]>wrote: > Thanks Saurabh. > If that is the case, I am actually thinking about using storm spout to > talk to our socket server so that the storm cluster can take care of the > reading socket server part. Then in each storm node, start a flume agent, > listening on a RPC port and write to HDFS(with fail over) .Then in the > storm bolt, simply send the data to RPC so that flume can get it. > How do you think of this setup? It takes care of both failover on the > source(by storm) and on the sink(by flume) But It looks a little > complicated for me. > Chen > > > On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <[email protected]>wrote: > >> Hi Chen, >> >> I think Flume doesn't have a way to configure multiple sources pointing >> to same data source. Of course you can do that, but you will end up with >> duplicate data. Flume offers fail over at the sink level. >> >> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <[email protected]>wrote: >> >>> Ok. so after more researching:) It seems that what i need is the >>> failover for agent source, (not fail over for sink): >>> If one agent dies, another same kind of agent will start running. >>> Does flume support this scenario? >>> Thanks, >>> Chen >>> >>> >>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <[email protected]>wrote: >>> >>>> After reading more docs, it seems that if I want to achieve my goal, i >>>> have to do the following: >>>> 1. Having one agent with the custom source running on one node. This >>>> agent reads from those 5 socket server, and sink to some kind of sink(maybe >>>> another socket?) >>>> 2. On another(or more) machines, setting up collectors that read from >>>> the agent sink in 1, and sink to hdfs. >>>> 3. Having a master node managing nodes in 1,2. >>>> >>>> But it seems to be overskilled in my case: in 1, i can already sink to >>>> hdfs. Since the data available at socket server are much faster than the >>>> data translation part. I want to be able to later add more nodes to do the >>>> translation job. so what is the correct setup? >>>> Thanks, >>>> Chen >>>> >>>> >>>> >>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang >>>> <[email protected]>wrote: >>>> >>>>> Guys, >>>>> In my environment, the client is 5 socket servers. Thus i wrote a >>>>> custom source spawning 5 threads reading from each of them infinitely,and >>>>> the sink is hdfs(hive table). The work fine by running flume-ng agent. >>>>> >>>>> But how can i deploy this in distributed mode(cluster)? I am confused >>>>> about the 3 ties(agent,collector,storage) mentioned in the doc. Does it >>>>> apply to my case? How can I separate my agent/collect/storage? Apparently >>>>> i >>>>> can only have one agent running: multiple agent will result in getting >>>>> duplicates from the socket server. But I want that if one agent dies, >>>>> other >>>>> agent can take it up. I would also like to be able to add horizontal >>>>> scalability for writing to hdfs. How can I achieve all this? >>>>> >>>>> thank you very much for your advice. >>>>> Chen >>>>> >>>> >>>> >>> >> >> >> -- >> Mailing List Archives, >> QnaList.com >> > >
