Ok, thanks. Quick Q: Won't each sink consume the same data? Do I need to set up the load balancing sink processor to keep that from happening?
On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[email protected]> wrote: > Also can you try adding more HDFS sinks reading from the same channel. I'd > recommend using different file prefixes, or paths for each sink, to avoid > collision. Since each sink really has just one thread driving them, adding > multiple sinks might help. Also, keep an eye on the memory channel's sizes > and see if it is filling up (there will be ChannelExceptions in the logs if > it is). > > > Hari > > -- > Hari Shreedharan > > On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote: > >> Good to hear! Take five six thread dumps of it and then them our way. >> >> On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[email protected]> wrote: >>> Cool, thanks for the advice! That's a great blog post. >>> >>> I've changed my ways (for now at least). I've got lots of disks to use once >>> memory starts working, and this node has tooooons of memory (192G). >>> >>> Here's my new flume.conf: >>> https://gist.github.com/4551513 >>> >>> This is doing better, for sure. Note that I took out the timestamp >>> regex_extractor just in case that was impacting performance. I'm using the >>> regular ol' timestamp interceptor now. >>> >>> I'm still not doing so great though. I'm getting about 300 Mb per minute in >>> my HDFS files. I should be getting about 300G. That's better than before >>> though. I've got 10% of the data this time, rather than 0.14% :) >>> >>> >>> >>> >>> On Jan 16, 2013, at 4:36 PM, Brock Noland <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I would use memory channel for now as opposed to file channel. For >>>> file channel to keep up with that you'd need multiple disks. Also your >>>> checkpoint period is super-low which will cause lots of checkpoints >>>> and slow things down. >>>> >>>> However, I think the biggest issue is probably batch size. With that >>>> much data you are likely going to want a large batch size for all >>>> components involved. Something a low multiple of 1000. There is a good >>>> article on this: >>>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 >>>> >>>> To re-cap would: >>>> >>>> Use memory channel for now. Once you prove things work you can work on >>>> tuning file channel (going to write larger batch sizes and multiple >>>> disks). >>>> >>>> Increase the batch size for your source/sink. >>>> >>>> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[email protected]> wrote: >>>>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. >>>>> This is available to me via UDP Multicast. Everything seems to be working >>>>> great, except that I seem to be missing a lot of data. >>>>> >>>>> Our webrequest log stream consists of about 100000 events per second, >>>>> which amounts to around 50 Mb per second. >>>>> >>>>> I understand that this is probably too much for a single node to handle, >>>>> but I should be able to either see most of the data written to HDFS, or >>>>> at least see errors about channels being filled to capacity. HDFS files >>>>> are set to roll every 60 seconds. Each of my files is only about 4.2MB, >>>>> which is only 72 Kb per second. That's only 0.14% of the data I'm >>>>> expecting to consume. >>>>> >>>>> Where did the rest of it go? If Flume is dropping it, why doesn't it tell >>>>> me!? >>>>> >>>>> Here's my flume.conf: >>>>> >>>>> https://gist.github.com/4551001 >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[email protected]> wrote: >>>>> >>>>>> I just submitted the patch on >>>>>> https://issues.apache.org/jira/browse/FLUME-1838. >>>>>> >>>>>> Would love some reviews, thanks! >>>>>> -Andrew >>>>>> >>>>>> >>>>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[email protected]> wrote: >>>>>> >>>>>>> Thanks guys! I've opened up a JIRA here: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/FLUME-1838 >>>>>>> >>>>>>> >>>>>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> Hey Andrew, >>>>>>>> >>>>>>>> for your reference, we have a lot of developer informations in our >>>>>>>> wiki: >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>>>>>>> >>>>>>>> cheers, >>>>>>>> Alex >>>>>>>> >>>>>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am >>>>>>>>> fairly sure that if you find such a source useful, there would >>>>>>>>> definitely be others who find it useful too. I'd recommend filing a >>>>>>>>> jira and starting a discussion, and then submitting the patch. We >>>>>>>>> would be happy to review and commit it. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Hari >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Hari Shreedharan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're >>>>>>>>>> investigating using Flume for our web request log HDFS imports. >>>>>>>>>> We've previously been using Kafka, but have had to change short term >>>>>>>>>> architecture plans in order to get data into HDFS reliably and >>>>>>>>>> regularly soon. >>>>>>>>>> >>>>>>>>>> Our current web request logs are available for consumption over a >>>>>>>>>> multicast UDP stream. I could hack something together to try and >>>>>>>>>> pipe this into Flume using the existing sources (SyslogUDPSource, or >>>>>>>>>> maybe some combination of socat + NetcatSource), but I'd rather >>>>>>>>>> reduce the number of moving parts. I'd like to consume directly from >>>>>>>>>> the multicast UDP stream as a Flume source. >>>>>>>>>> >>>>>>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly >>>>>>>>>> just stripping out the syslog event header extraction, and adding in >>>>>>>>>> multicast Datagram connection code. I plan on cleaning this up, and >>>>>>>>>> making this a generic raw UDP source, with multicast being a >>>>>>>>>> configuration option. >>>>>>>>>> >>>>>>>>>> My question to you guys is, is this something the Flume community >>>>>>>>>> would find useful? If so, should I open up a JIRA to track this? >>>>>>>>>> I've got a fork of the Flume git repo over on github and will be >>>>>>>>>> doing my work there. I'd love to share it upstream if it would be >>>>>>>>>> useful. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> -Andrew Otto >>>>>>>>>> Systems Engineer >>>>>>>>>> Wikimedia Foundation >>>>>>>> >>>>>>>> -- >>>>>>>> Alexander Alten-Lorenz >>>>>>>> http://mapredit.blogspot.com >>>>>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >>>> >>>> >>>> >>>> -- >>>> Apache MRUnit - Unit testing MapReduce - >>>> http://incubator.apache.org/mrunit/ >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ >
