Yeah good point, the ExecSink does no batching and as such will be quite slow when interacting with any channel which guarantees no dataloss on a commit.
On Tue, Jul 10, 2012 at 8:54 AM, Juhani Connolly < [email protected]> wrote: > A further observation: > > When running our collector node with avro source and hdfssink, I observed > it keeping up with about 1400+ events per second. Upon looking at the exec > sink I noticed it sends every item as a separate event to the processor. So > I think I may have misunderstood the frequency with which fsync is > happening, and that the main issue is any sink/source that works together > with the channel in tiny amounts(resulting in frequent disk flushes and > strangling throughput). > > While improvements to the channel would be very welcome, it may be more > productive to document this behavior and introduce batching modes to > those sources/sinks that do not currently feature one. > > > On 07/10/2012 11:14 AM, Juhani Connolly wrote: > >> On 07/10/2012 02:36 AM, Brock Noland wrote: >> >>> If you ran the workload with file channel and then took 10 thread >>> dumps I think we'd have enough to understand what is going on. >>> >>> Brock >>> >> I've taken some dumps and you can find them here: >> http://people.apache.org/~**juhanic/ca-flume-fc-dumps.tar.**gz<http://people.apache.org/~juhanic/ca-flume-fc-dumps.tar.gz> >> >> I also included a png from visualvm's thread visualization where you can >> confirm that the source is constantly busy(trying to get stuff into the >> file channel), while the 5 sinks are pretty idle. Let me know if there's >> anything else I can provide >> >> On Mon, Jul 9, 2012 at 11:49 AM, Juhani Connolly >>> <juhani_connolly@cyberagent.**co.jp <[email protected]>> >>> wrote: >>> >>>> It is currently pushing only 10 events per second or so(roughly 250 >>>> bytes >>>> per event). This is with datadir/checkpoint on the same directory. Of >>>> course >>>> the fact that there is a tail process running and that tomcat is also >>>> writing out logs is without a doubt compounding the problem somewhat. >>>> >>>> I haven't taken a serious look at thread dumps of the file channel >>>> since I >>>> don't have a thorough understanding of it. However analysis has involved >>>> trying varying numbers of sinks(no throughput difference) and replacing >>>> with >>>> memory channel(which easily handles the 650 ish requests per second we >>>> have >>>> per server for the particular api, no problems even with a single sink). >>>> >>>> Since you say there's heavy fsyncing, and with 7200rpm disks, each seek >>>> will >>>> have an average latency of 4.16ms, so for alternating seeks between the >>>> checkpoint and the data dir, if each of those writes happens in order, >>>> you're already limited to best case of barely more than 100 events per >>>> second. Our experience so far has shown it to be significantly less. >>>> >>>> I do believe that batching a bunch of puts or takes with a single commit >>>> together as two seeks followed by writes(or one if we can only use a >>>> single >>>> storage file) could give significant returns when paired with a batching >>>> sink/source(which many already do... Requesting multiple items at a >>>> time). >>>> >>>> If there is any specific data you would like I would be happy to try and >>>> provide it. >>>> >>>> >>>> On 07/09/2012 05:22 PM, Brock Noland wrote: >>>> >>>> On Mon, Jul 9, 2012 at 8:51 AM, Juhani Connolly >>>> <juhani_connolly@cyberagent.**co.jp <[email protected]>> >>>> wrote: >>>> >>>>> - Intended setup with flume was a file channel connected to an avro >>>>> sink. >>>>> With only a single disk available, it is extremely slow. JDBC channel >>>>> is >>>>> also extremely slow, and MemoryChannel will fill up and start refusing >>>>> puts >>>>> as soon as a network issue comes up. >>>>> >>>> >>>> Have you taken a few thread dumps or done other analysis? When you say >>>> "extremely slow" what do you mean? Configured for no dataloss >>>> FileChannel is >>>> going to be doing a lot of fsync'ing so I am not surprised it's slow. >>>> That >>>> is a property of disks not FileChannel. I think we should use group >>>> commit >>>> but that shouldn't make it 10x faster. >>>> >>>> Brock >>>> >>>> >>>> >>>> >>> >>> >> >> >> > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
