How much data are ingesting per minute or second bases? How many source we are taking here ? What kind of channel are you using currently and what is the memory /storage footprint on the source as well as sink? is it a uniform distribution of traffic? if not, what is the max peak of the data throughput you you expect from a given source?
On Thu, Mar 27, 2014 at 11:07 AM, Andrew Ehrlich <[email protected]>wrote: > What about having more than one flume agent? > > You could have two agents that read the small messages and sink to HDFS, > or two agents that read the messages, serialize them, and send them to a > third agent which sinks them into HDFS. > > > On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider < > [email protected]> wrote: > >> I have a fair bit of data continually being created in the form of >> smallish messages (a few hundred bytes), which needs to enter flume, and >> eventually sink into HDFS. >> >> I need to be sure that the data lands in persistent storage and won't be >> lost, but otherwise throughput isn't important. It just needs to be fast >> enough to not back up. >> >> I'm running into a bottleneck in the initial ingestion of data. >> >> I've tried the netcat source, and the thrift source but both have capped >> out at a thousand or so records per second. >> >> Batching up the thrift api items into sets of 10 and using appendBatch is >> a pretty large speedup, but still not enough. >> >> Here's a gist of my ruby test script, and some example runs, and my >> config. >> >> https://gist.github.com/cschneid/9792305 >> >> >> 1. Are there any obvious performance changes I can do to speed up >> ingestion? >> 2. How fast can flume reasonably go? Should I switch my source to be >> something else that's faster? What? >> 3. Is there a better tool for this kind of task? (rapid, safe ingestion >> small messages). >> >> Thanks! >> Chris >> > >
