Thanks Harsha. My spout is listening to a kafka queue which contains the es query from user's input. Is it safe to spawn a thread in the spout and do the ES query directly in the spout? What is the fundamental difference in doing the query in a thread of spout VS a thread of bolt?
The reason of using flume is that I have to split the data into different partitions(hdfs folders) depending on the value of the bolt: meaning I will need to modify the hdfs bolt any ways. In the past, i tried to shift large amount of data to a partitioned hive table using this approach(avro to flume to hdfs), and it seems to working well. Thus i stick to this approach without reinventing the wheel. Thanks, Chen On Fri, Jul 11, 2014 at 4:51 PM, Harsha <[email protected]> wrote: > Hi Chen, > I looked at your code. The first part is inside a Bolt's execute > method ? and it looks like fetching all the data (10000 per call) from a > elastic search and emitting each value from inside the execute method which > ends when the ES result set runs out. > It doesn't look like you followed storm's conventions here was there any > reason not use Spout here . A bolt' execute method gets called for every > tuple that's getting passed. Docs on spout & bolt > https://storm.incubator.apache.org/documentation/Concepts.html > > from your comment in the code "10000 hits per shard will be returned for > each scroll" and if it taking longer read 10000 records from ES I would > suggest you to reduce this batch size ". The idea here is you are making > quicker calls to ES and pushing the data downstream and making another call > to ES for the next batch instead of acquiring one big batch in single call. > > "i am getting around 15000 entries in a batch, the query itself takes > about 4second, however, he emit method in the query bolt takes about 20 > seconds." Can you try reducing the batch size here too it looks like the > time is taking emitting 15k entries at one go. > > Was there any reason/utility of using flume to write to hdfs. If > not I would recommend using https://github.com/ptgoetz/storm-hdfs bolt . > > > > On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote: > > Here is the output from the ES query bolt: > "Total execution time for this batch: 179655(millisecond)" is the call > time around .emit. As you can see, to emit 14000 entries, it takes > anytime from 145231 to 180000 > > > > On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang <[email protected]> > wrote: > > here you go: > https://gist.github.com/cynosureabu/b317646d5c475d0d2e42 > Its actually pretty straight forward. The only thing worth of mention is > that I use another thread in the ES bolt to do the actual query and tuple > emit. > Thanks for looking. > Chen > > > > On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin <[email protected]> > wrote: > > Can you show some code? 200 seconds for 15K puts sounds like you're not > batching. > > > On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <[email protected]> > wrote: > > typo in previous email > The emit method in the query bolt takes about 200(instead of 20) seconds.. > > > On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <[email protected]> > wrote: > > Hi, Guys, > I have a storm topology, with a single thread bolt querying large amount > of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing > some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply > emit the tuple to arvo client, which will be received by two flume node and > then sink into hdfs. I am testing in local mode. > > In the query bolt, i am getting around 15000 entries in a batch, the > query itself takes about 4second, however, he emit method in the query bolt > takes about 20 seconds. Does it mean that > the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the > query bolt? > > How can I tune my topology to make this process as fast as possible? I > tried to increase the HBase thread to 20 but it does not seem to help. > > I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt > to avro. > > Thanks for any advice. > Chen > > > > > > > > > > >
