Hi Chen,
I looked at your code. The first part is inside a
Bolt's execute method ? and it looks like fetching all the
data (10000 per call) from a elastic search and emitting each
value from inside the execute method which ends when the ES
result set runs out.
It doesn't look like you followed storm's conventions here was
there any reason not use Spout here . A bolt' execute method
gets called for every tuple that's getting passed. Docs on
spout &
bolt [1]https://storm.incubator.apache.org/documentation/Concep
ts.html
from your comment in the code "10000 hits per shard will be
returned for each scroll" and if it taking longer read 10000
records from ES I would suggest you to reduce this batch size
". The idea here is you are making quicker calls to ES and
pushing the data downstream and making another call to ES for
the next batch instead of acquiring one big batch in single
call.
"i am getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the
query bolt takes about 20 seconds." Can you try reducing the
batch size here too it looks like the time is taking emitting
15k entries at one go.
Was there any reason/utility of using flume to write
to hdfs. If not I would recommend
using [2]https://github.com/ptgoetz/storm-hdfs bolt .
On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:
Here is the output from the ES query bolt:
"Total execution time for this batch: 179655(millisecond)" is
the call time around .emit. As you can see, to emit 14000
entries, it takes
anytime from 145231 to 180000
INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=14000 hits=14000 took=26172
40813 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-13_00-00-00
40889 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 782
40890 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 4000 records
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
59335 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=28000 hits=14000 took=18033
238920 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-14_00-00-00
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 179655
238990 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 8000 records
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
257633 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=42000 hits=14000 took=17926
260932 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-15_00-00-00
402852 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-16_00-00-00
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 145231
402865 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 2000 records
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
417427 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=56000 hits=14000 took=13962
417459 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-17_00-00-00
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 66
417493 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 6000 records
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
429629 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=70000 hits=14000 took=12009
441208 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-18_00-00-00
744276 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-19_00-00-00
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 314647
744277 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 0 records
779030 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
779030 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=84000 hits=14000 took=34631
785315 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-20_00-00-00
785332 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 6302
785332 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 4000 records
811859 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
811859 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=98000 hits=14000 took=25806
945938 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-21_00-00-00
960308 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 148449
960308 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 8000 records
983611 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
983611 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=112000 hits=14000 took=22698
983627 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-22_00-00-00
1002262 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-23_00-00-00
1002272 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 18661
1002272 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 2000 records
1021226 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1021227 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=126000 hits=14000 took=18854
1110480 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-24_00-00-00
1188188 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 166961
1188188 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 6000 records
1204474 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1204474 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=140000 hits=14000 took=15422
1204495 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-25_00-00-00
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the new key(hdfs folder) is 2014-07-26_00-00-00
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 65766
1270240 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 0 records
1284391 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1284391 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=145861 hits=5861 took=14084
1284414 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 23
1284414 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 5861 records
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the total hits are 145861
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- total=145861 hits=0 took=0
1284417 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- Total execution time for this batch: 0
1284418 [pool-1-thread-1] INFO
com.walmartlabs.targeting.storm.bolt.ElasticSearchQueryRunner
- the current batch has 5861 records
Total execution time: 1276946
On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang
<[3][email protected]> wrote:
here you go:
[4]https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of
mention is that I use another thread in the ES bolt to do the
actual query and tuple emit.
Thanks for looking.
Chen
On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin
<[5][email protected]> wrote:
Can you show some code? 200 seconds for 15K puts sounds like
you're not batching.
On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang
<[6][email protected]> wrote:
typo in previous email
The emit method in the query bolt takes about 200(instead of
20) seconds..
On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang
<[7][email protected]> wrote:
Hi, Guys,
I have a storm topology, with a single thread bolt querying
large amount of data (From elasticsearch), and emit to a HBase
bolt(10 threads), doing some filtering, then emit to Arvo
bolt.(10threads) The arvo bolt simply emit the tuple to arvo
client, which will be received by two flume node and then sink
into hdfs. I am testing in local mode.
In the query bolt, i am getting around 15000 entries in a
batch, the query itself takes about 4second, however, he emit
method in the query bolt takes about 20 seconds. Does it mean
that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up
with the query bolt?
How can I tune my topology to make this process as fast as
possible? I tried to increase the HBase thread to 20 but it
does not seem to help.
I use shuffleGrouping from query bolt to hbase bolt, and from
hbase bolt to avro.
Thanks for any advice.
Chen
References
1. https://storm.incubator.apache.org/documentation/Concepts.html
2. https://github.com/ptgoetz/storm-hdfs
3. mailto:[email protected]
4. https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
5. mailto:[email protected]
6. mailto:[email protected]
7. mailto:[email protected]