Re: writing huge amount of data to HDFS

Harsha Sat, 12 Jul 2014 09:29:13 -0700

Hi Chen,

          I thought your bolt was the one doing reading from ES
and there is no spout. I suppose its ok since the ES queries
are flowing from kafka.  Did you measure Hbase bolt's execute
method. It looks like its making read call on hbase for each
tuple emitted from ES bolt. From what I see ES bolt emits bunch
of tuples and it goes to Hbase bolt which makes call to hbase
db it might be hanging there to get the results from hbase
query which makes it slower to consume from ES bolt.


Ideally if you can batch tuples to hbase query it will speed up
instead of making a call for every tuple or you can reduce the
batch size for ES query and emit fewer tuples instead of 15k at
a time . Increasing parallelism of hbase bolt might not be
helpful as you increase the no.of connections to hbase. I start
with measuring HbaseBolt execute method latency and reduce the
ES batch size , try to batch up hbase reads.

-Harsha





On Sat, Jul 12, 2014, at 12:33 AM, Chen Wang wrote:

Thanks Harsha.
My spout is listening to a kafka queue which contains the es
query from user's input. Is it safe to spawn a thread in the
spout and do the ES query directly in the spout? What is the
fundamental difference in doing the query in a thread of spout
VS a thread of bolt?

The reason of using flume is that I have to split the data into
different partitions(hdfs folders) depending on the value of
the bolt: meaning I will need to modify the hdfs bolt any ways.
In the past, i tried to shift large amount of data to a
partitioned hive table using this approach(avro to flume to
hdfs), and it seems to working well. Thus i stick to this
approach without reinventing the wheel.

Thanks,
Chen


On Fri, Jul 11, 2014 at 4:51 PM, Harsha <[1][email protected]>
wrote:

Hi Chen,
          I looked at your code. The first part is inside a
Bolt's execute method ?  and it looks like fetching all the
data (10000 per call)  from a elastic search and emitting each
value from inside the execute method which ends when the ES
result set runs out.
It doesn't look like you followed storm's conventions here was
there any reason not use Spout here . A bolt' execute method
gets called for every tuple that's getting passed. Docs on
spout &
bolt [2]https://storm.incubator.apache.org/documentation/Concep
ts.html

from your comment in the code "10000 hits per shard will be
returned for each scroll" and if it taking longer  read 10000
records from ES I would suggest you to reduce this batch size
". The idea here is you are making quicker calls to ES and
pushing the data downstream and making another call to ES for
the next batch instead of acquiring one big batch in single
call.

 "i am  getting around 15000 entries in a batch, the query
itself takes about 4second, however, he emit method in the
query bolt takes about 20 seconds." Can you try reducing the
batch size here too it looks like the time is taking emitting
15k entries at one go.

          Was there any reason/utility of using flume to write
to hdfs. If not I would recommend
using [3]https://github.com/ptgoetz/storm-hdfs bolt .



On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:

Here is the output from the ES query bolt:
 "Total execution time for this batch: 179655(millisecond)" is
the call time around .emit. As you can see, to emit 14000
entries, it takes
anytime from 145231 to 180000



On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang
<[4][email protected]> wrote:

here you go:
[5]https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
Its actually pretty straight forward. The only thing worth of
mention is that I use another thread in the ES bolt to do the
actual query and tuple emit.
Thanks for looking.
Chen



On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin
<[6][email protected]> wrote:

Can you show some code? 200 seconds for 15K puts sounds like
you're not batching.



On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang
<[7][email protected]> wrote:

typo in previous email
The emit method in the query bolt takes about 200(instead of
20) seconds..



On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang
<[8][email protected]> wrote:

Hi, Guys,
I have a storm topology, with a single thread bolt querying
large amount of data (From elasticsearch), and emit to a HBase
bolt(10 threads), doing some filtering, then emit to Arvo
bolt.(10threads) The arvo bolt simply emit the tuple to arvo
client, which will be received by two flume node and then sink
into hdfs. I am testing in local mode.

In the query bolt, i am  getting around 15000 entries in a
batch, the query itself takes about 4second, however, he emit
method in the query bolt takes about 20 seconds. Does it mean
that
the downstream bolt(HBaseBolt and Avro bolt) cannot catch up
with the query bolt?

How can I tune my topology to make this process as fast as
possible? I tried to increase the HBase thread to 20 but it
does not seem to help.

I use shuffleGrouping from query bolt to hbase bolt, and from
hbase bolt to avro.

Thanks for any advice.
Chen

References

1. mailto:[email protected]
2. https://storm.incubator.apache.org/documentation/Concepts.html
3. https://github.com/ptgoetz/storm-hdfs
4. mailto:[email protected]
5. https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
6. mailto:[email protected]
7. mailto:[email protected]
8. mailto:[email protected]

Re: writing huge amount of data to HDFS

Reply via email to