Re: writing huge amount of data to HDFS

Chen Wang Sat, 12 Jul 2014 00:35:27 -0700

Thanks Harsha.
My spout is listening to a kafka queue which contains the es query from
user's input. Is it safe to spawn a thread in the spout and do the ES query
directly in the spout? What is the fundamental difference in doing the
query in a thread of spout VS a thread of bolt?


The reason of using flume is that I have to split the data into different
partitions(hdfs folders) depending on the value of the bolt: meaning I will
need to modify the hdfs bolt any ways. In the past, i tried to shift large
amount of data to a partitioned hive table using this approach(avro to
flume to hdfs), and it seems to working well. Thus i stick to this approach
without reinventing the wheel.

Thanks,
Chen


On Fri, Jul 11, 2014 at 4:51 PM, Harsha <[email protected]> wrote:

>  Hi Chen,
>           I looked at your code. The first part is inside a Bolt's execute
> method ?  and it looks like fetching all the data (10000 per call)  from a
> elastic search and emitting each value from inside the execute method which
> ends when the ES result set runs out.
> It doesn't look like you followed storm's conventions here was there any
> reason not use Spout here . A bolt' execute method gets called for every
> tuple that's getting passed. Docs on spout & bolt
> https://storm.incubator.apache.org/documentation/Concepts.html
>
> from your comment in the code "10000 hits per shard will be returned for
> each scroll" and if it taking longer  read 10000 records from ES I would
> suggest you to reduce this batch size ". The idea here is you are making
> quicker calls to ES and pushing the data downstream and making another call
> to ES for the next batch instead of acquiring one big batch in single call.
>
>  "i am  getting around 15000 entries in a batch, the query itself takes
> about 4second, however, he emit method in the query bolt takes about 20
> seconds." Can you try reducing the batch size here too it looks like the
> time is taking emitting 15k entries at one go.
>
>           Was there any reason/utility of using flume to write to hdfs. If
> not I would recommend using https://github.com/ptgoetz/storm-hdfs bolt .
>
>
>
> On Fri, Jul 11, 2014, at 03:37 PM, Chen Wang wrote:
>
> Here is the output from the ES query bolt:
>  "Total execution time for this batch: 179655(millisecond)" is the call
> time around .emit. As you can see, to emit 14000 entries, it takes
> anytime from 145231 to 180000
>
>
>
> On Fri, Jul 11, 2014 at 2:14 PM, Chen Wang <[email protected]>
> wrote:
>
> here you go:
> https://gist.github.com/cynosureabu/b317646d5c475d0d2e42
> Its actually pretty straight forward. The only thing worth of mention is
> that I use another thread in the ES bolt to do the actual query and tuple
> emit.
> Thanks for looking.
> Chen
>
>
>
> On Fri, Jul 11, 2014 at 1:18 PM, Sam Goodwin <[email protected]>
> wrote:
>
> Can you show some code? 200 seconds for 15K puts sounds like you're not
> batching.
>
>
> On Fri, Jul 11, 2014 at 12:47 PM, Chen Wang <[email protected]>
> wrote:
>
> typo in previous email
> The emit method in the query bolt takes about 200(instead of 20) seconds..
>
>
> On Fri, Jul 11, 2014 at 11:58 AM, Chen Wang <[email protected]>
> wrote:
>
> Hi, Guys,
> I have a storm topology, with a single thread bolt querying large amount
> of data (From elasticsearch), and emit to a HBase bolt(10 threads), doing
> some filtering, then emit to Arvo bolt.(10threads) The arvo bolt simply
> emit the tuple to arvo client, which will be received by two flume node and
> then sink into hdfs. I am testing in local mode.
>
> In the query bolt, i am  getting around 15000 entries in a batch, the
> query itself takes about 4second, however, he emit method in the query bolt
> takes about 20 seconds. Does it mean that
> the downstream bolt(HBaseBolt and Avro bolt) cannot catch up with the
> query bolt?
>
> How can I tune my topology to make this process as fast as possible? I
> tried to increase the HBase thread to 20 but it does not seem to help.
>
> I use shuffleGrouping from query bolt to hbase bolt, and from hbase bolt
> to avro.
>
> Thanks for any advice.
> Chen
>
>
>
>
>
>
>
>
>
>
>

Re: writing huge amount of data to HDFS

Reply via email to