Re: HBase bulk load through co-processors

Andrew Purtell Sat, 14 Jul 2012 12:21:20 -0700

Sever,

You and Jesse Yates should talk. See 
http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html


   - Andy

On Jul 14, 2012, at 5:24 AM, Sever Fundatureanu <[email protected]> 
wrote:

> My intention is to implement a Secondary Index as suggested here:
> http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing.
> It is advised here to add secondary index edits to a "shared work queue".
> And that "The shared queue would be a thread or threadpool that picks up
> these secondary table edit jobs and applies them using a normal Put
> operation to the secondary table". Is this shared queue some kind of
> mechanism external to all Region servers? Or a queue shared only between
> the threads local to one RS?
> 
> Thanks,
> Sever
> 
> On Mon, Jul 9, 2012 at 10:16 AM, Nick Dimiduk <[email protected]> wrote:
> 
>> On Sat, Jul 7, 2012 at 2:58 AM, Sever Fundatureanu <
>> [email protected]> wrote:
>> 
>>> Also does anybody know what is the flow in the system when a coprocessor
>>> from one RS make a Put call with a row key which falls on another RS?
>> I.e.
>>> do the Region servers communicate directly between each other?
>>> 
>> 
>> In this case, the coprocessor in your RS is acting like any other HBase
>> client. Puts will write from the coproc to the target RS like a normal
>> write. That is, of course, assuming I understand your implementation.
>> 
>> -n
>> 
>> On Fri, Jul 6, 2012 at 10:16 PM, Nick Dimiduk <[email protected]> wrote:
>>> 
>>>> Sever,
>>>> 
>>>> I presume you're loading your data via online Puts via the MR job (as
>>>> opposed to generating HFiles). What are you hoping to gain from a
>>>> coprocessor implementation vs the 6 MR jobs? Have you pre-split your
>>>> tables? Can the RegionServer(s) handle all the concurrent mappers?
>>>> 
>>>> -n
>>>> 
>>>> On Mon, Jul 2, 2012 at 11:58 AM, Sever Fundatureanu <
>>>> [email protected]> wrote:
>>>> 
>>>>> I agree that increasing the timeout is not the best option, I will
>> work
>>>>> both on better balancing the load and maybe doing it in increments
>> like
>>>> you
>>>>> suggested. However for now I want a quick fix to the problem.
>>>>> 
>>>>> Just to see if I understand this right: a zookeeper node redirects my
>>>>> client to a region server node and then my client talk directly to
>> this
>>>>> region server; now the timeout happens on the client while talking to
>>> the
>>>>> RS right? It expects some kind of confirmation and it times out.. if
>>> this
>>>>> is the case how can I increase this timeout? I only found in the
>>>>> documentation "zookeeper.session.timeout" which is the timeout
>> between
>>>>> zookeeper and HBase.
>>>>> 
>>>>> Thanks,
>>>>> Sever
>>>>> 
>>>>> On Mon, Jul 2, 2012 at 8:19 PM, Jean-Marc Spaggiari <
>>>>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> Hi Sever,
>>>>>> 
>>>>>> It seems one of the nodes in your cluster is overwhelmed with the
>>> load
>>>>>> you are giving him.
>>>>>> 
>>>>>> So IMO, you have two options here:
>>>>>> First, you can try to reduce the load. I mean, split the bulk in
>>>>>> multiple smaller bulks and load them one by one to give the time to
>>>>>> your cluster to dispatch it correctly.
>>>>>> Second, you can inscreade the timeone from 60s to 120s. But you
>> might
>>>>>> face the same issue with 120s so  I really recommand the fist
>> option.
>>>>>> 
>>>>>> JM
>>>>>> 
>>>>>> 2012/7/2, Sever Fundatureanu <[email protected]>:
>>>>>>> Can someone please help me with this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Sever
>>>>>>> 
>>>>>>> On Tue, Jun 26, 2012 at 8:14 PM, Sever Fundatureanu <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> My keys are built of 4  8-byte Ids. I am currently doing the
>> load
>>>> with
>>>>>> MR
>>>>>>>> but I get a timeout when doing the loadIncrementalFiles call:
>>>>>>>> 
>>>>>>>> 12/06/24 21:29:01 ERROR mapreduce.LoadIncrementalHFiles:
>>> Encountered
>>>>>>>> unrecoverable error from region server
>>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>>> after
>>>>>>>> attempts=10, exceptions:
>>>>>>>> Sun Jun 24 21:29:01 CEST 2012,
>>>>>>>> 
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$3@4699ecf9
>>> ,
>>>>>>>> java.net.SocketTimeoutException: Call to das3002.cm.cluster/
>>>>>>>> 10.141.0.79:60020
>>>>>>>> failed on socket timeout exception:
>>> java.net.SocketTimeoutException:
>>>>>>>> 60000
>>>>>>>> millis timeout while waiting for channel to be ready for read.
>> ch
>>> :
>>>>>>>> java.nio.channels.SocketChannel[co
>>>>>>>> nnected local=/10.141.0.254:43240 remote=das3002.cm.cluster/
>>>>>>>> 10.141.0.79:60020]
>>>>>>>> 
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1345)
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.tryAtomicRegionLoad(LoadIncrementalHFiles.java:487)
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$1.call(LoadIncrementalHFiles.java:275)
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$1.call(LoadIncrementalHFiles.java:273)
>>>>>>>>       at
>>>>>>>> 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>       at
>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>       at
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>       at java.lang.Thread.run(Thread.java:662)
>>>>>>>> 12/06/24 21:30:52 ERROR mapreduce.LoadIncrementalHFiles:
>>> Encountered
>>>>>>>> unrecoverable error from region server
>>>>>>>> 
>>>>>>>> Is there a way in which I can increase the timeout period?
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> 
>>>>>>>> On Tue, Jun 26, 2012 at 7:05 PM, Andrew Purtell
>>>>>>>> <[email protected]>wrote:
>>>>>>>> 
>>>>>>>>> On Tue, Jun 26, 2012 at 9:56 AM, Sever Fundatureanu
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> I have to bulkload 6 tables which contain the same
>> information
>>>> but
>>>>>>>>>> with
>>>>>>>>> a
>>>>>>>>>> different order to cover all possible access patterns. Would
>> it
>>>> be
>>>>> a
>>>>>>>>> good
>>>>>>>>>> idea to do only one load and use co-processors to populate
>> the
>>>>> other
>>>>>>>>>> tables, instead of doing the traditional MR bulkload which
>>> would
>>>>>>>>> require 6
>>>>>>>>>> separate jobs?
>>>>>>>>> 
>>>>>>>>> Without knowing more than you've said, it seems better to use
>> MR
>>> to
>>>>>>>>> build all input.
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>>  - Andy
>>>>>>>>> 
>>>>>>>>> Problems worthy of attack prove their worth by hitting back. -
>>> Piet
>>>>>>>>> Hein (via Tom White)
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Sever Fundatureanu
>>>>>>>> 
>>>>>>>> Vrije Universiteit Amsterdam
>>>>>>>> E-mail: [email protected]
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Sever Fundatureanu
>>>>>>> 
>>>>>>> Vrije Universiteit Amsterdam
>>>>>>> E-mail: [email protected]
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Sever Fundatureanu
>>>>> 
>>>>> Vrije Universiteit Amsterdam
>>>>> E-mail: [email protected]
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sever Fundatureanu
>>> 
>>> Vrije Universiteit Amsterdam
>>> E-mail: [email protected]
>>> 
>> 
> 
> 
> 
> -- 
> Sever Fundatureanu
> 
> Vrije Universiteit Amsterdam
> E-mail: [email protected]

Re: HBase bulk load through co-processors

Reply via email to