Re: coprocessor enabled put very slow, help please~~~

Michael Segel Tue, 19 Feb 2013 08:30:01 -0800

I should follow up with that I was asking why he was using an HTable Pool, not 
saying that it was wrong.


Still. I think in the pool the writes shouldn't have to go to the WAL. 


On Feb 19, 2013, at 10:01 AM, Michael Segel <[email protected]> wrote:

> Good question.. 
> 
> You create a class MyRO. 
> 
> How many instances of  MyRO exist per RS?
> 
> How many queries can access the instance MyRO at the same time? 
> 
> 
> 
> 
> On Feb 19, 2013, at 9:15 AM, Wei Tan <[email protected]> wrote:
> 
>> A side question: if HTablePool is not encouraged to be used... how we 
>> handle the thread safeness in using HTable? Any replacement for 
>> HTablePool, in plan?
>> Thanks,
>> 
>> 
>> Best Regards,
>> Wei
>> 
>> 
>> 
>> 
>> From:   Michel Segel <[email protected]>
>> To:     "[email protected]" <[email protected]>, 
>> Date:   02/18/2013 09:23 AM
>> Subject:        Re: coprocessor enabled put very slow, help please~~~
>> 
>> 
>> 
>> Why are you using an HTable Pool?
>> Why are you closing the table after each iteration through?
>> 
>> Try using 1 HTable object. Turn off WAL
>> Initiate in start()
>> Close in Stop()
>> Surround the use in a try / catch
>> If exception caught, re instantiate new HTable connection.
>> 
>> Maybe want to flush the connection after puts. 
>> 
>> 
>> Again not sure why you are using check and put on the base table. Your 
>> count could be off.
>> 
>> As an example look at poem/rhyme 'Marry had a little lamb'.
>> Then check your word count.
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Feb 18, 2013, at 7:21 AM, prakash kadel <[email protected]> 
>> wrote:
>> 
>>> Thank you guys for your replies,
>>> Michael,
>>> I think i didnt make it clear. Here is my use case,
>>> 
>>> I have text documents to insert in the hbase. (With possible duplicates)
>>> Suppose i have a document as : " I am working. He is not working"
>>> 
>>> I want to insert this document to a table in hbase, say table "doc"
>>> 
>>> =doc table=
>>> -----
>>> rowKey : doc_id
>>> cf: doc_content
>>> value: "I am working. He is not working"
>>> 
>>> Now, i to create another table that stores the word count, say "doc_idx"
>>> 
>>> doc_idx table
>>> ---
>>> rowKey : I, cf: count, value: 1
>>> rowKey : am, cf: count, value: 1
>>> rowKey : working, cf: count, value: 2
>>> rowKey : He, cf: count, value: 1
>>> rowKey : is, cf: count, value: 1
>>> rowKey : not, cf: count, value: 1
>>> 
>>> My MR job code:
>>> ==============
>>> 
>>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>>  for(String word : doc_content.split("\\s+")) {
>>>     Increment inc = new Increment(Bytes.toBytes(word));
>>>     inc.addColumn("count", "", 1);
>>>  }
>>> }
>>> 
>>> Now, i wanted to do some experiments with coprocessors. So, i modified
>>> the code as follows.
>>> 
>>> My MR job code:
>>> ===============
>>> 
>>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>> 
>>> Coprocessor code:
>>> ===============
>>> 
>>>  public void start(CoprocessorEnvironment env)  {
>>>      pool = new HTablePool(conf, 100);
>>>  }
>>> 
>>>  public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>>> compareOp,     comparator,  put, result) {
>>> 
>>>              if(!result) return true; // check if the put succeeded
>>> 
>>>      HTableInterface table_idx = pool.getTable("doc_idx");
>>> 
>>>      try {
>>> 
>>>          for(KeyValue contentKV = put.get("doc_content", "")) {
>>>                          for(String word :
>>> contentKV.getValue().split("\\s+")) {
>>>                              Increment inc = new
>>> Increment(Bytes.toBytes(word));
>>>                              inc.addColumn("count", "", 1);
>>>                              table_idx.increment(inc);
>>>                          }
>>>                     }
>>>      } finally {
>>>          table_idx.close();
>>>      }
>>>      return true;
>>>  }
>>> 
>>>  public void stop(env) {
>>>      pool.close();
>>>  }
>>> 
>>> I am a newbee to HBASE. I am not sure this is the way to do.
>>> Given that, why is the cooprocessor enabled version much slower than
>>> the one without?
>>> 
>>> 
>>> Sincerely,
>>> Prakash Kadel
>>> 
>>> 
>>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>>> <[email protected]> wrote:
>>>> 
>>>> The  issue I was talking about was the use of a check and put.
>>>> The OP wrote:
>>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
>> rows to
>>>>>>>> a index table.
>>>> 
>>>> My question is why does the OP use a checkAndPut, and the 
>> RegionObserver's postChecAndPut?
>>>> 
>>>> 
>>>> Here's a good example... 
>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>> 
>>>> 
>>>> The OP doesn't really get in to the use case, so we don't know why the 
>> Check and Put in the M/R job.
>>>> He should just be using put() and then a postPut().
>>>> 
>>>> Another issue... since he's writing to  a different HTable... how? Does 
>> he create an HTable instance in the start() method of his RO object and 
>> then reference it later? Or does he create the instance of the HTable on 
>> the fly in each postCheckAndPut() ?
>>>> Without seeing his code, we don't know.
>>>> 
>>>> Note that this is synchronous set of writes. Your overall return from 
>> the M/R call to put will wait until the second row is inserted.
>>>> 
>>>> Interestingly enough, you may want to consider disabling the WAL on the 
>> write to the index.  You can always run a M/R job that rebuilds the index 
>> should something occur to the system where you might lose the data. 
>> Indexes *ARE* expendable. ;-)
>>>> 
>>>> Does that explain it?
>>>> 
>>>> -Mike
>>>> 
>>>> On Feb 18, 2013, at 4:57 AM, yonghu <[email protected]> wrote:
>>>> 
>>>>> Hi, Michael
>>>>> 
>>>>> I don't quite understand what do you mean by "round trip back to the
>>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>>> be the same node, MR don't have to pull data into client and then
>>>>> process.  And you also mention the "unnecessary overhead", can you
>>>>> explain a little bit what operations or data processing can be seen as
>>>>> "unnecessary overhead".
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> yong
>>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>>> <[email protected]> wrote:
>>>>>> Why?
>>>>>> 
>>>>>> This seems like an unnecessary overhead.
>>>>>> 
>>>>>> You are writing code within the coprocessor on the server. 
>> Pessimistic code really isn't recommended if you are worried about 
>> performance.
>>>>>> 
>>>>>> I have to ask... by the time you have executed the code in your 
>> co-processor, what would cause the initial write to fail?
>>>>>> 
>>>>>> 
>>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <[email protected]> 
>> wrote:
>>>>>> 
>>>>>>> its a local read. i just check the last param of PostCheckAndPut 
>> indicating if the Put succeeded. Incase if the put success, i insert a row 
>> in another table
>>>>>>> 
>>>>>>> Sincerely,
>>>>>>> Prakash Kadel
>>>>>>> 
>>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the 
>> nature of
>>>>>>>> LSM, read is much slower compared to a write...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Wei
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From:   Prakash Kadel <[email protected]>
>>>>>>>> To:     "[email protected]" <[email protected]>,
>>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> hi,
>>>>>>>> i am trying to insert few million documents to hbase with 
>> mapreduce. To
>>>>>>>> enable quick search of docs i want to have some indexes, so i tried 
>> to use
>>>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>>>> coprocessors not supposed to increase the latency?
>>>>>>>> my settings:
>>>>>>>> 3 region servers
>>>>>>>> 60 maps
>>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
>> rows to
>>>>>>>> a index table.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sincerely,
>>>>>>>> Prakash
>>>>>> 
>>>>>> Michael Segel  | (m) 312.755.9623
>>>>>> 
>>>>>> Segel and Associates
>>> 
>> 
>> 
> 
>

Re: coprocessor enabled put very slow, help please~~~

Reply via email to