Re: Client Get vs Coprocessor scan performance

Kiru Pakkirisamy Mon, 12 Aug 2013 11:29:16 -0700

James,
We actually planned to use Phoenix for this project. But we did not have much 
time to design on top of Phoenix. 
Also, this app is more like a 'search' app and I wanted it to be doing just 
"key lookups". There is no write and everything is in block cache.
Still, yes, let me take a look at your code. Maybe, we will get a chance to 
rewrite this on top of Phoenix.
Thanks for your tip and reminder,
 
Regards,
- kiru



Kiru Pakkirisamy | webcloudtech.wordpress.com


________________________________
 From: James Taylor <[email protected]>
To: [email protected]; Kiru Pakkirisamy <[email protected]> 
Sent: Monday, August 12, 2013 9:41 AM
Subject: Re: Client Get vs Coprocessor scan performance
 


Hey Kiru,
Another option for you may be to use Phoenix 
(https://github.com/forcedotcom/phoenix). In particular, our skip scan may be 
what you're looking for: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html.
 Under-the-covers, the skip scan is doing a series of parallel scans taking 
advantage of both coprocessors and the SEEK_NEXT_USING_HINT. As you can see, 
it's more than 2x faster than the batched get approach. On top of that, your 
queries do not only have to be doing point gets, but range scans leverage it as 
well.
Thanks,
James
@JamesPlusPlus



On Sat, Aug 10, 2013 at 11:15 PM, Kiru Pakkirisamy <[email protected]> 
wrote:

Maybe I spoke too soon. HBASE-6870 fixes the table scan (as verified by metrics 
of read requests on the region).
>But the performance with RowFilter is very bad (actually worse than a full 
>table scan, dont know how this can happen).API 
>I hope my API usage is right. All I am doing is add RowFilters to FilterList 
>and setFilter on the scan.
>I tried looking into the AggregateImplementation  (which is mentioned as unit 
>test for this bug)  but did not follow through because I am in a rush for a 
>good workaround.
>I have now replaced RowFilters with a Get on the Region (in a loop) after 
>making sure my key is within startKey and endKey of the region.
>I think this is getting my data right. Performance is very good, almost half 
>that of the full scan code we had in the coprocessor earlier.
>Are there any gotchas/bad side-effects to using a Get on the Region ?
>
> 
>Regards,
>- kiru
>
>
>Kiru Pakkirisamy | webcloudtech.wordpress.com
>
>
>
>________________________________
> From: Kiru Pakkirisamy <[email protected]>
>To: "[email protected]" <[email protected]>
>Sent: Friday, August 9, 2013 1:04 PM
>
>Subject: Re: Client Get vs Coprocessor scan performance
>
>
>I think this fixes my issues. On our dev cluster what used to take 1200 msec 
>is now in the 700-800 msec region. Thanks again.
>I will be soon deploying this to our Performance cluster where our query is at 
>15 secs range.
> 
>Regards,
>- kiru
>
>
>Kiru Pakkirisamy | webcloudtech.wordpress.com
>
>
>________________________________
>From: Ted Yu <[email protected]>
>To: "[email protected]" <[email protected]>
>Cc: "[email protected]" <[email protected]>
>Sent: Thursday, August 8, 2013 10:44 PM
>Subject: Re: Client Get vs Coprocessor scan performance
>
>
>I think you need HBASE-6870 which went into 0.94.8
>
>Upgrading should boost coprocessor performance.
>
>Cheers
>
>On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy <[email protected]> 
>wrote:
>
>> Ted,
>> Here is the method signature/protocol
>> public Map<String, Double> getFooMap<String, Double> input,
>> int topN) throws IOException;
>>
>> There are 31 regions on 4 nodes X 8 CPU.
>> I am on 0.94.6 (from Hortonworks).
>> I think it seems to behave like what linwukang says, - it is almost a full 
>> table scan in the coprocessor.
>> Actually, when I set more specific ColumnPrefixFilters performance went down.
>> I want to do things on the server side because, I dont want to be sending 
>> 500K column/values to the client.
>> I cannot believe a single-threaded client which does some calculations and 
>> group-by  beats the coprocessor running in 31 regions.
>> 
>> Regards,
>> - kiru
>>
>>
>> Kiru Pakkirisamy | webcloudtech.wordpress.com
>>
>>
>> ________________________________
>> From: Ted Yu <[email protected]>
>> To: [email protected]; Kiru Pakkirisamy <[email protected]>
>> Sent: Thursday, August 8, 2013 8:40 PM
>> Subject: Re: Client Get vs Coprocessor scan performance
>>
>>
>> Can you give us a bit more information ?
>>
>> How do you deliver the 55 rowkeys to your endpoint ?
>> How many regions do you have for this table ?
>>
>> What HBase version are you using ?
>>
>> Thanks
>>
>> On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy
>> <[email protected]>wrote:
>>
>>> Hi,
>>> I am finding an odd behavior with the Coprocessor performance lagging a
>>> client side Get.
>>> I have a table with 500000 rows. Each have variable # of columns in one
>>> column family (in this case about 600000 columns in total are processed)
>>> When I try to get specific 55 rows, the client side completes in half-the
>>> time as the coprocessor endpoint.
>>> I am using  55 RowFilters on the Coprocessor scan side. The rows are
>>> processed are exactly the same way in both the cases.
>>> Any pointers on how to debug this scenario ?
>>>
>>> Regards,
>>> - kiru
>>>
>>>
>>> Kiru Pakkirisamy | webcloudtech.wordpress.com

Re: Client Get vs Coprocessor scan performance

Reply via email to