Maybe I spoke too soon. HBASE-6870 fixes the table scan (as verified by metrics of read requests on the region). But the performance with RowFilter is very bad (actually worse than a full table scan, dont know how this can happen).API I hope my API usage is right. All I am doing is add RowFilters to FilterList and setFilter on the scan. I tried looking into the AggregateImplementation (which is mentioned as unit test for this bug) but did not follow through because I am in a rush for a good workaround. I have now replaced RowFilters with a Get on the Region (in a loop) after making sure my key is within startKey and endKey of the region. I think this is getting my data right. Performance is very good, almost half that of the full scan code we had in the coprocessor earlier. Are there any gotchas/bad side-effects to using a Get on the Region ? Regards, - kiru
Kiru Pakkirisamy | webcloudtech.wordpress.com ________________________________ From: Kiru Pakkirisamy <[email protected]> To: "[email protected]" <[email protected]> Sent: Friday, August 9, 2013 1:04 PM Subject: Re: Client Get vs Coprocessor scan performance I think this fixes my issues. On our dev cluster what used to take 1200 msec is now in the 700-800 msec region. Thanks again. I will be soon deploying this to our Performance cluster where our query is at 15 secs range. Regards, - kiru Kiru Pakkirisamy | webcloudtech.wordpress.com ________________________________ From: Ted Yu <[email protected]> To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Sent: Thursday, August 8, 2013 10:44 PM Subject: Re: Client Get vs Coprocessor scan performance I think you need HBASE-6870 which went into 0.94.8 Upgrading should boost coprocessor performance. Cheers On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy <[email protected]> wrote: > Ted, > Here is the method signature/protocol > public Map<String, Double> getFooMap<String, Double> input, > int topN) throws IOException; > > There are 31 regions on 4 nodes X 8 CPU. > I am on 0.94.6 (from Hortonworks). > I think it seems to behave like what linwukang says, - it is almost a full > table scan in the coprocessor. > Actually, when I set more specific ColumnPrefixFilters performance went down. > I want to do things on the server side because, I dont want to be sending > 500K column/values to the client. > I cannot believe a single-threaded client which does some calculations and > group-by beats the coprocessor running in 31 regions. > > Regards, > - kiru > > > Kiru Pakkirisamy | webcloudtech.wordpress.com > > > ________________________________ > From: Ted Yu <[email protected]> > To: [email protected]; Kiru Pakkirisamy <[email protected]> > Sent: Thursday, August 8, 2013 8:40 PM > Subject: Re: Client Get vs Coprocessor scan performance > > > Can you give us a bit more information ? > > How do you deliver the 55 rowkeys to your endpoint ? > How many regions do you have for this table ? > > What HBase version are you using ? > > Thanks > > On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy > <[email protected]>wrote: > >> Hi, >> I am finding an odd behavior with the Coprocessor performance lagging a >> client side Get. >> I have a table with 500000 rows. Each have variable # of columns in one >> column family (in this case about 600000 columns in total are processed) >> When I try to get specific 55 rows, the client side completes in half-the >> time as the coprocessor endpoint. >> I am using 55 RowFilters on the Coprocessor scan side. The rows are >> processed are exactly the same way in both the cases. >> Any pointers on how to debug this scenario ? >> >> Regards, >> - kiru >> >> >> Kiru Pakkirisamy | webcloudtech.wordpress.com
