James, I have only one family -cp. Yes, that is how I store the Double. No, the doubles are always positive. The keys are "A14568 " Less than a million and I added the alphabets to randomize them. I group them based on the C_ suffix and say order them by the Double (to simplify it). Is there a way to do a sort of "user defined function" on a column ? that would take care of my calculation on that double. Thanks again. Regards, - kiru
Kiru Pakkirisamy | webcloudtech.wordpress.com ________________________________ From: James Taylor <[email protected]> To: Kiru Pakkirisamy <[email protected]> Cc: "[email protected]" <[email protected]> Sent: Sunday, August 18, 2013 5:34 PM Subject: Re: Client Get vs Coprocessor scan performance Kiru, What's your column family name? Just to confirm, the column qualifier of your key value is C_10345 and this stores a value as a Double using Bytes.toBytes(double)? Are any of the Double values negative? Any other key values? Can you give me an idea of the kind of fuzzy filtering you're doing on the 7 char row key? We may want to model that as a set of row key columns in Phoenix to leverage the skip scan more. How about I model your aggregation as an AVG over a group of rows? What would your GROUP BY expression look like? Are you grouping based on a part of the 7 char row key? Or on some other key value? Thanks, James On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy <[email protected] > wrote: > James, > Rowkey - String - len - 7 > Col = String - variable length - but looks C_10345 > Col value = Double > > If I can create a Phoenix schema mapping to this existing table that would > be great. I actually do a group by the column values and return another > value which is a function of the value and an input double value. Input is > a Map<String, Double> and return is also a Map<String, Double>. > > > Regards, > - kiru > > > Kiru Pakkirisamy | webcloudtech.wordpress.com > > ------------------------------ > *From:* James Taylor <[email protected]> > *To:* [email protected]; Kiru Pakkirisamy <[email protected]> > *Sent:* Sunday, August 18, 2013 2:07 PM > > *Subject:* Re: Client Get vs Coprocessor scan performance > > Kiru, > If you're able to post the key values, row key structure, and data types > you're using, I can post the Phoenix code to query against it. You're doing > some kind of aggregation too, right? If you could explain that part too, > that would be helpful. It's likely that you can just query the existing > HBase data you've already created on the same cluster you're already using > (provided you put the phoenix jar on all the region servers - use our 2.0.0 > version that just came out). Might be interesting to compare the amount of > code necessary in each approach as well. > Thanks, > James > > > On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy < > [email protected]> wrote: > > James, > I am using the FuzzyRowFilter or the Gets within a Coprocessor. Looks > like I cannot use your SkipScanFilter by itself as it has lots of phoenix > imports. I thought of writing my own Custom filter and saw that the > FuzzyRowFilter in the 0.94 branch also had an implementation for > getNextKeyHint(), only that it works well only with fixed length keys if I > wanted a complete match of the keys. After my padding my keys to fixed > length it seems to be fine. > Once I confirm some key locality and other issues (like heap), I will try > to bench mark this table alone against Phoenix on another cluster. Thanks. > > Regards, > - kiru > > > Kiru Pakkirisamy | webcloudtech.wordpress.com > > > ________________________________ > From: James Taylor <[email protected]> > To: "[email protected]" <[email protected]> > Cc: Kiru Pakkirisamy <[email protected]> > Sent: Sunday, August 18, 2013 11:44 AM > Subject: Re: Client Get vs Coprocessor scan performance > > > Would be interesting to compare against Phoenix's Skip Scan > ( > http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html > ) > which does a scan through a coprocessor and is more than 2x faster > than multi Get (plus handles multi-range scans in addition to point > gets). > > James > > On Aug 18, 2013, at 6:39 AM, Ted Yu <[email protected]> wrote: > > > bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on > > the whole length of the key) > > > > In this case the Get's are very selective. The number of rows > FuzzyRowFilter > > was evaluated against would be much higher. > > It would be nice if you remember the time each took. > > > > bq. Also, I am seeing very bad concurrent query performance > > > > Were the multi Get's performed by your coprocessor within region boundary > > of the respective coprocessor ? Just to confirm. > > > > bq. that would make Coprocessors almost single threaded across multiple > > invocations ? > > > > Let me dig into code some more. > > > > Cheers > > > > > > On Sat, Aug 17, 2013 at 10:34 PM, Kiru Pakkirisamy < > > [email protected]> wrote: > > > >> Ted, > >> On a table with 600K rows, Get'ting 100 rows seems to be faster than the > >> FuzzyRowFilter (mask on the whole length of the key). I thought the > >> FuzzyRowFilter's SEEK_NEXT_USING_HINT would help. All this on the > client > >> side, I have not changed my CoProcessor to use the FuzzyRowFilter based > on > >> the client side performance (still doing multiple get inside the > >> coprocessor). Also, I am seeing very bad concurrent query performance. > Are > >> there any thing that would make Coprocessors almost single threaded > across > >> multiple invocations ? > >> Again, all this after putting in 0.94.10 (for hbase-6870 sake) which > seems > >> to be very good in bringing up the regions online fast and balanced. > Thanks > >> and much appreciated. > >> > >> Regards, > >> - kiru > >> > >> > >> Kiru Pakkirisamy | webcloudtech.wordpress.com > >> > >> > >> ________________________________ > >> From: Ted Yu <[email protected]> > >> To: "[email protected]" <[email protected]> > >> Sent: Saturday, August 17, 2013 4:19 PM > >> Subject: Re: Client Get vs Coprocessor scan performance > >> > >> > >> HBASE-6870 targeted whole table scanning for each coprocessorService > call > >> which exhibited itself through: > >> > >> HTable#coprocessorService -> getStartKeysInRange -> getStartEndKeys -> > >> getRegionLocations -> MetaScanner.allTableRegions(getConfiguration(), > >> getTableName(), false) > >> > >> The cached region locations in HConnectionImplementation would be used. > >> > >> Cheers > >> > >> > >> On Sat, Aug 17, 2013 at 2:21 PM, Asaf Mesika <[email protected]> > >> wrote: > >> > >>> Ted, can you elaborate a little bit why this issue boosts performance? > >>> I couldn't figure out from the issue comments if they execCoprocessor > >> scans > >>> the entire .META. table or and entire table, to understand the actual > >>> improvement. > >>> > >>> Thanks! > >>> > >>> > >>> > >>> > >>> On Fri, Aug 9, 2013 at 8:44 AM, Ted Yu <[email protected]> wrote: > >>> > >>>> I think you need HBASE-6870 which went into 0.94.8 > >>>> > >>>> Upgrading should boost coprocessor performance. > >>>> > >>>> Cheers > >>>> > >>>> On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy < > >> [email protected] > >>>> > >>>> wrote: > >>>> > >>>>> Ted, > >>>>> Here is the method signature/protocol > >>>>> public Map<String, Double> getFooMap<String, Double> input, > >>>>> int topN) throws IOException; > >>>>> > >>>>> There are 31 regions on 4 nodes X 8 CPU. > >>>>> I am on 0.94.6 (from Hortonworks). > >>>>> I think it seems to behave like what linwukang says, - it is almost a > >>>> full table scan in the coprocessor. > >>>>> Actually, when I set more specific ColumnPrefixFilters performance > >> went > >>>> down. > >>>>> I want to do things on the server side because, I dont want to be > >>>> sending 500K column/values to the client. > >>>>> I cannot believe a single-threaded client which does some > >> calculations > >>>> and group-by beats the coprocessor running in 31 regions. > >>>>> > >>>>> Regards, > >>>>> - kiru > >>>>> > >>>>> > >>>>> Kiru Pakkirisamy | webcloudtech.wordpress.com > >>>>> > >>>>> > >>>>> ________________________________ > >>>>> From: Ted Yu <[email protected]> > >>>>> To: [email protected]; Kiru Pakkirisamy < > >> [email protected] > >>>> > >>>>> Sent: Thursday, August 8, 2013 8:40 PM > >>>>> Subject: Re: Client Get vs Coprocessor scan performance > >>>>> > >>>>> > >>>>> Can you give us a bit more information ? > >>>>> > >>>>> How do you deliver the 55 rowkeys to your endpoint ? > >>>>> How many regions do you have for this table ? > >>>>> > >>>>> What HBase version are you using ? > >>>>> > >>>>> Thanks > >>>>> > >>>>> On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy > >>>>> <[email protected]>wrote: > >>>>> > >>>>>> Hi, > >>>>>> I am finding an odd behavior with the Coprocessor performance > >> lagging > >>> a > >>>>>> client side Get. > >>>>>> I have a table with 500000 rows. Each have variable # of columns in > >>> one > >>>>>> column family (in this case about 600000 columns in total are > >>> processed) > >>>>>> When I try to get specific 55 rows, the client side completes in > >>>> half-the > >>>>>> time as the coprocessor endpoint. > >>>>>> I am using 55 RowFilters on the Coprocessor scan side. The rows are > >>>>>> processed are exactly the same way in both the cases. > >>>>>> Any pointers on how to debug this scenario ? > >>>>>> > >>>>>> Regards, > >>>>>> - kiru > >>>>>> > >>>>>> > >>>>>> Kiru Pakkirisamy | webcloudtech.wordpress.com > >>>> > >>> > >> > > > > >
