Kiru, Is the column qualifier for the key value storing the double different for different rows? Not sure I understand what you're grouping over. Maybe 5 rows worth of sample input and expected output would help. Thanks, James
On Aug 19, 2013, at 1:37 AM, Kiru Pakkirisamy <[email protected]> wrote: > James, > I have only one family -cp. Yes, that is how I store the Double. No, the > doubles are always positive. > The keys are "A14568 " Less than a million and I added the alphabets to > randomize them. > I group them based on the C_ suffix and say order them by the Double (to > simplify it). > Is there a way to do a sort of "user defined function" on a column ? that > would take care of my calculation on that double. > Thanks again. > > Regards, > - kiru > > > Kiru Pakkirisamy | webcloudtech.wordpress.com > > > ________________________________ > From: James Taylor <[email protected]> > To: Kiru Pakkirisamy <[email protected]> > Cc: "[email protected]" <[email protected]> > Sent: Sunday, August 18, 2013 5:34 PM > Subject: Re: Client Get vs Coprocessor scan performance > > > Kiru, > What's your column family name? Just to confirm, the column qualifier of > your key value is C_10345 and this stores a value as a Double using > Bytes.toBytes(double)? Are any of the Double values negative? Any other key > values? > > Can you give me an idea of the kind of fuzzy filtering you're doing on the > 7 char row key? We may want to model that as a set of row key columns in > Phoenix to leverage the skip scan more. > > How about I model your aggregation as an AVG over a group of rows? What > would your GROUP BY expression look like? Are you grouping based on a part > of the 7 char row key? Or on some other key value? > > Thanks, > James > > > On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy <[email protected] >> wrote: > >> James, >> Rowkey - String - len - 7 >> Col = String - variable length - but looks C_10345 >> Col value = Double >> >> If I can create a Phoenix schema mapping to this existing table that would >> be great. I actually do a group by the column values and return another >> value which is a function of the value and an input double value. Input is >> a Map<String, Double> and return is also a Map<String, Double>. >> >> >> Regards, >> - kiru >> >> >> Kiru Pakkirisamy | webcloudtech.wordpress.com >> >> ------------------------------ >> *From:* James Taylor <[email protected]> >> *To:* [email protected]; Kiru Pakkirisamy <[email protected]> >> *Sent:* Sunday, August 18, 2013 2:07 PM >> >> *Subject:* Re: Client Get vs Coprocessor scan performance >> >> Kiru, >> If you're able to post the key values, row key structure, and data types >> you're using, I can post the Phoenix code to query against it. You're doing >> some kind of aggregation too, right? If you could explain that part too, >> that would be helpful. It's likely that you can just query the existing >> HBase data you've already created on the same cluster you're already using >> (provided you put the phoenix jar on all the region servers - use our 2.0.0 >> version that just came out). Might be interesting to compare the amount of >> code necessary in each approach as well. >> Thanks, >> James >> >> >> On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy < >> [email protected]> wrote: >> >> James, >> I am using the FuzzyRowFilter or the Gets within a Coprocessor. Looks >> like I cannot use your SkipScanFilter by itself as it has lots of phoenix >> imports. I thought of writing my own Custom filter and saw that the >> FuzzyRowFilter in the 0.94 branch also had an implementation for >> getNextKeyHint(), only that it works well only with fixed length keys if I >> wanted a complete match of the keys. After my padding my keys to fixed >> length it seems to be fine. >> Once I confirm some key locality and other issues (like heap), I will try >> to bench mark this table alone against Phoenix on another cluster. Thanks. >> >> Regards, >> - kiru >> >> >> Kiru Pakkirisamy | webcloudtech.wordpress.com >> >> >> ________________________________ >> From: James Taylor <[email protected]> >> To: "[email protected]" <[email protected]> >> Cc: Kiru Pakkirisamy <[email protected]> >> Sent: Sunday, August 18, 2013 11:44 AM >> Subject: Re: Client Get vs Coprocessor scan performance >> >> >> Would be interesting to compare against Phoenix's Skip Scan >> ( >> http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html >> ) >> which does a scan through a coprocessor and is more than 2x faster >> than multi Get (plus handles multi-range scans in addition to point >> gets). >> >> James >> >> On Aug 18, 2013, at 6:39 AM, Ted Yu <[email protected]> wrote: >> >>> bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on >>> the whole length of the key) >>> >>> In this case the Get's are very selective. The number of rows >> FuzzyRowFilter >>> was evaluated against would be much higher. >>> It would be nice if you remember the time each took. >>> >>> bq. Also, I am seeing very bad concurrent query performance >>> >>> Were the multi Get's performed by your coprocessor within region boundary >>> of the respective coprocessor ? Just to confirm. >>> >>> bq. that would make Coprocessors almost single threaded across multiple >>> invocations ? >>> >>> Let me dig into code some more. >>> >>> Cheers >>> >>> >>> On Sat, Aug 17, 2013 at 10:34 PM, Kiru Pakkirisamy < >>> [email protected]> wrote: >>> >>>> Ted, >>>> On a table with 600K rows, Get'ting 100 rows seems to be faster than the >>>> FuzzyRowFilter (mask on the whole length of the key). I thought the >>>> FuzzyRowFilter's SEEK_NEXT_USING_HINT would help. All this on the >> client >>>> side, I have not changed my CoProcessor to use the FuzzyRowFilter based >> on >>>> the client side performance (still doing multiple get inside the >>>> coprocessor). Also, I am seeing very bad concurrent query performance. >> Are >>>> there any thing that would make Coprocessors almost single threaded >> across >>>> multiple invocations ? >>>> Again, all this after putting in 0.94.10 (for hbase-6870 sake) which >> seems >>>> to be very good in bringing up the regions online fast and balanced. >> Thanks >>>> and much appreciated. >>>> >>>> Regards, >>>> - kiru >>>> >>>> >>>> Kiru Pakkirisamy | webcloudtech.wordpress.com >>>> >>>> >>>> ________________________________ >>>> From: Ted Yu <[email protected]> >>>> To: "[email protected]" <[email protected]> >>>> Sent: Saturday, August 17, 2013 4:19 PM >>>> Subject: Re: Client Get vs Coprocessor scan performance >>>> >>>> >>>> HBASE-6870 targeted whole table scanning for each coprocessorService >> call >>>> which exhibited itself through: >>>> >>>> HTable#coprocessorService -> getStartKeysInRange -> getStartEndKeys -> >>>> getRegionLocations -> MetaScanner.allTableRegions(getConfiguration(), >>>> getTableName(), false) >>>> >>>> The cached region locations in HConnectionImplementation would be used. >>>> >>>> Cheers >>>> >>>> >>>> On Sat, Aug 17, 2013 at 2:21 PM, Asaf Mesika <[email protected]> >>>> wrote: >>>> >>>>> Ted, can you elaborate a little bit why this issue boosts performance? >>>>> I couldn't figure out from the issue comments if they execCoprocessor >>>> scans >>>>> the entire .META. table or and entire table, to understand the actual >>>>> improvement. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Aug 9, 2013 at 8:44 AM, Ted Yu <[email protected]> wrote: >>>>> >>>>>> I think you need HBASE-6870 which went into 0.94.8 >>>>>> >>>>>> Upgrading should boost coprocessor performance. >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Aug 8, 2013, at 10:21 PM, Kiru Pakkirisamy < >>>> [email protected] >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Ted, >>>>>>> Here is the method signature/protocol >>>>>>> public Map<String, Double> getFooMap<String, Double> input, >>>>>>> int topN) throws IOException; >>>>>>> >>>>>>> There are 31 regions on 4 nodes X 8 CPU. >>>>>>> I am on 0.94.6 (from Hortonworks). >>>>>>> I think it seems to behave like what linwukang says, - it is almost a >>>>>> full table scan in the coprocessor. >>>>>>> Actually, when I set more specific ColumnPrefixFilters performance >>>> went >>>>>> down. >>>>>>> I want to do things on the server side because, I dont want to be >>>>>> sending 500K column/values to the client. >>>>>>> I cannot believe a single-threaded client which does some >>>> calculations >>>>>> and group-by beats the coprocessor running in 31 regions. >>>>>>> >>>>>>> Regards, >>>>>>> - kiru >>>>>>> >>>>>>> >>>>>>> Kiru Pakkirisamy | webcloudtech.wordpress.com >>>>>>> >>>>>>> >>>>>>> ________________________________ >>>>>>> From: Ted Yu <[email protected]> >>>>>>> To: [email protected]; Kiru Pakkirisamy < >>>> [email protected] >>>>>> >>>>>>> Sent: Thursday, August 8, 2013 8:40 PM >>>>>>> Subject: Re: Client Get vs Coprocessor scan performance >>>>>>> >>>>>>> >>>>>>> Can you give us a bit more information ? >>>>>>> >>>>>>> How do you deliver the 55 rowkeys to your endpoint ? >>>>>>> How many regions do you have for this table ? >>>>>>> >>>>>>> What HBase version are you using ? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Thu, Aug 8, 2013 at 6:43 PM, Kiru Pakkirisamy >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> I am finding an odd behavior with the Coprocessor performance >>>> lagging >>>>> a >>>>>>>> client side Get. >>>>>>>> I have a table with 500000 rows. Each have variable # of columns in >>>>> one >>>>>>>> column family (in this case about 600000 columns in total are >>>>> processed) >>>>>>>> When I try to get specific 55 rows, the client side completes in >>>>>> half-the >>>>>>>> time as the coprocessor endpoint. >>>>>>>> I am using 55 RowFilters on the Coprocessor scan side. The rows are >>>>>>>> processed are exactly the same way in both the cases. >>>>>>>> Any pointers on how to debug this scenario ? >>>>>>>> >>>>>>>> Regards, >>>>>>>> - kiru >>>>>>>> >>>>>>>> >>>>>>>> Kiru Pakkirisamy | webcloudtech.wordpress.com >>>>>> >>>>> >>>> >> >> >> >>
