Something I'm not getting, why not using separate tables instead of CFs for a single table? Simply name your table tablename_cfname then you get ride of the CF# limitation?
Or is there big pros to have CFs? JM 2013/4/8 Anoop John <[email protected]>: > Agree here. The effectiveness depends on what % of data satisfies the > condition, how it is distributed across HFile blocks. We will get > performance gain when the we will be able to skip some HFile blocks (from > non essential CFs). Can test with different HFile block size (lower value)? > > -Anoop- > > > On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote: > >> I made the following change in TestJoinedScanners.java: >> >> - int flag_percent = 1; >> + int flag_percent = 40; >> >> The test took longer but still favors joined scanner. >> I got some new results: >> >> 2013-04-08 07:46:06,959 INFO [main] regionserver.TestJoinedScanners(157): >> Slow scanner finished in 7.424388 seconds, got 2050 rows >> ... >> 2013-04-08 07:46:12,010 INFO [main] regionserver.TestJoinedScanners(157): >> Joined scanner finished in 5.05063 seconds, got 2050 rows >> >> 2013-04-08 07:46:18,358 INFO [main] regionserver.TestJoinedScanners(157): >> Slow scanner finished in 6.348517 seconds, got 2050 rows >> ... >> 2013-04-08 07:46:22,946 INFO [main] regionserver.TestJoinedScanners(157): >> Joined scanner finished in 4.587545 seconds, got 2050 rows >> >> Looks like effectiveness of joined scanner is affected by distribution of >> data. >> >> Cheers >> >> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote: >> >> > Looking at the joined scanner test code, it sets it up such that 1% of >> the >> > rows match, which would somewhat be in line with James' results. >> > >> > In my own testing a while ago I found a 100% improvement with 0% match. >> > >> > >> > -- Lars >> > >> > >> > >> > ________________________________ >> > From: Ted Yu <[email protected]> >> > To: [email protected] >> > Sent: Sunday, April 7, 2013 4:13 PM >> > Subject: Re: Essential column family performance >> > >> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your >> > reference. >> > >> > On my MacBook, I got the following results from the test: >> > >> > 2013-04-07 16:08:17,474 INFO [main] >> regionserver.TestJoinedScanners(157): >> > Slow scanner finished in 7.973822 seconds, got 100 rows >> > ... >> > 2013-04-07 16:08:17,946 INFO [main] >> regionserver.TestJoinedScanners(157): >> > Joined scanner finished in 0.47235 seconds, got 100 rows >> > >> > Cheers >> > >> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: >> > >> > > Looking at >> > > >> > >> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt >> , >> > I found that it didn't contain TestJoinedScanners which shows >> > > difference in scanner performance: >> > > >> > > LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + >> > > Double.toString(timeSec) >> > > >> > > + " seconds, got " + Long.toString(rows_count/2) + " rows"); >> > > >> > > The test uses SingleColumnValueFilter: >> > > >> > > SingleColumnValueFilter filter = new SingleColumnValueFilter( >> > > >> > > cf_essential, col_name, CompareFilter.CompareOp.EQUAL, >> flag_yes); >> > > It is possible that the custom filter you were using would exhibit >> > > different access pattern compared to SingleColumnValueFilter. e.g. does >> > > your filter utilize hint ? >> > > It would be easier for me and other people to reproduce the issue you >> > > experienced if you put your scenario in some test similar to >> > > TestJoinedScanners. >> > > >> > > Will take a closer look at the code Monday. >> > > >> > > Cheers >> > > >> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected] >> > >wrote: >> > > >> > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, >> > so >> > >> filterIfMissing isn't the issue - the results of the scan are correct. >> > >> >> > >> I can see that if the essential column family has more data compared >> to >> > >> the non essential column family that the results would eventually even >> > out. >> > >> I was hoping to always be able to enable the essential column family >> > >> feature. Is there an inherent reason why performance would degrade >> like >> > >> this? Does it boil down to a single sequential scan versus many seeks? >> > >> >> > >> Thanks, >> > >> >> > >> James >> > >> >> > >> >> > >> On 04/07/2013 07:44 AM, Ted Yu wrote: >> > >> >> > >>> James: >> > >>> Your test was based on 0.94.6.1, right ? >> > >>> >> > >>> What Filter were you using ? >> > >>> >> > >>> If you used SingleColumnValueFilter, have you seen my comment here ? >> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?** >> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.** >> > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229< >> > >> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229 >> > > >> > >>> >> > >>> BTW the use case Max Lapan tried to address has non essential column >> > >>> family >> > >>> carrying considerably more data compared to essential column family. >> > >>> >> > >>> Cheers >> > >>> >> > >>> >> > >>> >> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < >> [email protected] >> > >>> >wrote: >> > >>> >> > >>> Hello, >> > >>>> We're doing some performance testing of the essential column family >> > >>>> feature, and we're seeing some performance degradation when >> comparing >> > >>>> with >> > >>>> and without the feature enabled: >> > >>>> >> > >>>> Performance of scan relative >> > >>>> % of rows selected to not enabling the feature >> > >>>> --------------------- ------------------------------****-- >> > >>>> >> > >>>> 100% 1.0x >> > >>>> 80% 2.0x >> > >>>> 60% 2.3x >> > >>>> 40% 2.2x >> > >>>> 20% 1.5x >> > >>>> 10% 1.0x >> > >>>> 5% 0.67x >> > >>>> 0% 0.30% >> > >>>> >> > >>>> In our scenario, we have two column families. The key value from the >> > >>>> essential column family is used in the filter, while the key value >> > from >> > >>>> the >> > >>>> other, non essential column family is returned by the scan. Each row >> > >>>> contains values for both key values, with the values being >> relatively >> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're >> > >>>> seeing a >> > >>>> performance gain is when less than 10% of the rows are selected. >> > >>>> >> > >>>> Is this a reasonable test? Has anyone else measured this? >> > >>>> >> > >>>> Thanks, >> > >>>> >> > >>>> James >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >> >> > > >> > >>
