Thanks for the background information. For your last question, the columns given by addColumn() calls (ColumnTracker uses) are checked first. So yes.
Relaxing this limitation may take some effort - ScanQueryMatcher can take Filter user passes into account. But the filter may not be ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter. To add such logic into ScanQueryMatcher#match() makes the code less maintainable. Can you check whether the column in addColumn() is covered by the ColumnRangeFilter and if so, do not call addColumn() ? Cheers On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim <[email protected]> wrote: > It's a somewhat long story. > Maybe I use HBase some weird way. > > My use case is as follows: > > I didn't want to put many small file into HDFS. (Since it is bad for HDFS, > both for scalability and performance) > > The small files are grouped by a test log, since the files are many facets > of the result of the analysis of one test log. So, they could be the > members of one SequentialFile. > But I felt SequentialFile (or other similar ones) not attractive, since > anyway I would get many not-so-big(about ~20MB, except for rare cases) > Sequential files since the analysis result files are not so big and the > test log files are continually generated. > So some manual file management and merge could be a must. > > So, I decided to use a HBase record as a kind of 'directory' to avoid the > manual file management. (directory = file group) > By this, the 'files' are automatically 'merged' into appropriately sized > HFiles, and as a bonus that 'files' can be automatically deleted when it's > lifetime is done. > > The 'directory' has the following files. > > - 'm': meta file. (to check the version of the 'directory' format) > - 'Result.csv.0' > - 'Result.csv.1' > - ... > - 'Result.csv.p': parts file. (has the split count and each size. 'p' is > for 'parts') > - 'AnotherResultA.csv.0' > - 'AnotherResultA.csv.1' > - ... > - 'AnotherResultA.csv.p' > - 'TestEnvironment.txt' > > Each 'file' is saved as a column. > > Result files are split for the following reasons: > - To handle extreme case the file is too big to be processed by one task. > - To save the task process memory: the split size is actually smaller than > 64MB(size for one task) and individually compressed. By this, a task > process can have at most one column uncompressed. A task is assigned > multiple 'splits'. > > For this, I've written an InputFormat class. > > Now, the InputFormat class can first Get both 'm' and a parts file to get > the inputSplit information. This is not a problem. Single Get with 2 > addColumn() is sufficient. > But when the whole content of a file must be read(like > Files.readAllBytes()), must Get 'm' and unknown number of splits that has a > name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single > Get. (addColumn() + ColumnRangeFilter) > But for the current HBase status, it seems that I have to invoke 2 Gets, > or disable the version check. (Maybe not a big deal?) > > That's all. > > If you think that this Record is not efficient, or there is better > solution, please let me know. > > BTW, for the current status, when both addColumn() and ColumnRangeFilter > are applied, they are practically combined by 'AND' operator. Right? > > -----Original Message----- > From: Ted Yu [mailto:[email protected]] > Sent: Friday, January 16, 2015 3:39 PM > To: [email protected] > Subject: Re: Get addColumn + ColumnRangeFilter > > I reproduced the failed test (testAddColumnWithColumnRangeFilter) after > modifying your test case to fit master branch. > > The reason for one Cell being returned is that ExplicitColumnTracker is > used by ScanQueryMatcher to first check if the column is part of the > requested columns (f:fc in your case). The other columns don't pass this > check, hence they're not included in the result. > > Before this part of code is changed, can I ask why you need to call > g.addColumn() when g has ColumnRangeFilter associated with it. > > Cheers > > On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <[email protected]> > wrote: > > > (Sorry if this mail is a duplicate) > > > > Hi Ted, > > > > I've attached 2 unit test classes. > > > > Both have one failed test. > > > > - > > > HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter(): > > Expected: 10, Actual 1 > > - > > > HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter(): > > Result is empty > > > > If the tests have problems, please let me know. > > > > > > -----Original Message----- > > From: Ted Yu [mailto:[email protected]] > > Sent: Thursday, January 15, 2015 6:59 PM > > To: [email protected] > > Subject: Re: Get addColumn + ColumnRangeFilter > > > > Can you write a unit test which shows this behavior? > > > > Thanks > > > > > > > > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim < > > [email protected]> wrote: > > > > > > Hi, > > > > > > > > > > > > I have a situation that both Get.addColumn() and Get.setFilter(new > > > ColumnRangeFilter(…)) needed to Get. > > > > > > The source code snippet is as follows: > > > > > > > > > > > > Get g = new Get(getRowKey(lfileId)); > > > > > > g.addColumn(Schema.ColumnFamilyNameBytes, > > > MetaColumnNameBytes); > > > > > > g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false, > > > > > > Bytes.toBytes(name + "~"), false)); > > > > > > Result r = table.get(g); > > > > > > > > > > > > if (r.isEmpty()) > > > > > > throw new FileNotFoundException( > > > > > > String.format("%d:%d:%s", projectId, lfileId, name)); > > > > > > > > > > > > When g.addColumn() is commented out, the Result is not empty, while > > > with g.addColumn the Result is empty(FileNotFoundException is thrown). > > > > > > Is it illegal to use both methods? > > > > > > > > > > > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1) > > > > > > > > > > > > Thanks. > > > >
