It's a somewhat long story. Maybe I use HBase some weird way. My use case is as follows:
I didn't want to put many small file into HDFS. (Since it is bad for HDFS, both for scalability and performance) The small files are grouped by a test log, since the files are many facets of the result of the analysis of one test log. So, they could be the members of one SequentialFile. But I felt SequentialFile (or other similar ones) not attractive, since anyway I would get many not-so-big(about ~20MB, except for rare cases) Sequential files since the analysis result files are not so big and the test log files are continually generated. So some manual file management and merge could be a must. So, I decided to use a HBase record as a kind of 'directory' to avoid the manual file management. (directory = file group) By this, the 'files' are automatically 'merged' into appropriately sized HFiles, and as a bonus that 'files' can be automatically deleted when it's lifetime is done. The 'directory' has the following files. - 'm': meta file. (to check the version of the 'directory' format) - 'Result.csv.0' - 'Result.csv.1' - ... - 'Result.csv.p': parts file. (has the split count and each size. 'p' is for 'parts') - 'AnotherResultA.csv.0' - 'AnotherResultA.csv.1' - ... - 'AnotherResultA.csv.p' - 'TestEnvironment.txt' Each 'file' is saved as a column. Result files are split for the following reasons: - To handle extreme case the file is too big to be processed by one task. - To save the task process memory: the split size is actually smaller than 64MB(size for one task) and individually compressed. By this, a task process can have at most one column uncompressed. A task is assigned multiple 'splits'. For this, I've written an InputFormat class. Now, the InputFormat class can first Get both 'm' and a parts file to get the inputSplit information. This is not a problem. Single Get with 2 addColumn() is sufficient. But when the whole content of a file must be read(like Files.readAllBytes()), must Get 'm' and unknown number of splits that has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole content by single Get. (addColumn() + ColumnRangeFilter) But for the current HBase status, it seems that I have to invoke 2 Gets, or disable the version check. (Maybe not a big deal?) That's all. If you think that this Record is not efficient, or there is better solution, please let me know. BTW, for the current status, when both addColumn() and ColumnRangeFilter are applied, they are practically combined by 'AND' operator. Right? -----Original Message----- From: Ted Yu [mailto:[email protected]] Sent: Friday, January 16, 2015 3:39 PM To: [email protected] Subject: Re: Get addColumn + ColumnRangeFilter I reproduced the failed test (testAddColumnWithColumnRangeFilter) after modifying your test case to fit master branch. The reason for one Cell being returned is that ExplicitColumnTracker is used by ScanQueryMatcher to first check if the column is part of the requested columns (f:fc in your case). The other columns don't pass this check, hence they're not included in the result. Before this part of code is changed, can I ask why you need to call g.addColumn() when g has ColumnRangeFilter associated with it. Cheers On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <[email protected]> wrote: > (Sorry if this mail is a duplicate) > > Hi Ted, > > I've attached 2 unit test classes. > > Both have one failed test. > > - > HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter(): > Expected: 10, Actual 1 > - > HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter(): > Result is empty > > If the tests have problems, please let me know. > > > -----Original Message----- > From: Ted Yu [mailto:[email protected]] > Sent: Thursday, January 15, 2015 6:59 PM > To: [email protected] > Subject: Re: Get addColumn + ColumnRangeFilter > > Can you write a unit test which shows this behavior? > > Thanks > > > > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim < > [email protected]> wrote: > > > > Hi, > > > > > > > > I have a situation that both Get.addColumn() and Get.setFilter(new > > ColumnRangeFilter(…)) needed to Get. > > > > The source code snippet is as follows: > > > > > > > > Get g = new Get(getRowKey(lfileId)); > > > > g.addColumn(Schema.ColumnFamilyNameBytes, > > MetaColumnNameBytes); > > > > g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false, > > > > Bytes.toBytes(name + "~"), false)); > > > > Result r = table.get(g); > > > > > > > > if (r.isEmpty()) > > > > throw new FileNotFoundException( > > > > String.format("%d:%d:%s", projectId, lfileId, name)); > > > > > > > > When g.addColumn() is commented out, the Result is not empty, while > > with g.addColumn the Result is empty(FileNotFoundException is thrown). > > > > Is it illegal to use both methods? > > > > > > > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1) > > > > > > > > Thanks. >
