RE: Get addColumn + ColumnRangeFilter

Taeyun Kim Sun, 18 Jan 2015 18:03:06 -0800

Thank you!

In fact I've forgotten that there is FilterList.
It works well!


-----Original Message-----
From: Ted Yu [mailto:[email protected]] 
Sent: Monday, January 19, 2015 9:46 AM
To: [email protected]
Subject: Re: Get addColumn + ColumnRangeFilter

If the number of splits is greater than 1, you can use FilterList with two 
ColumnRangeFilters when needed.

Cheers

On Sun, Jan 18, 2015 at 4:37 PM, Taeyun Kim <[email protected]>
wrote:

> Thanks.
>
> But in my case it is unlikely that the FirstColumnName would be 
> included in the range. (If it is included, it would cause a problem.)
>
> Instead, since the number of splits is mostly 1, I will include the 
> name of the first split to the first Get with addColumn(). With that, 
> most queries can be satisfied with single Get.
>
> Thanks again.
>
> -----Original Message-----
> From: Ted Yu [mailto:[email protected]]
> Sent: Saturday, January 17, 2015 6:31 AM
> To: [email protected]
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> To clarify what I meant, the test passes with the following change:
>
>       Get g = new Get(RowKey);
>
>       byte[] minColumn = new byte[]{(byte)0};
>
>       int cmpMin = Bytes.compareTo(FirstColumnNameBytes, 0, 
> FirstColumnNameBytes.length,
>
>         minColumn, 0, minColumn.length);
>
>       byte[] maxColumn = Bytes.toBytes("~");
>
>       int cmpMax = Bytes.compareTo(FirstColumnNameBytes, 0, 
> FirstColumnNameBytes.length,
>
>         maxColumn, 0, maxColumn.length);
>
>       if (cmpMin <= 0 || cmpMax >= 0) {
>
>         g.addColumn(ColumnFamilyNameBytes, FirstColumnNameBytes);  // 
> should be redundant...
>
>       }
>
>       g.setFilter(new ColumnRangeFilter(minColumn, false,
>
>         maxColumn, false));  // ...since this includes the first 
> column
>
> FYI
>
> On Fri, Jan 16, 2015 at 7:23 AM, Ted Yu <[email protected]> wrote:
>
> > Thanks for the background information.
> >
> > For your last question, the columns given by addColumn() calls 
> > (ColumnTracker
> > uses) are checked first.
> > So yes.
> >
> > Relaxing this limitation may take some effort - ScanQueryMatcher can 
> > take Filter user passes into account. But the filter may not be 
> > ColumnRangeFilter. It can be FilterList involving ColumnRangeFilter.
> > To add such logic into ScanQueryMatcher#match() makes the code less 
> > maintainable.
> >
> > Can you check whether the column in addColumn() is covered by the 
> > ColumnRangeFilter and if so, do not call addColumn() ?
> >
> > Cheers
> >
> > On Thu, Jan 15, 2015 at 11:35 PM, Taeyun Kim 
> > <[email protected]>
> > wrote:
> >
> >> It's a somewhat long story.
> >> Maybe I use HBase some weird way.
> >>
> >> My use case is as follows:
> >>
> >> I didn't want to put many small file into HDFS. (Since it is bad 
> >> for HDFS, both for scalability and performance)
> >>
> >> The small files are grouped by a test log, since the files are many 
> >> facets of the result of the analysis of one test log. So, they 
> >> could be the members of one SequentialFile.
> >> But I felt SequentialFile (or other similar ones) not attractive, 
> >> since anyway I would get many not-so-big(about ~20MB, except for 
> >> rare
> >> cases) Sequential files since the analysis result files are not so 
> >> big and the test log files are continually generated.
> >> So some manual file management and merge could be a must.
> >>
> >> So, I decided to use a HBase record as a kind of 'directory' to 
> >> avoid the manual file management. (directory = file group) By this, 
> >> the 'files' are automatically 'merged' into appropriately sized 
> >> HFiles, and as a bonus that 'files' can be automatically deleted 
> >> when it's lifetime is done.
> >>
> >> The 'directory' has the following files.
> >>
> >> - 'm': meta file. (to check the version of the 'directory' format)
> >> - 'Result.csv.0'
> >> - 'Result.csv.1'
> >> - ...
> >> - 'Result.csv.p': parts file. (has the split count and each size. 'p'
> >> is for 'parts')
> >> - 'AnotherResultA.csv.0'
> >> - 'AnotherResultA.csv.1'
> >> - ...
> >> - 'AnotherResultA.csv.p'
> >> - 'TestEnvironment.txt'
> >>
> >> Each 'file' is saved as a column.
> >>
> >> Result files are split for the following reasons:
> >> - To handle extreme case the file is too big to be processed by one
> task.
> >> - To save the task process memory: the split size is actually 
> >> smaller than 64MB(size for one task) and individually compressed. 
> >> By this, a task process can have at most one column uncompressed. A 
> >> task is assigned multiple 'splits'.
> >>
> >> For this, I've written an InputFormat class.
> >>
> >> Now, the InputFormat class can first Get both 'm' and a parts file 
> >> to get the inputSplit information. This is not a problem. Single 
> >> Get with 2
> >> addColumn() is sufficient.
> >> But when the whole content of a file must be read(like 
> >> Files.readAllBytes()), must Get 'm' and unknown number of splits 
> >> that has a name range(Result.csv.0 ~ Result.csv.7) to Get the whole 
> >> content by single Get. (addColumn() + ColumnRangeFilter) But for 
> >> the current HBase status, it seems that I have to invoke 2 Gets, or 
> >> disable the version check. (Maybe not a big deal?)
> >>
> >> That's all.
> >>
> >> If you think that this Record is not efficient, or there is better 
> >> solution, please let me know.
> >>
> >> BTW, for the current status, when both addColumn() and 
> >> ColumnRangeFilter are applied, they are practically combined by 'AND'
> operator. Right?
> >>
> >> -----Original Message-----
> >> From: Ted Yu [mailto:[email protected]]
> >> Sent: Friday, January 16, 2015 3:39 PM
> >> To: [email protected]
> >> Subject: Re: Get addColumn + ColumnRangeFilter
> >>
> >> I reproduced the failed test (testAddColumnWithColumnRangeFilter)
> >> after modifying your test case to fit master branch.
> >>
> >> The reason for one Cell being returned is that 
> >> ExplicitColumnTracker is used by ScanQueryMatcher to first check if 
> >> the column is part of the requested columns (f:fc in your case). 
> >> The other columns don't pass this check, hence they're not included in the 
> >> result.
> >>
> >> Before this part of code is changed, can I ask why you need to call
> >> g.addColumn() when g has ColumnRangeFilter associated with it.
> >>
> >> Cheers
> >>
> >> On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim 
> >> <[email protected]>
> >> wrote:
> >>
> >> > (Sorry if this mail is a duplicate)
> >> >
> >> > Hi Ted,
> >> >
> >> > I've attached 2 unit test classes.
> >> >
> >> > Both have one failed test.
> >> >
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> >> > Expected: 10, Actual 1
> >> > -
> >> >
> >>
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> >> > Result is empty
> >> >
> >> > If the tests have problems, please let me know.
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Ted Yu [mailto:[email protected]]
> >> > Sent: Thursday, January 15, 2015 6:59 PM
> >> > To: [email protected]
> >> > Subject: Re: Get addColumn + ColumnRangeFilter
> >> >
> >> > Can you write a unit test which shows this behavior?
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> >> > [email protected]> wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > >
> >> > >
> >> > > I have a situation that both Get.addColumn() and 
> >> > > Get.setFilter(new
> >> > > ColumnRangeFilter(…)) needed to Get.
> >> > >
> >> > > The source code snippet is as follows:
> >> > >
> >> > >
> >> > >
> >> > >        Get g = new Get(getRowKey(lfileId));
> >> > >
> >> > >        g.addColumn(Schema.ColumnFamilyNameBytes,
> >> > > MetaColumnNameBytes);
> >> > >
> >> > >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name),
> >> > > false,
> >> > >
> >> > >            Bytes.toBytes(name + "~"), false));
> >> > >
> >> > >        Result r = table.get(g);
> >> > >
> >> > >
> >> > >
> >> > >        if (r.isEmpty())
> >> > >
> >> > >            throw new FileNotFoundException(
> >> > >
> >> > >                String.format("%d:%d:%s", projectId, lfileId, 
> >> > > name));
> >> > >
> >> > >
> >> > >
> >> > > When g.addColumn() is commented out, the Result is not empty, 
> >> > > while with g.addColumn the Result is 
> >> > > empty(FileNotFoundException is
> thrown).
> >> > >
> >> > > Is it illegal to use both methods?
> >> > >
> >> > >
> >> > >
> >> > > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >> > >
> >> > >
> >> > >
> >> > > Thanks.
> >> >
> >>
> >>
> >
>
>

RE: Get addColumn + ColumnRangeFilter

Reply via email to