RE: Get addColumn + ColumnRangeFilter

Taeyun Kim Thu, 15 Jan 2015 23:48:06 -0800

Some more.

The files cannot be physically merged (that is, each file must retain its 
identity) since there is a requirement that the individual file group must be 
able to be deleted.
And since the files are individually postprocessed, there is no need to scan 
through all the file groups, so HBase' 'slow' scan speed relative to the HDFS 
sequential read is not a concern.

-----Original Message-----
From: Taeyun Kim [mailto:[email protected]] 
Sent: Friday, January 16, 2015 4:36 PM
To: '[email protected]'
Subject: RE: Get addColumn + ColumnRangeFilter

It's a somewhat long story.
Maybe I use HBase some weird way.

My use case is as follows:

I didn't want to put many small file into HDFS. (Since it is bad for HDFS, both 
for scalability and performance)

The small files are grouped by a test log, since the files are many facets of 
the result of the analysis of one test log. So, they could be the members of 
one SequentialFile.
But I felt SequentialFile (or other similar ones) not attractive, since anyway 
I would get many not-so-big(about ~20MB, except for rare cases) Sequential 
files since the analysis result files are not so big and the test log files are 
continually generated.
So some manual file management and merge could be a must.

So, I decided to use a HBase record as a kind of 'directory' to avoid the 
manual file management. (directory = file group) By this, the 'files' are 
automatically 'merged' into appropriately sized HFiles, and as a bonus that 
'files' can be automatically deleted when it's lifetime is done.

The 'directory' has the following files.

- 'm': meta file. (to check the version of the 'directory' format)
- 'Result.csv.0'
- 'Result.csv.1'
- ...
- 'Result.csv.p': parts file. (has the split count and each size. 'p' is for 
'parts')
- 'AnotherResultA.csv.0'
- 'AnotherResultA.csv.1'
- ...
- 'AnotherResultA.csv.p'
- 'TestEnvironment.txt'

Each 'file' is saved as a column.

Result files are split for the following reasons:
- To handle extreme case the file is too big to be processed by one task.
- To save the task process memory: the split size is actually smaller than 
64MB(size for one task) and individually compressed. By this, a task process 
can have at most one column uncompressed. A task is assigned multiple 'splits'.

For this, I've written an InputFormat class.

Now, the InputFormat class can first Get both 'm' and a parts file to get the 
inputSplit information. This is not a problem. Single Get with 2 addColumn() is 
sufficient.
But when the whole content of a file must be read(like Files.readAllBytes()), 
must Get 'm' and unknown number of splits that has a name range(Result.csv.0 ~ 
Result.csv.7) to Get the whole content by single Get. (addColumn() + 
ColumnRangeFilter) But for the current HBase status, it seems that I have to 
invoke 2 Gets, or disable the version check. (Maybe not a big deal?)

That's all.

If you think that this Record is not efficient, or there is better solution, 
please let me know.

BTW, for the current status, when both addColumn() and ColumnRangeFilter are 
applied, they are practically combined by 'AND' operator. Right?

-----Original Message-----
From: Ted Yu [mailto:[email protected]]
Sent: Friday, January 16, 2015 3:39 PM
To: [email protected]
Subject: Re: Get addColumn + ColumnRangeFilter

I reproduced the failed test (testAddColumnWithColumnRangeFilter) after 
modifying your test case to fit master branch.

The reason for one Cell being returned is that ExplicitColumnTracker is used by 
ScanQueryMatcher to first check if the column is part of the requested columns 
(f:fc in your case). The other columns don't pass this check, hence they're not 
included in the result.

Before this part of code is changed, can I ask why you need to call
g.addColumn() when g has ColumnRangeFilter associated with it.

Cheers

On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <[email protected]>
wrote:

> (Sorry if this mail is a duplicate)
>
> Hi Ted,
>
> I've attached 2 unit test classes.
>
> Both have one failed test.
>
> -
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> Expected: 10, Actual 1
> -
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> Result is empty
>
> If the tests have problems, please let me know.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:[email protected]]
> Sent: Thursday, January 15, 2015 6:59 PM
> To: [email protected]
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> Can you write a unit test which shows this behavior?
>
> Thanks
>
>
>
> > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> [email protected]> wrote:
> >
> > Hi,
> >
> >
> >
> > I have a situation that both Get.addColumn() and Get.setFilter(new
> > ColumnRangeFilter(…)) needed to Get.
> >
> > The source code snippet is as follows:
> >
> >
> >
> >        Get g = new Get(getRowKey(lfileId));
> >
> >        g.addColumn(Schema.ColumnFamilyNameBytes,
> > MetaColumnNameBytes);
> >
> >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> >
> >            Bytes.toBytes(name + "~"), false));
> >
> >        Result r = table.get(g);
> >
> >
> >
> >        if (r.isEmpty())
> >
> >            throw new FileNotFoundException(
> >
> >                String.format("%d:%d:%s", projectId, lfileId, name));
> >
> >
> >
> > When g.addColumn() is commented out, the Result is not empty, while 
> > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> >
> > Is it illegal to use both methods?
> >
> >
> >
> > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >
> >
> >
> > Thanks.
>

RE: Get addColumn + ColumnRangeFilter

Reply via email to