from:"He Yongqiang \(Commented\) \(JIRA\)"

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-04-05 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247497#comment-13247497
 ] 

He Yongqiang commented on HBASE-5313:
-

Hi Kannan,

We are still experimenting this. The initial results shows only less than one 
quarter off, which is kind of not big enough for us. The timestamp issue is a 
low hanging fruit, which can cut 8%. 
We will post some diff asap, once after we finalize our experiments.

 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242094#comment-13242094
]

He Yongqiang commented on HBASE-5674:
-

I use the term 'researchy' as it is mentioned so in one email thread. refer to
http://osdir.com/ml/general/2012-03/msg52707.html I have no idea how this term
come up.

bq. The most of us working on hbase are trying to make it an hardcore
production worthy platform. 'Pluggable' and 'research', at least on first
blush, sound like distractions from the project objective.
So are you referring this as conflicting with your 'hardcore production worthy
platform' goal?

add support in HBase to overwrite hbase timestamp to a version number during
major compaction
-

Key: HBASE-5674
URL: https://issues.apache.org/jira/browse/HBASE-5674
Project: HBase
Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

Right now, a millisecond-level timestamp is attached to every record.
In our case, we only need a version number (mostly it will be just zero etc).
A millisecond timestamp is too heavy to carry. We should add support to
overwrite it to zero during major compaction.
KVs before major compaction will remain using system timestamp. And this
should be configurable, so that we should not mess up if the hbase timestamp
is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242115#comment-13242115
]

He Yongqiang commented on HBASE-5674:
-

okay. Now i need to make it public on my lack sense of humor. :)

Here is the real problem:
In our use case, the space the data occupies *really* matter. We need to find
all kind of things that we can do to bring down the size as much as possible.
Apparently we do not want to bring in LZMA compression or bzip2 compression as
they are really slow. In my simple test, a 41MB data can be reduced to 32MB
after i rewrite the hbase Long timestamp to zero. The 8-bytes Long timestamp is
heavy is because it is binary system timestamp which makes it very hard to
compress (MemstoreTS is also a Long timestamp but there is no problem with it
as it will be zero eventually). And if you look at how we are using that data,
pretty much that data is not used by most applications if the data is system
generated (not specified by applications). A good reason to make it
configurable is some application may do specify it. In that case, pretty much
you as hbase can not modify that data. But for a lot of other applications
which do not care this data should not suffer this problem if data size really
matter to them.
I think this could benefit other community members as they may see this problem
when they want to decrease the data size.

add support in HBase to overwrite hbase timestamp to a version number during
major compaction
-

Key: HBASE-5674
URL: https://issues.apache.org/jira/browse/HBASE-5674
Project: HBase
Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242650#comment-13242650
 ] 

He Yongqiang commented on HBASE-5674:
-

Thanks Matt and stack for the point out of 4676. Yeah, we are very very 
interested in the work that is going on HBase-4767.

 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-29 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242028#comment-13242028
 ] 

He Yongqiang commented on HBASE-5674:
-

bq. For whom?

For our 'researchy' project...

bq. Can you not just have your client specify timestamp of 0?

I hope this can be done in open source hbase, and can be pluggable. 

 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5605) compression does not work in Store.java trunk

2012-03-20 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233681#comment-13233681
 ] 

He Yongqiang commented on HBASE-5605:
-

https://reviews.facebook.net/D2391

 compression does not work in Store.java trunk
 -

 Key: HBASE-5605
 URL: https://issues.apache.org/jira/browse/HBASE-5605
 Project: HBase
  Issue Type: Bug
Reporter: He Yongqiang
Assignee: He Yongqiang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

2012-03-08 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225379#comment-13225379
 ] 

He Yongqiang commented on HBASE-5521:
-

Tests passed in my local test, not sure why failed on Jenkins.

 Move compression/decompression to an encoder specific encoding context
 --

 Key: HBASE-5521
 URL: https://issues.apache.org/jira/browse/HBASE-5521
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang
 Attachments: HBASE-5521.1.patch, HBASE-5521.D2097.1.patch, 
 HBASE-5521.D2097.2.patch, HBASE-5521.D2097.3.patch, HBASE-5521.D2097.4.patch, 
 HBASE-5521.D2097.5.patch


 As part of working on HBASE-5313, we want to add a new columnar 
 encoder/decoder. It makes sense to move compression to be part of 
 encoder/decoder:
 1) a scanner for a columnar encoded block can do lazy decompression to a 
 specific part of a key value object
 2) avoid an extra bytes copy from encoder to hblock-writer. 
 If there is no encoder specified for a writer, the HBlock.Writer will use a 
 default compression-context to do something very similar to today's code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-03-05 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222512#comment-13222512
]

He Yongqiang commented on HBASE-5313:
-

As part of working on HBASE-5313, we first tried to write a
HFileWriter/HFileReader to do it. After finishing some work, it seems this
requires a lot of code refactoring in order to reuse existing code as much as
possible.

Then we find seems adding a new columnar encoder/decoder would be easy to do.
opened https://issues.apache.org/jira/browse/HBASE-5521 to do encoder/decoder
specific compression work.

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

A HFile block contain a stream of key-values. Can we can organize these kvs
on the disk in a better way so that we get much greater compression ratios?
One option (thanks Prakash) is to store all the keys in the beginning of the
block (let's call this the key-section) and then store all their
corresponding values towards the end of the block. This will allow us to
not-even decompress the values when we are scanning and skipping over rows in
the block.
Any other ideas?

[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

2012-03-05 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222576#comment-13222576
 ] 

He Yongqiang commented on HBASE-5521:
-

moved the review to https://reviews.facebook.net/D2097

 Move compression/decompression to an encoder specific encoding context
 --

 Key: HBASE-5521
 URL: https://issues.apache.org/jira/browse/HBASE-5521
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang
 Attachments: HBASE-5521.1.patch, HBASE-5521.D2097.1.patch


 As part of working on HBASE-5313, we want to add a new columnar 
 encoder/decoder. It makes sense to move compression to be part of 
 encoder/decoder:
 1) a scanner for a columnar encoded block can do lazy decompression to a 
 specific part of a key value object
 2) avoid an extra bytes copy from encoder to hblock-writer. 
 If there is no encoder specified for a writer, the HBlock.Writer will use a 
 default compression-context to do something very similar to today's code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

2012-02-23 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214915#comment-13214915
]

He Yongqiang commented on HBASE-5457:
-

@stack, we haven't thought that in much detail, but we can start the discussion
by an example.

Let's say there is one column family, and it only contains one type column
whose name is a combine of 'string and ts'. So the data is sorted by 'string'
first. But one query wants the data to be sorted by ts instead.

add inline index in data block for data which are not clustered together

Key: HBASE-5457
URL: https://issues.apache.org/jira/browse/HBASE-5457
Project: HBase
Issue Type: New Feature
Reporter: He Yongqiang

As we are go through our data schema, and we found we have one large column
family which is just duplicating data from another column family and is just
a re-org of the data to cluster data in a different way than the original
column family in order to serve another type of queries efficiently.
If we compare this second column family with similar situation in mysql, it
is like an index in mysql. So if we can add inline block index on required
columns, the second column family then is not needed.

[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

2012-02-23 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215002#comment-13215002
]

He Yongqiang commented on HBASE-5457:
-

@lars, in today's implementation we actually create another column family and
reorg the column name to be 'ts and string', so the data is sorted by ts in
this new column family. And we redirect the query to use the second column
family. But this approach duplicates data.
Without the second column family, we can do a search once we found the row. but
that requires searching all data with the target row key. It hurts cpu.

add inline index in data block for data which are not clustered together

Key: HBASE-5457
URL: https://issues.apache.org/jira/browse/HBASE-5457
Project: HBase
Issue Type: New Feature
Reporter: He Yongqiang

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-22 Thread He Yongqiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214032#comment-13214032
 ] 

He Yongqiang commented on HBASE-5313:
-

As a first step, we will go ahead with a simple columnar layout implementation. 
And leave more advanced features (like nested column layout) in a follow up. 



 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-13 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207057#comment-13207057
]

He Yongqiang commented on HBASE-5313:
-

Can you also list the time it took writing the HFile for each of the three
schemes ?
@Zhihong, we are still trying to explore more ideas here. Once we got a
finalized plan, i will get the cpu/latency numbers.

Yongqiang, what is the delta encoding algorithm did you use? The default
algorithm only do a simple encoding. Do we have results using prefix with
fast diff algorithm for the current hfile v2?

@jerry, i tried all three delta. And Diff with HFileWriterV2 is producing
smallest file in my test.

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-13 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207070#comment-13207070
]

He Yongqiang commented on HBASE-5313:
-

bq. However, those compression numbers are pretty nice. I worry a little bit
about having now an hfileV3, so soon on the heels of the last, leading to a
proliferation of versions. My other concern is that the columnar storage
doesn't make sense for all cases - Dremel is for a specific use case.
That being said, I would love to see the ability to do Dremel in HBase. How
about along with a new version/columnar data support comes the ability to
select storage files on a per-table basis? That would enable some tables to be
optimized for certain use cases, other tables for others, rather than having to
use completely different clusters (continuing the multi-tenancy story).

@Jesse Yates, Yeah. Agree here. One big thing we need to answer is how to
integrate with current HFile implementation. We want to reuse code as much as
possible. I guess a nested columnar structure like Dremel is what we finally
want for HBase. But we first need to figure out a good story of how
applications will use it.

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-10 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205892#comment-13205892
]

He Yongqiang commented on HBASE-5313:
-

@Todd, with such a small block size and data also already sorted, i was also
thinking it is will be very hard to optimize the space.

So we did some experiments by modifying today's HFileWriter. It turns out it
can still save a lot if we play more tricks.

Here are test results (block size is 16KB):

*42MB HFile, with Delta compression and with LZO compression* (with default
setting on Apache trunk)

*30MB HFile, with Columnar, with Delta compression, and with LZO compression.*

*24MB HFile, with Columnar, Sort value column, Sort column_qualifier column,
and with LZO compression.*

Inside one block, first put all row keys inside that block, and do delta
compression, and then LZO compression. After row key, put all column family
data in that block, and do Delta+LZO for it. And then put column_qualifier,
sort it, and then do Delta+LZO. TS column and Code column are processed the
same as column family. The value column is processed the same as
column_qualifier. So it is the same as disk format for the 30MB HFile, except
all data for 'column_qualifier' and 'value' are sorted separately.

Out of 24MB file, 6MB is used to store row keys, 7MB is used to store
column_qualifier, and 6MB is to store value.

More ideas are welcome!

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-08 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203935#comment-13203935
]

He Yongqiang commented on HBASE-5313:
-

I suppose we could use the value length from the key, then know we have nth
key and by using the value length of all 1 to n-1 keys to find the value.
Yes. The value length is stored in the key header. The key header is cheap. And
can always be decompressed without a big cpu cost.

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-07 Thread He Yongqiang (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203324#comment-13203324
]

He Yongqiang commented on HBASE-5313:
-

As discussed earlier, one thing we can try is to use something like hive's
rcfile. The thing different from hive is hbase row's value is not a single
type. If it turns out the columnar file format helps, we can employ nested
columnar format for the value (like what dremel does.). There is one thread on
Quora about dremel
http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases.

Restructure hfiles layout for better compression

Key: HBASE-5313
URL: https://issues.apache.org/jira/browse/HBASE-5313
Project: HBase
Issue Type: Improvement
Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

[jira] [Commented] (HBASE-5605) compression does not work in Store.java trunk

[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

17 matches

Site Navigation

Mail list logo

Footer information