[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-04-05 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247497#comment-13247497
 ] 

He Yongqiang commented on HBASE-5313:
-

Hi Kannan,

We are still experimenting this. The initial results shows only less than one 
quarter off, which is kind of not big enough for us. The timestamp issue is a 
low hanging fruit, which can cut 8%. 
We will post some diff asap, once after we finalize our experiments.

 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242094#comment-13242094
 ] 

He Yongqiang commented on HBASE-5674:
-

I use the term 'researchy' as it is mentioned so in one email thread. refer to 
http://osdir.com/ml/general/2012-03/msg52707.html I have no idea how this term 
come up.

bq. The most of us working on hbase are trying to make it an hardcore 
production worthy platform. 'Pluggable' and 'research', at least on first 
blush, sound like distractions from the project objective.
So are you referring this as conflicting with your 'hardcore production worthy 
platform' goal? 

 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242115#comment-13242115
 ] 

He Yongqiang commented on HBASE-5674:
-

okay. Now i need to make it public on my lack sense of humor. :)

Here is the real problem:
In our use case, the space the data occupies *really* matter. We need to find 
all kind of things that we can do to bring down the size as much as possible. 
Apparently we do not want to bring in LZMA compression or bzip2 compression as 
they are really slow. In my simple test, a 41MB data can be reduced to 32MB 
after i rewrite the hbase Long timestamp to zero. The 8-bytes Long timestamp is 
heavy is because it is binary system timestamp which makes it very hard to 
compress (MemstoreTS is also a Long timestamp but there is no problem with it 
as it will be zero eventually). And if you look at how we are using that data, 
pretty much that data is not used by most applications if the data is system 
generated (not specified by applications). A good reason to make it 
configurable is some application may do specify it. In that case, pretty much 
you as hbase can not modify that data. But for a lot of other applications 
which do not care this data should not suffer this problem if data size really 
matter to them. 
I think this could benefit other community members as they may see this problem 
when they want to decrease the data size. 



 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-30 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242650#comment-13242650
 ] 

He Yongqiang commented on HBASE-5674:
-

Thanks Matt and stack for the point out of 4676. Yeah, we are very very 
interested in the work that is going on HBase-4767.

 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5674) add support in HBase to overwrite hbase timestamp to a version number during major compaction

2012-03-29 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242028#comment-13242028
 ] 

He Yongqiang commented on HBASE-5674:
-

bq. For whom?

For our 'researchy' project...

bq. Can you not just have your client specify timestamp of 0?

I hope this can be done in open source hbase, and can be pluggable. 

 add support in HBase to overwrite hbase timestamp to a version number during 
 major compaction
 -

 Key: HBASE-5674
 URL: https://issues.apache.org/jira/browse/HBASE-5674
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang

 Right now, a millisecond-level timestamp is attached to every record. 
 In our case, we only need a version number (mostly it will be just zero etc). 
 A millisecond timestamp is too heavy to carry. We should add support to 
 overwrite it to zero during major compaction. 
 KVs before major compaction will remain using system timestamp. And this 
 should be configurable, so that we should not mess up if the hbase timestamp 
 is specified by application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5605) compression does not work in Store.java trunk

2012-03-20 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233681#comment-13233681
 ] 

He Yongqiang commented on HBASE-5605:
-

https://reviews.facebook.net/D2391

 compression does not work in Store.java trunk
 -

 Key: HBASE-5605
 URL: https://issues.apache.org/jira/browse/HBASE-5605
 Project: HBase
  Issue Type: Bug
Reporter: He Yongqiang
Assignee: He Yongqiang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

2012-03-08 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225379#comment-13225379
 ] 

He Yongqiang commented on HBASE-5521:
-

Tests passed in my local test, not sure why failed on Jenkins.

 Move compression/decompression to an encoder specific encoding context
 --

 Key: HBASE-5521
 URL: https://issues.apache.org/jira/browse/HBASE-5521
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang
 Attachments: HBASE-5521.1.patch, HBASE-5521.D2097.1.patch, 
 HBASE-5521.D2097.2.patch, HBASE-5521.D2097.3.patch, HBASE-5521.D2097.4.patch, 
 HBASE-5521.D2097.5.patch


 As part of working on HBASE-5313, we want to add a new columnar 
 encoder/decoder. It makes sense to move compression to be part of 
 encoder/decoder:
 1) a scanner for a columnar encoded block can do lazy decompression to a 
 specific part of a key value object
 2) avoid an extra bytes copy from encoder to hblock-writer. 
 If there is no encoder specified for a writer, the HBlock.Writer will use a 
 default compression-context to do something very similar to today's code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-03-05 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222512#comment-13222512
 ] 

He Yongqiang commented on HBASE-5313:
-

As part of working on HBASE-5313, we first tried to write a 
HFileWriter/HFileReader to do it. After finishing some work, it seems this 
requires a lot of code refactoring in order to reuse existing code as much as 
possible.

Then we find seems adding a new columnar encoder/decoder would be easy to do. 
opened https://issues.apache.org/jira/browse/HBASE-5521 to do encoder/decoder 
specific compression work.

 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5521) Move compression/decompression to an encoder specific encoding context

2012-03-05 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222576#comment-13222576
 ] 

He Yongqiang commented on HBASE-5521:
-

moved the review to https://reviews.facebook.net/D2097

 Move compression/decompression to an encoder specific encoding context
 --

 Key: HBASE-5521
 URL: https://issues.apache.org/jira/browse/HBASE-5521
 Project: HBase
  Issue Type: Improvement
Reporter: He Yongqiang
Assignee: He Yongqiang
 Attachments: HBASE-5521.1.patch, HBASE-5521.D2097.1.patch


 As part of working on HBASE-5313, we want to add a new columnar 
 encoder/decoder. It makes sense to move compression to be part of 
 encoder/decoder:
 1) a scanner for a columnar encoded block can do lazy decompression to a 
 specific part of a key value object
 2) avoid an extra bytes copy from encoder to hblock-writer. 
 If there is no encoder specified for a writer, the HBlock.Writer will use a 
 default compression-context to do something very similar to today's code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

2012-02-23 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214915#comment-13214915
 ] 

He Yongqiang commented on HBASE-5457:
-

@stack, we haven't thought that in much detail, but we can start the discussion 
by an example.

Let's say there is one column family, and it only contains one type column 
whose name is a combine of 'string and ts'. So the data is sorted by 'string' 
first. But one query wants the data to be sorted by ts instead.

 add inline index in data block for data which are not clustered together
 

 Key: HBASE-5457
 URL: https://issues.apache.org/jira/browse/HBASE-5457
 Project: HBase
  Issue Type: New Feature
Reporter: He Yongqiang

 As we are go through our data schema, and we found we have one large column 
 family which is just duplicating data from another column family and is just 
 a re-org of the data to cluster data in a different way than the original 
 column family in order to serve another type of queries efficiently.
 If we compare this second column family with similar situation in mysql, it 
 is like an index in mysql. So if we can add inline block index on required 
 columns, the second column family then is not needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5457) add inline index in data block for data which are not clustered together

2012-02-23 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215002#comment-13215002
 ] 

He Yongqiang commented on HBASE-5457:
-

@lars, in today's implementation we actually create another column family and 
reorg the column name to be 'ts and string', so the data is sorted by ts in 
this new column family. And we redirect the query to use the second column 
family. But this approach duplicates data. 
Without the second column family, we can do a search once we found the row. but 
that requires searching all data with the target row key. It hurts cpu. 

 add inline index in data block for data which are not clustered together
 

 Key: HBASE-5457
 URL: https://issues.apache.org/jira/browse/HBASE-5457
 Project: HBase
  Issue Type: New Feature
Reporter: He Yongqiang

 As we are go through our data schema, and we found we have one large column 
 family which is just duplicating data from another column family and is just 
 a re-org of the data to cluster data in a different way than the original 
 column family in order to serve another type of queries efficiently.
 If we compare this second column family with similar situation in mysql, it 
 is like an index in mysql. So if we can add inline block index on required 
 columns, the second column family then is not needed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-22 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214032#comment-13214032
 ] 

He Yongqiang commented on HBASE-5313:
-

As a first step, we will go ahead with a simple columnar layout implementation. 
And leave more advanced features (like nested column layout) in a follow up. 



 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-13 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207057#comment-13207057
 ] 

He Yongqiang commented on HBASE-5313:
-

Can you also list the time it took writing the HFile for each of the three 
schemes ?
@Zhihong, we are still trying to explore more ideas here. Once we got a 
finalized plan, i will get the cpu/latency numbers. 


Yongqiang, what is the delta encoding algorithm did you use? The default 
algorithm only do a simple encoding. Do we have results using prefix with 
fast diff algorithm for the current hfile v2?

@jerry, i tried all three delta. And Diff with HFileWriterV2 is producing 
smallest file in my test. 






 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-13 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207070#comment-13207070
 ] 

He Yongqiang commented on HBASE-5313:
-

bq. However, those compression numbers are pretty nice. I worry a little bit 
about having now an hfileV3, so soon on the heels of the last, leading to a 
proliferation of versions. My other concern is that the columnar storage 
doesn't make sense for all cases - Dremel is for a specific use case.
That being said, I would love to see the ability to do Dremel in HBase. How 
about along with a new version/columnar data support comes the ability to 
select storage files on a per-table basis? That would enable some tables to be 
optimized for certain use cases, other tables for others, rather than having to 
use completely different clusters (continuing the multi-tenancy story).

@Jesse Yates, Yeah. Agree here. One big thing we need to answer is how to 
integrate with current HFile implementation. We want to reuse code as much as 
possible. I guess a nested columnar structure like Dremel is what we finally 
want for HBase. But we first need to figure out a good story of how 
applications will use it.



 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-10 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205892#comment-13205892
 ] 

He Yongqiang commented on HBASE-5313:
-

@Todd, with such a small block size and data also already sorted, i was also 
thinking it is will be very hard to optimize the space.

So we did some experiments by modifying today's HFileWriter. It turns out it 
can still save a lot if we play more tricks.

Here are test results (block size is 16KB):

*42MB HFile, with Delta compression and with LZO compression* (with default 
setting on Apache trunk)

*30MB HFile, with Columnar, with Delta compression, and with LZO compression.*

Inside one block, first put all row keys inside that block, and do delta 
compression, and then LZO compression. After row key, put all column family 
data in that block, and do Delta+LZO for it. And then similarly put 
column_qualifier. etc

*24MB HFile, with Columnar, Sort value column, Sort column_qualifier column, 
and with LZO compression.*

Inside one block, first put all row keys inside that block, and do delta 
compression, and then LZO compression. After row key, put all column family 
data in that block, and do Delta+LZO for it. And then put column_qualifier, 
sort it, and then do Delta+LZO. TS column and Code column are processed the 
same as column family. The value column is processed the same as 
column_qualifier. So it is the same as disk format for the 30MB HFile, except 
all data for 'column_qualifier' and 'value' are sorted separately.

Out of 24MB file, 6MB is used to store row keys, 7MB is used to store 
column_qualifier, and 6MB is to store value.

More ideas are welcome! 


 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-08 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203935#comment-13203935
 ] 

He Yongqiang commented on HBASE-5313:
-

I suppose we could use the value length from the key, then know we have nth 
key and by using the value length of all 1 to n-1 keys to find the value.
Yes. The value length is stored in the key header. The key header is cheap. And 
can always be decompressed without a big cpu cost.

 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression

2012-02-07 Thread He Yongqiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203324#comment-13203324
 ] 

He Yongqiang commented on HBASE-5313:
-

As discussed earlier, one thing we can try is to use something like hive's 
rcfile. The thing different from hive is hbase row's value is not a single 
type. If it turns out the columnar file format helps, we can employ nested 
columnar format for the value (like what dremel does.). There is one thread on 
Quora about dremel 
http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases.

 Restructure hfiles layout for better compression
 

 Key: HBASE-5313
 URL: https://issues.apache.org/jira/browse/HBASE-5313
 Project: HBase
  Issue Type: Improvement
  Components: io
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 A HFile block contain a stream of key-values. Can we can organize these kvs 
 on the disk in a better way so that we get much greater compression ratios?
 One option (thanks Prakash) is to store all the keys in the beginning of the 
 block (let's call this the key-section) and then store all their 
 corresponding values towards the end of the block. This will allow us to 
 not-even decompress the values when we are scanning and skipping over rows in 
 the block.
 Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira