Re: multiple data versions vs. multiple rows?

2015-01-20 Thread yonghu
I think we need to take a look different situations. 1. One column gets frequently updated and the others not. If we use row representation, we will include the unchanged data value for each tuple. This may cause a large data redundancy. So, I think it can explain why in my test the multiple data

Re: multiple data versions vs. multiple rows?

2015-01-19 Thread Serega Sheypak
does performance should differ significantly if row value size is small and we don't have too much versions. Assume, that a pack of versions for key is less than recommended HFile block (8KB to 1MB https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/HFile.html), which is minimal "read

Re: multiple data versions vs. multiple rows?

2015-01-19 Thread Jean-Marc Spaggiari
Hi Yong, If you want to compare the performances, you need to run way bigger and longer tests. Dont run them in parallete. Run them at least 10 time each to make sure you have a good trend. Is the difference between the 2 significant? It should not. JM 2015-01-19 15:17 GMT-05:00 yonghu : > Hi,

Re: multiple data versions vs. multiple rows?

2015-01-19 Thread yonghu
Hi, Thanks for your suggestion. I have already considered the first issue that one row is not allowed to be split between 2 regions. However, I have made a small scan-test with MapReduce. I first created a table t1 with 1 million rows and allowed each column to store 10 data versions. Then, I tr

Re: multiple data versions vs. multiple rows?

2015-01-19 Thread Jean-Marc Spaggiari
Hi Yong, A row will not split between 2 regions. If you plan having thousands of versions, based on the size of your data, you might end up having a row bigger than your preferred region size. If you plan just keep few versions of the history to have a look at it, I will say go with it. If you pl

multiple data versions vs. multiple rows?

2015-01-19 Thread yonghu
Dear all, I want to record the user history data. I know there exists two options, one is to store user events in a single row with multiple data versions and the other one is to use multiple rows. I wonder which one is better for performance? Thanks! Yong