See this recent thread: http://search-hadoop.com/m/DHED4pDVFG1
Before a major compaction, the query may fetch data from multiple HFiles. This would be slower compared to fetching data from a single file. As for the difference in duration of queries, you can perform query on your data to get more concrete idea. Cheers On Mon, Nov 10, 2014 at 7:45 AM, Bill Q <[email protected]> wrote: > Hi Ted, > Thanks a lot. > > When would it break? Would you please give some details of why the size > would be a decision factor? > > I will have probably 10 cells that have daily updates. And the rest cells > in the column family will only have a handful of versions. So, the cells in > the same column family will be very skewed in terms of version numbers. > > And before a major compaction, if I try to grab the all the versions of the > cell, will there be any performance issue? I plan to do a batch process on > hundreds of thousands of devices with all the versions of that few cells > pulled out. > > On Monday, November 10, 2014, Ted Yu <[email protected]> wrote: > > > Half a million timestamps with 20 bytes each cell equate to 10MB. > > That should be fine for your client. > > > > Cheers > > > > On Mon, Nov 10, 2014 at 7:23 AM, Bill Q <[email protected] > > <javascript:;>> wrote: > > > > > Hi Ted, > > > Thanks a lot for the reply. > > > > > > For #1, the size for the value only will be around 20 bytes for each > > cell. > > > And there will be hundreds of thousands of time stamp per cell. But not > > > millions. Any suggestion? > > > > > > Many thanks. > > > > > > > > > Cao > > > > > > On Monday, November 10, 2014, Ted Yu <[email protected] > <javascript:;>> > > wrote: > > > > > > > For #1, what's the expected size of data you want to store ? > > > > > > > > For #2, the new data inserted under column:value with a newer > timestamp > > > > would be stored in a different HFile. Old and new data would be > > > > consolidated after major compaction. > > > > > > > > Cheers > > > > > > > > On Mon, Nov 10, 2014 at 6:21 AM, Bill Q <[email protected] > > <javascript:;> > > > > <javascript:;>> wrote: > > > > > > > > > Hi, > > > > > I am designing a schema to store time series data for each device. > > And > > > I > > > > > have a couple of questions that I am not quit sure. > > > > > > > > > > 1. *Is there any down side for storing the data in the same > > > > > columnfamily:column with a long history of customized timestamp? * > > > > > > > > > > For example, I have historical daily data for a device. I would > like > > to > > > > use > > > > > only one column qualifier to store them with custom timestamp, > which > > is > > > > the > > > > > date of the data was collected. So, when I query the data I can > > easily > > > > pull > > > > > all the timeseries data against this particular device in one scan. > > > > > > > > > > 2. *After a storefile is finalized and become immutable, what would > > > > happen > > > > > when someone updates the row? * > > > > > > > > > > For example, if I insert a new column:value with a newer timestamp > > into > > > > the > > > > > same row:columnfamily. Where is this new key/value part going to > sit > > in > > > > the > > > > > HDFS? Is it close to the previous K/V pairs in the storefile? > > > > > > > > > > > > > > > Many thanks. > > > > > > > > > > > > > > > Bill > > > > > > > > > > > > > > > > > > -- > > > Many thanks. > > > > > > > > > Bill > > > > > > > > -- > Many thanks. > > > Bill >
