I'm not sure if you're eliding this fact or not, but you'd be much better off if you used a fixed-width format for your keys. So in your example, you'd have:
PATTERN: source(4-byte-int).type(4-byte-int or smaller).fixed 128-bit hash.8-byte timestamp Example: \x00\x00\x00\x01\x00\x00\x02\x03.... The advantage of this is not only that it's significantly less data (remember your key is stored on each KeyValue), but also you can now use FuzzyRowFilter and other techniques to quickly perform scans. The disadvantage is that you have to normalize the source-> integer but I find I can either store that in an enum or cache it for a long time so it's not a big issue. -Mike On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <[email protected]> wrote: > Thank you very much for the great support! > This is how I thought to design my key: > > PATTERN: source|type|qualifier|hash(name)|timestamp > EXAMPLE: > google|appliance|oven|be9173589a7471a7179e928adc1a86f7|1372837702753 > > Do you think my key could be good for my scope (my search will be > essentially by source or source|type)? > Another point is that initially I will not have so many sources, so I will > probably have only google|* but in the next phases there could be more > sources.. > > Best, > Flavio > > On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <[email protected]> wrote: > >> For #1, yes - the client receives less data after filtering. >> >> For #2, please take a look at TestMultiVersions >> (./src/test/java/org/apache/hadoop/hbase/TestMultiVersions.java in 0.94) >> for time range: >> >> scan = new Scan(); >> >> scan.setTimeRange(1000L, Long.MAX_VALUE); >> For row key selection, you need a filter. Take a look at >> FuzzyRowFilter.java >> >> Cheers >> >> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <[email protected] >> >wrote: >> >> > Thanks for the reply! I thus have two questions more: >> > >> > 1) is it true that filtering on timestamps doesn't affect performance..? >> > 2) could you send me a little snippet of how you would do such a filter >> (by >> > row key + timestamps)? For example get all rows whose key starts with >> > 'someid-' and whose timestamps is greater than some timestamp? >> > >> > Best, >> > Flavio >> > >> > >> > On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <[email protected]> wrote: >> > >> > > bq. Using timestamp in row-keys is discouraged >> > > >> > > The above is true. >> > > Prefixing row key with timestamp would create hot region. >> > > >> > > bq. should I filter by a simpler row-key plus a filter on timestamp? >> > > >> > > You can do the above. >> > > >> > > On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier < >> [email protected] >> > > >wrote: >> > > >> > > > Hi to everybody, >> > > > >> > > > in my use case I have to perform batch analysis skipping old data. >> > > > For example, I want to process all rows created after a certain >> > > timestamp, >> > > > passed as parameter. >> > > > >> > > > What is the most effective way to do this? >> > > > Should I design my row-key to embed timestamp? >> > > > Or just filtering by timestamp of the row is fast as well? Or what >> > else? >> > > > >> > > > Initially I was thinking to compose my key as: >> > > > timestamp|source|title|type >> > > > >> > > > but: >> > > > >> > > > 1) Using timestamp in row-keys is discouraged >> > > > 2) If this design is ok, using this approach I still have problems >> > > > filtering by timestamp because I cannot found a way to numerically >> > filer >> > > > (instead of alphanumerically/by string). Example: >> > > > 1372776400441|something has timestamp lesser >> > > > than 1372778470913|somethingelse but I cannot filter all row whose >> key >> > is >> > > > "numerically" greater than 1372776400441. Is it possible to overcome >> > this >> > > > issue? >> > > > 3) If this design is not ok, should I filter by a simpler row-key >> plus >> > a >> > > > filter on timestamp? Or what else? >> > > > >> > > > Best, >> > > > Flavio >> > > > >> > > >> > >>
