Jon,

Short answer: no.

In RDBMS parlance, Accumulo has a single index. That index is the "row" portion of the Key class. This is the reason you see that as a "standard practice". Any other attempt to fetch data based on another component of the key (ignoring locality groups/column family subtleties) is an exhaustive scan of your dataset.

If you are going to support this application for any duration of time, it is a good idea to take the penalty once in rewriting your old data into the new format to make all of your queries henceforth fast. If you have such a significant amount of data that you want to avoid running a large mapreduce task, you'll likely not want to make your users wait to read all of that data to answer every query :)

Does that make sense?

- Josh

Parise, Jonathan wrote:
Hi,

I was wondering if there is a way to scan a table based on the
timestamps. For example, is there a way to set a range based on the
timestamp portion of the key?

I know that standard practice is to add a timestamp as part of the row
id, but in this particular case I probably cannot use that technique.
The reason I can’t use it is that I need to find the most recent data in
a preexisting Accumulo instance. Not all of the information was stored
with timestamps as appended to the row id. I can’t go back and change
the data, I just have to work with what is there.

So, given a large amount of preexisting data without time information in
the row id, column family or column qualifier, how would you scan for
the most recent data?

Specifically, is there any way to scan/sort by the timestamp portion of
the key. I did not see any way to make a Range with times.

I also really do not want to run a job over all the data to make a new
copy of the table that is sorted. I have a lot of data here and such a
replication would take a very long time.

Thanks,

Jon

Reply via email to