Have you looked at SortedKeyValueFile? https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html
This may already provide what you need. Doug On Dec 3, 2014 10:14 PM, "Joey Echeverria" <[email protected]> wrote: > It sounds feasible to me. You can certainly seek to a specific sync > marker and so long as you're periodically calling sync to get the last > position, then you can save those offsets in a separate file(s) that > you load into memory or search sequentially. > > This sounds very similar to MapFiles which used a pair of > SequenceFiles, one with the data and one with an index of every Nth > key to speed up lookups of sorted data. > > -Joey > > On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <[email protected]> > wrote: > > Hi all, > > > > I'm looking for suggestions on how to optimize a number of Hadoop jobs > > (written using Cascading) that only need a fraction of the records store > in > > Avro files. > > > > Essentially I have a small number (let's say 10K) of essentially random > keys > > out of a total of 100M unique values, and I need to select & process all > and > > only those records in my Avro files where the key field matches. The set > of > > keys that are of interest changes with each run. > > > > I have about 1TB of compressed data to scan through, saved as about 200 > 5GB > > files. This represents about 10B records. > > > > The data format has to stay as Avro, for interchange with various groups. > > > > As I'm building the Avro files, I could sort by the key field. > > > > I'm wondering if it's feasible to build a skip table that would let me > seek > > to a sync position in the Avro file and read from it. If the default sync > > interval is 16K, then I'd have 65M of these that I could use, and even if > > every key of interest had 100 records that were each in a separate block, > > this would still dramatically cut down on the amount of data I'd have to > > scan over. > > > > But is that possible? Any input would be appreciated. > > > > Thanks, > > > > -- Ken > > > > -------------------------- > > Ken Krugler > > +1 530-210-6378 > > http://www.scaleunlimited.com > > custom big data solutions & training > > Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > -- > Joey Echeverria >
