It sounds feasible to me. You can certainly seek to a specific sync marker and so long as you're periodically calling sync to get the last position, then you can save those offsets in a separate file(s) that you load into memory or search sequentially.
This sounds very similar to MapFiles which used a pair of SequenceFiles, one with the data and one with an index of every Nth key to speed up lookups of sorted data. -Joey On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <[email protected]> wrote: > Hi all, > > I'm looking for suggestions on how to optimize a number of Hadoop jobs > (written using Cascading) that only need a fraction of the records store in > Avro files. > > Essentially I have a small number (let's say 10K) of essentially random keys > out of a total of 100M unique values, and I need to select & process all and > only those records in my Avro files where the key field matches. The set of > keys that are of interest changes with each run. > > I have about 1TB of compressed data to scan through, saved as about 200 5GB > files. This represents about 10B records. > > The data format has to stay as Avro, for interchange with various groups. > > As I'm building the Avro files, I could sort by the key field. > > I'm wondering if it's feasible to build a skip table that would let me seek > to a sync position in the Avro file and read from it. If the default sync > interval is 16K, then I'd have 65M of these that I could use, and even if > every key of interest had 100 records that were each in a separate block, > this would still dramatically cut down on the amount of data I'd have to > scan over. > > But is that possible? Any input would be appreciated. > > Thanks, > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > -- Joey Echeverria
