Have you looked at SortedKeyValueFile?

https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html

This may already provide what you need.

Doug
On Dec 3, 2014 10:14 PM, "Joey Echeverria" <[email protected]> wrote:

> It sounds feasible to me. You can certainly seek to a specific sync
> marker and so long as you're periodically calling sync to get the last
> position, then you can save those offsets in a separate file(s) that
> you load into memory or search sequentially.
>
> This sounds very similar to MapFiles which used a pair of
> SequenceFiles, one with the data and one with an index of every Nth
> key to speed up lookups of sorted data.
>
> -Joey
>
> On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <[email protected]>
> wrote:
> > Hi all,
> >
> > I'm looking for suggestions on how to optimize a number of Hadoop jobs
> > (written using Cascading) that only need a fraction of the records store
> in
> > Avro files.
> >
> > Essentially I have a small number (let's say 10K) of essentially random
> keys
> > out of a total of 100M unique values, and I need to select & process all
> and
> > only those records in my Avro files where the key field matches. The set
> of
> > keys that are of interest changes with each run.
> >
> > I have about 1TB of compressed data to scan through, saved as about 200
> 5GB
> > files. This represents about 10B records.
> >
> > The data format has to stay as Avro, for interchange with various groups.
> >
> > As I'm building the Avro files, I could sort by the key field.
> >
> > I'm wondering if it's feasible to build a skip table that would let me
> seek
> > to a sync position in the Avro file and read from it. If the default sync
> > interval is 16K, then I'd have 65M of these that I could use, and even if
> > every key of interest had 100 records that were each in a separate block,
> > this would still dramatically cut down on the amount of data I'd have to
> > scan over.
> >
> > But is that possible? Any input would be appreciated.
> >
> > Thanks,
> >
> > -- Ken
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
>
>
>
> --
> Joey Echeverria
>

Reply via email to