It sounds feasible to me. You can certainly seek to a specific sync
marker and so long as you're periodically calling sync to get the last
position, then you can save those offsets in a separate file(s) that
you load into memory or search sequentially.

This sounds very similar to MapFiles which used a pair of
SequenceFiles, one with the data and one with an index of every Nth
key to speed up lookups of sorted data.

-Joey

On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <[email protected]> wrote:
> Hi all,
>
> I'm looking for suggestions on how to optimize a number of Hadoop jobs
> (written using Cascading) that only need a fraction of the records store in
> Avro files.
>
> Essentially I have a small number (let's say 10K) of essentially random keys
> out of a total of 100M unique values, and I need to select & process all and
> only those records in my Avro files where the key field matches. The set of
> keys that are of interest changes with each run.
>
> I have about 1TB of compressed data to scan through, saved as about 200 5GB
> files. This represents about 10B records.
>
> The data format has to stay as Avro, for interchange with various groups.
>
> As I'm building the Avro files, I could sort by the key field.
>
> I'm wondering if it's feasible to build a skip table that would let me seek
> to a sync position in the Avro file and read from it. If the default sync
> interval is 16K, then I'd have 65M of these that I could use, and even if
> every key of interest had 100 records that were each in a separate block,
> this would still dramatically cut down on the amount of data I'd have to
> scan over.
>
> But is that possible? Any input would be appreciated.
>
> Thanks,
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



-- 
Joey Echeverria

Reply via email to