I did but I am not sure if allows you to tokenize Value (when Text or
bytearray). For example, say a record in my file is

ABC^GDEF^GGHI^GJKL^GMNO

And I want to be able to tokenize by ^G and access this value positionally
(as done with PigStorage). Is it possible to do it with the elephant-bird
version?

A = LOAD 'data' using SequenceFileLoader('\u0007');
B = FOREACH A GENERATE $2, $3;

In any case, it would be good to have it as part of the Piggybank version
of SequenceFileLoader. I see you were the original author of it!

-Prashant

On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Prashant, check out the much more sophisticated version of
> SequenceFileLoader in Elephant-Bird.
>
> D
>
> On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <[email protected]
> >wrote:
>
> > I see a lot of cases when users store data as SequenceFiles and would
> like
> > to parse through the Value (ONLY if Text/DataByteArray) similar to
> > PigStorage(String delim).
> >
> > For example, lets say the logs are delimited by ^G and one would like to
> > read from SequenceFiles. Currently, this would involve 2 operations (1.
> > Read from SequenceFile and 2. Parse by passing this data to a UDF that
> does
> > the parsing).
> > We could enhance the SequenceFileLoader to do this internally by
> providing
> > a Constructor similar to PigStorage(String delim) -
> > SequenceFileLoader(String delim). The default constructor would still
> > perform the same way it does now.
> >
> >  // returned tuple contains both Key and Value (tokenized by delimiter).
> >    public SequenceFileLoader(String delimiter) {
> >        this.isDelimSpecified = true;
> >        this.fieldDel = StorageUtil.parseFieldDel(delimeter);
> >        mProtoTuple = new ArrayList<Object>();
> >    }
> >
> >
> >  @Override
> >    public Tuple getNext() throws IOException {
> >   //foo bar
> >   .
> >   .
> >   .
> >
> > if (isDelimSpecified && valType == DataType.CHARARRAY) {
> >            Text val = (Text)value;
> >            int len = val.getLength();
> >            byte[] buf = val.getBytes();
> >            parseValue(buf, len);
> >        } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
> >            // Tokenize value by delimiter when Value class
> >            // type is DataByteArray
> >            DataByteArray val = (DataByteArray)value;
> >            parseValue(val.get(), val.size());
> >        } else
> >            // Add the value without tokenizing if Delimeter is not
> > specified OR value class type is not Text or
> >            // DataByteArray
> >            mProtoTuple.add(translateWritableToPigDataType(value,
> valType));
> >
> > .
> > .
> > .
> > }
> >
> > //From PigStorage
> > private void parseValue(byte[] buf, int len) {
> >        int start = 0;
> >        for (int i = 0; i < len; i++) {
> >            if (buf[i] == fieldDel) {
> >                readField(buf, start, i);
> >                start = i + 1;
> >            }
> >        }
> >        // Store the field after last delimiter occurs
> >        if (start <= len) {
> >            readField(buf, start, len);
> >        }
> >    }
> >
> >
> > Thoughts?
> >
> > -Prashant
> >
>

Reply via email to