you can provide arbitrary converters to the EB one.. not as simple as just
"\u0007", but far more e
extendible (and you can make a stringtokenizer converter that takes an
argument)

the SeqFileLoader in piggybank was an exercise in "here's how you might go
about doing this" rather than a real loader...

D

On Wed, Dec 14, 2011 at 3:59 PM, Prashant Kommireddi <[email protected]>wrote:

> I did but I am not sure if allows you to tokenize Value (when Text or
> bytearray). For example, say a record in my file is
>
> ABC^GDEF^GGHI^GJKL^GMNO
>
> And I want to be able to tokenize by ^G and access this value positionally
> (as done with PigStorage). Is it possible to do it with the elephant-bird
> version?
>
> A = LOAD 'data' using SequenceFileLoader('\u0007');
> B = FOREACH A GENERATE $2, $3;
>
> In any case, it would be good to have it as part of the Piggybank version
> of SequenceFileLoader. I see you were the original author of it!
>
> -Prashant
>
> On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Prashant, check out the much more sophisticated version of
> > SequenceFileLoader in Elephant-Bird.
> >
> > D
> >
> > On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <
> [email protected]
> > >wrote:
> >
> > > I see a lot of cases when users store data as SequenceFiles and would
> > like
> > > to parse through the Value (ONLY if Text/DataByteArray) similar to
> > > PigStorage(String delim).
> > >
> > > For example, lets say the logs are delimited by ^G and one would like
> to
> > > read from SequenceFiles. Currently, this would involve 2 operations (1.
> > > Read from SequenceFile and 2. Parse by passing this data to a UDF that
> > does
> > > the parsing).
> > > We could enhance the SequenceFileLoader to do this internally by
> > providing
> > > a Constructor similar to PigStorage(String delim) -
> > > SequenceFileLoader(String delim). The default constructor would still
> > > perform the same way it does now.
> > >
> > >  // returned tuple contains both Key and Value (tokenized by
> delimiter).
> > >    public SequenceFileLoader(String delimiter) {
> > >        this.isDelimSpecified = true;
> > >        this.fieldDel = StorageUtil.parseFieldDel(delimeter);
> > >        mProtoTuple = new ArrayList<Object>();
> > >    }
> > >
> > >
> > >  @Override
> > >    public Tuple getNext() throws IOException {
> > >   //foo bar
> > >   .
> > >   .
> > >   .
> > >
> > > if (isDelimSpecified && valType == DataType.CHARARRAY) {
> > >            Text val = (Text)value;
> > >            int len = val.getLength();
> > >            byte[] buf = val.getBytes();
> > >            parseValue(buf, len);
> > >        } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
> > >            // Tokenize value by delimiter when Value class
> > >            // type is DataByteArray
> > >            DataByteArray val = (DataByteArray)value;
> > >            parseValue(val.get(), val.size());
> > >        } else
> > >            // Add the value without tokenizing if Delimeter is not
> > > specified OR value class type is not Text or
> > >            // DataByteArray
> > >            mProtoTuple.add(translateWritableToPigDataType(value,
> > valType));
> > >
> > > .
> > > .
> > > .
> > > }
> > >
> > > //From PigStorage
> > > private void parseValue(byte[] buf, int len) {
> > >        int start = 0;
> > >        for (int i = 0; i < len; i++) {
> > >            if (buf[i] == fieldDel) {
> > >                readField(buf, start, i);
> > >                start = i + 1;
> > >            }
> > >        }
> > >        // Store the field after last delimiter occurs
> > >        if (start <= len) {
> > >            readField(buf, start, len);
> > >        }
> > >    }
> > >
> > >
> > > Thoughts?
> > >
> > > -Prashant
> > >
> >
>

Reply via email to