I did but I am not sure if allows you to tokenize Value (when Text or
bytearray). For example, say a record in my file is
ABC^GDEF^GGHI^GJKL^GMNO
And I want to be able to tokenize by ^G and access this value positionally
(as done with PigStorage). Is it possible to do it with the elephant-bird
version?
A = LOAD 'data' using SequenceFileLoader('\u0007');
B = FOREACH A GENERATE $2, $3;
In any case, it would be good to have it as part of the Piggybank version
of SequenceFileLoader. I see you were the original author of it!
-Prashant
On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]> wrote:
> Prashant, check out the much more sophisticated version of
> SequenceFileLoader in Elephant-Bird.
>
> D
>
> On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <[email protected]
> >wrote:
>
> > I see a lot of cases when users store data as SequenceFiles and would
> like
> > to parse through the Value (ONLY if Text/DataByteArray) similar to
> > PigStorage(String delim).
> >
> > For example, lets say the logs are delimited by ^G and one would like to
> > read from SequenceFiles. Currently, this would involve 2 operations (1.
> > Read from SequenceFile and 2. Parse by passing this data to a UDF that
> does
> > the parsing).
> > We could enhance the SequenceFileLoader to do this internally by
> providing
> > a Constructor similar to PigStorage(String delim) -
> > SequenceFileLoader(String delim). The default constructor would still
> > perform the same way it does now.
> >
> > // returned tuple contains both Key and Value (tokenized by delimiter).
> > public SequenceFileLoader(String delimiter) {
> > this.isDelimSpecified = true;
> > this.fieldDel = StorageUtil.parseFieldDel(delimeter);
> > mProtoTuple = new ArrayList<Object>();
> > }
> >
> >
> > @Override
> > public Tuple getNext() throws IOException {
> > //foo bar
> > .
> > .
> > .
> >
> > if (isDelimSpecified && valType == DataType.CHARARRAY) {
> > Text val = (Text)value;
> > int len = val.getLength();
> > byte[] buf = val.getBytes();
> > parseValue(buf, len);
> > } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
> > // Tokenize value by delimiter when Value class
> > // type is DataByteArray
> > DataByteArray val = (DataByteArray)value;
> > parseValue(val.get(), val.size());
> > } else
> > // Add the value without tokenizing if Delimeter is not
> > specified OR value class type is not Text or
> > // DataByteArray
> > mProtoTuple.add(translateWritableToPigDataType(value,
> valType));
> >
> > .
> > .
> > .
> > }
> >
> > //From PigStorage
> > private void parseValue(byte[] buf, int len) {
> > int start = 0;
> > for (int i = 0; i < len; i++) {
> > if (buf[i] == fieldDel) {
> > readField(buf, start, i);
> > start = i + 1;
> > }
> > }
> > // Store the field after last delimiter occurs
> > if (start <= len) {
> > readField(buf, start, len);
> > }
> > }
> >
> >
> > Thoughts?
> >
> > -Prashant
> >
>