Prashant, check out the much more sophisticated version of
SequenceFileLoader in Elephant-Bird.

D

On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <[email protected]>wrote:

> I see a lot of cases when users store data as SequenceFiles and would like
> to parse through the Value (ONLY if Text/DataByteArray) similar to
> PigStorage(String delim).
>
> For example, lets say the logs are delimited by ^G and one would like to
> read from SequenceFiles. Currently, this would involve 2 operations (1.
> Read from SequenceFile and 2. Parse by passing this data to a UDF that does
> the parsing).
> We could enhance the SequenceFileLoader to do this internally by providing
> a Constructor similar to PigStorage(String delim) -
> SequenceFileLoader(String delim). The default constructor would still
> perform the same way it does now.
>
>  // returned tuple contains both Key and Value (tokenized by delimiter).
>    public SequenceFileLoader(String delimiter) {
>        this.isDelimSpecified = true;
>        this.fieldDel = StorageUtil.parseFieldDel(delimeter);
>        mProtoTuple = new ArrayList<Object>();
>    }
>
>
>  @Override
>    public Tuple getNext() throws IOException {
>   //foo bar
>   .
>   .
>   .
>
> if (isDelimSpecified && valType == DataType.CHARARRAY) {
>            Text val = (Text)value;
>            int len = val.getLength();
>            byte[] buf = val.getBytes();
>            parseValue(buf, len);
>        } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
>            // Tokenize value by delimiter when Value class
>            // type is DataByteArray
>            DataByteArray val = (DataByteArray)value;
>            parseValue(val.get(), val.size());
>        } else
>            // Add the value without tokenizing if Delimeter is not
> specified OR value class type is not Text or
>            // DataByteArray
>            mProtoTuple.add(translateWritableToPigDataType(value, valType));
>
> .
> .
> .
> }
>
> //From PigStorage
> private void parseValue(byte[] buf, int len) {
>        int start = 0;
>        for (int i = 0; i < len; i++) {
>            if (buf[i] == fieldDel) {
>                readField(buf, start, i);
>                start = i + 1;
>            }
>        }
>        // Store the field after last delimiter occurs
>        if (start <= len) {
>            readField(buf, start, len);
>        }
>    }
>
>
> Thoughts?
>
> -Prashant
>

Reply via email to