I see a lot of cases when users store data as SequenceFiles and would like
to parse through the Value (ONLY if Text/DataByteArray) similar to
PigStorage(String delim).

For example, lets say the logs are delimited by ^G and one would like to
read from SequenceFiles. Currently, this would involve 2 operations (1.
Read from SequenceFile and 2. Parse by passing this data to a UDF that does
the parsing).
We could enhance the SequenceFileLoader to do this internally by providing
a Constructor similar to PigStorage(String delim) -
SequenceFileLoader(String delim). The default constructor would still
perform the same way it does now.

 // returned tuple contains both Key and Value (tokenized by delimiter).
    public SequenceFileLoader(String delimiter) {
        this.isDelimSpecified = true;
        this.fieldDel = StorageUtil.parseFieldDel(delimeter);
        mProtoTuple = new ArrayList<Object>();
    }


 @Override
    public Tuple getNext() throws IOException {
   //foo bar
   .
   .
   .

if (isDelimSpecified && valType == DataType.CHARARRAY) {
            Text val = (Text)value;
            int len = val.getLength();
            byte[] buf = val.getBytes();
            parseValue(buf, len);
        } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
            // Tokenize value by delimiter when Value class
            // type is DataByteArray
            DataByteArray val = (DataByteArray)value;
            parseValue(val.get(), val.size());
        } else
            // Add the value without tokenizing if Delimeter is not
specified OR value class type is not Text or
            // DataByteArray
            mProtoTuple.add(translateWritableToPigDataType(value, valType));

.
.
.
}

//From PigStorage
private void parseValue(byte[] buf, int len) {
        int start = 0;
        for (int i = 0; i < len; i++) {
            if (buf[i] == fieldDel) {
                readField(buf, start, i);
                start = i + 1;
            }
        }
        // Store the field after last delimiter occurs
        if (start <= len) {
            readField(buf, start, len);
        }
    }


Thoughts?

-Prashant

Reply via email to