I see a lot of cases when users store data as SequenceFiles and would like
to parse through the Value (ONLY if Text/DataByteArray) similar to
PigStorage(String delim).
For example, lets say the logs are delimited by ^G and one would like to
read from SequenceFiles. Currently, this would involve 2 operations (1.
Read from SequenceFile and 2. Parse by passing this data to a UDF that does
the parsing).
We could enhance the SequenceFileLoader to do this internally by providing
a Constructor similar to PigStorage(String delim) -
SequenceFileLoader(String delim). The default constructor would still
perform the same way it does now.
// returned tuple contains both Key and Value (tokenized by delimiter).
public SequenceFileLoader(String delimiter) {
this.isDelimSpecified = true;
this.fieldDel = StorageUtil.parseFieldDel(delimeter);
mProtoTuple = new ArrayList<Object>();
}
@Override
public Tuple getNext() throws IOException {
//foo bar
.
.
.
if (isDelimSpecified && valType == DataType.CHARARRAY) {
Text val = (Text)value;
int len = val.getLength();
byte[] buf = val.getBytes();
parseValue(buf, len);
} else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
// Tokenize value by delimiter when Value class
// type is DataByteArray
DataByteArray val = (DataByteArray)value;
parseValue(val.get(), val.size());
} else
// Add the value without tokenizing if Delimeter is not
specified OR value class type is not Text or
// DataByteArray
mProtoTuple.add(translateWritableToPigDataType(value, valType));
.
.
.
}
//From PigStorage
private void parseValue(byte[] buf, int len) {
int start = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
readField(buf, start, i);
start = i + 1;
}
}
// Store the field after last delimiter occurs
if (start <= len) {
readField(buf, start, len);
}
}
Thoughts?
-Prashant