Prashant, check out the much more sophisticated version of SequenceFileLoader in Elephant-Bird.
D On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <[email protected]>wrote: > I see a lot of cases when users store data as SequenceFiles and would like > to parse through the Value (ONLY if Text/DataByteArray) similar to > PigStorage(String delim). > > For example, lets say the logs are delimited by ^G and one would like to > read from SequenceFiles. Currently, this would involve 2 operations (1. > Read from SequenceFile and 2. Parse by passing this data to a UDF that does > the parsing). > We could enhance the SequenceFileLoader to do this internally by providing > a Constructor similar to PigStorage(String delim) - > SequenceFileLoader(String delim). The default constructor would still > perform the same way it does now. > > // returned tuple contains both Key and Value (tokenized by delimiter). > public SequenceFileLoader(String delimiter) { > this.isDelimSpecified = true; > this.fieldDel = StorageUtil.parseFieldDel(delimeter); > mProtoTuple = new ArrayList<Object>(); > } > > > @Override > public Tuple getNext() throws IOException { > //foo bar > . > . > . > > if (isDelimSpecified && valType == DataType.CHARARRAY) { > Text val = (Text)value; > int len = val.getLength(); > byte[] buf = val.getBytes(); > parseValue(buf, len); > } else if (isDelimSpecified && valType == DataType.BYTEARRAY) { > // Tokenize value by delimiter when Value class > // type is DataByteArray > DataByteArray val = (DataByteArray)value; > parseValue(val.get(), val.size()); > } else > // Add the value without tokenizing if Delimeter is not > specified OR value class type is not Text or > // DataByteArray > mProtoTuple.add(translateWritableToPigDataType(value, valType)); > > . > . > . > } > > //From PigStorage > private void parseValue(byte[] buf, int len) { > int start = 0; > for (int i = 0; i < len; i++) { > if (buf[i] == fieldDel) { > readField(buf, start, i); > start = i + 1; > } > } > // Store the field after last delimiter occurs > if (start <= len) { > readField(buf, start, len); > } > } > > > Thoughts? > > -Prashant >
