you can provide arbitrary converters to the EB one.. not as simple as just "\u0007", but far more e extendible (and you can make a stringtokenizer converter that takes an argument)
the SeqFileLoader in piggybank was an exercise in "here's how you might go about doing this" rather than a real loader... D On Wed, Dec 14, 2011 at 3:59 PM, Prashant Kommireddi <[email protected]>wrote: > I did but I am not sure if allows you to tokenize Value (when Text or > bytearray). For example, say a record in my file is > > ABC^GDEF^GGHI^GJKL^GMNO > > And I want to be able to tokenize by ^G and access this value positionally > (as done with PigStorage). Is it possible to do it with the elephant-bird > version? > > A = LOAD 'data' using SequenceFileLoader('\u0007'); > B = FOREACH A GENERATE $2, $3; > > In any case, it would be good to have it as part of the Piggybank version > of SequenceFileLoader. I see you were the original author of it! > > -Prashant > > On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > Prashant, check out the much more sophisticated version of > > SequenceFileLoader in Elephant-Bird. > > > > D > > > > On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi < > [email protected] > > >wrote: > > > > > I see a lot of cases when users store data as SequenceFiles and would > > like > > > to parse through the Value (ONLY if Text/DataByteArray) similar to > > > PigStorage(String delim). > > > > > > For example, lets say the logs are delimited by ^G and one would like > to > > > read from SequenceFiles. Currently, this would involve 2 operations (1. > > > Read from SequenceFile and 2. Parse by passing this data to a UDF that > > does > > > the parsing). > > > We could enhance the SequenceFileLoader to do this internally by > > providing > > > a Constructor similar to PigStorage(String delim) - > > > SequenceFileLoader(String delim). The default constructor would still > > > perform the same way it does now. > > > > > > // returned tuple contains both Key and Value (tokenized by > delimiter). > > > public SequenceFileLoader(String delimiter) { > > > this.isDelimSpecified = true; > > > this.fieldDel = StorageUtil.parseFieldDel(delimeter); > > > mProtoTuple = new ArrayList<Object>(); > > > } > > > > > > > > > @Override > > > public Tuple getNext() throws IOException { > > > //foo bar > > > . > > > . > > > . > > > > > > if (isDelimSpecified && valType == DataType.CHARARRAY) { > > > Text val = (Text)value; > > > int len = val.getLength(); > > > byte[] buf = val.getBytes(); > > > parseValue(buf, len); > > > } else if (isDelimSpecified && valType == DataType.BYTEARRAY) { > > > // Tokenize value by delimiter when Value class > > > // type is DataByteArray > > > DataByteArray val = (DataByteArray)value; > > > parseValue(val.get(), val.size()); > > > } else > > > // Add the value without tokenizing if Delimeter is not > > > specified OR value class type is not Text or > > > // DataByteArray > > > mProtoTuple.add(translateWritableToPigDataType(value, > > valType)); > > > > > > . > > > . > > > . > > > } > > > > > > //From PigStorage > > > private void parseValue(byte[] buf, int len) { > > > int start = 0; > > > for (int i = 0; i < len; i++) { > > > if (buf[i] == fieldDel) { > > > readField(buf, start, i); > > > start = i + 1; > > > } > > > } > > > // Store the field after last delimiter occurs > > > if (start <= len) { > > > readField(buf, start, len); > > > } > > > } > > > > > > > > > Thoughts? > > > > > > -Prashant > > > > > >
