Great, thanks I will check it out. Sent from my iPhone
On Dec 14, 2011, at 5:05 PM, Dmitriy Ryaboy <[email protected]> wrote: > you can provide arbitrary converters to the EB one.. not as simple as just > "\u0007", but far more e > extendible (and you can make a stringtokenizer converter that takes an > argument) > > the SeqFileLoader in piggybank was an exercise in "here's how you might go > about doing this" rather than a real loader... > > D > > On Wed, Dec 14, 2011 at 3:59 PM, Prashant Kommireddi > <[email protected]>wrote: > >> I did but I am not sure if allows you to tokenize Value (when Text or >> bytearray). For example, say a record in my file is >> >> ABC^GDEF^GGHI^GJKL^GMNO >> >> And I want to be able to tokenize by ^G and access this value positionally >> (as done with PigStorage). Is it possible to do it with the elephant-bird >> version? >> >> A = LOAD 'data' using SequenceFileLoader('\u0007'); >> B = FOREACH A GENERATE $2, $3; >> >> In any case, it would be good to have it as part of the Piggybank version >> of SequenceFileLoader. I see you were the original author of it! >> >> -Prashant >> >> On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]> >> wrote: >> >>> Prashant, check out the much more sophisticated version of >>> SequenceFileLoader in Elephant-Bird. >>> >>> D >>> >>> On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi < >> [email protected] >>>> wrote: >>> >>>> I see a lot of cases when users store data as SequenceFiles and would >>> like >>>> to parse through the Value (ONLY if Text/DataByteArray) similar to >>>> PigStorage(String delim). >>>> >>>> For example, lets say the logs are delimited by ^G and one would like >> to >>>> read from SequenceFiles. Currently, this would involve 2 operations (1. >>>> Read from SequenceFile and 2. Parse by passing this data to a UDF that >>> does >>>> the parsing). >>>> We could enhance the SequenceFileLoader to do this internally by >>> providing >>>> a Constructor similar to PigStorage(String delim) - >>>> SequenceFileLoader(String delim). The default constructor would still >>>> perform the same way it does now. >>>> >>>> // returned tuple contains both Key and Value (tokenized by >> delimiter). >>>> public SequenceFileLoader(String delimiter) { >>>> this.isDelimSpecified = true; >>>> this.fieldDel = StorageUtil.parseFieldDel(delimeter); >>>> mProtoTuple = new ArrayList<Object>(); >>>> } >>>> >>>> >>>> @Override >>>> public Tuple getNext() throws IOException { >>>> //foo bar >>>> . >>>> . >>>> . >>>> >>>> if (isDelimSpecified && valType == DataType.CHARARRAY) { >>>> Text val = (Text)value; >>>> int len = val.getLength(); >>>> byte[] buf = val.getBytes(); >>>> parseValue(buf, len); >>>> } else if (isDelimSpecified && valType == DataType.BYTEARRAY) { >>>> // Tokenize value by delimiter when Value class >>>> // type is DataByteArray >>>> DataByteArray val = (DataByteArray)value; >>>> parseValue(val.get(), val.size()); >>>> } else >>>> // Add the value without tokenizing if Delimeter is not >>>> specified OR value class type is not Text or >>>> // DataByteArray >>>> mProtoTuple.add(translateWritableToPigDataType(value, >>> valType)); >>>> >>>> . >>>> . >>>> . >>>> } >>>> >>>> //From PigStorage >>>> private void parseValue(byte[] buf, int len) { >>>> int start = 0; >>>> for (int i = 0; i < len; i++) { >>>> if (buf[i] == fieldDel) { >>>> readField(buf, start, i); >>>> start = i + 1; >>>> } >>>> } >>>> // Store the field after last delimiter occurs >>>> if (start <= len) { >>>> readField(buf, start, len); >>>> } >>>> } >>>> >>>> >>>> Thoughts? >>>> >>>> -Prashant >>>> >>> >>
