Re: Add parsing functionality to SequenceFileLoader (piggybank)

Prashant Kommireddi Wed, 14 Dec 2011 17:40:29 -0800

Great, thanks I will check it out.

Sent from my iPhone


On Dec 14, 2011, at 5:05 PM, Dmitriy Ryaboy <[email protected]> wrote:

> you can provide arbitrary converters to the EB one.. not as simple as just
> "\u0007", but far more e
> extendible (and you can make a stringtokenizer converter that takes an
> argument)
>
> the SeqFileLoader in piggybank was an exercise in "here's how you might go
> about doing this" rather than a real loader...
>
> D
>
> On Wed, Dec 14, 2011 at 3:59 PM, Prashant Kommireddi 
> <[email protected]>wrote:
>
>> I did but I am not sure if allows you to tokenize Value (when Text or
>> bytearray). For example, say a record in my file is
>>
>> ABC^GDEF^GGHI^GJKL^GMNO
>>
>> And I want to be able to tokenize by ^G and access this value positionally
>> (as done with PigStorage). Is it possible to do it with the elephant-bird
>> version?
>>
>> A = LOAD 'data' using SequenceFileLoader('\u0007');
>> B = FOREACH A GENERATE $2, $3;
>>
>> In any case, it would be good to have it as part of the Piggybank version
>> of SequenceFileLoader. I see you were the original author of it!
>>
>> -Prashant
>>
>> On Wed, Dec 14, 2011 at 3:44 PM, Dmitriy Ryaboy <[email protected]>
>> wrote:
>>
>>> Prashant, check out the much more sophisticated version of
>>> SequenceFileLoader in Elephant-Bird.
>>>
>>> D
>>>
>>> On Wed, Dec 14, 2011 at 3:12 PM, Prashant Kommireddi <
>> [email protected]
>>>> wrote:
>>>
>>>> I see a lot of cases when users store data as SequenceFiles and would
>>> like
>>>> to parse through the Value (ONLY if Text/DataByteArray) similar to
>>>> PigStorage(String delim).
>>>>
>>>> For example, lets say the logs are delimited by ^G and one would like
>> to
>>>> read from SequenceFiles. Currently, this would involve 2 operations (1.
>>>> Read from SequenceFile and 2. Parse by passing this data to a UDF that
>>> does
>>>> the parsing).
>>>> We could enhance the SequenceFileLoader to do this internally by
>>> providing
>>>> a Constructor similar to PigStorage(String delim) -
>>>> SequenceFileLoader(String delim). The default constructor would still
>>>> perform the same way it does now.
>>>>
>>>> // returned tuple contains both Key and Value (tokenized by
>> delimiter).
>>>>   public SequenceFileLoader(String delimiter) {
>>>>       this.isDelimSpecified = true;
>>>>       this.fieldDel = StorageUtil.parseFieldDel(delimeter);
>>>>       mProtoTuple = new ArrayList<Object>();
>>>>   }
>>>>
>>>>
>>>> @Override
>>>>   public Tuple getNext() throws IOException {
>>>>  //foo bar
>>>>  .
>>>>  .
>>>>  .
>>>>
>>>> if (isDelimSpecified && valType == DataType.CHARARRAY) {
>>>>           Text val = (Text)value;
>>>>           int len = val.getLength();
>>>>           byte[] buf = val.getBytes();
>>>>           parseValue(buf, len);
>>>>       } else if (isDelimSpecified && valType == DataType.BYTEARRAY) {
>>>>           // Tokenize value by delimiter when Value class
>>>>           // type is DataByteArray
>>>>           DataByteArray val = (DataByteArray)value;
>>>>           parseValue(val.get(), val.size());
>>>>       } else
>>>>           // Add the value without tokenizing if Delimeter is not
>>>> specified OR value class type is not Text or
>>>>           // DataByteArray
>>>>           mProtoTuple.add(translateWritableToPigDataType(value,
>>> valType));
>>>>
>>>> .
>>>> .
>>>> .
>>>> }
>>>>
>>>> //From PigStorage
>>>> private void parseValue(byte[] buf, int len) {
>>>>       int start = 0;
>>>>       for (int i = 0; i < len; i++) {
>>>>           if (buf[i] == fieldDel) {
>>>>               readField(buf, start, i);
>>>>               start = i + 1;
>>>>           }
>>>>       }
>>>>       // Store the field after last delimiter occurs
>>>>       if (start <= len) {
>>>>           readField(buf, start, len);
>>>>       }
>>>>   }
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>> -Prashant
>>>>
>>>
>>

Re: Add parsing functionality to SequenceFileLoader (piggybank)

Reply via email to