Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Andy Schlaikjer Mon, 04 Mar 2013 07:03:02 -0800

Colum, thank you for passing on details. Could you also share with us
the version of pig you are running? I assume you're running in local
mode with the example data below?



On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote:

> Hi Andy, Ted,
>
> Thank you both for replying. Below I will describe the input data, the
> pig script I am using, and the resulting output.
>
> -Input data is the following (in file 'vectorsPigStored.dat' ):
>
> bbb    (2,{(6595,4.0),(608,1.0)})
> ccd    (1,{(9763,1.0)})
> adc    (1,{(3670,1.0)})
> ads    (1,{(2297,1.0)})
>
>
> -The full Pig script I am running is as follows:
>
>
> REGISTER 'elephant-bird-core-3.0.7.jar'
> REGISTER 'elephant-bird-pig-3.0.7.jar'
> REGISTER 'elephant-bird-mahout-3.0.7.jar'
> REGISTER 'mahout-core-0.7.jar'
> REGISTER 'mahout-math-0.7.jar'
>
> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
> (cardinality: int, entries: {entry: (index: int, value: double)}));
> --Store output
> store pair into 'output' using
> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> );
> --Store output without params for comparison
> store pair into 'outputRaw' using
> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>
>
>
>
> -Here is the output that I see, printed line by line and below is the
> type of input (using SequenceFile.Reader, reader.getKeyClass())
>
> -- from 'output'
> bbb  {}
> ccd  {}
> adc  {}
> ads  {}
> class org.apache.hadoop.io.Text    class org.apache.mahout.math.VectorWritable
>
> --from 'outputRaw'
> bbb  (2,{(6595,4.0),(608,1.0)})
> ccd  (1,{(9763,1.0)})
> adc  (1,{(3670,1.0)})
> ads  (1,{(2297,1.0)})
> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>
>
>
> **Just to confirm that the issue wasn't with my use of chararray keys
> (instead of integer keys), I also tried a run with using int keys, but
> the result is the same:
>
>
>
> --Output when using SequenceFileStorage with params  '-c
> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>
> 1  {}
> 1  {}
> 1  {}
> 1  {}
> 1  {}
> class org.apache.hadoop.io.IntWritable    class
> org.apache.mahout.math.VectorWritable
>
> --Output from SequenceFileStorage without params
> 1  (2,{(6595,4.0),(608,1.0)})
> 1  (1,{(9763,1.0)})
> 1  (1,{(3670,1.0)})
> 1  (1,{(2297,1.0)})
> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>
>
>
>
> Any help greatly appreciated,
>
> Thanks again,
> Colum
>
>
>
> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote:
>> Andy,
>>
>> Thanks for popping up!
>>
>> Elephant bird looks like it has awesome potential to make machine learning
>> with Hadoop vastly easier.  It is really good to see this kind of response
>> ... that is what turns potential into action.
>>
>> Thanks again.
>>
>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected]
>>> wrote:
>>
>>> Hi Colum, I'm an ElephantBird project committer and wrote both
>>> SequenceFileStorage and the VectorWritableConverter.
>>>
>>> The default Writable type used by SequenceFileStorage for both key and
>>> value is Text, hence the Text data when you don't provide extra
>>> configuration.
>>>
>>> Could you provide some sample data or task attempt logs from your job to
>>> help diagnose the issue? Unit tests for both of these utils cover a lot of
>>> edge cases, but if you've found a new one I'd like to get it sorted out!
>>>
>>> Thanks,
>>> Andy
>>>
>>>
>>>
>>>
>>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote:
>>>
>>>> I haven't touched elephant bird in some time.  I had some fits with it at
>>>> the time that I used it whenever I strayed from the well-trod path, but I
>>>> had heard it was much better lately.
>>>>
>>>> Sorry not to be much more help than that.
>>>>
>>>> On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]>
>>> wrote:
>>>>
>>>>> I am trying to store Mahout RandomAccessSparseVector using
>>>>> elephant-bird and pig. The data is of the form
>>>>> key(text),value(RandomAccessSparseVector). when I run pig describe it
>>>>> presents the following:
>>>>>
>>>>> pair: {key: int,val: (cardinality: int,entries: {entry: (index:
>>>>> int,value: double)})}
>>>>>
>>>>> My problem is that when I try to store tuples using elephant-bird's
>>>>> SequenceFileStorage as follows:
>>>>>
>>>>> store clusteredOut into 'logsvectors.dat' using
>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>>>>>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>>>>>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter  --
>>>>> -sparse'
>>>>> );
>>>>>
>>>>> It runs successfully but when I examine the resulting Sequencefile all
>>>>> the vectors are empty.
>>>>>
>>>>> On the other hand, if I run the following instead:
>>>>>
>>>>> store clusteredOut into 'logsvectors.dat' using
>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>>>>>
>>>>> ie do not specify the types of the key or value.
>>>>>
>>>>> The vectors are non-empty but are of type text..and this causes my
>>>>> clustering algorithm to fail(as they are expecting VectorWritable).
>>>>>
>>>>> So my problem is that I need to output in VectorFileFormat, but when I
>>>>> do the resulting vectors are empty.
>>>>>
>>>>> Anyone else have experience with this issue?
>>>>>
>>>>> Many thanks,
>>>>> Colum
>>>

Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Reply via email to