Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Colum Foley Mon, 04 Mar 2013 07:37:14 -0800

Hi Andy,

I am using Pig 0.10.0, (but am happy to try another). Yes, I am
running in local mode with the example data below.


Thanks again,
Colum

On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer
<[email protected]> wrote:
> Colum, thank you for passing on details. Could you also share with us
> the version of pig you are running? I assume you're running in local
> mode with the example data below?
>
>
> On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote:
>
>> Hi Andy, Ted,
>>
>> Thank you both for replying. Below I will describe the input data, the
>> pig script I am using, and the resulting output.
>>
>> -Input data is the following (in file 'vectorsPigStored.dat' ):
>>
>> bbb    (2,{(6595,4.0),(608,1.0)})
>> ccd    (1,{(9763,1.0)})
>> adc    (1,{(3670,1.0)})
>> ads    (1,{(2297,1.0)})
>>
>>
>> -The full Pig script I am running is as follows:
>>
>>
>> REGISTER 'elephant-bird-core-3.0.7.jar'
>> REGISTER 'elephant-bird-pig-3.0.7.jar'
>> REGISTER 'elephant-bird-mahout-3.0.7.jar'
>> REGISTER 'mahout-core-0.7.jar'
>> REGISTER 'mahout-math-0.7.jar'
>>
>> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
>> (cardinality: int, entries: {entry: (index: int, value: double)}));
>> --Store output
>> store pair into 'output' using
>> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> );
>> --Store output without params for comparison
>> store pair into 'outputRaw' using
>> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>>
>>
>>
>>
>> -Here is the output that I see, printed line by line and below is the
>> type of input (using SequenceFile.Reader, reader.getKeyClass())
>>
>> -- from 'output'
>> bbb  {}
>> ccd  {}
>> adc  {}
>> ads  {}
>> class org.apache.hadoop.io.Text    class 
>> org.apache.mahout.math.VectorWritable
>>
>> --from 'outputRaw'
>> bbb  (2,{(6595,4.0),(608,1.0)})
>> ccd  (1,{(9763,1.0)})
>> adc  (1,{(3670,1.0)})
>> ads  (1,{(2297,1.0)})
>> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>>
>>
>>
>> **Just to confirm that the issue wasn't with my use of chararray keys
>> (instead of integer keys), I also tried a run with using int keys, but
>> the result is the same:
>>
>>
>>
>> --Output when using SequenceFileStorage with params  '-c
>> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
>> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>
>> 1  {}
>> 1  {}
>> 1  {}
>> 1  {}
>> 1  {}
>> class org.apache.hadoop.io.IntWritable    class
>> org.apache.mahout.math.VectorWritable
>>
>> --Output from SequenceFileStorage without params
>> 1  (2,{(6595,4.0),(608,1.0)})
>> 1  (1,{(9763,1.0)})
>> 1  (1,{(3670,1.0)})
>> 1  (1,{(2297,1.0)})
>> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>>
>>
>>
>>
>> Any help greatly appreciated,
>>
>> Thanks again,
>> Colum
>>
>>
>>
>> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote:
>>> Andy,
>>>
>>> Thanks for popping up!
>>>
>>> Elephant bird looks like it has awesome potential to make machine learning
>>> with Hadoop vastly easier.  It is really good to see this kind of response
>>> ... that is what turns potential into action.
>>>
>>> Thanks again.
>>>
>>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected]
>>>> wrote:
>>>
>>>> Hi Colum, I'm an ElephantBird project committer and wrote both
>>>> SequenceFileStorage and the VectorWritableConverter.
>>>>
>>>> The default Writable type used by SequenceFileStorage for both key and
>>>> value is Text, hence the Text data when you don't provide extra
>>>> configuration.
>>>>
>>>> Could you provide some sample data or task attempt logs from your job to
>>>> help diagnose the issue? Unit tests for both of these utils cover a lot of
>>>> edge cases, but if you've found a new one I'd like to get it sorted out!
>>>>
>>>> Thanks,
>>>> Andy
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote:
>>>>
>>>>> I haven't touched elephant bird in some time.  I had some fits with it at
>>>>> the time that I used it whenever I strayed from the well-trod path, but I
>>>>> had heard it was much better lately.
>>>>>
>>>>> Sorry not to be much more help than that.
>>>>>
>>>>> On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]>
>>>> wrote:
>>>>>
>>>>>> I am trying to store Mahout RandomAccessSparseVector using
>>>>>> elephant-bird and pig. The data is of the form
>>>>>> key(text),value(RandomAccessSparseVector). when I run pig describe it
>>>>>> presents the following:
>>>>>>
>>>>>> pair: {key: int,val: (cardinality: int,entries: {entry: (index:
>>>>>> int,value: double)})}
>>>>>>
>>>>>> My problem is that when I try to store tuples using elephant-bird's
>>>>>> SequenceFileStorage as follows:
>>>>>>
>>>>>> store clusteredOut into 'logsvectors.dat' using
>>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>>>>>>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>>>>>>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter  --
>>>>>> -sparse'
>>>>>> );
>>>>>>
>>>>>> It runs successfully but when I examine the resulting Sequencefile all
>>>>>> the vectors are empty.
>>>>>>
>>>>>> On the other hand, if I run the following instead:
>>>>>>
>>>>>> store clusteredOut into 'logsvectors.dat' using
>>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>>>>>>
>>>>>> ie do not specify the types of the key or value.
>>>>>>
>>>>>> The vectors are non-empty but are of type text..and this causes my
>>>>>> clustering algorithm to fail(as they are expecting VectorWritable).
>>>>>>
>>>>>> So my problem is that I need to output in VectorFileFormat, but when I
>>>>>> do the resulting vectors are empty.
>>>>>>
>>>>>> Anyone else have experience with this issue?
>>>>>>
>>>>>> Many thanks,
>>>>>> Colum
>>>>

Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Reply via email to