Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Colum Foley Mon, 04 Mar 2013 03:19:23 -0800

Hi Andy, Ted,

Thank you both for replying. Below I will describe the input data, the
pig script I am using, and the resulting output.


-Input data is the following (in file 'vectorsPigStored.dat' ):

bbb     (2,{(6595,4.0),(608,1.0)})
ccd     (1,{(9763,1.0)})
adc     (1,{(3670,1.0)})
ads     (1,{(2297,1.0)})


-The full Pig script I am running is as follows:


REGISTER 'elephant-bird-core-3.0.7.jar'
REGISTER 'elephant-bird-pig-3.0.7.jar'
REGISTER 'elephant-bird-mahout-3.0.7.jar'
REGISTER 'mahout-core-0.7.jar'
REGISTER 'mahout-math-0.7.jar'

pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
(cardinality: int, entries: {entry: (index: int, value: double)}));
--Store output
store pair into 'output' using
com.twitter.elephantbird.pig.store.SequenceFileStorage (
   '-c com.twitter.elephantbird.pig.util.TextConverter',
   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
);
--Store output without params for comparison
store pair into 'outputRaw' using
com.twitter.elephantbird.pig.store.SequenceFileStorage ();




-Here is the output that I see, printed line by line and below is the
type of input (using SequenceFile.Reader, reader.getKeyClass())

-- from 'output'
bbb  {}
ccd  {}
adc  {}
ads  {}
class org.apache.hadoop.io.Text class org.apache.mahout.math.VectorWritable

--from 'outputRaw'
bbb  (2,{(6595,4.0),(608,1.0)})
ccd  (1,{(9763,1.0)})
adc  (1,{(3670,1.0)})
ads  (1,{(2297,1.0)})
class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text



**Just to confirm that the issue wasn't with my use of chararray keys
(instead of integer keys), I also tried a run with using int keys, but
the result is the same:



--Output when using SequenceFileStorage with params  '-c
com.twitter.elephantbird.pig.util.IntWritableConverter','-c
com.twitter.elephantbird.pig.mahout.VectorWritableConverter'

1  {}
1  {}
1  {}
1  {}
1  {}
class org.apache.hadoop.io.IntWritable  class
org.apache.mahout.math.VectorWritable

--Output from SequenceFileStorage without params
1  (2,{(6595,4.0),(608,1.0)})
1  (1,{(9763,1.0)})
1  (1,{(3670,1.0)})
1  (1,{(2297,1.0)})
class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text




Any help greatly appreciated,

Thanks again,
Colum



On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote:
> Andy,
>
> Thanks for popping up!
>
> Elephant bird looks like it has awesome potential to make machine learning
> with Hadoop vastly easier.  It is really good to see this kind of response
> ... that is what turns potential into action.
>
> Thanks again.
>
> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected]
>> wrote:
>
>> Hi Colum, I'm an ElephantBird project committer and wrote both
>> SequenceFileStorage and the VectorWritableConverter.
>>
>> The default Writable type used by SequenceFileStorage for both key and
>> value is Text, hence the Text data when you don't provide extra
>> configuration.
>>
>> Could you provide some sample data or task attempt logs from your job to
>> help diagnose the issue? Unit tests for both of these utils cover a lot of
>> edge cases, but if you've found a new one I'd like to get it sorted out!
>>
>> Thanks,
>> Andy
>>
>>
>>
>>
>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote:
>>
>> > I haven't touched elephant bird in some time.  I had some fits with it at
>> > the time that I used it whenever I strayed from the well-trod path, but I
>> > had heard it was much better lately.
>> >
>> > Sorry not to be much more help than that.
>> >
>> > On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]>
>> wrote:
>> >
>> > > I am trying to store Mahout RandomAccessSparseVector using
>> > > elephant-bird and pig. The data is of the form
>> > > key(text),value(RandomAccessSparseVector). when I run pig describe it
>> > > presents the following:
>> > >
>> > > pair: {key: int,val: (cardinality: int,entries: {entry: (index:
>> > > int,value: double)})}
>> > >
>> > > My problem is that when I try to store tuples using elephant-bird's
>> > > SequenceFileStorage as follows:
>> > >
>> > > store clusteredOut into 'logsvectors.dat' using
>> > > com.twitter.elephantbird.pig.store.SequenceFileStorage (
>> > >    '-c com.twitter.elephantbird.pig.util.TextConverter',
>> > >    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter  --
>> > > -sparse'
>> > > );
>> > >
>> > > It runs successfully but when I examine the resulting Sequencefile all
>> > > the vectors are empty.
>> > >
>> > > On the other hand, if I run the following instead:
>> > >
>> > > store clusteredOut into 'logsvectors.dat' using
>> > > com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>> > >
>> > > ie do not specify the types of the key or value.
>> > >
>> > > The vectors are non-empty but are of type text..and this causes my
>> > > clustering algorithm to fail(as they are expecting VectorWritable).
>> > >
>> > > So my problem is that I need to output in VectorFileFormat, but when I
>> > > do the resulting vectors are empty.
>> > >
>> > > Anyone else have experience with this issue?
>> > >
>> > > Many thanks,
>> > > Colum
>> > >
>> >
>>

Re: Elephant-Bird SequenceFile Storage of RandomAccessSparseVectors for Mahout

Reply via email to