Hi Andy, Ted,
Thank you both for replying. Below I will describe the input data, the
pig script I am using, and the resulting output.
-Input data is the following (in file 'vectorsPigStored.dat' ):
bbb (2,{(6595,4.0),(608,1.0)})
ccd (1,{(9763,1.0)})
adc (1,{(3670,1.0)})
ads (1,{(2297,1.0)})
-The full Pig script I am running is as follows:
REGISTER 'elephant-bird-core-3.0.7.jar'
REGISTER 'elephant-bird-pig-3.0.7.jar'
REGISTER 'elephant-bird-mahout-3.0.7.jar'
REGISTER 'mahout-core-0.7.jar'
REGISTER 'mahout-math-0.7.jar'
pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
(cardinality: int, entries: {entry: (index: int, value: double)}));
--Store output
store pair into 'output' using
com.twitter.elephantbird.pig.store.SequenceFileStorage (
'-c com.twitter.elephantbird.pig.util.TextConverter',
'-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
);
--Store output without params for comparison
store pair into 'outputRaw' using
com.twitter.elephantbird.pig.store.SequenceFileStorage ();
-Here is the output that I see, printed line by line and below is the
type of input (using SequenceFile.Reader, reader.getKeyClass())
-- from 'output'
bbb {}
ccd {}
adc {}
ads {}
class org.apache.hadoop.io.Text class org.apache.mahout.math.VectorWritable
--from 'outputRaw'
bbb (2,{(6595,4.0),(608,1.0)})
ccd (1,{(9763,1.0)})
adc (1,{(3670,1.0)})
ads (1,{(2297,1.0)})
class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text
**Just to confirm that the issue wasn't with my use of chararray keys
(instead of integer keys), I also tried a run with using int keys, but
the result is the same:
--Output when using SequenceFileStorage with params '-c
com.twitter.elephantbird.pig.util.IntWritableConverter','-c
com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
1 {}
1 {}
1 {}
1 {}
1 {}
class org.apache.hadoop.io.IntWritable class
org.apache.mahout.math.VectorWritable
--Output from SequenceFileStorage without params
1 (2,{(6595,4.0),(608,1.0)})
1 (1,{(9763,1.0)})
1 (1,{(3670,1.0)})
1 (1,{(2297,1.0)})
class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text
Any help greatly appreciated,
Thanks again,
Colum
On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote:
> Andy,
>
> Thanks for popping up!
>
> Elephant bird looks like it has awesome potential to make machine learning
> with Hadoop vastly easier. It is really good to see this kind of response
> ... that is what turns potential into action.
>
> Thanks again.
>
> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected]
>> wrote:
>
>> Hi Colum, I'm an ElephantBird project committer and wrote both
>> SequenceFileStorage and the VectorWritableConverter.
>>
>> The default Writable type used by SequenceFileStorage for both key and
>> value is Text, hence the Text data when you don't provide extra
>> configuration.
>>
>> Could you provide some sample data or task attempt logs from your job to
>> help diagnose the issue? Unit tests for both of these utils cover a lot of
>> edge cases, but if you've found a new one I'd like to get it sorted out!
>>
>> Thanks,
>> Andy
>>
>>
>>
>>
>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote:
>>
>> > I haven't touched elephant bird in some time. I had some fits with it at
>> > the time that I used it whenever I strayed from the well-trod path, but I
>> > had heard it was much better lately.
>> >
>> > Sorry not to be much more help than that.
>> >
>> > On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]>
>> wrote:
>> >
>> > > I am trying to store Mahout RandomAccessSparseVector using
>> > > elephant-bird and pig. The data is of the form
>> > > key(text),value(RandomAccessSparseVector). when I run pig describe it
>> > > presents the following:
>> > >
>> > > pair: {key: int,val: (cardinality: int,entries: {entry: (index:
>> > > int,value: double)})}
>> > >
>> > > My problem is that when I try to store tuples using elephant-bird's
>> > > SequenceFileStorage as follows:
>> > >
>> > > store clusteredOut into 'logsvectors.dat' using
>> > > com.twitter.elephantbird.pig.store.SequenceFileStorage (
>> > > '-c com.twitter.elephantbird.pig.util.TextConverter',
>> > > '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter --
>> > > -sparse'
>> > > );
>> > >
>> > > It runs successfully but when I examine the resulting Sequencefile all
>> > > the vectors are empty.
>> > >
>> > > On the other hand, if I run the following instead:
>> > >
>> > > store clusteredOut into 'logsvectors.dat' using
>> > > com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>> > >
>> > > ie do not specify the types of the key or value.
>> > >
>> > > The vectors are non-empty but are of type text..and this causes my
>> > > clustering algorithm to fail(as they are expecting VectorWritable).
>> > >
>> > > So my problem is that I need to output in VectorFileFormat, but when I
>> > > do the resulting vectors are empty.
>> > >
>> > > Anyone else have experience with this issue?
>> > >
>> > > Many thanks,
>> > > Colum
>> > >
>> >
>>