Colum, thank you for passing on details. Could you also share with us the version of pig you are running? I assume you're running in local mode with the example data below?
On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote: > Hi Andy, Ted, > > Thank you both for replying. Below I will describe the input data, the > pig script I am using, and the resulting output. > > -Input data is the following (in file 'vectorsPigStored.dat' ): > > bbb (2,{(6595,4.0),(608,1.0)}) > ccd (1,{(9763,1.0)}) > adc (1,{(3670,1.0)}) > ads (1,{(2297,1.0)}) > > > -The full Pig script I am running is as follows: > > > REGISTER 'elephant-bird-core-3.0.7.jar' > REGISTER 'elephant-bird-pig-3.0.7.jar' > REGISTER 'elephant-bird-mahout-3.0.7.jar' > REGISTER 'mahout-core-0.7.jar' > REGISTER 'mahout-math-0.7.jar' > > pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val: > (cardinality: int, entries: {entry: (index: int, value: double)})); > --Store output > store pair into 'output' using > com.twitter.elephantbird.pig.store.SequenceFileStorage ( > '-c com.twitter.elephantbird.pig.util.TextConverter', > '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > ); > --Store output without params for comparison > store pair into 'outputRaw' using > com.twitter.elephantbird.pig.store.SequenceFileStorage (); > > > > > -Here is the output that I see, printed line by line and below is the > type of input (using SequenceFile.Reader, reader.getKeyClass()) > > -- from 'output' > bbb {} > ccd {} > adc {} > ads {} > class org.apache.hadoop.io.Text class org.apache.mahout.math.VectorWritable > > --from 'outputRaw' > bbb (2,{(6595,4.0),(608,1.0)}) > ccd (1,{(9763,1.0)}) > adc (1,{(3670,1.0)}) > ads (1,{(2297,1.0)}) > class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > > > > **Just to confirm that the issue wasn't with my use of chararray keys > (instead of integer keys), I also tried a run with using int keys, but > the result is the same: > > > > --Output when using SequenceFileStorage with params '-c > com.twitter.elephantbird.pig.util.IntWritableConverter','-c > com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > > 1 {} > 1 {} > 1 {} > 1 {} > 1 {} > class org.apache.hadoop.io.IntWritable class > org.apache.mahout.math.VectorWritable > > --Output from SequenceFileStorage without params > 1 (2,{(6595,4.0),(608,1.0)}) > 1 (1,{(9763,1.0)}) > 1 (1,{(3670,1.0)}) > 1 (1,{(2297,1.0)}) > class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > > > > > Any help greatly appreciated, > > Thanks again, > Colum > > > > On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote: >> Andy, >> >> Thanks for popping up! >> >> Elephant bird looks like it has awesome potential to make machine learning >> with Hadoop vastly easier. It is really good to see this kind of response >> ... that is what turns potential into action. >> >> Thanks again. >> >> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected] >>> wrote: >> >>> Hi Colum, I'm an ElephantBird project committer and wrote both >>> SequenceFileStorage and the VectorWritableConverter. >>> >>> The default Writable type used by SequenceFileStorage for both key and >>> value is Text, hence the Text data when you don't provide extra >>> configuration. >>> >>> Could you provide some sample data or task attempt logs from your job to >>> help diagnose the issue? Unit tests for both of these utils cover a lot of >>> edge cases, but if you've found a new one I'd like to get it sorted out! >>> >>> Thanks, >>> Andy >>> >>> >>> >>> >>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote: >>> >>>> I haven't touched elephant bird in some time. I had some fits with it at >>>> the time that I used it whenever I strayed from the well-trod path, but I >>>> had heard it was much better lately. >>>> >>>> Sorry not to be much more help than that. >>>> >>>> On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]> >>> wrote: >>>> >>>>> I am trying to store Mahout RandomAccessSparseVector using >>>>> elephant-bird and pig. The data is of the form >>>>> key(text),value(RandomAccessSparseVector). when I run pig describe it >>>>> presents the following: >>>>> >>>>> pair: {key: int,val: (cardinality: int,entries: {entry: (index: >>>>> int,value: double)})} >>>>> >>>>> My problem is that when I try to store tuples using elephant-bird's >>>>> SequenceFileStorage as follows: >>>>> >>>>> store clusteredOut into 'logsvectors.dat' using >>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage ( >>>>> '-c com.twitter.elephantbird.pig.util.TextConverter', >>>>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter -- >>>>> -sparse' >>>>> ); >>>>> >>>>> It runs successfully but when I examine the resulting Sequencefile all >>>>> the vectors are empty. >>>>> >>>>> On the other hand, if I run the following instead: >>>>> >>>>> store clusteredOut into 'logsvectors.dat' using >>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage (); >>>>> >>>>> ie do not specify the types of the key or value. >>>>> >>>>> The vectors are non-empty but are of type text..and this causes my >>>>> clustering algorithm to fail(as they are expecting VectorWritable). >>>>> >>>>> So my problem is that I need to output in VectorFileFormat, but when I >>>>> do the resulting vectors are empty. >>>>> >>>>> Anyone else have experience with this issue? >>>>> >>>>> Many thanks, >>>>> Colum >>>
