Hi Andy, I am using Pig 0.10.0, (but am happy to try another). Yes, I am running in local mode with the example data below.
Thanks again, Colum On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer <[email protected]> wrote: > Colum, thank you for passing on details. Could you also share with us > the version of pig you are running? I assume you're running in local > mode with the example data below? > > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote: > >> Hi Andy, Ted, >> >> Thank you both for replying. Below I will describe the input data, the >> pig script I am using, and the resulting output. >> >> -Input data is the following (in file 'vectorsPigStored.dat' ): >> >> bbb (2,{(6595,4.0),(608,1.0)}) >> ccd (1,{(9763,1.0)}) >> adc (1,{(3670,1.0)}) >> ads (1,{(2297,1.0)}) >> >> >> -The full Pig script I am running is as follows: >> >> >> REGISTER 'elephant-bird-core-3.0.7.jar' >> REGISTER 'elephant-bird-pig-3.0.7.jar' >> REGISTER 'elephant-bird-mahout-3.0.7.jar' >> REGISTER 'mahout-core-0.7.jar' >> REGISTER 'mahout-math-0.7.jar' >> >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val: >> (cardinality: int, entries: {entry: (index: int, value: double)})); >> --Store output >> store pair into 'output' using >> com.twitter.elephantbird.pig.store.SequenceFileStorage ( >> '-c com.twitter.elephantbird.pig.util.TextConverter', >> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >> ); >> --Store output without params for comparison >> store pair into 'outputRaw' using >> com.twitter.elephantbird.pig.store.SequenceFileStorage (); >> >> >> >> >> -Here is the output that I see, printed line by line and below is the >> type of input (using SequenceFile.Reader, reader.getKeyClass()) >> >> -- from 'output' >> bbb {} >> ccd {} >> adc {} >> ads {} >> class org.apache.hadoop.io.Text class >> org.apache.mahout.math.VectorWritable >> >> --from 'outputRaw' >> bbb (2,{(6595,4.0),(608,1.0)}) >> ccd (1,{(9763,1.0)}) >> adc (1,{(3670,1.0)}) >> ads (1,{(2297,1.0)}) >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text >> >> >> >> **Just to confirm that the issue wasn't with my use of chararray keys >> (instead of integer keys), I also tried a run with using int keys, but >> the result is the same: >> >> >> >> --Output when using SequenceFileStorage with params '-c >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >> >> 1 {} >> 1 {} >> 1 {} >> 1 {} >> 1 {} >> class org.apache.hadoop.io.IntWritable class >> org.apache.mahout.math.VectorWritable >> >> --Output from SequenceFileStorage without params >> 1 (2,{(6595,4.0),(608,1.0)}) >> 1 (1,{(9763,1.0)}) >> 1 (1,{(3670,1.0)}) >> 1 (1,{(2297,1.0)}) >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text >> >> >> >> >> Any help greatly appreciated, >> >> Thanks again, >> Colum >> >> >> >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> wrote: >>> Andy, >>> >>> Thanks for popping up! >>> >>> Elephant bird looks like it has awesome potential to make machine learning >>> with Hadoop vastly easier. It is really good to see this kind of response >>> ... that is what turns potential into action. >>> >>> Thanks again. >>> >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <[email protected] >>>> wrote: >>> >>>> Hi Colum, I'm an ElephantBird project committer and wrote both >>>> SequenceFileStorage and the VectorWritableConverter. >>>> >>>> The default Writable type used by SequenceFileStorage for both key and >>>> value is Text, hence the Text data when you don't provide extra >>>> configuration. >>>> >>>> Could you provide some sample data or task attempt logs from your job to >>>> help diagnose the issue? Unit tests for both of these utils cover a lot of >>>> edge cases, but if you've found a new one I'd like to get it sorted out! >>>> >>>> Thanks, >>>> Andy >>>> >>>> >>>> >>>> >>>> On Fri, Mar 1, 2013 at 8:29 AM, Ted Dunning <[email protected]> wrote: >>>> >>>>> I haven't touched elephant bird in some time. I had some fits with it at >>>>> the time that I used it whenever I strayed from the well-trod path, but I >>>>> had heard it was much better lately. >>>>> >>>>> Sorry not to be much more help than that. >>>>> >>>>> On Fri, Mar 1, 2013 at 3:50 AM, Colum Foley <[email protected]> >>>> wrote: >>>>> >>>>>> I am trying to store Mahout RandomAccessSparseVector using >>>>>> elephant-bird and pig. The data is of the form >>>>>> key(text),value(RandomAccessSparseVector). when I run pig describe it >>>>>> presents the following: >>>>>> >>>>>> pair: {key: int,val: (cardinality: int,entries: {entry: (index: >>>>>> int,value: double)})} >>>>>> >>>>>> My problem is that when I try to store tuples using elephant-bird's >>>>>> SequenceFileStorage as follows: >>>>>> >>>>>> store clusteredOut into 'logsvectors.dat' using >>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage ( >>>>>> '-c com.twitter.elephantbird.pig.util.TextConverter', >>>>>> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter -- >>>>>> -sparse' >>>>>> ); >>>>>> >>>>>> It runs successfully but when I examine the resulting Sequencefile all >>>>>> the vectors are empty. >>>>>> >>>>>> On the other hand, if I run the following instead: >>>>>> >>>>>> store clusteredOut into 'logsvectors.dat' using >>>>>> com.twitter.elephantbird.pig.store.SequenceFileStorage (); >>>>>> >>>>>> ie do not specify the types of the key or value. >>>>>> >>>>>> The vectors are non-empty but are of type text..and this causes my >>>>>> clustering algorithm to fail(as they are expecting VectorWritable). >>>>>> >>>>>> So my problem is that I need to output in VectorFileFormat, but when I >>>>>> do the resulting vectors are empty. >>>>>> >>>>>> Anyone else have experience with this issue? >>>>>> >>>>>> Many thanks, >>>>>> Colum >>>>
