I think the issue is with your understanding of what 'cardinality' means here: it is the *dimension* of the vector (featureSpaceSize), not the number of nonzero elements in that particular vector
On Monday, March 4, 2013, Colum Foley wrote: > Hi Andy, > > I am using Pig 0.10.0, (but am happy to try another). Yes, I am > running in local mode with the example data below. > > Thanks again, > Colum > > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer > <[email protected]> wrote: > > Colum, thank you for passing on details. Could you also share with us > > the version of pig you are running? I assume you're running in local > > mode with the example data below? > > > > > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote: > > > >> Hi Andy, Ted, > >> > >> Thank you both for replying. Below I will describe the input data, the > >> pig script I am using, and the resulting output. > >> > >> -Input data is the following (in file 'vectorsPigStored.dat' ): > >> > >> bbb (2,{(6595,4.0),(608,1.0)}) > >> ccd (1,{(9763,1.0)}) > >> adc (1,{(3670,1.0)}) > >> ads (1,{(2297,1.0)}) > >> > >> > >> -The full Pig script I am running is as follows: > >> > >> > >> REGISTER 'elephant-bird-core-3.0.7.jar' > >> REGISTER 'elephant-bird-pig-3.0.7.jar' > >> REGISTER 'elephant-bird-mahout-3.0.7.jar' > >> REGISTER 'mahout-core-0.7.jar' > >> REGISTER 'mahout-math-0.7.jar' > >> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val: > >> (cardinality: int, entries: {entry: (index: int, value: double)})); > >> --Store output > >> store pair into 'output' using > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ( > >> '-c com.twitter.elephantbird.pig.util.TextConverter', > >> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > >> ); > >> --Store output without params for comparison > >> store pair into 'outputRaw' using > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (); > >> > >> > >> > >> > >> -Here is the output that I see, printed line by line and below is the > >> type of input (using SequenceFile.Reader, reader.getKeyClass()) > >> > >> -- from 'output' > >> bbb {} > >> ccd {} > >> adc {} > >> ads {} > >> class org.apache.hadoop.io.Text class > org.apache.mahout.math.VectorWritable > >> > >> --from 'outputRaw' > >> bbb (2,{(6595,4.0),(608,1.0)}) > >> ccd (1,{(9763,1.0)}) > >> adc (1,{(3670,1.0)}) > >> ads (1,{(2297,1.0)}) > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > >> > >> > >> > >> **Just to confirm that the issue wasn't with my use of chararray keys > >> (instead of integer keys), I also tried a run with using int keys, but > >> the result is the same: > >> > >> > >> > >> --Output when using SequenceFileStorage with params '-c > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > >> > >> 1 {} > >> 1 {} > >> 1 {} > >> 1 {} > >> 1 {} > >> class org.apache.hadoop.io.IntWritable class > >> org.apache.mahout.math.VectorWritable > >> > >> --Output from SequenceFileStorage without params > >> 1 (2,{(6595,4.0),(608,1.0)}) > >> 1 (1,{(9763,1.0)}) > >> 1 (1,{(3670,1.0)}) > >> 1 (1,{(2297,1.0)}) > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > >> > >> > >> > >> > >> Any help greatly appreciated, > >> > >> Thanks again, > >> Colum > >> > >> > >> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]> > wrote: > >>> Andy, > >>> > >>> Thanks for popping up! > >>> > >>> Elephant bird looks like it has awesome potential to make machine > learning > >>> with Hadoop vastly easier. It is really good to see this kind of > response > >>> ... that is what turns potential into action. > >>> > >>> Thanks again. > >>> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer < -- -jake
