I think the issue is with your understanding of  what 'cardinality' means
here: it is the *dimension* of the vector (featureSpaceSize), not the
number of nonzero elements in that particular vector

On Monday, March 4, 2013, Colum Foley wrote:

> Hi Andy,
>
> I am using Pig 0.10.0, (but am happy to try another). Yes, I am
> running in local mode with the example data below.
>
> Thanks again,
> Colum
>
> On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer
> <[email protected]> wrote:
> > Colum, thank you for passing on details. Could you also share with us
> > the version of pig you are running? I assume you're running in local
> > mode with the example data below?
> >
> >
> > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote:
> >
> >> Hi Andy, Ted,
> >>
> >> Thank you both for replying. Below I will describe the input data, the
> >> pig script I am using, and the resulting output.
> >>
> >> -Input data is the following (in file 'vectorsPigStored.dat' ):
> >>
> >> bbb    (2,{(6595,4.0),(608,1.0)})
> >> ccd    (1,{(9763,1.0)})
> >> adc    (1,{(3670,1.0)})
> >> ads    (1,{(2297,1.0)})
> >>
> >>
> >> -The full Pig script I am running is as follows:
> >>
> >>
> >> REGISTER 'elephant-bird-core-3.0.7.jar'
> >> REGISTER 'elephant-bird-pig-3.0.7.jar'
> >> REGISTER 'elephant-bird-mahout-3.0.7.jar'
> >> REGISTER 'mahout-core-0.7.jar'
> >> REGISTER 'mahout-math-0.7.jar'
> >>
> >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
> >> (cardinality: int, entries: {entry: (index: int, value: double)}));
> >> --Store output
> >> store pair into 'output' using
> >> com.twitter.elephantbird.pig.store.SequenceFileStorage (
> >>   '-c com.twitter.elephantbird.pig.util.TextConverter',
> >>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> >> );
> >> --Store output without params for comparison
> >> store pair into 'outputRaw' using
> >> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
> >>
> >>
> >>
> >>
> >> -Here is the output that I see, printed line by line and below is the
> >> type of input (using SequenceFile.Reader, reader.getKeyClass())
> >>
> >> -- from 'output'
> >> bbb  {}
> >> ccd  {}
> >> adc  {}
> >> ads  {}
> >> class org.apache.hadoop.io.Text    class
> org.apache.mahout.math.VectorWritable
> >>
> >> --from 'outputRaw'
> >> bbb  (2,{(6595,4.0),(608,1.0)})
> >> ccd  (1,{(9763,1.0)})
> >> adc  (1,{(3670,1.0)})
> >> ads  (1,{(2297,1.0)})
> >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
> >>
> >>
> >>
> >> **Just to confirm that the issue wasn't with my use of chararray keys
> >> (instead of integer keys), I also tried a run with using int keys, but
> >> the result is the same:
> >>
> >>
> >>
> >> --Output when using SequenceFileStorage with params  '-c
> >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
> >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> >>
> >> 1  {}
> >> 1  {}
> >> 1  {}
> >> 1  {}
> >> 1  {}
> >> class org.apache.hadoop.io.IntWritable    class
> >> org.apache.mahout.math.VectorWritable
> >>
> >> --Output from SequenceFileStorage without params
> >> 1  (2,{(6595,4.0),(608,1.0)})
> >> 1  (1,{(9763,1.0)})
> >> 1  (1,{(3670,1.0)})
> >> 1  (1,{(2297,1.0)})
> >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
> >>
> >>
> >>
> >>
> >> Any help greatly appreciated,
> >>
> >> Thanks again,
> >> Colum
> >>
> >>
> >>
> >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]>
> wrote:
> >>> Andy,
> >>>
> >>> Thanks for popping up!
> >>>
> >>> Elephant bird looks like it has awesome potential to make machine
> learning
> >>> with Hadoop vastly easier.  It is really good to see this kind of
> response
> >>> ... that is what turns potential into action.
> >>>
> >>> Thanks again.
> >>>
> >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <



-- 

  -jake

Reply via email to