Thanks Jake, yes, that's the first thing to fix-- Generally, all of your
sparse vectors should have the same "size" (cardinality), but may have
different numbers of non-default values.

Try updating your example input data to read:

bbb     (10000,{(6595,4.0),(608,1.0)})
ccd     (10000,{(9763,1.0)})
adc     (10000,{(3670,1.0)})
ads     (10000,{(2297,1.0)})

All of your indices must fall within [0, cardinality).

Andy



On Mon, Mar 4, 2013 at 8:17 AM, Jake Mannix <[email protected]> wrote:

> I think the issue is with your understanding of  what 'cardinality' means
> here: it is the *dimension* of the vector (featureSpaceSize), not the
> number of nonzero elements in that particular vector
>
> On Monday, March 4, 2013, Colum Foley wrote:
>
> > Hi Andy,
> >
> > I am using Pig 0.10.0, (but am happy to try another). Yes, I am
> > running in local mode with the example data below.
> >
> > Thanks again,
> > Colum
> >
> > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer
> > <[email protected]> wrote:
> > > Colum, thank you for passing on details. Could you also share with us
> > > the version of pig you are running? I assume you're running in local
> > > mode with the example data below?
> > >
> > >
> > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> wrote:
> > >
> > >> Hi Andy, Ted,
> > >>
> > >> Thank you both for replying. Below I will describe the input data, the
> > >> pig script I am using, and the resulting output.
> > >>
> > >> -Input data is the following (in file 'vectorsPigStored.dat' ):
> > >>
> > >> bbb    (2,{(6595,4.0),(608,1.0)})
> > >> ccd    (1,{(9763,1.0)})
> > >> adc    (1,{(3670,1.0)})
> > >> ads    (1,{(2297,1.0)})
> > >>
> > >>
> > >> -The full Pig script I am running is as follows:
> > >>
> > >>
> > >> REGISTER 'elephant-bird-core-3.0.7.jar'
> > >> REGISTER 'elephant-bird-pig-3.0.7.jar'
> > >> REGISTER 'elephant-bird-mahout-3.0.7.jar'
> > >> REGISTER 'mahout-core-0.7.jar'
> > >> REGISTER 'mahout-math-0.7.jar'
> > >>
> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
> > >> (cardinality: int, entries: {entry: (index: int, value: double)}));
> > >> --Store output
> > >> store pair into 'output' using
> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (
> > >>   '-c com.twitter.elephantbird.pig.util.TextConverter',
> > >>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> > >> );
> > >> --Store output without params for comparison
> > >> store pair into 'outputRaw' using
> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
> > >>
> > >>
> > >>
> > >>
> > >> -Here is the output that I see, printed line by line and below is the
> > >> type of input (using SequenceFile.Reader, reader.getKeyClass())
> > >>
> > >> -- from 'output'
> > >> bbb  {}
> > >> ccd  {}
> > >> adc  {}
> > >> ads  {}
> > >> class org.apache.hadoop.io.Text    class
> > org.apache.mahout.math.VectorWritable
> > >>
> > >> --from 'outputRaw'
> > >> bbb  (2,{(6595,4.0),(608,1.0)})
> > >> ccd  (1,{(9763,1.0)})
> > >> adc  (1,{(3670,1.0)})
> > >> ads  (1,{(2297,1.0)})
> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
> > >>
> > >>
> > >>
> > >> **Just to confirm that the issue wasn't with my use of chararray keys
> > >> (instead of integer keys), I also tried a run with using int keys, but
> > >> the result is the same:
> > >>
> > >>
> > >>
> > >> --Output when using SequenceFileStorage with params  '-c
> > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
> > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
> > >>
> > >> 1  {}
> > >> 1  {}
> > >> 1  {}
> > >> 1  {}
> > >> 1  {}
> > >> class org.apache.hadoop.io.IntWritable    class
> > >> org.apache.mahout.math.VectorWritable
> > >>
> > >> --Output from SequenceFileStorage without params
> > >> 1  (2,{(6595,4.0),(608,1.0)})
> > >> 1  (1,{(9763,1.0)})
> > >> 1  (1,{(3670,1.0)})
> > >> 1  (1,{(2297,1.0)})
> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
> > >>
> > >>
> > >>
> > >>
> > >> Any help greatly appreciated,
> > >>
> > >> Thanks again,
> > >> Colum
> > >>
> > >>
> > >>
> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]>
> > wrote:
> > >>> Andy,
> > >>>
> > >>> Thanks for popping up!
> > >>>
> > >>> Elephant bird looks like it has awesome potential to make machine
> > learning
> > >>> with Hadoop vastly easier.  It is really good to see this kind of
> > response
> > >>> ... that is what turns potential into action.
> > >>>
> > >>> Thanks again.
> > >>>
> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <
>
>
>
> --
>
>   -jake
>

Reply via email to