It is usually OK to over-estimate this, but it depends on the algorithm. The issue turns up when you transform matrices into a form that has as many rows as you declared columns. This comes up in cooccurrence counting for recommendations or SVD (where the right singular vectors have as many rows as the original matrix had columns).
That may indicate it is a good idea to pad the number, but not be outrageous. On Mon, Mar 4, 2013 at 12:42 PM, Colum Foley <[email protected]> wrote: > Hi Jake, Andy, > > Indeed that was the problem, I had thought the cardinality value was > for the number of items in the bag, many thanks for the help! > > Is it OK to overestimate this value or does it need to match the > actual cardinality exactly? > > Thanks, > Colum > > > > > > On Mon, Mar 4, 2013 at 4:31 PM, Andy Schlaikjer > <[email protected]> wrote: > > Thanks Jake, yes, that's the first thing to fix-- Generally, all of your > > sparse vectors should have the same "size" (cardinality), but may have > > different numbers of non-default values. > > > > Try updating your example input data to read: > > > > bbb (10000,{(6595,4.0),(608,1.0)}) > > ccd (10000,{(9763,1.0)}) > > adc (10000,{(3670,1.0)}) > > ads (10000,{(2297,1.0)}) > > > > All of your indices must fall within [0, cardinality). > > > > Andy > > > > > > > > On Mon, Mar 4, 2013 at 8:17 AM, Jake Mannix <[email protected]> > wrote: > > > >> I think the issue is with your understanding of what 'cardinality' > means > >> here: it is the *dimension* of the vector (featureSpaceSize), not the > >> number of nonzero elements in that particular vector > >> > >> On Monday, March 4, 2013, Colum Foley wrote: > >> > >> > Hi Andy, > >> > > >> > I am using Pig 0.10.0, (but am happy to try another). Yes, I am > >> > running in local mode with the example data below. > >> > > >> > Thanks again, > >> > Colum > >> > > >> > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer > >> > <[email protected]> wrote: > >> > > Colum, thank you for passing on details. Could you also share with > us > >> > > the version of pig you are running? I assume you're running in local > >> > > mode with the example data below? > >> > > > >> > > > >> > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> > wrote: > >> > > > >> > >> Hi Andy, Ted, > >> > >> > >> > >> Thank you both for replying. Below I will describe the input data, > the > >> > >> pig script I am using, and the resulting output. > >> > >> > >> > >> -Input data is the following (in file 'vectorsPigStored.dat' ): > >> > >> > >> > >> bbb (2,{(6595,4.0),(608,1.0)}) > >> > >> ccd (1,{(9763,1.0)}) > >> > >> adc (1,{(3670,1.0)}) > >> > >> ads (1,{(2297,1.0)}) > >> > >> > >> > >> > >> > >> -The full Pig script I am running is as follows: > >> > >> > >> > >> > >> > >> REGISTER 'elephant-bird-core-3.0.7.jar' > >> > >> REGISTER 'elephant-bird-pig-3.0.7.jar' > >> > >> REGISTER 'elephant-bird-mahout-3.0.7.jar' > >> > >> REGISTER 'mahout-core-0.7.jar' > >> > >> REGISTER 'mahout-math-0.7.jar' > >> > >> > >> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val: > >> > >> (cardinality: int, entries: {entry: (index: int, value: double)})); > >> > >> --Store output > >> > >> store pair into 'output' using > >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ( > >> > >> '-c com.twitter.elephantbird.pig.util.TextConverter', > >> > >> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > >> > >> ); > >> > >> --Store output without params for comparison > >> > >> store pair into 'outputRaw' using > >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (); > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> -Here is the output that I see, printed line by line and below is > the > >> > >> type of input (using SequenceFile.Reader, reader.getKeyClass()) > >> > >> > >> > >> -- from 'output' > >> > >> bbb {} > >> > >> ccd {} > >> > >> adc {} > >> > >> ads {} > >> > >> class org.apache.hadoop.io.Text class > >> > org.apache.mahout.math.VectorWritable > >> > >> > >> > >> --from 'outputRaw' > >> > >> bbb (2,{(6595,4.0),(608,1.0)}) > >> > >> ccd (1,{(9763,1.0)}) > >> > >> adc (1,{(3670,1.0)}) > >> > >> ads (1,{(2297,1.0)}) > >> > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > >> > >> > >> > >> > >> > >> > >> > >> **Just to confirm that the issue wasn't with my use of chararray > keys > >> > >> (instead of integer keys), I also tried a run with using int keys, > but > >> > >> the result is the same: > >> > >> > >> > >> > >> > >> > >> > >> --Output when using SequenceFileStorage with params '-c > >> > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c > >> > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > >> > >> > >> > >> 1 {} > >> > >> 1 {} > >> > >> 1 {} > >> > >> 1 {} > >> > >> 1 {} > >> > >> class org.apache.hadoop.io.IntWritable class > >> > >> org.apache.mahout.math.VectorWritable > >> > >> > >> > >> --Output from SequenceFileStorage without params > >> > >> 1 (2,{(6595,4.0),(608,1.0)}) > >> > >> 1 (1,{(9763,1.0)}) > >> > >> 1 (1,{(3670,1.0)}) > >> > >> 1 (1,{(2297,1.0)}) > >> > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> Any help greatly appreciated, > >> > >> > >> > >> Thanks again, > >> > >> Colum > >> > >> > >> > >> > >> > >> > >> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected] > > > >> > wrote: > >> > >>> Andy, > >> > >>> > >> > >>> Thanks for popping up! > >> > >>> > >> > >>> Elephant bird looks like it has awesome potential to make machine > >> > learning > >> > >>> with Hadoop vastly easier. It is really good to see this kind of > >> > response > >> > >>> ... that is what turns potential into action. > >> > >>> > >> > >>> Thanks again. > >> > >>> > >> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer < > >> > >> > >> > >> -- > >> > >> -jake > >> >
