Thanks Ted.
On Mon, Mar 4, 2013 at 7:16 PM, Ted Dunning <[email protected]> wrote: > It is usually OK to over-estimate this, but it depends on the algorithm. > > The issue turns up when you transform matrices into a form that has as many > rows as you declared columns. This comes up in cooccurrence counting for > recommendations or SVD (where the right singular vectors have as many rows > as the original matrix had columns). > > That may indicate it is a good idea to pad the number, but not be > outrageous. > > On Mon, Mar 4, 2013 at 12:42 PM, Colum Foley <[email protected]> wrote: > >> Hi Jake, Andy, >> >> Indeed that was the problem, I had thought the cardinality value was >> for the number of items in the bag, many thanks for the help! >> >> Is it OK to overestimate this value or does it need to match the >> actual cardinality exactly? >> >> Thanks, >> Colum >> >> >> >> >> >> On Mon, Mar 4, 2013 at 4:31 PM, Andy Schlaikjer >> <[email protected]> wrote: >> > Thanks Jake, yes, that's the first thing to fix-- Generally, all of your >> > sparse vectors should have the same "size" (cardinality), but may have >> > different numbers of non-default values. >> > >> > Try updating your example input data to read: >> > >> > bbb (10000,{(6595,4.0),(608,1.0)}) >> > ccd (10000,{(9763,1.0)}) >> > adc (10000,{(3670,1.0)}) >> > ads (10000,{(2297,1.0)}) >> > >> > All of your indices must fall within [0, cardinality). >> > >> > Andy >> > >> > >> > >> > On Mon, Mar 4, 2013 at 8:17 AM, Jake Mannix <[email protected]> >> wrote: >> > >> >> I think the issue is with your understanding of what 'cardinality' >> means >> >> here: it is the *dimension* of the vector (featureSpaceSize), not the >> >> number of nonzero elements in that particular vector >> >> >> >> On Monday, March 4, 2013, Colum Foley wrote: >> >> >> >> > Hi Andy, >> >> > >> >> > I am using Pig 0.10.0, (but am happy to try another). Yes, I am >> >> > running in local mode with the example data below. >> >> > >> >> > Thanks again, >> >> > Colum >> >> > >> >> > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer >> >> > <[email protected]> wrote: >> >> > > Colum, thank you for passing on details. Could you also share with >> us >> >> > > the version of pig you are running? I assume you're running in local >> >> > > mode with the example data below? >> >> > > >> >> > > >> >> > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]> >> wrote: >> >> > > >> >> > >> Hi Andy, Ted, >> >> > >> >> >> > >> Thank you both for replying. Below I will describe the input data, >> the >> >> > >> pig script I am using, and the resulting output. >> >> > >> >> >> > >> -Input data is the following (in file 'vectorsPigStored.dat' ): >> >> > >> >> >> > >> bbb (2,{(6595,4.0),(608,1.0)}) >> >> > >> ccd (1,{(9763,1.0)}) >> >> > >> adc (1,{(3670,1.0)}) >> >> > >> ads (1,{(2297,1.0)}) >> >> > >> >> >> > >> >> >> > >> -The full Pig script I am running is as follows: >> >> > >> >> >> > >> >> >> > >> REGISTER 'elephant-bird-core-3.0.7.jar' >> >> > >> REGISTER 'elephant-bird-pig-3.0.7.jar' >> >> > >> REGISTER 'elephant-bird-mahout-3.0.7.jar' >> >> > >> REGISTER 'mahout-core-0.7.jar' >> >> > >> REGISTER 'mahout-math-0.7.jar' >> >> > >> >> >> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val: >> >> > >> (cardinality: int, entries: {entry: (index: int, value: double)})); >> >> > >> --Store output >> >> > >> store pair into 'output' using >> >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ( >> >> > >> '-c com.twitter.elephantbird.pig.util.TextConverter', >> >> > >> '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >> >> > >> ); >> >> > >> --Store output without params for comparison >> >> > >> store pair into 'outputRaw' using >> >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (); >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> -Here is the output that I see, printed line by line and below is >> the >> >> > >> type of input (using SequenceFile.Reader, reader.getKeyClass()) >> >> > >> >> >> > >> -- from 'output' >> >> > >> bbb {} >> >> > >> ccd {} >> >> > >> adc {} >> >> > >> ads {} >> >> > >> class org.apache.hadoop.io.Text class >> >> > org.apache.mahout.math.VectorWritable >> >> > >> >> >> > >> --from 'outputRaw' >> >> > >> bbb (2,{(6595,4.0),(608,1.0)}) >> >> > >> ccd (1,{(9763,1.0)}) >> >> > >> adc (1,{(3670,1.0)}) >> >> > >> ads (1,{(2297,1.0)}) >> >> > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text >> >> > >> >> >> > >> >> >> > >> >> >> > >> **Just to confirm that the issue wasn't with my use of chararray >> keys >> >> > >> (instead of integer keys), I also tried a run with using int keys, >> but >> >> > >> the result is the same: >> >> > >> >> >> > >> >> >> > >> >> >> > >> --Output when using SequenceFileStorage with params '-c >> >> > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c >> >> > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter' >> >> > >> >> >> > >> 1 {} >> >> > >> 1 {} >> >> > >> 1 {} >> >> > >> 1 {} >> >> > >> 1 {} >> >> > >> class org.apache.hadoop.io.IntWritable class >> >> > >> org.apache.mahout.math.VectorWritable >> >> > >> >> >> > >> --Output from SequenceFileStorage without params >> >> > >> 1 (2,{(6595,4.0),(608,1.0)}) >> >> > >> 1 (1,{(9763,1.0)}) >> >> > >> 1 (1,{(3670,1.0)}) >> >> > >> 1 (1,{(2297,1.0)}) >> >> > >> class org.apache.hadoop.io.Text class org.apache.hadoop.io.Text >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> Any help greatly appreciated, >> >> > >> >> >> > >> Thanks again, >> >> > >> Colum >> >> > >> >> >> > >> >> >> > >> >> >> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected] >> > >> >> > wrote: >> >> > >>> Andy, >> >> > >>> >> >> > >>> Thanks for popping up! >> >> > >>> >> >> > >>> Elephant bird looks like it has awesome potential to make machine >> >> > learning >> >> > >>> with Hadoop vastly easier. It is really good to see this kind of >> >> > response >> >> > >>> ... that is what turns potential into action. >> >> > >>> >> >> > >>> Thanks again. >> >> > >>> >> >> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer < >> >> >> >> >> >> >> >> -- >> >> >> >> -jake >> >> >>
