Thanks Ted.

On Mon, Mar 4, 2013 at 7:16 PM, Ted Dunning <[email protected]> wrote:
> It is usually OK to over-estimate this, but it depends on the algorithm.
>
> The issue turns up when you transform matrices into a form that has as many
> rows as you declared columns.  This comes up in cooccurrence counting for
> recommendations or SVD (where the right singular vectors have as many rows
> as the original matrix had columns).
>
> That may indicate it is a good idea to pad the number, but not be
> outrageous.
>
> On Mon, Mar 4, 2013 at 12:42 PM, Colum Foley <[email protected]> wrote:
>
>> Hi Jake, Andy,
>>
>> Indeed that was the problem, I had thought the cardinality value was
>> for the number of items in the  bag, many thanks for the help!
>>
>> Is it OK to overestimate this value or does it need to match the
>> actual cardinality exactly?
>>
>> Thanks,
>> Colum
>>
>>
>>
>>
>>
>> On Mon, Mar 4, 2013 at 4:31 PM, Andy Schlaikjer
>> <[email protected]> wrote:
>> > Thanks Jake, yes, that's the first thing to fix-- Generally, all of your
>> > sparse vectors should have the same "size" (cardinality), but may have
>> > different numbers of non-default values.
>> >
>> > Try updating your example input data to read:
>> >
>> > bbb     (10000,{(6595,4.0),(608,1.0)})
>> > ccd     (10000,{(9763,1.0)})
>> > adc     (10000,{(3670,1.0)})
>> > ads     (10000,{(2297,1.0)})
>> >
>> > All of your indices must fall within [0, cardinality).
>> >
>> > Andy
>> >
>> >
>> >
>> > On Mon, Mar 4, 2013 at 8:17 AM, Jake Mannix <[email protected]>
>> wrote:
>> >
>> >> I think the issue is with your understanding of  what 'cardinality'
>> means
>> >> here: it is the *dimension* of the vector (featureSpaceSize), not the
>> >> number of nonzero elements in that particular vector
>> >>
>> >> On Monday, March 4, 2013, Colum Foley wrote:
>> >>
>> >> > Hi Andy,
>> >> >
>> >> > I am using Pig 0.10.0, (but am happy to try another). Yes, I am
>> >> > running in local mode with the example data below.
>> >> >
>> >> > Thanks again,
>> >> > Colum
>> >> >
>> >> > On Mon, Mar 4, 2013 at 3:02 PM, Andy Schlaikjer
>> >> > <[email protected]> wrote:
>> >> > > Colum, thank you for passing on details. Could you also share with
>> us
>> >> > > the version of pig you are running? I assume you're running in local
>> >> > > mode with the example data below?
>> >> > >
>> >> > >
>> >> > > On Mar 4, 2013, at 3:19 AM, Colum Foley <[email protected]>
>> wrote:
>> >> > >
>> >> > >> Hi Andy, Ted,
>> >> > >>
>> >> > >> Thank you both for replying. Below I will describe the input data,
>> the
>> >> > >> pig script I am using, and the resulting output.
>> >> > >>
>> >> > >> -Input data is the following (in file 'vectorsPigStored.dat' ):
>> >> > >>
>> >> > >> bbb    (2,{(6595,4.0),(608,1.0)})
>> >> > >> ccd    (1,{(9763,1.0)})
>> >> > >> adc    (1,{(3670,1.0)})
>> >> > >> ads    (1,{(2297,1.0)})
>> >> > >>
>> >> > >>
>> >> > >> -The full Pig script I am running is as follows:
>> >> > >>
>> >> > >>
>> >> > >> REGISTER 'elephant-bird-core-3.0.7.jar'
>> >> > >> REGISTER 'elephant-bird-pig-3.0.7.jar'
>> >> > >> REGISTER 'elephant-bird-mahout-3.0.7.jar'
>> >> > >> REGISTER 'mahout-core-0.7.jar'
>> >> > >> REGISTER 'mahout-math-0.7.jar'
>> >> > >>
>> >> > >> pair = LOAD 'vectorsPigStored.dat' AS (key: chararray, val:
>> >> > >> (cardinality: int, entries: {entry: (index: int, value: double)}));
>> >> > >> --Store output
>> >> > >> store pair into 'output' using
>> >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage (
>> >> > >>   '-c com.twitter.elephantbird.pig.util.TextConverter',
>> >> > >>   '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> >> > >> );
>> >> > >> --Store output without params for comparison
>> >> > >> store pair into 'outputRaw' using
>> >> > >> com.twitter.elephantbird.pig.store.SequenceFileStorage ();
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> -Here is the output that I see, printed line by line and below is
>> the
>> >> > >> type of input (using SequenceFile.Reader, reader.getKeyClass())
>> >> > >>
>> >> > >> -- from 'output'
>> >> > >> bbb  {}
>> >> > >> ccd  {}
>> >> > >> adc  {}
>> >> > >> ads  {}
>> >> > >> class org.apache.hadoop.io.Text    class
>> >> > org.apache.mahout.math.VectorWritable
>> >> > >>
>> >> > >> --from 'outputRaw'
>> >> > >> bbb  (2,{(6595,4.0),(608,1.0)})
>> >> > >> ccd  (1,{(9763,1.0)})
>> >> > >> adc  (1,{(3670,1.0)})
>> >> > >> ads  (1,{(2297,1.0)})
>> >> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> **Just to confirm that the issue wasn't with my use of chararray
>> keys
>> >> > >> (instead of integer keys), I also tried a run with using int keys,
>> but
>> >> > >> the result is the same:
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> --Output when using SequenceFileStorage with params  '-c
>> >> > >> com.twitter.elephantbird.pig.util.IntWritableConverter','-c
>> >> > >> com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>> >> > >>
>> >> > >> 1  {}
>> >> > >> 1  {}
>> >> > >> 1  {}
>> >> > >> 1  {}
>> >> > >> 1  {}
>> >> > >> class org.apache.hadoop.io.IntWritable    class
>> >> > >> org.apache.mahout.math.VectorWritable
>> >> > >>
>> >> > >> --Output from SequenceFileStorage without params
>> >> > >> 1  (2,{(6595,4.0),(608,1.0)})
>> >> > >> 1  (1,{(9763,1.0)})
>> >> > >> 1  (1,{(3670,1.0)})
>> >> > >> 1  (1,{(2297,1.0)})
>> >> > >> class org.apache.hadoop.io.Text    class org.apache.hadoop.io.Text
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> Any help greatly appreciated,
>> >> > >>
>> >> > >> Thanks again,
>> >> > >> Colum
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> On Fri, Mar 1, 2013 at 6:45 PM, Ted Dunning <[email protected]
>> >
>> >> > wrote:
>> >> > >>> Andy,
>> >> > >>>
>> >> > >>> Thanks for popping up!
>> >> > >>>
>> >> > >>> Elephant bird looks like it has awesome potential to make machine
>> >> > learning
>> >> > >>> with Hadoop vastly easier.  It is really good to see this kind of
>> >> > response
>> >> > >>> ... that is what turns potential into action.
>> >> > >>>
>> >> > >>> Thanks again.
>> >> > >>>
>> >> > >>> On Fri, Mar 1, 2013 at 9:59 AM, Andy Schlaikjer <
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >>   -jake
>> >>
>>

Reply via email to