Re: PCA to improve classification performances

Kevin Moulart Mon, 10 Mar 2014 08:42:34 -0700

Yes but rowId transforms my dataset into an index which associates keys
like 0, 1, 2... to my actual key and a sequence file indexed using these
new keys, as integer.


Then pca/ssvd comes in, outputs a reducted matrix (as a sequence file using
the same keys it found in the input file, which are the IntWritables I got
from RowId).

And what I need for trainnb and testnb is the sequence file associated to
the matrix given by pca and the index created by rowID, but I can't find a
way to recombine them into a sequence file in a parallel fashion.

Kévin Moulart


2014-03-10 15:48 GMT+01:00 Dmitriy Lyubimov <[email protected]>:

> Pca and ssvd propagates exact row keys given in the input. If you give it
> text keys, U and Usigma will have text keys. It doesn t change that.
> On Mar 10, 2014 3:39 AM, "Kevin Moulart" <[email protected]> wrote:
>
> > Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job
> > to improve performances.
> >
> > I'm using PCA as a way to reduce the dimension of the dataset both to
> > improve its relevance (with 1600+ variables, many of them are correlated)
> > and to improve the performances of the classification algorithm used.
> >
> >
> >
> > Kévin Moulart
> >
> >
> > 2014-03-10 9:45 GMT+01:00 Suneel Marthi <[email protected]>:
> >
> > >
> > >
> > >
> > >   On Monday, March 10, 2014 4:21 AM, Kevin Moulart <
> > [email protected]>
> > > wrote:
> > >
> > > Its not clear to me from ur description as to the exact sequence of
> steps
> > > u r running thru, but an SSVD job requires a matrix as input (not a
> > > sequencefile of <Text, VectorWritables>.
> > > When u try running a seqdumper on ur SSVD output do u see anything?
> > >
> > >
> > > I see a Seqence File Text/VectorWritable with my original keys, and 99
> > > valuesfor each element in my original dataset.
> > >
> > > The next step after u create ur sequencefiles of Vectors would be to
> run
> > > the rowId job to generate a matrix and docIndex.
> > >
> > > This matrix needs to be the input to SSVD (for dimensional reduction),
> > >
> > >
> > > Ok so I tried that and indeed the SSVD accepts the matrix as input and
> > > gives me a Sequence File IntWritable/VectorWritable.
> > >
> > >
> > > followed by train Naive Bayes and test Naive Bayes.
> > >
> > >
> > > Here it doesn't work anymore, the NB wants a Sequence File
> > > Text/VectorWritable, and it won't take the one created hereabove.
> > > Is there a counterpart to rowId that takes a matrix and docIndex
> outputs
> > > the SequenceFile ?
> > >
> > > >> Hmm...  not that I know of.  You are gonna have to write a utility
> > that
> > > reads docIndex and <IntWritable/VectorWritable> as inputs.
> > >      a)  Create a dictionary of documentId, documentName  from docIndex
> > >      b)
> > >          (i) Read the Pair<Intwritable, VectorWritable> from the
> > > sequencefile<IntWritable,VectorWritable>,
> > >          (ii) for each pair, read the key <IntWritable> and value
> > > <VectorWritable> {
> > >                   replace each key with the corresponding DocumentName
> > > <Text> from dictionary in (a)
> > >                   SequenceFile,Writer.write(Text, VectorWritable)
> > >               }
> > >
> > >    Question: I might have missed it but what's the reason again u r
> > > calling PCA for before running TrainNaiveBayes ?
> > >
> > >    If others, have a better ideas please feel free to comment.
> > >
> > >
> > > Kévin Moulart
> > >
> > >
> > > 2014-03-07 16:23 GMT+01:00 Suneel Marthi <[email protected]>:
> > >
> > > Its not clear to me from ur description as to the exact sequence of
> steps
> > > u r running thru, but an SSVD job requires a matrix as input (not a
> > > sequencefile of <Text, VectorWritables>.
> > >
> > > When u try running a seqdumper on ur SSVD output do u see anything?
> > >
> > > The next step after u create ur sequencefiles of Vectors would be to
> run
> > > the rowId job to generate a matrix and docIndex.
> > >
> > > This matrix needs to be the input to SSVD (for dimensional reduction),
> > > followed by train Naive Bayes and test Naive Bayes.
> > >
> > >
> > >
> > >
> > >
> > > On Friday, March 7, 2014 10:01 AM, Kevin Moulart <
> [email protected]
> > >
> > > wrote:
> > >
> > > Hi again,
> > >
> > > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to
> > > reduce the dimention of a dataset from 1600+ features to ~100 and then
> to
> > > use the reducted dataset to train a naive bayes model and test it.
> > >
> > > So here is my workflow :
> > >
> > >    - Transform my CSV into a SequencFile with
> > >
> > > key = class as Text (with a "/" in it to be accepted by NaiveBayes, so
> in
> > > the for "class/class") using a custom job in MapReduce.
> > >
> > > value = features as VectorWritable
> > >
> > >    - Use mahout command line to reduce the dimension of the dataset :
> > >
> > > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o
> > > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U
> true
> > > -pca -ow -t 3
> > >
> > > ==> Here I get - if I understand things correctly - U, being the
> reducted
> > > dataset.
> > >
> > >    - Use mahout command line to train the NaiveBayes model :
> > >
> > > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o
> > > /user/myCompany/Echant/echant100k_red.model -l 0,1
> > > -li /user/myCompany/Echant/labelIndex100k_red -ow
> > >
> > >
> > >    - Use mahout command line to test the generated model :
> > >
> > > mahout testnb
> > > -i /user/myCompany/Echant/echant100k_red.seq/U --model
> > > /user/myCompany/Echant/echant100k_red.model -ow
> > > -o /user/myCompany/Echant/predicted_echant100k --labelIndex
> > > /user/myCompany/Echant/labelIndex100k_red
> > >
> > > (Here I test with the same dataset, but I should try with other
> datasets
> > as
> > > well once it runs smoothly)
> > >
> > > Here is my problem, everything seems to work quite well until I test my
> > > model : the output is full of NaN :
> > >
> > >
> > > Key: 1: Value: {0:NaN,1:NaN}
> > > Key: 1: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 1: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 1: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 0: Value: {0:NaN,1:NaN}
> > > Key: 1: Value: {0:NaN,1:NaN}
> > >
> > >
> > > I already have the same problem when training and testing with the full
> > > dataset but there, about 15% of the data still has values in output and
> > > gets predicted, the rest being NaN and unpredicted.
> > >
> > > Could you help me see what could be causing that ?
> > >
> > > Thanks in advance
> > > Bests,
> > >
> > > Kévin Moulart
> > >
> > >
> > >
> > >
> > >
> >
>

Re: PCA to improve classification performances

Reply via email to