Re: PCA to improve classification performances

Dmitriy Lyubimov Mon, 10 Mar 2014 07:49:09 -0700

Pca and ssvd propagates exact row keys given in the input. If you give it
text keys, U and Usigma will have text keys. It doesn t change that.
On Mar 10, 2014 3:39 AM, "Kevin Moulart" <[email protected]> wrote:


> Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job
> to improve performances.
>
> I'm using PCA as a way to reduce the dimension of the dataset both to
> improve its relevance (with 1600+ variables, many of them are correlated)
> and to improve the performances of the classification algorithm used.
>
>
>
> Kévin Moulart
>
>
> 2014-03-10 9:45 GMT+01:00 Suneel Marthi <[email protected]>:
>
> >
> >
> >
> >   On Monday, March 10, 2014 4:21 AM, Kevin Moulart <
> [email protected]>
> > wrote:
> >
> > Its not clear to me from ur description as to the exact sequence of steps
> > u r running thru, but an SSVD job requires a matrix as input (not a
> > sequencefile of <Text, VectorWritables>.
> > When u try running a seqdumper on ur SSVD output do u see anything?
> >
> >
> > I see a Seqence File Text/VectorWritable with my original keys, and 99
> > valuesfor each element in my original dataset.
> >
> > The next step after u create ur sequencefiles of Vectors would be to run
> > the rowId job to generate a matrix and docIndex.
> >
> > This matrix needs to be the input to SSVD (for dimensional reduction),
> >
> >
> > Ok so I tried that and indeed the SSVD accepts the matrix as input and
> > gives me a Sequence File IntWritable/VectorWritable.
> >
> >
> > followed by train Naive Bayes and test Naive Bayes.
> >
> >
> > Here it doesn't work anymore, the NB wants a Sequence File
> > Text/VectorWritable, and it won't take the one created hereabove.
> > Is there a counterpart to rowId that takes a matrix and docIndex outputs
> > the SequenceFile ?
> >
> > >> Hmm...  not that I know of.  You are gonna have to write a utility
> that
> > reads docIndex and <IntWritable/VectorWritable> as inputs.
> >      a)  Create a dictionary of documentId, documentName  from docIndex
> >      b)
> >          (i) Read the Pair<Intwritable, VectorWritable> from the
> > sequencefile<IntWritable,VectorWritable>,
> >          (ii) for each pair, read the key <IntWritable> and value
> > <VectorWritable> {
> >                   replace each key with the corresponding DocumentName
> > <Text> from dictionary in (a)
> >                   SequenceFile,Writer.write(Text, VectorWritable)
> >               }
> >
> >    Question: I might have missed it but what's the reason again u r
> > calling PCA for before running TrainNaiveBayes ?
> >
> >    If others, have a better ideas please feel free to comment.
> >
> >
> > Kévin Moulart
> >
> >
> > 2014-03-07 16:23 GMT+01:00 Suneel Marthi <[email protected]>:
> >
> > Its not clear to me from ur description as to the exact sequence of steps
> > u r running thru, but an SSVD job requires a matrix as input (not a
> > sequencefile of <Text, VectorWritables>.
> >
> > When u try running a seqdumper on ur SSVD output do u see anything?
> >
> > The next step after u create ur sequencefiles of Vectors would be to run
> > the rowId job to generate a matrix and docIndex.
> >
> > This matrix needs to be the input to SSVD (for dimensional reduction),
> > followed by train Naive Bayes and test Naive Bayes.
> >
> >
> >
> >
> >
> > On Friday, March 7, 2014 10:01 AM, Kevin Moulart <[email protected]
> >
> > wrote:
> >
> > Hi again,
> >
> > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to
> > reduce the dimention of a dataset from 1600+ features to ~100 and then to
> > use the reducted dataset to train a naive bayes model and test it.
> >
> > So here is my workflow :
> >
> >    - Transform my CSV into a SequencFile with
> >
> > key = class as Text (with a "/" in it to be accepted by NaiveBayes, so in
> > the for "class/class") using a custom job in MapReduce.
> >
> > value = features as VectorWritable
> >
> >    - Use mahout command line to reduce the dimension of the dataset :
> >
> > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o
> > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U true
> > -pca -ow -t 3
> >
> > ==> Here I get - if I understand things correctly - U, being the reducted
> > dataset.
> >
> >    - Use mahout command line to train the NaiveBayes model :
> >
> > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o
> > /user/myCompany/Echant/echant100k_red.model -l 0,1
> > -li /user/myCompany/Echant/labelIndex100k_red -ow
> >
> >
> >    - Use mahout command line to test the generated model :
> >
> > mahout testnb
> > -i /user/myCompany/Echant/echant100k_red.seq/U --model
> > /user/myCompany/Echant/echant100k_red.model -ow
> > -o /user/myCompany/Echant/predicted_echant100k --labelIndex
> > /user/myCompany/Echant/labelIndex100k_red
> >
> > (Here I test with the same dataset, but I should try with other datasets
> as
> > well once it runs smoothly)
> >
> > Here is my problem, everything seems to work quite well until I test my
> > model : the output is full of NaN :
> >
> >
> > Key: 1: Value: {0:NaN,1:NaN}
> > Key: 1: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 1: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 1: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 0: Value: {0:NaN,1:NaN}
> > Key: 1: Value: {0:NaN,1:NaN}
> >
> >
> > I already have the same problem when training and testing with the full
> > dataset but there, about 15% of the data still has values in output and
> > gets predicted, the rest being NaN and unpredicted.
> >
> > Could you help me see what could be causing that ?
> >
> > Thanks in advance
> > Bests,
> >
> > Kévin Moulart
> >
> >
> >
> >
> >
>

Re: PCA to improve classification performances

Reply via email to