Re: PCA to improve classification performances

Dmitriy Lyubimov Mon, 10 Mar 2014 09:57:06 -0700

Ok, it's just FYI as you build out your pipelines.

FYI there's a bit of inconsistency between DRM-based in methods in mahout.
Some methods require Int row keys, some don't. Yet them some also rely on
names of a NamedVector, and some don't .


PCA/SSVD propagates BOTH keys from sequence file AND names in NamedVectors
if present. But they don't do things such as propagating names into keys or
vice versa, which some people found they might need to do depending on
preceeding and succeedingn methods in their pipelines.

-d


On Mon, Mar 10, 2014 at 8:41 AM, Kevin Moulart <[email protected]>wrote:

> Yes but rowId transforms my dataset into an index which associates keys
> like 0, 1, 2... to my actual key and a sequence file indexed using these
> new keys, as integer.
>
> Then pca/ssvd comes in, outputs a reducted matrix (as a sequence file using
> the same keys it found in the input file, which are the IntWritables I got
> from RowId).
>
> And what I need for trainnb and testnb is the sequence file associated to
> the matrix given by pca and the index created by rowID, but I can't find a
> way to recombine them into a sequence file in a parallel fashion.
>
> Kévin Moulart
>
>
> 2014-03-10 15:48 GMT+01:00 Dmitriy Lyubimov <[email protected]>:
>
> > Pca and ssvd propagates exact row keys given in the input. If you give it
> > text keys, U and Usigma will have text keys. It doesn t change that.
> > On Mar 10, 2014 3:39 AM, "Kevin Moulart" <[email protected]> wrote:
> >
> > > Hi and thanks, I'll try that, but I'd like to do so using a mapreduce
> job
> > > to improve performances.
> > >
> > > I'm using PCA as a way to reduce the dimension of the dataset both to
> > > improve its relevance (with 1600+ variables, many of them are
> correlated)
> > > and to improve the performances of the classification algorithm used.
> > >
> > >
> > >
> > > Kévin Moulart
> > >
> > >
> > > 2014-03-10 9:45 GMT+01:00 Suneel Marthi <[email protected]>:
> > >
> > > >
> > > >
> > > >
> > > >   On Monday, March 10, 2014 4:21 AM, Kevin Moulart <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > Its not clear to me from ur description as to the exact sequence of
> > steps
> > > > u r running thru, but an SSVD job requires a matrix as input (not a
> > > > sequencefile of <Text, VectorWritables>.
> > > > When u try running a seqdumper on ur SSVD output do u see anything?
> > > >
> > > >
> > > > I see a Seqence File Text/VectorWritable with my original keys, and
> 99
> > > > valuesfor each element in my original dataset.
> > > >
> > > > The next step after u create ur sequencefiles of Vectors would be to
> > run
> > > > the rowId job to generate a matrix and docIndex.
> > > >
> > > > This matrix needs to be the input to SSVD (for dimensional
> reduction),
> > > >
> > > >
> > > > Ok so I tried that and indeed the SSVD accepts the matrix as input
> and
> > > > gives me a Sequence File IntWritable/VectorWritable.
> > > >
> > > >
> > > > followed by train Naive Bayes and test Naive Bayes.
> > > >
> > > >
> > > > Here it doesn't work anymore, the NB wants a Sequence File
> > > > Text/VectorWritable, and it won't take the one created hereabove.
> > > > Is there a counterpart to rowId that takes a matrix and docIndex
> > outputs
> > > > the SequenceFile ?
> > > >
> > > > >> Hmm...  not that I know of.  You are gonna have to write a utility
> > > that
> > > > reads docIndex and <IntWritable/VectorWritable> as inputs.
> > > >      a)  Create a dictionary of documentId, documentName  from
> docIndex
> > > >      b)
> > > >          (i) Read the Pair<Intwritable, VectorWritable> from the
> > > > sequencefile<IntWritable,VectorWritable>,
> > > >          (ii) for each pair, read the key <IntWritable> and value
> > > > <VectorWritable> {
> > > >                   replace each key with the corresponding
> DocumentName
> > > > <Text> from dictionary in (a)
> > > >                   SequenceFile,Writer.write(Text, VectorWritable)
> > > >               }
> > > >
> > > >    Question: I might have missed it but what's the reason again u r
> > > > calling PCA for before running TrainNaiveBayes ?
> > > >
> > > >    If others, have a better ideas please feel free to comment.
> > > >
> > > >
> > > > Kévin Moulart
> > > >
> > > >
> > > > 2014-03-07 16:23 GMT+01:00 Suneel Marthi <[email protected]>:
> > > >
> > > > Its not clear to me from ur description as to the exact sequence of
> > steps
> > > > u r running thru, but an SSVD job requires a matrix as input (not a
> > > > sequencefile of <Text, VectorWritables>.
> > > >
> > > > When u try running a seqdumper on ur SSVD output do u see anything?
> > > >
> > > > The next step after u create ur sequencefiles of Vectors would be to
> > run
> > > > the rowId job to generate a matrix and docIndex.
> > > >
> > > > This matrix needs to be the input to SSVD (for dimensional
> reduction),
> > > > followed by train Naive Bayes and test Naive Bayes.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Friday, March 7, 2014 10:01 AM, Kevin Moulart <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to
> > > > reduce the dimention of a dataset from 1600+ features to ~100 and
> then
> > to
> > > > use the reducted dataset to train a naive bayes model and test it.
> > > >
> > > > So here is my workflow :
> > > >
> > > >    - Transform my CSV into a SequencFile with
> > > >
> > > > key = class as Text (with a "/" in it to be accepted by NaiveBayes,
> so
> > in
> > > > the for "class/class") using a custom job in MapReduce.
> > > >
> > > > value = features as VectorWritable
> > > >
> > > >    - Use mahout command line to reduce the dimension of the dataset :
> > > >
> > > > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o
> > > > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U
> > true
> > > > -pca -ow -t 3
> > > >
> > > > ==> Here I get - if I understand things correctly - U, being the
> > reducted
> > > > dataset.
> > > >
> > > >    - Use mahout command line to train the NaiveBayes model :
> > > >
> > > > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o
> > > > /user/myCompany/Echant/echant100k_red.model -l 0,1
> > > > -li /user/myCompany/Echant/labelIndex100k_red -ow
> > > >
> > > >
> > > >    - Use mahout command line to test the generated model :
> > > >
> > > > mahout testnb
> > > > -i /user/myCompany/Echant/echant100k_red.seq/U --model
> > > > /user/myCompany/Echant/echant100k_red.model -ow
> > > > -o /user/myCompany/Echant/predicted_echant100k --labelIndex
> > > > /user/myCompany/Echant/labelIndex100k_red
> > > >
> > > > (Here I test with the same dataset, but I should try with other
> > datasets
> > > as
> > > > well once it runs smoothly)
> > > >
> > > > Here is my problem, everything seems to work quite well until I test
> my
> > > > model : the output is full of NaN :
> > > >
> > > >
> > > > Key: 1: Value: {0:NaN,1:NaN}
> > > > Key: 1: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 1: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 1: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 0: Value: {0:NaN,1:NaN}
> > > > Key: 1: Value: {0:NaN,1:NaN}
> > > >
> > > >
> > > > I already have the same problem when training and testing with the
> full
> > > > dataset but there, about 15% of the data still has values in output
> and
> > > > gets predicted, the rest being NaN and unpredicted.
> > > >
> > > > Could you help me see what could be causing that ?
> > > >
> > > > Thanks in advance
> > > > Bests,
> > > >
> > > > Kévin Moulart
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: PCA to improve classification performances

Reply via email to