Yes but rowId transforms my dataset into an index which associates keys like 0, 1, 2... to my actual key and a sequence file indexed using these new keys, as integer.
Then pca/ssvd comes in, outputs a reducted matrix (as a sequence file using the same keys it found in the input file, which are the IntWritables I got from RowId). And what I need for trainnb and testnb is the sequence file associated to the matrix given by pca and the index created by rowID, but I can't find a way to recombine them into a sequence file in a parallel fashion. Kévin Moulart 2014-03-10 15:48 GMT+01:00 Dmitriy Lyubimov <[email protected]>: > Pca and ssvd propagates exact row keys given in the input. If you give it > text keys, U and Usigma will have text keys. It doesn t change that. > On Mar 10, 2014 3:39 AM, "Kevin Moulart" <[email protected]> wrote: > > > Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job > > to improve performances. > > > > I'm using PCA as a way to reduce the dimension of the dataset both to > > improve its relevance (with 1600+ variables, many of them are correlated) > > and to improve the performances of the classification algorithm used. > > > > > > > > Kévin Moulart > > > > > > 2014-03-10 9:45 GMT+01:00 Suneel Marthi <[email protected]>: > > > > > > > > > > > > > > On Monday, March 10, 2014 4:21 AM, Kevin Moulart < > > [email protected]> > > > wrote: > > > > > > Its not clear to me from ur description as to the exact sequence of > steps > > > u r running thru, but an SSVD job requires a matrix as input (not a > > > sequencefile of <Text, VectorWritables>. > > > When u try running a seqdumper on ur SSVD output do u see anything? > > > > > > > > > I see a Seqence File Text/VectorWritable with my original keys, and 99 > > > valuesfor each element in my original dataset. > > > > > > The next step after u create ur sequencefiles of Vectors would be to > run > > > the rowId job to generate a matrix and docIndex. > > > > > > This matrix needs to be the input to SSVD (for dimensional reduction), > > > > > > > > > Ok so I tried that and indeed the SSVD accepts the matrix as input and > > > gives me a Sequence File IntWritable/VectorWritable. > > > > > > > > > followed by train Naive Bayes and test Naive Bayes. > > > > > > > > > Here it doesn't work anymore, the NB wants a Sequence File > > > Text/VectorWritable, and it won't take the one created hereabove. > > > Is there a counterpart to rowId that takes a matrix and docIndex > outputs > > > the SequenceFile ? > > > > > > >> Hmm... not that I know of. You are gonna have to write a utility > > that > > > reads docIndex and <IntWritable/VectorWritable> as inputs. > > > a) Create a dictionary of documentId, documentName from docIndex > > > b) > > > (i) Read the Pair<Intwritable, VectorWritable> from the > > > sequencefile<IntWritable,VectorWritable>, > > > (ii) for each pair, read the key <IntWritable> and value > > > <VectorWritable> { > > > replace each key with the corresponding DocumentName > > > <Text> from dictionary in (a) > > > SequenceFile,Writer.write(Text, VectorWritable) > > > } > > > > > > Question: I might have missed it but what's the reason again u r > > > calling PCA for before running TrainNaiveBayes ? > > > > > > If others, have a better ideas please feel free to comment. > > > > > > > > > Kévin Moulart > > > > > > > > > 2014-03-07 16:23 GMT+01:00 Suneel Marthi <[email protected]>: > > > > > > Its not clear to me from ur description as to the exact sequence of > steps > > > u r running thru, but an SSVD job requires a matrix as input (not a > > > sequencefile of <Text, VectorWritables>. > > > > > > When u try running a seqdumper on ur SSVD output do u see anything? > > > > > > The next step after u create ur sequencefiles of Vectors would be to > run > > > the rowId job to generate a matrix and docIndex. > > > > > > This matrix needs to be the input to SSVD (for dimensional reduction), > > > followed by train Naive Bayes and test Naive Bayes. > > > > > > > > > > > > > > > > > > On Friday, March 7, 2014 10:01 AM, Kevin Moulart < > [email protected] > > > > > > wrote: > > > > > > Hi again, > > > > > > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to > > > reduce the dimention of a dataset from 1600+ features to ~100 and then > to > > > use the reducted dataset to train a naive bayes model and test it. > > > > > > So here is my workflow : > > > > > > - Transform my CSV into a SequencFile with > > > > > > key = class as Text (with a "/" in it to be accepted by NaiveBayes, so > in > > > the for "class/class") using a custom job in MapReduce. > > > > > > value = features as VectorWritable > > > > > > - Use mahout command line to reduce the dimension of the dataset : > > > > > > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o > > > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U > true > > > -pca -ow -t 3 > > > > > > ==> Here I get - if I understand things correctly - U, being the > reducted > > > dataset. > > > > > > - Use mahout command line to train the NaiveBayes model : > > > > > > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o > > > /user/myCompany/Echant/echant100k_red.model -l 0,1 > > > -li /user/myCompany/Echant/labelIndex100k_red -ow > > > > > > > > > - Use mahout command line to test the generated model : > > > > > > mahout testnb > > > -i /user/myCompany/Echant/echant100k_red.seq/U --model > > > /user/myCompany/Echant/echant100k_red.model -ow > > > -o /user/myCompany/Echant/predicted_echant100k --labelIndex > > > /user/myCompany/Echant/labelIndex100k_red > > > > > > (Here I test with the same dataset, but I should try with other > datasets > > as > > > well once it runs smoothly) > > > > > > Here is my problem, everything seems to work quite well until I test my > > > model : the output is full of NaN : > > > > > > > > > Key: 1: Value: {0:NaN,1:NaN} > > > Key: 1: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 1: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 1: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 0: Value: {0:NaN,1:NaN} > > > Key: 1: Value: {0:NaN,1:NaN} > > > > > > > > > I already have the same problem when training and testing with the full > > > dataset but there, about 15% of the data still has values in output and > > > gets predicted, the rest being NaN and unpredicted. > > > > > > Could you help me see what could be causing that ? > > > > > > Thanks in advance > > > Bests, > > > > > > Kévin Moulart > > > > > > > > > > > > > > > > > >
