Its not clear to me from ur description as to the exact sequence of steps u r running thru, but an SSVD job requires a matrix as input (not a sequencefile of <Text, VectorWritables>.
When u try running a seqdumper on ur SSVD output do u see anything? The next step after u create ur sequencefiles of Vectors would be to run the rowId job to generate a matrix and docIndex. This matrix needs to be the input to SSVD (for dimensional reduction), followed by train Naive Bayes and test Naive Bayes. On Friday, March 7, 2014 10:01 AM, Kevin Moulart <[email protected]> wrote: Hi again, I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to reduce the dimention of a dataset from 1600+ features to ~100 and then to use the reducted dataset to train a naive bayes model and test it. So here is my workflow : - Transform my CSV into a SequencFile with key = class as Text (with a "/" in it to be accepted by NaiveBayes, so in the for "class/class") using a custom job in MapReduce. value = features as VectorWritable - Use mahout command line to reduce the dimension of the dataset : mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U true -pca -ow -t 3 ==> Here I get - if I understand things correctly - U, being the reducted dataset. - Use mahout command line to train the NaiveBayes model : mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o /user/myCompany/Echant/echant100k_red.model -l 0,1 -li /user/myCompany/Echant/labelIndex100k_red -ow - Use mahout command line to test the generated model : mahout testnb -i /user/myCompany/Echant/echant100k_red.seq/U --model /user/myCompany/Echant/echant100k_red.model -ow -o /user/myCompany/Echant/predicted_echant100k --labelIndex /user/myCompany/Echant/labelIndex100k_red (Here I test with the same dataset, but I should try with other datasets as well once it runs smoothly) Here is my problem, everything seems to work quite well until I test my model : the output is full of NaN : Key: 1: Value: {0:NaN,1:NaN} Key: 1: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 1: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 1: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 0: Value: {0:NaN,1:NaN} Key: 1: Value: {0:NaN,1:NaN} I already have the same problem when training and testing with the full dataset but there, about 15% of the data still has values in output and gets predicted, the rest being NaN and unpredicted. Could you help me see what could be causing that ? Thanks in advance Bests, Kévin Moulart
