Hi everyone, I’m trying to build a classifier that uses as training input documents taken from a Lucene Index.
Following the wiki and the examples, I understood I need to do the following: Step 1)Transform the documents in the Lucene Index in Vector format, like in https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html Step 2)Use the transformed data to train a model Step 3)Use the model to classify new documents The problem is I don’t know how to progress from Step 1 to Step 2: the trainer needs formatted files (“One doc per line, first entry on the line is the label, rest is the evidence” ) while the Driver from Step 1 creates a file containing a term dictionary and another containing the following text: SEQ_!org.apache.hadoop.io.LongWritable%org.apache.mahout.math.VectorWritable__*org.apache.hadoop.io.compress.DefaultCodec______; hÙ¥4iU_7ãŒ(M I guess there are some steps I’m missing or I’m doing something wrong. My idea would be to read the documents in the Lucene index and use one of the fields as the label (es. Category: a document with category “sport” is labeled “sport” in the training set) Thanks for your help Claudia