Hi everyone,

I’m trying to build a classifier that uses as training input documents taken 
from a Lucene Index.

Following the wiki and the examples, I understood I need to do the following:

 

Step 1)Transform the documents in the Lucene Index in Vector format, like in 
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

Step 2)Use the transformed data to train a model

Step 3)Use the model to classify new documents

 

The problem is I don’t know how to progress from Step 1 to Step 2: the trainer 
needs formatted files (“One doc per line, first entry on the line is the label, 
rest is the evidence” ) while the Driver from Step 1 creates a file containing 
a term dictionary and another containing the following text:

 

SEQ_!org.apache.hadoop.io.LongWritable%org.apache.mahout.math.VectorWritable__*org.apache.hadoop.io.compress.DefaultCodec______;
 hÙ¥4iU_7ãŒ(M

 

I guess there are some steps I’m missing or I’m doing something wrong.

My idea would be to read the documents in the Lucene index and use one of the 
fields as the label (es. Category: a document with category “sport” is labeled 
“sport” in the training set)

 

Thanks for your help

Claudia

 

Reply via email to