Hi,
My problem is to classify multi-attributes text files.
Some attributes are in text format, others are numeric (e.g in CSV format).
I've started to work with Naive Bayes and SGD.

And no I have a few questions for each :) (sorry for the long mail, I'm trying to group all my questions to avoid posting dozen of messages)
_
Naives bayes_

- *TrainClassifier* and *TestClassifier* classes take a data directory in input. If I work with CSV files (multi attributes), how to setup the algorithm for each attribute (ex : the 3rd attribute is numeric etc.) ?

- I'm thinking about he way to make some data entries match to several categories. Is this kind of thing possible with classifications algorithms (e.g. Naive Bayes)? For example, if I want to tag news , some of them could be both "international" and "politics" news

- I've tried to classify classify 110 000 entries (after learning on 440 000 entries) and Mahout fails with Java Heap Space, even with more than 2 Go of memory on the JVM.
Do I have a configuration issue or does it seem normal?

- I have good results with small data sets with Naive Bayes, better than SVM and SGD tests I've done (I've tried many algorithms with wekka on my CSV files). The theory says that Naives Bayes only fits with big data sets, so is it dangerous to choose it anayway for small datasets analysis? For example, with 80 entries for learning and 4 categories, I have 90% of success on my text files (with 40 entries in test data). SVM gives very bad score for this (~40%), Logistic regression ~60%
So I'm very confused....
Maybe my entries are very simple for Naive Baye, so it does not need a lot of data for learning?


_
SGD_

I worked on the basis of the *RunLogistic* found in Mahout Examples.
As my examples have more than 2 categories.
I used classifyFull() method form *OnlineLogisticRegression* instead of classifyScalar(). For the evaluation of the model, I had to modify the *Auc* class, because it was able
to manage a matrix of only 2 elements.

It works fine with small data sets, but now I have some strange results with bigger sets.
Maybe I've done it wrong...
So my question is : is there a way in Mahout to classify and test (and have some metrics like Auc and Confusion) more than 2 categories without modify the provided classes?

Thanks a lot!

Loic

Reply via email to