3) database import
The most generic way is to use a Hadoop file reader that queries a
database. I don't know how to help you there.
In classify20newsgroups.sh, the first stage is:
echo "Creating sequence files from 20newsgroups data"
./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq
You need to replace this with something that reads a database and
creates Hadoop sequence files. The format is:
(Text,Text) where the key is a unique name for the document and the
value is the text of the document. The next step in the script turns
the text vectors into term-vectors. You do not have to change anything
after the above snippet.
echo "Converting sequence files to vectors"
./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
On Thu, Aug 30, 2012 at 2:30 AM, Salman Mahmood <[email protected]> wrote:
> I have got few questions about classification in mahout:
>
> 1) It is said that if the data set is small, then SDG is suited and naive
> bayes, if the data set is medium. I also learned that bayes classifier is for
> textual data as compared to continuous data.
> I am classifying around 10,000 news articles and they are all textual data.
> (no continuous variable used to determine classification). In my opinion the
> data set is small, so should I use SGD or Naive Bayes? (given the data is
> textual.)
>
> 2) Since multi-labeling is not supported in Mahout, I generated around 4,000
> binary models using SGD. This way I know if a particular news item belongs to
> one or more classes ("Apple sues Samsung" belongs to class "Apple" and
> "Samsung").
> The problem I am facing is the performance. It takes around 4 mins to
> classify one news item. Although the scalability is horizontal (meaning, it
> doesn't take 8 minutes to classify 2 news, 12 mins to classify 3 news and so
> on..), I still want to improve the throughput. What I am doing is loading a
> particular model and classifying N news, then loading the other model and
> classifying the N news again. Through this approach I am getting 16 mins to
> classify 1000 news items with 75-100 words per news. Is there a way to
> improve this further?(one option I am thinking is to use hadoop for the
> classification task. is that possible?)
>
> 3) Where can I find some good code examples/tutorials of training and testing
> Mahout classifier using naive Bayes? There are lots of examples on the net,
> but they all use command line. I need to know the code for Naive bayes
> because I do not have the dataset in files, but instead in database. and
> command line option read the dataset from files. Mahout in Action book gives
> you a good walkthrough for the SGD code but not for Naive Bayes.
>
> Thanks!
--
Lance Norskog
[email protected]