Re: A few questions about Classification

Lance Norskog Thu, 30 Aug 2012 17:19:21 -0700

3) database import
The most generic way is to use a Hadoop file reader that queries a
database. I don't know how to help you there.


In classify20newsgroups.sh, the first stage is:
  echo "Creating sequence files from 20newsgroups data"
  ./bin/mahout seqdirectory \
    -i ${WORK_DIR}/20news-all \
    -o ${WORK_DIR}/20news-seq

You need to replace this with something that reads a database and
creates Hadoop sequence files. The format is:
(Text,Text) where the key is a unique name for the document and the
value is the text of the document. The next step in the script turns
the text vectors into term-vectors. You do not have to change anything
after the above snippet.

 echo "Converting sequence files to vectors"
  ./bin/mahout seq2sparse \
    -i ${WORK_DIR}/20news-seq \
    -o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf



On Thu, Aug 30, 2012 at 2:30 AM, Salman Mahmood <[email protected]> wrote:
> I have got few questions about classification in mahout:
>
> 1) It is said that if the data set is small, then SDG is suited and naive 
> bayes, if the data set is medium. I also learned that bayes classifier is for 
> textual data as compared to continuous data.
> I am classifying around 10,000 news articles  and they are all textual data. 
> (no continuous variable used to determine classification). In my opinion the 
> data set is small, so should I use SGD or Naive Bayes? (given the data is 
> textual.)
>
> 2) Since multi-labeling is not supported in Mahout, I generated around 4,000 
> binary models using SGD. This way I know if a particular news item belongs to 
> one or more classes ("Apple sues Samsung" belongs to class "Apple" and 
> "Samsung").
> The problem I am facing is the performance. It takes around 4 mins to 
> classify one news item. Although the scalability is horizontal (meaning, it 
> doesn't take 8 minutes to classify 2 news, 12 mins to classify 3 news and so 
> on..), I still want to improve the throughput. What I am doing is loading a 
> particular model and classifying N news, then loading the other model and 
> classifying the N news again. Through this approach I am getting 16 mins to 
> classify 1000 news items with 75-100 words per news. Is there a way to 
> improve this further?(one option I am thinking is to use hadoop for the 
> classification task. is that possible?)
>
> 3) Where can I find some good code examples/tutorials of training and testing 
> Mahout classifier using naive Bayes? There are lots of examples on the net, 
> but they all use command line. I need to know the code for Naive bayes 
> because I do not have the dataset in files, but instead in database. and 
> command line option read the dataset from files. Mahout in Action book gives 
> you a good walkthrough for the SGD code but not for Naive Bayes.
>
> Thanks!



-- 
Lance Norskog
[email protected]

Re: A few questions about Classification

Reply via email to