Hi, Once the classifier is trained what are the standard approaches for:
"and then apply it to documents" dan --- On Tue, 9/20/11, Ken Krugler <[email protected]> wrote: From: Ken Krugler <[email protected]> Subject: Re: new to mahout and need direction. To: [email protected] Date: Tuesday, September 20, 2011, 7:52 PM Hi Dan, I don't think you really need Mahout for the actual processing pipeline. If I understand the issue correctly, you're trying to come up with potential categories for job postings that are flowing through your system. So that feels more like a typical train-a-classifier (offline) and then apply it to documents via whatever mechanism fits best with your current workflow. Which classifier to use, extracting features for training/classification, etc is where Mahout could be useful. -- Ken On Sep 20, 2011, at 7:01pm, Dan wrote: > Hello, > > I am new to using mahout. I have setup hadoop, nutch, pig and I feel I am > very knowledgeable about solr and fully understand lucene. I am a php > developer and have only tinkered with java code. > > I have 2 million jobs and I need to build a categorization system I figured > mahout should do the trick. So I setup the 20newsgroup example ran it. I am > trying to figure out how mahout will fit into the job-posting-into-solr chain. > > Currently a job posting will go into a queue to be processed into a solr > document. we currently have a bunch of processes that will add to the > document like calling google to get a latitude/longitude based on the job > posting location, etc. I figure mahout would be in one of these worker queues. > > What are my options for accessing mahout from php? webservice.. bash? I would > like a system where I post it a chunk of text and it would return a list of > suggested categories since a job posting could belong to multiple categories. > > > Any pointers in the right direction would be appreciated > > dan > > -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
