I'm not too worried about splitting the data into test and train sets. My main 
issue is that the classifier examples I can find all take as input a file with 
the form (at least for text):

<label>\t<text to classifiy...>

However, I don't have the original content of the files, only the index with 
term frequency vectors. I know the first step for the Bayesian algorithms is 
creating a TF-IDF vector, but is seems the existing code cannot take TF-IDF 
vectors like the cluster algorithms or even some variant of the Term Frequency 
vectors I can get from Lucene.

At this point, I am going to try to write code to dump the words and 
frequencies from the index, add a label, and modify the BayesFeatureDriver 
class to take my input.

David


-----Original Message-----
From: Lance Norskog [mailto:[email protected]] 
Sent: Tuesday, April 05, 2011 3:19 PM
To: [email protected]
Subject: Re: Classification with data from Lucene

The Lucene intake does not support searches on the index.

If you can make a copies of the index, here's a trick: delete the
documents you don't want, then optimize the index. You will need a
Lucene program to do this.
Use this to separate the big index into training and test indexes.

On Mon, Apr 4, 2011 at 6:51 PM, David Croley <[email protected]> wrote:
> I have a large Lucene index (with TermFreq vectors). I do not have easy
> access to the original source docs that the index was made from. I have
> identified a set of docs in the index as Category X. Is there a way to
> run Mahout's Bayesian classification algorithm, trained on the docs in
> Category X, on the remaining docs in the index to better indentify
> category matches?
>
>
>
> I have also exported the Lucene data into a Vector file in prep to run
> some clustering experiments (as per the wiki examples) and also wondered
> if that data could be used to feed the CBayes code. From what I can
> tell, the classification code in Mahout takes a completely different
> form of input compared to the clustering algorithms.
>
>
>
> Thanks for any pointers.
>
>
>
>
>
> David Croley
>
> Lead Engineer
>
> RenewData
>
> 512.351.0198 BlackBerry
>
> 512.276.5518 Desk
>
> [email protected]
>
> www.renewdata.com <http://www.renewdata.com/>
>
>
>
> Global in reach. Local in focus.
>
>
>
>
>
> Confidentiality Notice: This electronic communication contained in this 
> e-mail from [email protected] (including any attachments) may contain 
> privileged and/or confidential information. This communication is intended 
> only for the use of indicated e-mail addressees. Please be advised that any 
> disclosure, dissemination, distribution, copying, or other use of this 
> communication or any attached document other than for the purpose intended by 
> the sender is strictly prohibited. If you have received this communication in 
> error, please notify the sender immediately by reply e-mail and promptly 
> destroy all electronic and printed copies of this communication and any 
> attached document. Thank you in advance for your cooperation.
>



-- 
Lance Norskog
[email protected]


Confidentiality Notice: This electronic communication contained in this e-mail 
from [email protected] (including any attachments) may contain privileged 
and/or confidential information. This communication is intended only for the 
use of indicated e-mail addressees. Please be advised that any disclosure, 
dissemination, distribution, copying, or other use of this communication or any 
attached document other than for the purpose intended by the sender is strictly 
prohibited. If you have received this communication in error, please notify the 
sender immediately by reply e-mail and promptly destroy all electronic and 
printed copies of this communication and any attached document. Thank you in 
advance for your cooperation.

Reply via email to