The Lucene intake does not support searches on the index. If you can make a copies of the index, here's a trick: delete the documents you don't want, then optimize the index. You will need a Lucene program to do this. Use this to separate the big index into training and test indexes.
On Mon, Apr 4, 2011 at 6:51 PM, David Croley <[email protected]> wrote: > I have a large Lucene index (with TermFreq vectors). I do not have easy > access to the original source docs that the index was made from. I have > identified a set of docs in the index as Category X. Is there a way to > run Mahout's Bayesian classification algorithm, trained on the docs in > Category X, on the remaining docs in the index to better indentify > category matches? > > > > I have also exported the Lucene data into a Vector file in prep to run > some clustering experiments (as per the wiki examples) and also wondered > if that data could be used to feed the CBayes code. From what I can > tell, the classification code in Mahout takes a completely different > form of input compared to the clustering algorithms. > > > > Thanks for any pointers. > > > > > > David Croley > > Lead Engineer > > RenewData > > 512.351.0198 BlackBerry > > 512.276.5518 Desk > > [email protected] > > www.renewdata.com <http://www.renewdata.com/> > > > > Global in reach. Local in focus. > > > > > > Confidentiality Notice: This electronic communication contained in this > e-mail from [email protected] (including any attachments) may contain > privileged and/or confidential information. This communication is intended > only for the use of indicated e-mail addressees. Please be advised that any > disclosure, dissemination, distribution, copying, or other use of this > communication or any attached document other than for the purpose intended by > the sender is strictly prohibited. If you have received this communication in > error, please notify the sender immediately by reply e-mail and promptly > destroy all electronic and printed copies of this communication and any > attached document. Thank you in advance for your cooperation. > -- Lance Norskog [email protected]
