Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
We have a large set of documents that we would like to index with a customized stopword list. We have run tests by indexing a random set of about 10% of the documents, and we'd like to generate a list of the terms in that smaller set and their IDF values as a way to create a starter set of stopw

RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
terms in a document set On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary wrote: > We have a large set of documents that we would like to index with a > customized stopword list. We have run tests by indexing a random set of about > 10% of the documents, and we'd like to generate

Searching by similarity using term vectors

2012-02-14 Thread Mike O'Leary
If I have indexed a set of documents using term vectors, is there support in Lucene to treat a list of query terms as a small document, create a term vector for it, and find documents by computing similarity between the query's term vector and the term vectors in the index? If so, what API funct

Lucene's use of vectors

2012-03-01 Thread Mike O'Leary
In the Javadoc page for the Similarity class, it says, "Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM." Is the Vector Space Model that is referred to here different than the term

Highlighting in Luke?

2012-03-13 Thread Mike O'Leary
I sent this message to the Luke discussion forum, but there isn't a lot of activity there these days, so I thought I would ask my question here too. I was asked if Luke supports highlighting of matched terms in its search results display. I looked through the code, and it doesn't look to me like

Problem with TermVector offsets and positions not being preserved

2012-07-19 Thread Mike O'Leary
I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, July 20, 2012 6:11 AM To: java-user@lucene.apache.org Subject: Re: Problem with TermVector offsets and positions not being preserved Hi Mike: I wrote up some tests last night against 3.6 trying to find some way

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
I neglected to mention that CreateTestIndex uses a collection of data files with .properties extensions that are included in the Lucene In Action source code download. Mike -Original Message- From: Mike O'Leary [mailto:tmole...@uw.edu] Sent: Friday, July 20, 2012 2:10 PM To: java

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
d fields. This tool should be using something like IndexReader.getTermFreqVector for the document to determine if it has term vectors. On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary wrote: > Hi Robert, > I put together the following two small applications to try to separate the >

RE: Problem with TermVector offsets and positions not being preserved

2012-07-26 Thread Mike O'Leary
Subject: Re: Problem with TermVector offsets and positions not being preserved On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote: > Hi Robert, > I'm not trying to determine whether a document has term vectors, I'm trying > to determine whether the term vectors that are in th

Supporting advanced search methods in a user interface

2012-08-16 Thread Mike O'Leary
I would like to know if anyone has ideas (or pointers to discussions) about good ways to support advanced search options, such as the various kinds of SpanQuery, in a search application user interface that is understandable to non-expert users. Thanks, Mike

RE: Problem with TermVector offsets and positions not being preserved

2012-08-22 Thread Mike O'Leary
term vectors in the affected fields? Is there a way to add a field to the documents in an index in which this doesn't occur? Thanks, Mike -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, July 20, 2012 5:59 PM To: java-user@lucene.apache.org Subje

RE: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Mike O'Leary
ENE-3312 for an effort to fix this trap for google summer of code. On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary wrote: > I have one more question about term vector positions and offsets being > preserved. My co-worker is working on updating the documents in an index with > a fi

Uses for IndexWriter.commit(commitUserData)/IndexCommit.getUserData()

2012-09-21 Thread Mike O'Leary
I was looking at IndexWriter.commit(commitUserData) and IndexCommit.getUserData() as possible ways to save metadata about documents in an index, but I realized that the metadata we are looking at could easily get to have way too many map entries to work well. This pair of functions looks useful

Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
I am updating an analyzer that uses a particular configuration of the PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields use a custom analyzer and StandardTokenizer and the other fields use the KeywordAnalyzer and KeywordTokenizer. The older version of the analyzer looks like

RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
ode sample. Are you able to expand on the problem you're encountering? On Wed, Sep 26, 2012 at 11:57 AM, Mike O'Leary wrote: > I am updating an analyzer that uses a particular configuration of the > PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields > use

RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
ourse by not extending Analyzer but instead just instantiating a PerFieldAnalyerWrapper instance directly instead of your MyPerFieldAnalyzer. On Wed, Sep 26, 2012 at 12:25 PM, Mike O'Leary wrote: > Hi Chris, > In a nutshell, my question is, what should I put in place of ??? to >

RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-26 Thread Mike O'Leary
com] Sent: Tuesday, September 25, 2012 6:32 PM To: java-user@lucene.apache.org Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question Mike, On Wed, Sep 26, 2012 at 1:05 PM, Mike O'Leary wrote: > Hi Chris, > So if I change my analyzer to inherit from AnalyzerWrapper, I need to > def

Registering a local dtd file for use with Digester

2007-02-22 Thread Mike O'Leary
I have a collection of XML files that I would like to parse using Digester in order to index them for Lucene. A DTD file has been supplied for the XML files, but none of those files has a line associating them with the DTD file. Can the Digester's register function be used to tell it to use that D

Storing extra data in index

2007-02-27 Thread Mike O'Leary
how to do something like this? Or is there a better way that I'm not thinking of? Thanks. Mike O'Leary

RE: Storing extra data in index

2007-02-27 Thread Mike O'Leary
So if I wanted to record the length of each individual document, would it be better to store that information with each document, perhaps as an unindexed field? Or are there ways to refer to the indexed documents that don't change through delete and optimize steps? Thanks. Mike O&

Indexing single words and marked phrases

2007-03-02 Thread Mike O'Leary
tter to write an Analyzer, could someone point me to information on how to do this? Thanks. Mike O'Leary