We have a large set of documents that we would like to index with a customized
stopword list. We have run tests by indexing a random set of about 10% of the
documents, and we'd like to generate a list of the terms in that smaller set
and their IDF values as a way to create a starter set of stopw
terms in a document set
On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary wrote:
> We have a large set of documents that we would like to index with a
> customized stopword list. We have run tests by indexing a random set of about
> 10% of the documents, and we'd like to generate
If I have indexed a set of documents using term vectors, is there support in
Lucene to treat a list of query terms as a small document, create a term vector
for it, and find documents by computing similarity between the query's term
vector and the term vectors in the index? If so, what API funct
In the Javadoc page for the Similarity class, it says,
"Lucene combines Boolean model (BM) of Information Retrieval with Vector Space
Model (VSM) of Information Retrieval - documents "approved" by BM are scored by
VSM."
Is the Vector Space Model that is referred to here different than the term
I sent this message to the Luke discussion forum, but there isn't a lot of
activity there these days, so I thought I would ask my question here too.
I was asked if Luke supports highlighting of matched terms in its search
results display. I looked through the code, and it doesn't look to me like
I created an index using Lucene 3.6.0 in which I specified that a certain text
field in each document should be indexed, stored, analyzed with no norms, with
term vectors, offsets and positions. Later I looked at that index in Luke, and
it said that term vectors were created for this field, but
Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, July 20, 2012 6:11 AM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved
Hi Mike:
I wrote up some tests last night against 3.6 trying to find some way
I neglected to mention that CreateTestIndex uses a collection of data files
with .properties extensions that are included in the Lucene In Action source
code download.
Mike
-Original Message-
From: Mike O'Leary [mailto:tmole...@uw.edu]
Sent: Friday, July 20, 2012 2:10 PM
To: java
d fields.
This tool should be using something like IndexReader.getTermFreqVector for the
document to determine if it has term vectors.
On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the
>
Subject: Re: Problem with TermVector offsets and positions not being preserved
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying
> to determine whether the term vectors that are in th
I would like to know if anyone has ideas (or pointers to discussions) about
good ways to support advanced search options, such as the various kinds of
SpanQuery, in a search application user interface that is understandable to
non-expert users.
Thanks,
Mike
term vectors
in the affected fields? Is there a way to add a field to the documents in an
index in which this doesn't occur?
Thanks,
Mike
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
Subje
ENE-3312 for an effort to fix this
trap for google summer of code.
On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary wrote:
> I have one more question about term vector positions and offsets being
> preserved. My co-worker is working on updating the documents in an index with
> a fi
I was looking at IndexWriter.commit(commitUserData) and
IndexCommit.getUserData() as possible ways to save metadata about documents in
an index, but I realized that the metadata we are looking at could easily get
to have way too many map entries to work well. This pair of functions looks
useful
I am updating an analyzer that uses a particular configuration of the
PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields use a
custom analyzer and StandardTokenizer and the other fields use the
KeywordAnalyzer and KeywordTokenizer. The older version of the analyzer looks
like
ode sample.
Are you able to expand on the problem you're encountering?
On Wed, Sep 26, 2012 at 11:57 AM, Mike O'Leary wrote:
> I am updating an analyzer that uses a particular configuration of the
> PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields
> use
ourse by not extending Analyzer but instead
just instantiating a PerFieldAnalyerWrapper instance directly instead of your
MyPerFieldAnalyzer.
On Wed, Sep 26, 2012 at 12:25 PM, Mike O'Leary wrote:
> Hi Chris,
> In a nutshell, my question is, what should I put in place of ??? to
>
com]
Sent: Tuesday, September 25, 2012 6:32 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question
Mike,
On Wed, Sep 26, 2012 at 1:05 PM, Mike O'Leary wrote:
> Hi Chris,
> So if I change my analyzer to inherit from AnalyzerWrapper, I need to
> def
I have a collection of XML files that I would like to parse using Digester
in order to index them for Lucene. A DTD file has been supplied for the XML
files, but none of those files has a line associating them
with the DTD file. Can the Digester's register function be used to tell it
to use that D
how to do something like this? Or is there a better way that I'm not
thinking of? Thanks.
Mike O'Leary
So if I wanted to record the length of each individual document, would it be
better to store that information with each document, perhaps as an unindexed
field? Or are there ways to refer to the indexed documents that don't change
through delete and optimize steps? Thanks.
Mike O&
tter to write an Analyzer, could someone point me to information on how
to do this? Thanks.
Mike O'Leary
22 matches
Mail list logo