Hi,
We use Lucene to index and search HTML Documents. We extract
all text content from the html documents and index it.
While searching the documents, we found in several instances that
search terms matched are in navbar section. Since it is in navbar, almost
all pages in that site end up in
On Monday, Dec 30, 2002, at 15:01 Europe/Zurich, Erik Hatcher wrote:
If you have control over the HTML, how about marking the navbar pieces
with a certain CSS class and then filtering that out from what you
index? It seems like that would be a reasonable way to filter it -
but this is of
Hi Cutting:
Today, I found some articles Cutting wrote on search engine when clean up
my documents. These articles I downloaded from www.lucene.com almost 1 and
half years ago.
I think these article are still very useful for today's search engine
developers for reference. Hope some articles
Is it possible to get a collection of documents based on whether they
have a particular field (regardless of value)? I'm indexing HTML
documents, and want to pull out some information that may or may not be
present in the documents (and adding a field if that information exists
but not
You could write an Analyzer that doesn't drop '/' character and use an
instance of that Analzyer when you call QueryParser.parse method.
I never used escape characters myself, and saw a lot of people
complaining it does not work, however, you can check Lucene's
QueryParser test class (in CVS under
Erik,
The attached class does what you want.
Regards,
Terry
PS: I've discovered that this code may not work if the index isn't optimized
(though I've not a clue why that's so).
- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, December 30,
Terry,
Thanks for that quick reply and code. I love open source!
But it seems there should be a way to do this without walking every
document, since these fields are indexed after all :)
I realize that Lucene is very sophisticated under the covers, and there
may be some technical reason why
Erik Hatcher wrote:
Is it possible for me to retrieve all the values of a particular field
that exists within an index, across all documents?
For example, I'm indexing documents that have a category associated
with them. Several documents will share the same category. I'd like to
be able to
I don't know if the core API provides this feature since I don't think
you can search just using a wildcard.
However, you may want to provide 1 or more fields which describe the
fields available in this document.
field1exists:true
or
add the field names of the fields that exist in 1 Lucene
Terry Steichen wrote:
I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of strings which produced the following results:
aa/bb/cc/dd was tokenized into 4 terms: aa, bb, cc, dd
aa/bb/cc/d1 was tokenized into 3 terms: aa, bb, cc/d1
aa/bb/c1/dd was tokenized into 2
Yes, the document creation is out of my hands.
And in addition, the html documents mayn't be from single web site. The
number of
websites are dynamic. Even in the case of single web site, there are
different apps each having its own layout etc.
So, am not sure if longest common prefix/suffix
Doug,
Aha. I feel better knowing that I haven't lost my mind, now that I know
what you were trying to do.
As to a suggestion, I would only venture to say that, in its present form,
this results in confusing behavior (as noted in my original message).
Whether that drawback is outweighed by the
For asian language, Chinese Korean Japanese, bigram based word segment is easy way to
solve the word segment problem.
Bigram based word segment is: C1C2C3C4 = C1C2 C2C3 C3C4 (C# is single CJK
charator term)
I think the the make a StandardTokenizer can handle multi language mixed content
first indexing file path field with a untokened indexing field
Field(filePath, file.getAbsolutePath(), true, true, false)
second , construct a prefix filter for searcher.I wrote a StringFilter.java for match
and prefix match which can download from:
I wrote a StringFilter for exactly match and prefix match field filter
http://www.chedong.com/tech/lucene_ext.tar.gz
Che, Dong
- Original Message -
From: Terry Steichen [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, October 26, 2002 6:08 AM
Subject:
15 matches
Mail list logo