Heuristics on searching HTML Documents ?

2002-12-30 Thread Mailing Lists Account
Hi, We use Lucene to index and search HTML Documents. We extract all text content from the html documents and index it. While searching the documents, we found in several instances that search terms matched are in navbar section. Since it is in navbar, almost all pages in that site end up in

Re: Heuristics on searching HTML Documents ?

2002-12-30 Thread petite_abeille
On Monday, Dec 30, 2002, at 15:01 Europe/Zurich, Erik Hatcher wrote: If you have control over the HTML, how about marking the navbar pieces with a certain CSS class and then filtering that out from what you index? It seems like that would be a reasonable way to filter it - but this is of

Some articles Cutting wrote which I download from www.lucene.com

2002-12-30 Thread
Hi Cutting: Today, I found some articles Cutting wrote on search engine when clean up my documents. These articles I downloaded from www.lucene.com almost 1 and half years ago. I think these article are still very useful for today's search engine developers for reference. Hope some articles

Querying for documents that have a field

2002-12-30 Thread Erik Hatcher
Is it possible to get a collection of documents based on whether they have a particular field (regardless of value)? I'm indexing HTML documents, and want to pull out some information that may or may not be present in the documents (and adding a field if that information exists but not

Re: QueryParser question

2002-12-30 Thread Otis Gospodnetic
You could write an Analyzer that doesn't drop '/' character and use an instance of that Analzyer when you call QueryParser.parse method. I never used escape characters myself, and saw a lot of people complaining it does not work, however, you can check Lucene's QueryParser test class (in CVS under

Re: How to obtain unique field values

2002-12-30 Thread Terry Steichen
Erik, The attached class does what you want. Regards, Terry PS: I've discovered that this code may not work if the index isn't optimized (though I've not a clue why that's so). - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, December 30,

Re: How to obtain unique field values

2002-12-30 Thread Erik Hatcher
Terry, Thanks for that quick reply and code. I love open source! But it seems there should be a way to do this without walking every document, since these fields are indexed after all :) I realize that Lucene is very sophisticated under the covers, and there may be some technical reason why

Re: How to obtain unique field values

2002-12-30 Thread Doug Cutting
Erik Hatcher wrote: Is it possible for me to retrieve all the values of a particular field that exists within an index, across all documents? For example, I'm indexing documents that have a category associated with them. Several documents will share the same category. I'd like to be able to

Re: Querying for documents that have a field

2002-12-30 Thread Peter Carlson
I don't know if the core API provides this feature since I don't think you can search just using a wildcard. However, you may want to provide 1 or more fields which describe the fields available in this document. field1exists:true or add the field names of the fields that exist in 1 Lucene

Re: Incomprehensible (to me) tokenizing behavior

2002-12-30 Thread Doug Cutting
Terry Steichen wrote: I tested StandardAnalyzer (which uses StandardTokenizer) by inputing the a set of strings which produced the following results: aa/bb/cc/dd was tokenized into 4 terms: aa, bb, cc, dd aa/bb/cc/d1 was tokenized into 3 terms: aa, bb, cc/d1 aa/bb/c1/dd was tokenized into 2

Re: Heuristics on searching HTML Documents ?

2002-12-30 Thread Mailing Lists Account
Yes, the document creation is out of my hands. And in addition, the html documents mayn't be from single web site. The number of websites are dynamic. Even in the case of single web site, there are different apps each having its own layout etc. So, am not sure if longest common prefix/suffix

Re: Incomprehensible (to me) tokenizing behavior

2002-12-30 Thread Terry Steichen
Doug, Aha. I feel better knowing that I haven't lost my mind, now that I know what you were trying to do. As to a suggestion, I would only venture to say that, in its present form, this results in confusing behavior (as noted in my original message). Whether that drawback is outweighed by the

Re: Analyzers for various languages

2002-12-30 Thread Che Dong
For asian language, Chinese Korean Japanese, bigram based word segment is easy way to solve the word segment problem. Bigram based word segment is: C1C2C3C4 = C1C2 C2C3 C3C4 (C# is single CJK charator term) I think the the make a StandardTokenizer can handle multi language mixed content

Use filter instead of searching Re: Error when trying to match file path

2002-12-30 Thread Che Dong
first indexing file path field with a untokened indexing field Field(filePath, file.getAbsolutePath(), true, true, false) second , construct a prefix filter for searcher.I wrote a StringFilter.java for match and prefix match which can download from:

Re: Bitset Filters

2002-12-30 Thread Che Dong
I wrote a StringFilter for exactly match and prefix match field filter http://www.chedong.com/tech/lucene_ext.tar.gz Che, Dong - Original Message - From: Terry Steichen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, October 26, 2002 6:08 AM Subject: