- Index text and HTML files. Any others?
What, no PDF files!!
Ben
--
http://www.pdfbox.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I would suggest XML as well.
Tom
-Original Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 7:42 AM
To: Lucene Users List
Subject: Re: Lucene demo ideas?
- Index text and HTML files. Any others?
What, no PDF files!!
Ben
--
please keep the discussions on the lucene-user e-mail list.
of course the source code will be available... what is there is already
in lucene's CVS and i will just revamp what is there and commit it.
and when we make lucene releases it will be bundled and made available
as a single download
On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote:
I would suggest XML as well.
Again, I'd like to hear more about how you'd do this generically. Tell
me what the field names and values would correspond to when presented
with an XML file.
Erik
Might want two demos, one for Unix environments and one for Windows.
Most users will want a fast start that they can copy and adapt. So quick
targets would be:
filesystems - html / text / pdf / office documents for windows.
xml - fairly simple example maybe against news items.
database - again
I would like to see the taglib for searching the index in the demo. There is
an html form page and result page already built for the taglib that allows
you to change search params and demonstrates a fair amount of the search
capability of Lucene.
- Original Message -
From: Erik Hatcher
I know this may be far fetched, but how about being able to index
.jsp'sI know this is a spindle thing, but It seems a lot of people
need this functionality.
My suggestion
Russ
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003
Hello all,
I have recently indexed approx 15.8 million XML documents in which I index
the contents certain elements (titles, states, dates to name a few). I have
27 separate indices and use a MultiSearcher to search these indices.
When I search on the title and state fields with multiple
Hi,
Here are a couple of ideas for XML demos:
1. simply index the content into one 'content' field. Don't worry about
attributes.
2. index a linked Dublin core meta data file:
link rel=meta href=index.rdf /
And add fields for every element after rdf:Description
Best,
-Rob
-Original
I don't know how lucene handles date ranges, but I was having very poor
results using booleans between different because of the way lucene handles
them. What lucene does is that it evaluates each field in the query
separately and retrieves all of the results, then it evaluates the boolean
joins
Hi All,
has someone ever written an extension of QueryParser providing the
possibility to let wildcard search terms be run through an analyzer ( as
suggested by Tatu Saloranta a while ago)? I want to reduce german umlauts to
their base letters (eg. 'รค' (auml;) to 'a' ) and for non-wildcard
Does anyone have any suggestions for searching date ranges. Our
ranges will generally be between a 3 - 7 year period.
Apparently Lucene expands ranges to boolean 'or' queries. So if you have
a thousand distinct dates within a range, Lucene will build a query with
a thousand terms...
One
Does anyone have any suggestions on what they'd like to see in the
demo app?
Show how lucene can 1) do incremental indexing, 2) isn't restricted to
indexing file system resources and 3) can store and query arbitrary
fields. These are in my opinion the features where most other search
engines
I think all the attribute values together with element text values should be
indexed in the content part. Also a xml map file could be used to pick up
the nodes need to be indexed separately so we do not create too many fields
by indexing non-critical nodes separately. Simple xpath could be used
Erik Hatcher wrote:
On Wednesday, September 17, 2003, at 08:42 AM, Ben Litchfield wrote:
What, no PDF files!!
Haha!
http://www.pdfbox.org
And I've used pdfbox before - its cool.
And I'm cool with adding PDF and Word indexing to the demo personally,
but I didn't want to increase the weight
On Wednesday, September 17, 2003, at 09:21 AM, Pitre, Russell wrote:
I know this may be far fetched, but how about being able to index
.jsp'sI know this is a spindle thing, but It seems a lot of people
need this functionality.
Like I communicated in a previous thread, indexing JSP's just
Killeen, Tom wrote:
My query would look something like this: LongTitle:killeen AND
LongTitle:state AND StateDistrict:id AND FiledDate:[1997-01-01 TO
2002-04-04] and it returned in
5.7 seconds
Does anyone have any suggestions for searching date ranges. Our ranges will
generally be between a 3 -
Paging would be great for the results.
Jeff
- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 7:00 AM
Subject: Lucene demo ideas?
I'm about to start some refactorings on the web application demo that
ships with Lucene
And with the latest Lucene codebase in CVS, you could also use a
DateFilter wrapped inside a CachingWrapperFilter instead of a
QueryFilter. Just wanted to mention what is now available.
But I'll reiterate what Doug says... be sure to save off the filter
instance so you don't take the
20030917 (that is, today), I get 157 hits, all of which have a
score of .23000652. If I use 20030916 (yesterday), I get 197 hits,
each of which has a score of .22295427.
So far, all seems logical. However, when I search for all records for
the date 20030915, the first two (of 174 hits) have
Steichen wrote:
I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the MMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score
* super.lengthNorm(fieldName, Math.max(numTerms,750));
} else {
return super.lengthNorm(fieldName, Math.max(numTerms, 750));
}
}
}
Query #1: pub_date:20030917
All items: Score: .23000652
0.23000652 = weight(pub_date:20030917 in 91197), product of:
0.9994 = queryWeight(pub_date:20030917
Terry Steichen wrote:
0.03125 = fieldNorm(field=pub_date, doc=90992)
1.0 = fieldNorm(field=pub_date, doc=90970)
It looks like the fieldNorm's are what differ, not the IDFs. These are
the product of the document and/or field boost, and 1/sqrt(numTerms)
where numTerms is the number of terms
I would have the code ready is wanted...
- Original Message -
From: Pitre, Russell [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 2:21 PM
Subject: RE: Lucene demo ideas?
I know this may be far fetched, but how about being able to index
Yeah, that would be great!
- Original Message -
From: Jeff Linwood [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, September 17, 2003 5:15 PM
Subject: Re: Lucene demo ideas?
Paging would be great for the results.
Jeff
- Original Message -
From:
Doug,
(1) No, I did *not* boost the pub_date field, either in the indexing process
or in the query itself.
(2) And, each pub_date field of each document (which is in XML format)
contains only one instance of the date string.
(3) And only the pub_date field itself is indexed. There are other
On Wednesday 17 September 2003 07:07, Erik Hatcher wrote:
On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote:
I would suggest XML as well.
Again, I'd like to hear more about how you'd do this generically. Tell
me what the field names and values would correspond to when
Hmm. This makes no sense to me. Can you supply a reproducible
standalone test case?
Doug
Terry Steichen wrote:
Doug,
(1) No, I did *not* boost the pub_date field, either in the indexing process
or in the query itself.
(2) And, each pub_date field of each document (which is in XML format)
28 matches
Mail list logo