Re: QueryParser Behavior and Token.setPositionIncrement

2004-04-27 Thread Erik Hatcher
On Apr 26, 2004, at 5:16 PM, Norton, James wrote: Thanks for the reply. I had reached the same conclusion as you regarding the analyzer for queries (no multiple tokens per position), but I would still reqard the behaviour of QueryParser as incorrect. I agree that it is odd, but given that

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Erik Hatcher
There is a Berkeley DB implementation of Lucene's Directory in the jakarta-lucene-sandbox repository. Erik On Apr 26, 2004, at 8:35 PM, Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like

RE: need info for database based Lucene but not flat file

2004-04-27 Thread Cocula Remi
As lucene implements its own concept of document it is not dedicated to index a particular type of data source. It's up to you to write a tool that is able to browse your database and then submit the data as Lucene documents to the Lucene indexer. For example if your database contains a

Re: what web crawler work best with Lucene?

2004-04-27 Thread Michael Wechner
Tuan Jean Tee wrote: Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. there is a crawler included within Apache Lenya http://cocoon.apache.org/lenya/

sorting by date (XML)

2004-04-27 Thread Michael Wechner
my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet?

searching only part of an index

2004-04-27 Thread Alan Smith
Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to

RE: sorting by date (XML)

2004-04-27 Thread Nader S. Henein
Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you

RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
You may be able to jimmy the bi filter to produce the most recent 100, but really keeping your fetch count at 100 and ordering by DOC should be sufficient. -Original Message- From: Alan Smith [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:03 PM To: [EMAIL PROTECTED] Subject:

Re: searching only part of an index

2004-04-27 Thread Ioan Miftode
If you know the id of the last document in the index. (I don't know what's the best way to get it) you could probably use a range query. something like find all docs with the id in [lastId-100 TO lastID]. maybe you should make sure that the first limit is non negative, though. just a thought

Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Nader S. Henein wrote: Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than

RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? -Original Message- From: Ioan Miftode [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:55 PM To: Lucene Users List Subject: Re: searching only part of an

Re: searching only part of an index

2004-04-27 Thread Terry Steichen
I think that if you include the indexing timestamp in the Document you create when indexing, you could sort on this and only pick the first 100. Regards, Terry - Original Message - From: Alan Smith [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 8:02 AM Subject:

Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote: Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? They are unique and ascending. Gaps in id's exist when documents are removed, and then the id's are squeezed back to

RE: searching only part of an index

2004-04-27 Thread Nader S. Henein
So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at each delete will make searching the index really slow.

Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote: So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at

BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery
Hello, I am using Lucene 1.3 and I ran into the following exception: java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98) Is there any easy way to fix/adjust this (like the

Re: searching only part of an index

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 10:24 AM, Erik Hatcher wrote: On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote: So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything

Lucene and MS SQL

2004-04-27 Thread hgadm
Dear all, has anyone had experience using Lucene with data stored in MS SQL server 2000 ? How does indexing and searching work in that case. Thanks, Holger ___ The ALL NEW CS2000 from CompuServe Better! Faster! More Powerful! 250 FREE hours!

Re: phrase search AND term

2004-04-27 Thread Erik Hatcher
Can you provide a simple test case that shows this problem? Did you reindex when upgrading? On Apr 27, 2004, at 11:31 AM, Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but

Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date.

Re-associate a token with its source

2004-04-27 Thread Olaia Vázquez Sánchez
Hello I have documents in XML in which, for each word, I have 4 positions (top, down, left and right) that would let me to highlight this word in a jpg image. I want to index this XML documents and to highlight the results of the queries in the image, so I need to store this positions for each

Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString =

Re: sorting by date (XML)

2004-04-27 Thread Otis Gospodnetic
Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence

Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of

Read past EOF and negative bufferLength problem (1.4 rc2)

2004-04-27 Thread Joe Berkovitz
Using Lucene 1.4 rc2 I've run into a fatal problem: certain PhraseQueries cause a Read Past EOF exception (see below), while other PhraseQueries enter an infinite loop due to a negative bufferLength field in CSInputStream. Environment is WinXP, JDK 1.4.2. The index is large, incorporating

Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours,

Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote: On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the

RE: BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery
Or if I overlooked some previous post or thread that covers this please help me track it down. Thank you, Tate -Original Message- From: Tate Avery [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 10:20 AM To: [EMAIL PROTECTED] Subject: BooleanScorer - 32 required/prohibited

Re: phrase search AND term

2004-04-27 Thread Ioan Miftode
Thank you Doug, the latest CVS works fine. ioan At 12:23 PM 4/27/2004, you wrote: Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even have the term's I'm searching on). Try

Re: sorting by date (XML)

2004-04-27 Thread Erik Hatcher
On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with

Re: sorting by date (XML)

2004-04-27 Thread Robert Koberg
Erik Hatcher wrote: On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I

Re: sorting by date (XML)

2004-04-27 Thread Michael Wechner
Robert Koberg wrote: Ah. Great - thanks! I see you added it to the wiki. Thanks again :) I guess you mean http://wiki.apache.org/jakarta-lucene/IndexingDateFields Thanks as well Michi This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my

Index directory name

2004-04-27 Thread Narayan, Anand
I am having a problem with using a network path for the index directory. If I use a path of the form //server/indexdir the IndexWriter finds it and indexes documents but the IndexSearcher throws an exception saying it is not a valid path. I cannot use a local path as I need to be able to

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Incze Lajos
On Tue, Apr 27, 2004 at 09:15:05AM -0700, Doug Cutting wrote: Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.)

RE: languages supported by lucene 1.2.1 in eclipse help system

2004-04-27 Thread Eric Isakson
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help system. If what you are doing is relying on the lucene eclipse plugin, you may want to look at the help system anyway since it will give you an example of an eclipse plugin that is using the lucene plugin.

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Incze Lajos
On Tue, Apr 27, 2004 at 02:46:22PM -0700, Doug Cutting wrote: Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index

Re: status of LARM project

2004-04-27 Thread Kelvin Tan
As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that Clemens got a job which wasn't supportive of his continued development on LARM. AFAIK there aren't any other active developers of LARM (at least at the time it branched off to SF). Otis recently posted to use Nutch

Re: Index directory name

2004-04-27 Thread Gabriela D
I assume you are using Wintel platform. You may map the the directory where your indexes are kept using persistent connection. (this can be done using NET USE. command in command prompt). This keeps network connection always open, which otherwise Windows will close the connection after

Re: Index directory name

2004-04-27 Thread Gabriela D
I assume you are using Wintel platform. You may map the the directory where your indexes are kept using persistent connection. (this can be done using NET USE. command in command prompt). This keeps network connection always open, which otherwise Windows will close the connection after

Re: status of LARM project

2004-04-27 Thread Stephane James Vaucher
I suggest you look at: http://www.manageability.org/blog/stuff/open-source-web-crawlers-java From what I know of nutch, it's meant as the basic for a competitor to the big search engines (i.e. google). For a small web site, it might be overkill especially if it requires you to build from CVS