Re: QueryParser Behavior and Token.setPositionIncrement
On Apr 26, 2004, at 5:16 PM, Norton, James wrote: Thanks for the reply. I had reached the same conclusion as you regarding the analyzer for queries (no multiple tokens per position), but I would still reqard the behaviour of QueryParser as incorrect. I agree that it is odd, but given that PhraseQuery doesn't support token positions either, what would be the correct behavior of QueryParser? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: need info for database based Lucene but not flat file
There is a Berkeley DB implementation of Lucene's Directory in the jakarta-lucene-sandbox repository. Erik On Apr 26, 2004, at 8:35 PM, Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? Regards, Yukun Song - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: need info for database based Lucene but not flat file
As lucene implements its own concept of document it is not dedicated to index a particular type of data source. It's up to you to write a tool that is able to browse your database and then submit the data as Lucene documents to the Lucene indexer. For example if your database contains a customer entity and you want to index all informations about these customers, you can create a module that will perform a select on the customer table an for each row returned create un Lucene Document and then add it to the indexWriter. It is recommended that your Lucene Document contains a keyword Field that represent the unique id of a customer in the database. As a first step you should be familiar with the concept of Document and Field. See Lucene short intro documentation. -Message d'origine- De : Yukun Song [mailto:[EMAIL PROTECTED] Envoyé : mardi 27 avril 2004 02:35 À : [EMAIL PROTECTED] Objet : need info for database based Lucene but not flat file As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? Regards, Yukun Song - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what web crawler work best with Lucene?
Tuan Jean Tee wrote: Have anyone implemented any open source web crawler with Lucene? I have a dynamic website and are looking at putting in a search tools. Your advice is very much appreciated. there is a crawler included within Apache Lenya http://cocoon.apache.org/lenya/ src/java/org/apache/lenya/search/crawler/* or you might try LARM http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html HTH Michi Thank you. IMPORTANT - This email and any attachments are confidential and may be privileged in which case neither is intended to be waived. If you have received this message in error, please notify us and remove it from your system. It is your responsibility to check any attachments for viruses and defects before opening or sending them on. Where applicable, liability is limited by the Solicitors Scheme approved under the Professional Standards Act 1994 (NSW). Minter Ellison collects personal information to provide and market our services. For more information about use, disclosure and access, see our privacy policy at www.minterellison.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting by date (XML)
my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searching only part of an index
Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: sorting by date (XML)
Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
You may be able to jimmy the bi filter to produce the most recent 100, but really keeping your fetch count at 100 and ordering by DOC should be sufficient. -Original Message- From: Alan Smith [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:03 PM To: [EMAIL PROTECTED] Subject: searching only part of an index Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching only part of an index
If you know the id of the last document in the index. (I don't know what's the best way to get it) you could probably use a range query. something like find all docs with the id in [lastId-100 TO lastID]. maybe you should make sure that the first limit is non negative, though. just a thought ioan At 08:02 AM 4/27/2004, you wrote: Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Nader S. Henein wrote: Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type ok, I guess then will take the FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain unfortunately this isn't possible. Thanks a lot for your help Michi Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? -Original Message- From: Ioan Miftode [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:55 PM To: Lucene Users List Subject: Re: searching only part of an index If you know the id of the last document in the index. (I don't know what's the best way to get it) you could probably use a range query. something like find all docs with the id in [lastId-100 TO lastID]. maybe you should make sure that the first limit is non negative, though. just a thought ioan At 08:02 AM 4/27/2004, you wrote: Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching only part of an index
I think that if you include the indexing timestamp in the Document you create when indexing, you could sort on this and only pick the first 100. Regards, Terry - Original Message - From: Alan Smith [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 8:02 AM Subject: searching only part of an index Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching only part of an index
On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote: Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? They are unique and ascending. Gaps in id's exist when documents are removed, and then the id's are squeezed back to completely sequential with no holes during an optimize. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at each delete will make searching the index really slow. Right? Nader -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 5:15 PM To: Lucene Users List Subject: Re: searching only part of an index On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote: Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? They are unique and ascending. Gaps in id's exist when documents are removed, and then the id's are squeezed back to completely sequential with no holes during an optimize. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching only part of an index
On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote: So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at each delete will make searching the index really slow. Right? Well, if you know how many you've deleted, then a range would work :) (number of docs in index minus 100 minus number deleted = starting range for doc id) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
BooleanScorer - 32 required/prohibited clause limit
Hello, I am using Lucene 1.3 and I ran into the following exception: java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98) Is there any easy way to fix/adjust this (like the BooleanQuery.maxClauseCount, for example)? Strangely, I couldn't find mention of the BooleanScorer class in my javadoc. Thank you for any tips. Tate p.s. Yes, I am intentionally generating some rather long boolean queries. :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching only part of an index
On Apr 27, 2004, at 10:24 AM, Erik Hatcher wrote: On Apr 27, 2004, at 9:49 AM, Nader S. Henein wrote: So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at each delete will make searching the index really slow. Right? Well, if you know how many you've deleted, then a range would work :) (number of docs in index minus 100 minus number deleted = starting range for doc id) On second thought - this is incorrect - my apologies. To be clever, you'd have to know in what positions the deleted documents were in and account for them in that manner. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and MS SQL
Dear all, has anyone had experience using Lucene with data stored in MS SQL server 2000 ? How does indexing and searching work in that case. Thanks, Holger ___ The ALL NEW CS2000 from CompuServe Better! Faster! More Powerful! 250 FREE hours! Sign-on Now! http://www.compuserve.com/trycsrv/cs2000/webmail/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: phrase search AND term
Can you provide a simple test case that shows this problem? Did you reindex when upgrading? On Apr 27, 2004, at 11:31 AM, Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even have the term's I'm searching on). They were fine when using 1.3final. I noticed it happens when I combine a phrase search with a simple term like this: field1:some phrase search AND field2:term Has anyone experienced anything similar ? Any thoughts. thanks ioan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: need info for database based Lucene but not flat file
Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but none has yet been contributed to Lucene. Hopefully one will be soon. For some discussion of this, see messages on SQLDirectory in the mail archives: http://nagoya.apache.org/eyebrowse/SearchList?listId=listName=lucene-user%40jakarta.apache.orgsearchText=SQLDirectorydefaultField=subjectSearch=Search Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re-associate a token with its source
Hello I have documents in XML in which, for each word, I have 4 positions (top, down, left and right) that would let me to highlight this word in a jpg image. I want to index this XML documents and to highlight the results of the queries in the image, so I need to store this positions for each word inside the index. I was searching about how can I use the Token fields to store this attributes but I didnt found any example where this fields were used. Thanks, Olaia Vázquez
Re: sorting by date (XML)
Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Otis --- Robert Koberg [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob Otis --- Michael Wechner [EMAIL PROTECTED] wrote: my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? Otis --- Robert Koberg [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: Beware of storing timestamps (DateFields, I guess) in Lucene, if you intend to use range queries (xxx TO yyy). Why? We have attributes that contain iso8601 date strings and when indexing: Date date = isoConv.parse(value, new ParsePosition(0)); String dateString = DateField.dateToString(date); doc.add(Field.Keyword(name, dateString)); then when searching: String from = DateField.timeToString(searchFromDate); String to = DateField.timeToString(searchToDate); RangeQuery rq = new RangeQuery(new Term(searchKey, from), new Term(searchKey, to), true); Is this not correct? bst, -Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Read past EOF and negative bufferLength problem (1.4 rc2)
Using Lucene 1.4 rc2 I've run into a fatal problem: certain PhraseQueries cause a Read Past EOF exception (see below), while other PhraseQueries enter an infinite loop due to a negative bufferLength field in CSInputStream. Environment is WinXP, JDK 1.4.2. The index is large, incorporating 1,000,000 documents each of which has 3 stored, indexed fields of 10-100 chars. The problem does not occur with Lucene 1.3 indexing the exact same set of Documents. Nor does it occur with 1.4 rc2 using various smaller sets of documents. Right now my workaround is to use Lucene 1.3. For the PhraseQuery a y (that's right, two single-letter terms), the read-past-EOF exception is as follows: java.io.IOException: read past EOF at org.apache.lucene.store.InputStream.refill(InputStream.java:154) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:59) at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:187) at org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:69) at org.apache.lucene.search.Scorer.score(Scorer.java:37) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:81) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:43) at org.apache.lucene.search.Searcher.search(Searcher.java:33) at org.apache.lucene.search.Searcher.search(Searcher.java:27) at... For the phrase query z y, an infinite loop is entered. The loop occurs due to a similar condition to read-past-EOF: at line 153 of org.apache.lucene.store.InputStream, the value of bufferLength goes negative due to the value of start exceeding the value of end. This in turn seems to be a consequence of a seek to a position past the end of the stream. Something is clearly corrupt somewhere in the index structure. I'd love to post the files that reproduce the problem, but it's about 100 MB of data. If someone on the Lucene dev team wants to give me an upload destination, I can post the index somewhere and you can play with the problem. regards and thanks for any assistance, Joe Berkovitz Chief Architect Ruckus Network, Inc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? It all depends. But if all you care about is year, month, day, it is _not_ the most efficient. DateField converts down to milliseconds, and is what Otis was referring to. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Erik Hatcher wrote: On Apr 27, 2004, at 2:09 PM, Robert Koberg wrote: Otis Gospodnetic wrote: Because having small time units like milliseconds will result in Range query expanding to a large number of BooleanQueries, if you have a lot of documents with unique time stamps. Rounding the timestamp to minutes, hours, or days, can drastically reduce the number of unique time stamps, hence resulting in less BooleanQueries. Cool, thanks. So DateField.dateToString is the best, most efficient way, correct? It all depends. But if all you care about is year, month, day, it is _not_ the most efficient. DateField converts down to milliseconds, and is what Otis was referring to. Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: BooleanScorer - 32 required/prohibited clause limit
Or if I overlooked some previous post or thread that covers this please help me track it down. Thank you, Tate -Original Message- From: Tate Avery [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 10:20 AM To: [EMAIL PROTECTED] Subject: BooleanScorer - 32 required/prohibited clause limit Hello, I am using Lucene 1.3 and I ran into the following exception: java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98) Is there any easy way to fix/adjust this (like the BooleanQuery.maxClauseCount, for example)? Strangely, I couldn't find mention of the BooleanScorer class in my javadoc. Thank you for any tips. Tate p.s. Yes, I am intentionally generating some rather long boolean queries. :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: phrase search AND term
Thank you Doug, the latest CVS works fine. ioan At 12:23 PM 4/27/2004, you wrote: Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even have the term's I'm searching on). Try the latest CVS. There were some bugs in 1.4RC2 that have been fixed. (We'll probably do an RC3 release soon. There are currently some bugs in span search that would be good to get fixed in RC3, but perhaps these will have to wait until RC4...) Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... No worries. This is the type of thing that is a gotcha with dates, and is a prime candidate for a wiki page (nudge, nudge)... You should represent dates (at index and search time) using MMDD format - it needs to be lexicographically ordered. Forget DateField and Field.Keyword(String,Date) altogether. Some tricks are needed if you need to use QueryParser to translate mm/dd/ format to how you represent it, but it is quite simple. (subclass QueryParser, override getRangeQuery). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Erik Hatcher wrote: On Apr 27, 2004, at 3:41 PM, Robert Koberg wrote: Oops, I meant to write DateField.timeToString which I use when querying. If I use DateField.dateToString when indexing but timeToString when searching is that a bad practice? I do only need month, day and year. So should I be indexing with timeToString? How would you do it if the above is still a bad practice? Sorry for the basic questions... No worries. This is the type of thing that is a gotcha with dates, and is a prime candidate for a wiki page (nudge, nudge)... You should represent dates (at index and search time) using MMDD format - it needs to be lexicographically ordered. Forget DateField and Field.Keyword(String,Date) altogether. Some tricks are needed if you need to use QueryParser to translate mm/dd/ format to how you represent it, but it is quite simple. (subclass QueryParser, override getRangeQuery). Ah. Great - thanks! I see you added it to the wiki. Thanks again :) This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my logs, hardly anyone uses the date search. I guess I should have been doing this from the beginning, don't know why I didn't... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by date (XML)
Robert Koberg wrote: Ah. Great - thanks! I see you added it to the wiki. Thanks again :) I guess you mean http://wiki.apache.org/jakarta-lucene/IndexingDateFields Thanks as well Michi This is perfect in my case since iso8601 is in the format: 2004-04-27T01:23:33 Luckily so far, from my logs, hardly anyone uses the date search. I guess I should have been doing this from the beginning, don't know why I didn't... best, -Rob Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index directory name
I am having a problem with using a network path for the index directory. If I use a path of the form //server/indexdir the IndexWriter finds it and indexes documents but the IndexSearcher throws an exception saying it is not a valid path. I cannot use a local path as I need to be able to support a common index directory for a clustered environment. What is the best solution in this case? Thanks Anand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: need info for database based Lucene but not flat file
On Tue, Apr 27, 2004 at 09:15:05AM -0700, Doug Cutting wrote: Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but none has yet been contributed to Lucene. Hopefully one will be soon. For some discussion of this, see messages on SQLDirectory in the mail archives: http://nagoya.apache.org/eyebrowse/SearchList?listId=listName=lucene-user%40jakarta.apache.orgsearchText=SQLDirectorydefaultField=subjectSearch=Search Doug Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.) incze - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: need info for database based Lucene but not flat file
Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.) http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1344168 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: languages supported by lucene 1.2.1 in eclipse help system
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help system. If what you are doing is relying on the lucene eclipse plugin, you may want to look at the help system anyway since it will give you an example of an eclipse plugin that is using the lucene plugin. The eclipse help system uses lucene but they have their own Analyzer class that uses BreakIterator to identify tokens for languages other than english and german. The lucene eclipse plugin just exports the lucene jar and the html parser so that any plugin that depends on the lucene plugin (like the help system) will have those jars in the classpath of their plugin. For english they use the PorterStemFilter with a StopAnalyzer and a stopword list. For german, they use the GermanAnalyzer supplied by the lucene jar. In the latest CVS at :pserver:[EMAIL PROTECTED]:/home/eclipse see the project in org.eclipse.help.base/src/org/eclipse/help/internal/search in older eclipse versions see the R2_1_maintenance branch of org.eclipse.help/src/org/eclipse/help/internal/search the class DefaultAnalyzer is the analyzer implementation for languages other than english and german and WordTokenStream is where they use BreakIterator to break the content from the reader into individual tokens. The default Eclipse help system sets these extensions in the org.eclipse.help.base plugin: !-- Text Analyzers for search -- extension id=org.eclipse.help.base.Analyzer_en point=org.eclipse.help.base.luceneAnalyzer analyzer locale=en class=org.eclipse.help.internal.search.Analyzer_en /analyzer /extension extension id=org.eclipse.help.base.Analyzer_de point=org.eclipse.help.base.luceneAnalyzer analyzer locale=de class=org.apache.lucene.analysis.de.GermanAnalyzer /analyzer /extension Look at the extension point schema in http://dev.eclipse.org/viewcvs/index.cgi/~checkout~/org.eclipse.help.base/schema/luceneAnalyzer.exsd?rev=HEADcontent-type=text/plain for how to declare your own analyzer extensions. Beware though, I read that this affects all help searches in that language, not just the ones for your plugin. Also, since the WordTokenStream is in a package with internal in its path, you aren't supposed to ever make use of that class from other plugins, so if you wanted your own analyzer based on that class and a stop list, you shouldn't use that class without talking the eclipse help developers into moving it outside of an internal package. Most of this has been around for a while, so it is probably the same or very similar in previous eclipse versions, you may need to poke around at the extension point schema in your eclipse plugins directory to verify that the extension point works the same way in your version of eclipse. I haven't used it in versions prior to 3.0M8 Hope this is useful to you, Eric -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Saturday, April 24, 2004 10:18 AM To: Lucene Users List Subject: Re: languages supported by lucene 1.2.1 in eclipse help system That's no myth :) Core Lucene (even the current version) does not include classes that know how to analyze/tokenize text in languages other than English, Russian, and German. However, take a look at the Snowball contributions in Lucene Sandbox, where a few more analyzers are available, including those for CJK group of langauges. Otis --- Jason Elliott [EMAIL PROTECTED] wrote: We have a plugin in our eclipse project named org.apache.lucene_1.2.1. It works quite well in that help system. I've been notified that this particular version of the lucene search analyzer searches well in German and English (GE), but not so well in the rest of the languages on this planet. I have several questions 1.If it does not search very well in French, Italian and Japanese (FIJ), what does that really mean to a user conducting searches? a.If this is a myth and the searches work the same in EFIG-J, please let me know that. b.If this is not a myth and there are plugins that enable the search to work well in FIJ? Thanks jason - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: need info for database based Lucene but not flat file
On Tue, Apr 27, 2004 at 02:46:22PM -0700, Doug Cutting wrote: Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.) http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1344168 Doug Thanks. incze - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: status of LARM project
As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that Clemens got a job which wasn't supportive of his continued development on LARM. AFAIK there aren't any other active developers of LARM (at least at the time it branched off to SF). Otis recently posted to use Nutch instead of LARM. Kelvin On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said: Hi I have look at LARM website and I get different results http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages It says that development has stopped for this project. LARM hosted on sourceforge. The last message was dated 2003 in the mailing list. Is it still supported and active? LARM hosted on apache. It says the project is moved to sourceforge. Any one here who is active in LARM can comment on the status? Regards Sebastian Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index directory name
I assume you are using Wintel platform. You may map the the directory where your indexes are kept using persistent connection. (this can be done using NET USE. command in command prompt). This keeps network connection always open, which otherwise Windows will close the connection after sometime(but still manully accessible). You can notice this in explorer window where you will find a red cross mark against the mapped network drive. Harsha. Narayan, Anand [EMAIL PROTECTED] wrote: I am having a problem with using a network path for the index directory. If I use a path of the form //server/indexdir the IndexWriter finds it and indexes documents but the IndexSearcher throws an exception saying it is not a valid path. I cannot use a local path as I need to be able to support a common index directory for a clustered environment. What is the best solution in this case? Thanks Anand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs
Re: Index directory name
I assume you are using Wintel platform. You may map the the directory where your indexes are kept using persistent connection. (this can be done using NET USE. command in command prompt). This keeps network connection always open, which otherwise Windows will close the connection after sometime(but still manully accessible). You can notice this in explorer window where you will find a red cross mark against the mapped network drive. Harsha. Narayan, Anand [EMAIL PROTECTED] wrote: I am having a problem with using a network path for the index directory. If I use a path of the form //server/indexdir the IndexWriter finds it and indexes documents but the IndexSearcher throws an exception saying it is not a valid path. I cannot use a local path as I need to be able to support a common index directory for a clustered environment. What is the best solution in this case? Thanks Anand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs
Re: status of LARM project
I suggest you look at: http://www.manageability.org/blog/stuff/open-source-web-crawlers-java From what I know of nutch, it's meant as the basic for a competitor to the big search engines (i.e. google). For a small web site, it might be overkill especially if it requires you to build from CVS (unless there are distributions). Note: I've got the book Programming Spiders, Bots and Aggregators in Java, it describes spiders using a project called: j-spider http://sourceforge.net/projects/j-spider/ It could probably be adapted for your needs. HTH, sv On Wed, 28 Apr 2004, Kelvin Tan wrote: As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that Clemens got a job which wasn't supportive of his continued development on LARM. AFAIK there aren't any other active developers of LARM (at least at the time it branched off to SF). Otis recently posted to use Nutch instead of LARM. Kelvin On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said: Hi I have look at LARM website and I get different results http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages It says that development has stopped for this project. LARM hosted on sourceforge. The last message was dated 2003 in the mailing list. Is it still supported and active? LARM hosted on apache. It says the project is moved to sourceforge. Any one here who is active in LARM can comment on the status? Regards Sebastian Ho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]