Deleting documents that meet a query
I need to delete all the documents from an index that satisfy a BooleanQuery. The only methods I can find (in IndexReader) for deleting a document are delete(Term) and delete(int). I tried searching on my Query using IndexSearcher.search(), iterating over each document in the returned Hits, and then iterating over Hits deleting each document like this: for (int i=0; ihits.length(); +++i) { ireader.delete(hits.id(i)); } Hoping here that the value returned by Hits.id(int) is the docnum expected in IndexReader.delete(int) But the call to delete throws an IOException. So, is there any way I can delete all the documents from an Index that satisfy a general Query? Thank you for any advice. Bruce Cota, Unicon, Inc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting documents that meet a query
Hi, I confirm to you that delete( hits.id( i ) ) is ok. Hits.id( int ) returns the docnum that you need. MHF :) On Mon, 2003-06-09 at 12:11, Bruce Cota wrote: I need to delete all the documents from an index that satisfy a BooleanQuery. The only methods I can find (in IndexReader) for deleting a document are delete(Term) and delete(int). I tried searching on my Query using IndexSearcher.search(), iterating over each document in the returned Hits, and then iterating over Hits deleting each document like this: for (int i=0; ihits.length(); +++i) { ireader.delete(hits.id(i)); } Hoping here that the value returned by Hits.id(int) is the docnum expected in IndexReader.delete(int) But the call to delete throws an IOException. So, is there any way I can delete all the documents from an Index that satisfy a general Query? Thank you for any advice. Bruce Cota, Unicon, Inc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting documents that meet a query
Thanks. I will restore that code and try to figure out why it broke :) (Because my alternative solution was way uglier.) Marie-Hélène Forget wrote: Hi, I confirm to you that delete( hits.id( i ) ) is ok. Hits.id( int ) returns the docnum that you need. MHF :) On Mon, 2003-06-09 at 12:11, Bruce Cota wrote: I need to delete all the documents from an index that satisfy a BooleanQuery. The only methods I can find (in IndexReader) for deleting a document are delete(Term) and delete(int). I tried searching on my Query using IndexSearcher.search(), iterating over each document in the returned Hits, and then iterating over Hits deleting each document like this: for (int i=0; ihits.length(); +++i) { ireader.delete(hits.id(i)); } Hoping here that the value returned by Hits.id(int) is the docnum expected in IndexReader.delete(int) But the call to delete throws an IOException. So, is there any way I can delete all the documents from an Index that satisfy a general Query? Thank you for any advice. Bruce Cota, Unicon, Inc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: High Capacity (Distributed) Crawler
Leo, Have you started this project? Where is it hosted? It would be nice to see a few alternative implementations of a robust and scalable java web crawler with the ability to index whatever it fetches. Thanks, Otis --- Leo Galambos [EMAIL PROTECTED] wrote: Hi. I would like to write $SUBJ (HCDC), because LARM does not offer many options which are required by web/http crawling IMHO. Here is my list: 1. I would like to manage the decision what will be gathered first - this would be based on pageRank, number of errors, connection speed etc. etc. 2. pure JAVA solution without any DBMS/JDBC 3. better configuration in case of an error 4. NIO style as it is suggested by LARM specification 5. egothor's filters for automatic processing of various data formats 6. management of Expires HTTP-meta headers, heuristic rules which will describe how fast a page can expire (.php often expires faster than .html) 7. reindexing without any data exports from a full-text index 8. open protocol between the crawler and a full-text engine If anyone wants to join (or just extend the wish list), let me know, please. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: High Capacity (Distributed) Crawler
Hi Otis. The first beta is done (without NIO). It needs, however, further testing. Unfortunatelly, I could not find enough servers which I may hit. I wanted to commit the robot as a part of egothor (it will use it in PULL mode), but we have a nice weather here, so I lost any motivation to play with PC ;-). What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best. -g- Otis Gospodnetic wrote: Leo, Have you started this project? Where is it hosted? It would be nice to see a few alternative implementations of a robust and scalable java web crawler with the ability to index whatever it fetches. Thanks, Otis --- Leo Galambos [EMAIL PROTECTED] wrote: Hi. I would like to write $SUBJ (HCDC), because LARM does not offer many options which are required by web/http crawling IMHO. Here is my list: 1. I would like to manage the decision what will be gathered first - this would be based on pageRank, number of errors, connection speed etc. etc. 2. pure JAVA solution without any DBMS/JDBC 3. better configuration in case of an error 4. NIO style as it is suggested by LARM specification 5. egothor's filters for automatic processing of various data formats 6. management of Expires HTTP-meta headers, heuristic rules which will describe how fast a page can expire (.php often expires faster than .html) 7. reindexing without any data exports from a full-text index 8. open protocol between the crawler and a full-text engine If anyone wants to join (or just extend the wish list), let me know, please. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Duplicating a document in the index.
Hi list, Is there an easy way for duplicating a document in the index? Or can someone point me to the right direction for looking? Thanks, -- Victor Hadianto NUIX Pty Ltd Level 8, 143 York Street, Sydney 2000 Phone: (02) 9283 9010 Fax: (02) 9283 9020 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]