Deleting documents that meet a query

2003-06-09 Thread Bruce Cota
I need to delete all the documents from an index that
satisfy a BooleanQuery.
The only methods I can find (in IndexReader) for deleting
a document are delete(Term) and delete(int).
I tried searching on my Query using IndexSearcher.search(),
iterating over each document in the returned Hits,
and then iterating over Hits deleting each document like this:
for (int i=0; ihits.length(); +++i) {
   ireader.delete(hits.id(i));
}
Hoping here that the value returned by
Hits.id(int) is the docnum expected in
IndexReader.delete(int)
But the call to delete throws an IOException.

So, is there any way I can delete all the documents from
an Index that satisfy a general Query?
Thank you for any advice.

Bruce Cota,
Unicon, Inc.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Deleting documents that meet a query

2003-06-09 Thread Marie-Hélène Forget
Hi,

I confirm to you that delete( hits.id( i ) ) is ok.

Hits.id( int ) returns the docnum that you need.

MHF :)

On Mon, 2003-06-09 at 12:11, Bruce Cota wrote:
 I need to delete all the documents from an index that
 satisfy a BooleanQuery.
 
 The only methods I can find (in IndexReader) for deleting
 a document are delete(Term) and delete(int).
 
 I tried searching on my Query using IndexSearcher.search(),
 iterating over each document in the returned Hits,
 and then iterating over Hits deleting each document like this:
 
 for (int i=0; ihits.length(); +++i) {
 ireader.delete(hits.id(i));
 }
 
 Hoping here that the value returned by
 Hits.id(int) is the docnum expected in
 IndexReader.delete(int)
 
 But the call to delete throws an IOException.
 
 So, is there any way I can delete all the documents from
 an Index that satisfy a general Query?
 
 Thank you for any advice.
 
 Bruce Cota,
 Unicon, Inc.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deleting documents that meet a query

2003-06-09 Thread Bruce Cota
Thanks.  I will restore that code and try to figure out why
it broke :)  (Because my alternative solution was way uglier.)
Marie-Hélène Forget wrote:

Hi,

I confirm to you that delete( hits.id( i ) ) is ok.

Hits.id( int ) returns the docnum that you need.

MHF :)

On Mon, 2003-06-09 at 12:11, Bruce Cota wrote:
 

I need to delete all the documents from an index that
satisfy a BooleanQuery.
The only methods I can find (in IndexReader) for deleting
a document are delete(Term) and delete(int).
I tried searching on my Query using IndexSearcher.search(),
iterating over each document in the returned Hits,
and then iterating over Hits deleting each document like this:
for (int i=0; ihits.length(); +++i) {
   ireader.delete(hits.id(i));
}
Hoping here that the value returned by
Hits.id(int) is the docnum expected in
IndexReader.delete(int)
But the call to delete throws an IOException.

So, is there any way I can delete all the documents from
an Index that satisfy a general Query?
Thank you for any advice.

Bruce Cota,
Unicon, Inc.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: High Capacity (Distributed) Crawler

2003-06-09 Thread Otis Gospodnetic
Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.

Thanks,
Otis

--- Leo Galambos [EMAIL PROTECTED] wrote:
 Hi.
 
 I would like to write $SUBJ (HCDC), because LARM does not offer many 
 options which are required by web/http crawling IMHO. Here is my
 list:
 
 1. I would like to manage the decision what will be gathered first - 
 this would be based on pageRank, number of errors, connection speed
 etc. 
 etc.
 2. pure JAVA solution without any DBMS/JDBC
 3. better configuration in case of an error
 4. NIO style as it is suggested by LARM specification
 5. egothor's filters for automatic processing of various data formats
 6. management of Expires HTTP-meta headers, heuristic rules which
 will 
 describe how fast a page can expire (.php often expires faster than
 .html)
 7. reindexing without any data exports from a full-text index
 8. open protocol between the crawler and a full-text engine
 
 If anyone wants to join (or just extend the wish list), let me know,
 please.
 
 -g-
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: High Capacity (Distributed) Crawler

2003-06-09 Thread Leo Galambos
Hi Otis.

The first beta is done (without NIO). It needs, however, further 
testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in 
PULL mode), but we have a nice weather here, so I lost any motivation to 
play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from 
the robot) mode? Tell me what you need and I will try to do all my best.

-g-

Otis Gospodnetic wrote:

Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.
Thanks,
Otis
--- Leo Galambos [EMAIL PROTECTED] wrote:
 

Hi.

I would like to write $SUBJ (HCDC), because LARM does not offer many 
options which are required by web/http crawling IMHO. Here is my
list:

1. I would like to manage the decision what will be gathered first - 
this would be based on pageRank, number of errors, connection speed
etc. 
etc.
2. pure JAVA solution without any DBMS/JDBC
3. better configuration in case of an error
4. NIO style as it is suggested by LARM specification
5. egothor's filters for automatic processing of various data formats
6. management of Expires HTTP-meta headers, heuristic rules which
will 
describe how fast a page can expire (.php often expires faster than
.html)
7. reindexing without any data exports from a full-text index
8. open protocol between the crawler and a full-text engine

If anyone wants to join (or just extend the wish list), let me know,
please.
-g-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Duplicating a document in the index.

2003-06-09 Thread Victor Hadianto
Hi list,

Is there an easy way for duplicating a document in the index? Or can someone 
point me to the right direction for looking?

Thanks,
-- 
Victor Hadianto

NUIX Pty Ltd
Level 8, 143 York Street, Sydney 2000
Phone: (02) 9283 9010
Fax:   (02) 9283 9020

This message is intended only for the named recipient. If you are not the
intended recipient you are notified that disclosing, copying, distributing
or taking any action in reliance on the contents of this message or
attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]