org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace
I need a recursive file delete for cleaning up after a JUnit test. There is one in Commons IO (org.apache.commons.io): FileUtils.deleteDirectory(File directory) I wonder whether I should use org.apache.commons.io as a new jar added to lib or arrange a libtest for jars only used by JUnit

Re: org.apache.commons.io.FileUtils

2005-10-12 Thread Paul Baclace
Paul Baclace wrote: I need a recursive file delete for cleaning up after a JUnit test. I just now spotted: org.apache.nutch.fs.LocalFileSystem.delete(File f) which does what I want (recursive, local delete). So no need for common.io. Paul

Re: nutch downloads

2005-10-12 Thread Erik Hatcher
Joshua, We have received your message. I'm only remotely involved with Nutch, so I'm prodding other committers to Nutch to please update the links to take advantage of the mirroring system in place. Please - someone reply back volunteering to correct this ASAP. Erik On Oct 11,

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331877 ] Fuad Efendi commented on NUTCH-109: --- Ok, I'll do it tonight; I believe fetcher.server.delay means Wait for a Response from Server, then throw a Timeout Exception I can also

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331897 ] Fuad Efendi commented on NUTCH-109: --- Opps... need to learn more! [protocol-httpclient] Http.java is Singleton, it uses MultiThreadedHttpConnectionManager It uses single

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to

RE: Q about exact match counts?

2005-10-12 Thread Goldschmidt, Dave
Anyone answer this question? I see in the Hits class that there's a boolean totalIsExact attribute, but this becomes false only when deduplication (per site) occurs during the search. And I see that underneath Nutch, Lucene will obtain the documents for only the top hits. But does Nutch/Lucene

Re: nutch downloads

2005-10-12 Thread Doug Cutting
Erik Hatcher wrote: Please - someone reply back volunteering to correct this ASAP. My bad. I'm fixing this right now. In 24 hours all Nutch downloads should be through the mirrors. Sorry! Doug

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331913 ] Fuad Efendi commented on NUTCH-109: --- I was totally wrong and unfair: Have you seen Kelvin Tan's patch? You should take a look, it's in JIRA, and addresses some of the

suspicious outlink count

2005-10-12 Thread EM
202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does nutch bother

keep count of selected url

2005-10-12 Thread Daniele Menozzi
Hi all, I was interesting in keeping count of the number of time every URL is selected by an user. The problem is not on how do I know what page is clicked, but how can I store every page/number-of-clicks touple? What is the best way to store theese informations, and let nutch use them? Can you

clustering strategies

2005-10-12 Thread Earl Cahill
I think it would be nice to have a few cluster strategies on the wiki. It seems there are at least three separate needs: CPU, storage and bandwidth, and I think the more those could be cleanly spread to different boxes, the better. Guess I am imagining a breakdown that lists, by priority, how

[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann
Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ] Fuad Efendi commented on NUTCH-109: --- Please see attachment for more details. In order to be fair (protocol-http uses single shared Socket per Host) I tried to modify this

[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Fuad Efendi updated NUTCH-109: -- Attachment: test_results.txt Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation