Re: AW: Web Proxy Authentication

2007-02-15 Thread Damian Florczyk
ekoje ekoje napisał(a): Hello, I tried to modify Nutch in order to pass through a web proxy as advice below but it still doesn'tr work. I've got the following error: 2007-02-15 17:04:58,285 INFO fetcher.Fetcher - fetching http://lucene.apache.org/nutch/ 2007-02-15 17:04:58,300 INFO http.Http

Exception while intra-net crawling

2007-02-15 Thread Charlie Williams
I had setup a crawl of our intranet, ( approximately 1.6 million pages ) and had set the crawl parameters to be depth 5, MAX_INT pages per iteration After 12 days on the 3rd iteration, I got a crash with an exception thrown Exception in thread main java.io.IOException: Job failed! at

WEB2 help needed - did build but no page display..??

2007-02-15 Thread RP
Hi all, Pulled down the WEB2 stuff via SVN to finally look at the keymatch and spellchecker stuff. Did the ANT thing per the readme to compile plugins and build WAR with no errors. Added the plugins and plugins directory (I moved all the plugins into nutch/plugins and pointed to that) to

Re: AW: Web Proxy Authentication

2007-02-15 Thread Dennis Kubes
Fetcher is using the correct proxy but the DNS isn't getting out. Take a look at this, it might help. http://www.rgagnon.com/javadetails/java-0085.html Dennis Kubes Damian Florczyk wrote: ekoje ekoje napisał(a): Hello, I tried to modify Nutch in order to pass through a web proxy as advice

crawl indexes and part-00000

2007-02-15 Thread Brian Whitman
I am looking for a simple explanation on what the part-0 directory in my craw/index folders are, and when they are created and when they are not. I am having a bit of a trouble merging multiple nutch-created indexes using bin/nutch merge -- the merge tools seems to always expect the

RE: crawl indexes and part-00000

2007-02-15 Thread Gal Nitzan
Hi Brian, Well, it took me a while to figure it out too :-). The number of parts actually is the number of reduce tasks defined in hadoop-site.xml. If you are working with only one machine this value should be one and when you run different jobs you will notice that the result is saved in

Re: crawl indexes and part-00000

2007-02-15 Thread Brian Whitman
The merge program doesn't care what the name of the folder is. It cares it should be in a certain structure. So if we assume you have a folder named indexes, the program wants that each folder inside indexes (represents a previous run of index) should have a Lucene index in it (it looks

RE: crawl indexes and part-00000

2007-02-15 Thread Gal Nitzan
It's funny but merge is not ran as a job so you end up with one folder with the merged index in it no parts there. Let's say you have 2 separate indexes created in 2 separate runs. Now let's say that one index is located at crawl/index_1 and the second is in crawl/index_2 So now in each of those

Re: WEB2 help needed - did build but no page display..?? Kinda working

2007-02-15 Thread RP
Got live results pages after I moved a jasper lib out of the way but keymatch does nothing - div id is there but nothing is in it I put the keymatch def in the tiles-def.xml but left the location path at the default which might be an issue as that is not the same as the plugin path. I