ekoje ekoje napisał(a):
Hello, I tried to modify Nutch in order to pass through a web proxy as
advice below but it still doesn'tr work.
I've got the following error:
2007-02-15 17:04:58,285 INFO fetcher.Fetcher - fetching
http://lucene.apache.org/nutch/
2007-02-15 17:04:58,300 INFO http.Http
I had setup a crawl of our intranet, ( approximately 1.6 million pages ) and
had set the crawl parameters to be depth 5, MAX_INT pages per iteration
After 12 days on the 3rd iteration, I got a crash with an exception thrown
Exception in thread main java.io.IOException: Job failed!
at
Hi all,
Pulled down the WEB2 stuff via SVN to finally look at the keymatch and
spellchecker stuff. Did the ANT thing per the readme to compile plugins
and build WAR with no errors. Added the plugins and plugins directory
(I moved all the plugins into nutch/plugins and pointed to that) to
Fetcher is using the correct proxy but the DNS isn't getting out. Take
a look at this, it might help.
http://www.rgagnon.com/javadetails/java-0085.html
Dennis Kubes
Damian Florczyk wrote:
ekoje ekoje napisał(a):
Hello, I tried to modify Nutch in order to pass through a web proxy as
advice
I am looking for a simple explanation on what the part-0
directory in my craw/index folders are, and when they are created and
when they are not.
I am having a bit of a trouble merging multiple nutch-created indexes
using bin/nutch merge -- the merge tools seems to always expect the
Hi Brian,
Well, it took me a while to figure it out too :-).
The number of parts actually is the number of reduce tasks defined in
hadoop-site.xml. If you are working with only one machine this value should
be one and when you run different jobs you will notice that the result is
saved in
The merge program doesn't care what the name of the folder is. It
cares it
should be in a certain structure.
So if we assume you have a folder named indexes, the program wants
that each
folder inside indexes (represents a previous run of index) should
have a
Lucene index in it (it looks
It's funny but merge is not ran as a job so you end up with one folder with
the merged index in it no parts there.
Let's say you have 2 separate indexes created in 2 separate runs.
Now let's say that one index is located at crawl/index_1 and the second is
in crawl/index_2
So now in each of those
Got live results pages after I moved a jasper lib out of the way but
keymatch does nothing - div id is there but nothing is in it I put
the keymatch def in the tiles-def.xml but left the location path at the
default which might be an issue as that is not the same as the plugin
path. I