Sean Dean wrote:
You will see that type of error now when the fetch of a document goes beyond
the maximum amount of redirects in your nutch-default or nutch-site.xml files.
If you didn't change any settings yourself this will be set to 0, in which case it will not fetch the page but record it
You will see that type of error now when the fetch of a document goes beyond
the maximum amount of redirects in your nutch-default or nutch-site.xml files.
If you didn't change any settings yourself this will be set to 0, in which case
it will not fetch the page but record it for later fetching
Hi all,
I´m having trouble trying yo fetch http://www.hotmail.com.br/. I get the
fowling error: fetch of http://www.hotmail.com.br/ failed with:
java.lang.NullPointerException
The site seams to be OK, although its a redirect.
Using: Nutch (0.9-dev)
Any help?
Thanks,
Rafael.
A good place to begin would be to look at the log file generated by Hadoop,
which would be in "log/" directory. The file automatically truncates each day,
but that shouldn't be a problem.
You could parse that after each crawl, just taking the link and excluding all
the other information.
---
will update you once I am done with that testing... just stuck there :(
kan001 wrote:
>
> As my linux server is a virtual dedicated server and more often it goes
> outofmemory error, I wont be able to do fetch there right now. I need to
> upgrade the server or stop all applications running in
Hello,
I'm working with nutch since 2 month now, and i'm very happy to see that
this project is so powerfull!
I need to crawl only a set of given website, so i set the parameter
db.ignore.external.links to false and it works perfectly.
But now i need to create a log file with the list of all
Hi,
With your configuration of nutch, the crawl dont take the link with dynamic
parameter.
you must edit your regex filter at this line:
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
--
View this message in context:
http://www.nabble.com/Newbie-questions
Hi Hasan,
On 3/8/07, Hasan Diwan <[EMAIL PROTECTED]> wrote:
conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]
Remove the '?' and the links will be followed.
Thanks, that made it work.
I had to comment out the whole line '[EMAIL PROTECTED]' to make it work though
? Even though
exactly what I was going to say!
Cheers
Paul
On 3/8/07, Hasan Diwan <[EMAIL PROTECTED]> wrote:
Sir:
On 08/03/07, Jeroen Verhagen <[EMAIL PROTECTED]> wrote:
> Surely these links look ordinary enough to be seen and followed by
> nutch? Could someone please tell me what could be causing these lin
Sir:
On 08/03/07, Jeroen Verhagen <[EMAIL PROTECTED]> wrote:
Surely these links look ordinary enough to be seen and followed by
nutch? Could someone please tell me what could be causing these links
not be followed?
conf/urlfilter.txt.template contains the line:
[EMAIL PROTECTED]
Remove the '?'
Hi all,
I started experimenting with Nutch using the NutchTutorial. I got a
succesful crawl to work using the command 'bin/nutch crawl urls -dir
crawl' (no limitations on depth or number of documents). I noticed
that Nutch finishes quite fast. When I looked in the source-html of
the main page bei
11 matches
Mail list logo