Re: Fetch: java.lang.NullPointerException

2007-03-08 Thread Andrzej Bialecki
Sean Dean wrote: You will see that type of error now when the fetch of a document goes beyond the maximum amount of redirects in your nutch-default or nutch-site.xml files. If you didn't change any settings yourself this will be set to 0, in which case it will not fetch the page but record it

Re: Fetch: java.lang.NullPointerException

2007-03-08 Thread Sean Dean
You will see that type of error now when the fetch of a document goes beyond the maximum amount of redirects in your nutch-default or nutch-site.xml files. If you didn't change any settings yourself this will be set to 0, in which case it will not fetch the page but record it for later fetching

Fetch: java.lang.NullPointerException

2007-03-08 Thread Rafael Turk
Hi all, I´m having trouble trying yo fetch http://www.hotmail.com.br/. I get the fowling error: fetch of http://www.hotmail.com.br/ failed with: java.lang.NullPointerException The site seams to be OK, although its a redirect. Using: Nutch (0.9-dev) Any help? Thanks, Rafael.

Re: external host link logging

2007-03-08 Thread Sean Dean
A good place to begin would be to look at the log file generated by Hadoop, which would be in "log/" directory. The file automatically truncates each day, but that shouldn't be a problem. You could parse that after each crawl, just taking the link and excluding all the other information. ---

Re: [SOLVED] moving crawled db from windows to linux

2007-03-08 Thread kan001
will update you once I am done with that testing... just stuck there :( kan001 wrote: > > As my linux server is a virtual dedicated server and more often it goes > outofmemory error, I wont be able to do fetch there right now. I need to > upgrade the server or stop all applications running in

external host link logging

2007-03-08 Thread djames
Hello, I'm working with nutch since 2 month now, and i'm very happy to see that this project is so powerfull! I need to crawl only a set of given website, so i set the parameter db.ignore.external.links to false and it works perfectly. But now i need to create a log file with the list of all

Re: [SOLVED] Newbie questions about followed links

2007-03-08 Thread djames
Hi, With your configuration of nutch, the crawl dont take the link with dynamic parameter. you must edit your regex filter at this line: # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Newbie-questions

Re: Newbie questions about followed links

2007-03-08 Thread Jeroen Verhagen
Hi Hasan, On 3/8/07, Hasan Diwan <[EMAIL PROTECTED]> wrote: conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?' and the links will be followed. Thanks, that made it work. I had to comment out the whole line '[EMAIL PROTECTED]' to make it work though ? Even though

Re: Newbie questions about followed links

2007-03-08 Thread Paul Liddelow
exactly what I was going to say! Cheers Paul On 3/8/07, Hasan Diwan <[EMAIL PROTECTED]> wrote: Sir: On 08/03/07, Jeroen Verhagen <[EMAIL PROTECTED]> wrote: > Surely these links look ordinary enough to be seen and followed by > nutch? Could someone please tell me what could be causing these lin

Re: Newbie questions about followed links

2007-03-08 Thread Hasan Diwan
Sir: On 08/03/07, Jeroen Verhagen <[EMAIL PROTECTED]> wrote: Surely these links look ordinary enough to be seen and followed by nutch? Could someone please tell me what could be causing these links not be followed? conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?'

Newbie questions about followed links

2007-03-08 Thread Jeroen Verhagen
Hi all, I started experimenting with Nutch using the NutchTutorial. I got a succesful crawl to work using the command 'bin/nutch crawl urls -dir crawl' (no limitations on depth or number of documents). I noticed that Nutch finishes quite fast. When I looked in the source-html of the main page bei