Just one last thing: If you use a proxy to do the http request, it wouldn't
work although you configure nutch-site.xml and nutch-default.xml properly.
You should check your proxy conf or code, and be sure you set the
"User-Agent" header property. In other case, the server won't accept your
http requests.

Regards

2010/11/18 Markus Jelsma <[email protected]>

> Yes but encapsulate the string in double quotes. Also check the other
> http.agent.* configuration options, they control what's between the
> parentheses
> etc.
>
> On Thursday 18 November 2010 17:08:09 matinte wrote:
> > Hi again,
> > I've been looking deeper and error may be because the server filters the
> > requests by the User-Agent headers value. In fact, if I make requests
> with
> > curl or wget to the server with User-Agent value as Mozilla/4.0, it
> returns
> > the url content correctly:
> > wget-U Mozilla/4.0 <url>
> > curl-A "Mozilla/4.0" <url>
> >
> > Therefore, my goal now is to configure nutch so that the User-Agent
> headers
> > value be correct. To do this, I modified the nutch-default.xml file:
> >  <name> http.agent.name </ name>
> >  <value> "Mozilla/4.0" </ value>
> >
> > Is it enough?
> >
> > Thanks
> >
> > 2010/11/16 Markus Jelsma-2 [via Lucene] <
> > [email protected]<ml-node%[email protected]>
> <ml-node%2B1912155-136797957
> > [email protected]>
> >
> > > definately!
> > >
> > > On Tuesday 16 November 2010 18:28:17 matinte wrote:
> > > > The url does exist but for example, when I try curl <url> it returns:
> > > > curl: (56) Failure when receiving data from the peer
> > > >
> > > > It could be a problem of the server?
> > > >
> > > > 2010/11/16 Markus Jelsma-2 [via Lucene] <
> > > > [hidden email]
> > > > <http://user/SendEmail.jtp?type=node&node=1912155&i=0
> ><ml-node%2B19120
> > > > 44-590307235-
> > > >
> > > > [hidden email] <http://user/SendEmail.jtp?type=node&node=1912155&i=1
> >>
> > > >
> > > > > That should generate an IOException if i'm not mistaken.
> > > > >
> > > > > On Tuesday 16 November 2010 18:16:45 Ye T Thet wrote:
> > > > > > Matinte,
> > > > > >
> > > > > > I have encountered that before.
> > > > > >
> > > > > > In my experience, it is caused by <url>. The url you are trying
> to
> > > > > > crawl does not exists or server is not responding.
> > > > > >
> > > > > > Warm Regards,
> > > > > >
> > > > > > YT Thet
> > > > > >
> > > > > > On Wed, Nov 17, 2010 at 12:44 AM, matinte <[hidden
> > > > > > email]<http://user/SendEmail.jtp?type=node&node=1912044&i=0>>
> > > > >
> > > > > wrote:
> > > > > > > Hi,
> > > > > > > I am trying to crawl with a seed url given but I'm having the
> > > > > > > next
> > > > >
> > > > > error:
> > > > > > > ...
> > > > > > > fetch of <url> failed with: java.io.EOFException
> > > > > > > -finishing thread FetcherThread, activeThreads=0
> > > > > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > > > > > -activeThreads=0
> > > > > > > Fetcher: done
> > > > > > >
> > > > > > > Do you have any idea?
> > > > > > >
> > > > > > > Thanks in advance
> > > > > > > --
> > >
> > > > > > > View this message in context:
> > >
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847p
> > > <
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847
> > > p?by-user=t>
> > >
> > > > > <
> > >
> > >
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847<
> > >
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847?
> > > by-user=t>
> > >
> > > > > p?by-user=t>
> > > > >
> > > > > > > 1911847.html Sent from the Nutch - User mailing list archive at
> > > > > > > Nabble.com.
> > > > >
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536600 / 06-50258350
> > > > >
> > > > >
> > > > > ------------------------------
> > > > >
> > > > >  View message @
> > >
> > >
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847p
> > > <
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847
> > > p?by-user=t>
> > >
> > > > > 1912044.html To unsubscribe from Fetch error during crawling, click
> > > > > here<
> > >
> > > http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu
> <http:
> > > //
> lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu&by-user=t>
> > >
> > >
> bscribe_by_code&node=1911847&code=bWlndWVsLnRpbnRlQGdtYWlsLmNvbXwxOTExODQ
> > >
> > > > > 3fC0xODMzNjA4OTYy>.
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536600 / 06-50258350
> > >
> > >
> > > ------------------------------
> > >
> > >  View message @
> > >
> > >
> http://lucene.472066.n3.nabble.com/Fetch-error-during-crawling-tp1911847p
> > > 1912155.html To unsubscribe from Fetch error during crawling, click
> > > here<
> http://lucene.472066.n3.nabble.com/template/TplServlet.jtp?tpl=unsu
> > >
> bscribe_by_code&node=1911847&code=bWlndWVsLnRpbnRlQGdtYWlsLmNvbXwxOTExODQ
> > > 3fC0xODMzNjA4OTYy>.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>

Reply via email to