Hi Kaveh,

> why is it that some of the links go through with no
> problem and some of them don't?

do they really get fetched or is the error logged some lines below?

I tried most of your URLs with

 nutch plugin protocol-httpclient org.apache.nutch.protocol.httpclient.Http -verbose 
<url>

and all fail with "Invalid uri". I have no explanation why some should fail and 
others do not.


On 02/14/2012 02:46 AM, kaveh minooie wrote:
Thanks Sebastian that was very helpful, but then why is it that some of the 
links go through with no
problem and some of them don't?

On 02/13/2012 04:38 PM, Sebastian Nagel wrote:
Hi Kaveh,

protocol-httpclient does not accept URLs containing white space and
other characters
which are, strictly speaking, forbidden in URLs and have to be escaped, see
http://en.wikipedia.org/wiki/URI_encoding
Most browsers accept these URLs and escape the forbidden characters
tacitly.
Protocol-httpclient is inconvenient in this respect. One way to deal
with it is
to percent-escape the forbidden characters by an URL normalization rule.
For example, to replace the space by %20 you have to add to your
regex-normalize.xml:

<regex>
<pattern>&#x20;</pattern>
<substitution>%20</substitution>
</regex>

But there are more characters to escape:
http://www.prolitegear.com/site/xdpy/ssg/Bargains & Closeouts/Sleeping
Bags: 0° to 20° F.html

Sebastian

On 02/14/2012 12:57 AM, kaveh minooie wrote:
so one of the exceptions that I see a lot in my log files is invalid
uri exception like this:

2012-02-13 15:05:50,217 ERROR org.apache.nutch.protocol.httpclient.Http:
java.lang.IllegalArgumentException: Invalid uri
'http://www.prolitegear.com/site/xdpy/ssg/Shelters/Shelter
Accessories.html': escaped absolute path
not valid

2012-02-13 15:05:50,217 ERROR org.apache.nutch.protocol.httpclient.Http:
java.lang.IllegalArgumentException: Invalid uri
'http://www.prolitegear.com/site/xdpy/ssg/Shelters/Shelter
Accessories.html': escaped absolute path
not valid
2012-02-13 15:05:50,226 ERROR org.apache.nutch.protocol.httpclient.Http:
java.lang.IllegalArgumentException: Invalid uri
'http://www.prolitegear.com/activity/Adventure
Racing/index.html': escaped absolute path not valid


(there is a space between "Shelter" and "Accessories") I thought at
first that it is because of the
space in the linke but these addresses from the same site go through
with no problem:

2012-02-13 15:05:50,114 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.prolitegear.com/site/xdpy/ssg/Shelters/Shelter
Accessories.html
2012-02-13 15:05:50,105 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.prolitegear.com/site/xdpy/ssg/Accessories/Sun Protection.html
2012-02-13 15:05:50,149 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.prolitegear.com/site/xdpy/ssg/Bargains & Closeouts/Sleeping
Bags: 0° to 20° F.html
2012-02-13 15:05:50,100 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.prolitegear.com/site/xdpy/ssg/Climbing Gear/Protection.html


does anybody have any idea what might be wrong here? ( I am using
protocol-httpclient and all the
links are actually valid. they work if u copy and paste them into a
browser)





Reply via email to