[
https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michela Becchi resolved NUTCH-824.
--
Fix Version/s: 1.0.0
Resolution: Fixed
Hi,
I fixed (or, at least, circumvented) this by modifying the
org/apache.nutch.protocol.file.FileResponse class belonging to the
protocol-file plugin.
In particular, at line 120, I added
120 String path = .equals(url.getPath()) ? / : url.getPath();
121 +String decoded_path = path; //@Michela
122
123 +try {
124 +decoded_path=java.net.URLDecoder.decode(path,UTF-8);
125 +}catch(Exception ex){
126 +}
Then, rather than
- java.io.File f = new java.io.File(path);
I have
+ java.io.File f = new java.io.File(decoded_path);
Thanks,
Michela
Crawling - File Error 404 when fetching file with an hexadecimal character in
the file name.
Key: NUTCH-824
URL: https://issues.apache.org/jira/browse/NUTCH-824
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.0.0
Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64
GNU/Linux
Reporter: Michela Becchi
Fix For: 1.0.0
Hello,
I am performing a local file system crawling.
My problem is the following: all files that contain some hexadecimal
characters in the name do not get crawled.
For example, I will see the following error:
fetching
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of
file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
I am using nutch-1.0.
Among other standard settings, I configured nutch-site.conf as follows:
property
nameplugin.includes/name
valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
descriptionRegular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
/description
/property
property
namefile.content.limit/name
value-1/value
/property
Moreover, crawl-urlfilter.txt looks like:
# skip http:, ftp:, mailto: urls
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[...@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# accept everything else
+.*
~
---
Thanks,
Michela
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.