[ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma reassigned NUTCH-824: ----------------------------------- Assignee: Markus Jelsma > Crawling - File Error 404 when fetching file with an hexadecimal character in > the file name. > -------------------------------------------------------------------------------------------- > > Key: NUTCH-824 > URL: https://issues.apache.org/jira/browse/NUTCH-824 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.0.0 > Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 > GNU/Linux > Reporter: Michela Becchi > Assignee: Markus Jelsma > Fix For: 1.0.0 > > > Hello, > I am performing a local file system crawling. > My problem is the following: all files that contain some hexadecimal > characters in the name do not get crawled. > For example, I will see the following error: > fetching > file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html > org.apache.nutch.protocol.file.FileError: File Error: 404 > at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) > fetch of > file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 > I am using nutch-1.0. > Among other standard settings, I configured nutch-site.conf as follows: > <property> > <name>plugin.includes</name> > > <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with the > underlying commons-httpclient library. > </description> > </property> > <property> > <name>file.content.limit</name> > <value>-1</value> > </property> > Moreover, crawl-urlfilter.txt looks like: > # skip http:, ftp:, & mailto: urls > -^(http|ftp|mailto): > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # accept hosts in MY.DOMAIN.NAME > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > # accept everything else > +.* > ~ > --- > Thanks, > Michela -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.