[jira] Updated: (NUTCH-952) fix outlink which started with '?' in html parser
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stondet updated NUTCH-952: -- Affects Version/s: (was: 1.3) 2.0 fix outlink which started with '?' in html parser - Key: NUTCH-952 URL: https://issues.apache.org/jira/browse/NUTCH-952 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.0 Reporter: Stondet Attachments: NUTCH-952-v2.patch a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from http://bbs.soso.com/search?ty=csd=0w=rails) outlink parsed from above link: http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-954) Bugfix for Content-Length limit in http protocols
[ https://issues.apache.org/jira/browse/NUTCH-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-954. - Resolution: Fixed 1.3 : Committed revision 1056359 trunk : Committed revision 1056362 Thanks Alexis! Bugfix for Content-Length limit in http protocols - Key: NUTCH-954 URL: https://issues.apache.org/jira/browse/NUTCH-954 Project: Nutch Issue Type: Sub-task Components: fetcher Affects Versions: 1.3, 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.3, 2.0 3. Content-Length limit (nutch3.patch) This is related to NUTCH-899. The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues
[ https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978832#action_12978832 ] Julien Nioche commented on NUTCH-950: - Have committed the first 3 sub-issues. Regarding the last one, I haven't tested the first point (version changes) but here are a few comments about the other issues : * Hbase + MySQL : these backends should not be provided by default, same for the MySQL connector. One option would be to add them to the ivy file but comment them out and give a bit of an explanation e.g. uncomment this if you want to use xxx as a GORA backend * the dependency com.jcraft/jsch should be placed in the ivy file of the corresponding plugin, not in the main one Alexis, could you please create a new issue for this then mark this issue as resolved? Having a single JIRA number for completely separated issues is a bad idea and does not help keeping things in sync with the svn commits. Thanks a lot for your contributions Julien Content-Length limit, URL filter and few minor issues - Key: NUTCH-950 URL: https://issues.apache.org/jira/browse/NUTCH-950 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Reporter: Alexis Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch 1. crawl command (nutch1.patch) The class was renamed to Crawler but the references to it were not updated. 2. URL filter (nutch2.patch) This avoids a NPE on bogus urls which host do not have a suffix. 3. Content-Length limit (nutch3.patch) This is related to NUTCH-899. The patch avoids the entire flush operation on the Gora datastore to crash because the MySQL blob limit was exceeded by a few bytes. Both protocol-http and protocol-httpclient plugins were problematic. 4. Ivy configuration (nutch4.patch) - Change xercesImpl and restlet versions. These 2 version changes are required. The first one currently makes a JUnit test crash, the second one is missing in default Maven repository. - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. These jars are necesary to run Gora with HBase or MySQL datastores. (more a suggestion that a requirement here) - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.
[ https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-824. - Resolution: Fixed Assignee: Julien Nioche (was: Markus Jelsma) Have reactivated the tests for protocol-file in 1.3 and reorganised the test documents to follow other plugins i.e. test docs in sample dir Protocol-file now decodes the input URLs with UTF-8 trunk : Committed revision 1056394 1.3 : Committed revision 1056401 Thanks Michela Crawling - File Error 404 when fetching file with an hexadecimal character in the file name. Key: NUTCH-824 URL: https://issues.apache.org/jira/browse/NUTCH-824 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0, 1.2, 1.3, 2.0 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 GNU/Linux Reporter: Michela Becchi Assignee: Julien Nioche Fix For: 1.3, 2.0, 1.0.0 Hello, I am performing a local file system crawling. My problem is the following: all files that contain some hexadecimal characters in the name do not get crawled. For example, I will see the following error: fetching file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html org.apache.nutch.protocol.file.FileError: File Error: 404 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) fetch of file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 I am using nutch-1.0. Among other standard settings, I configured nutch-site.conf as follows: property nameplugin.includes/name valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property namefile.content.limit/name value-1/value /property Moreover, crawl-urlfilter.txt looks like: # skip http:, ftp:, mailto: urls -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # accept everything else +.* ~ --- Thanks, Michela -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #1361
See https://hudson.apache.org/hudson/job/Nutch-trunk/1361/changes Changes: [jnioche] NUTCH-824 FileProtocol does not resolve encoded URLs [jnioche] NUTCH-954 Strict application of Content-Length limit for http protocols (Alexis Detreglode via jnioche) -- [...truncated 1007 lines...] A src/plugin/subcollection/src/java/org/apache/nutch A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A