[jira] [Updated] (NUTCH-966) Behavior of NOINDEX,FOLLOW is not intuitive
[ https://issues.apache.org/jira/browse/NUTCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-966: --- Fix Version/s: 2.2 1.7 Behavior of NOINDEX,FOLLOW is not intuitive --- Key: NUTCH-966 URL: https://issues.apache.org/jira/browse/NUTCH-966 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: 1.2 Reporter: Josh Pavel Priority: Minor Fix For: 1.7, 2.2 If a page has NOINDEX,FOLLOW for the ROBOTS metatag, Nutch will still create a document that can be found in the index via metatag or URL matching. Instead, Nutch should rely on doc or parse metadata but nothing should be stored by the html parser. (thanks to Julien Nioche for helping me to understand the issue). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-911) recrawls file protocol causes Errors/Exceptions when actually not modified or gone
[ https://issues.apache.org/jira/browse/NUTCH-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-911: --- Fix Version/s: 1.7 recrawls file protocol causes Errors/Exceptions when actually not modified or gone -- Key: NUTCH-911 URL: https://issues.apache.org/jira/browse/NUTCH-911 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.1 Reporter: Peter Lundberg Priority: Minor Fix For: 1.7 When recrawling file systems file are marked as error and logging occurs such as: java.net.MalformedURLException at java.net.URL.init(URL.java:601) at java.net.URL.init(URL.java:464) at java.net.URL.init(URL.java:413) at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627) fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter Lundberg 20090929.pdf failed with: java.net.MalformedURLException This is due to FileResponse and File not working well together. The same is true for files that after a while disappear from the file system being crawled (ie error instead of GONE). I am too new with nutch to know the design rational behind this or any sideaffect. Below is a patch that I have used that cleans up the segment data and removevs false errors in the log file. --- src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java (revision 997976) +++ src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java (working copy) @@ -79,6 +79,10 @@ if (code == 200) { // got a good response return new ProtocolOutput(response.toContent()); // return it +} else if (code == 404) { // handle no such file + return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_GONE ); +} else if (code == 304) { // handle not modified + return new ProtocolOutput(response.toContent(), ProtocolStatus.STATUS_NOTMODIFIED ); } else if (code = 300 code 400) { // handle redirect if (redirects == MAX_REDIRECTS) throw new FileException(Too many redirects: + url); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-813. --- Resolution: Duplicate The described problem is identical to that of NUTCH-578. The provided patch (call setPageGoneSchedule when retry counter hits db.fetch.retry.max) is included in all patches of NUTCH-578. Repetitive crawl 403 status page Key: NUTCH-813 URL: https://issues.apache.org/jira/browse/NUTCH-813 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Nguyen Manh Tien Priority: Minor Fix For: 1.7 Attachments: ASF.LICENSE.NOT.GRANTED--Patch When we crawl a page the return a 403 status. It will be crawl repetitively each days with default schedule. Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-910) Cached.jsp has a bug with encoding
[ https://issues.apache.org/jira/browse/NUTCH-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-910. Resolution: Won't Fix this is a legacy issue so we won't be fixing it. Cached.jsp has a bug with encoding -- Key: NUTCH-910 URL: https://issues.apache.org/jira/browse/NUTCH-910 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 1.0.0 Environment: Any enironment Reporter: Attila Pados Priority: Minor Original Estimate: 2m Remaining Estimate: 2m cached.jsp Pages that has a non default encoding, or not utf-8 etc, the cached content is displayed screwed. This is quite annoying, but doesn't harm critically functionality. add : Metadata parseData = bean.getParseData(details).getParseMeta(); original : Metadata metaData = bean.getParseData(details).getContentMeta(); replace: String encoding = (String) parseData.get(CharEncodingForConversion); In the cached jsp, the encoding variable is tried to retrieved from the wrong metadata source, contentMeta, which doesn't include this value. It resides in the parseMetadata instead. First line is not a replacement above, it has to be added. Original metadata is needed there for other things. Then below, the encoding value line has to be changed, that is a replacement. This fix is for 1.0 nutch version, i didn't found an issue in the list that would cover this, just a mail found with google, on a mailing list that refered to it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-923: --- Patch Info: Patch Available Fix Version/s: 1.7 Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor Fix For: 1.7 Attachments: patch-923-nutch-release-1.2.txt It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-829) duplicate hadoop temp files
[ https://issues.apache.org/jira/browse/NUTCH-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-829: --- Fix Version/s: 2.2 1.7 duplicate hadoop temp files --- Key: NUTCH-829 URL: https://issues.apache.org/jira/browse/NUTCH-829 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0, 1.1 Reporter: Mike Baranczak Priority: Minor Fix For: 1.7, 2.2 When two crawls are started at exactly the same time, I see the following error: {quote} org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.Generator.generate(Generator.java:472) at org.apache.nutch.crawl.Generator.generate(Generator.java:409) [...] {quote} I traced it down to this code in Generator (I'm using Nutch 1.0, but this is still in the trunk): {quote} Path tempDir = new Path(getConf().get(mapred.temp.dir, .) + /generate-temp-+ System.currentTimeMillis()); {quote} I admit that this is an unlikely scenario for most users, but it just so happens that I ran into it. To absolutely guarantee that the temp directory doesn't already exist, I suggest changing System.currentTimeMillis() to java.util.UUID.randomUUID().toString(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-625) Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)
[ https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-625. Resolution: Won't Fix as per Dogacan's comments Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte) -- Key: NUTCH-625 URL: https://issues.apache.org/jira/browse/NUTCH-625 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Vinci Priority: Minor If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e. multi-byte character), the dumped contents by readseg utility will have garbled character appear in all of the non-utf8 non-ascii text, and those texts are unable to repair by encoding reload. At the same time, the utf-8 text is normal, only the non-utf8 text broken. Any possible solution available for repairing the broken text? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)
[ https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-609: --- Fix Version/s: 2.2 1.7 Allow Plugins to be Loaded from Jar File(s) --- Key: NUTCH-609 URL: https://issues.apache.org/jira/browse/NUTCH-609 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-609-1-20080212.patch Currently plugins cannot be loaded from a jar file. Plugins must be unzipped in one or more directories specified by the plugin.folders config. I have been thinking about an extension to PluginRepository or PluginManifestParser (or both) that would allow plugins to packaged into multiple independent jar files and placed on the classpath. The system would search the classpath for resources with the correct folder name and would load any plugins in those jars. This functionality would be very useful in making the nutch core more flexible in terms of packaging. It would also help with web applications where we don't want to have a plugins directory included in the webapp. Thoughts so far are unzipping those plugin jars into a common temp directory before loading. Another option is using something like commons vfs to interact with the jar files. VFS essential uses a disk based temporary cache for jar files, so it is pretty much the same solution. What are everyone else's thoughts on this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-670) feed plugin does not parse RSS2 enclosures
[ https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-670: --- Fix Version/s: 2.2 1.7 feed plugin does not parse RSS2 enclosures -- Key: NUTCH-670 URL: https://issues.apache.org/jira/browse/NUTCH-670 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor Fix For: 1.7, 2.2 Original Estimate: 1h Remaining Estimate: 1h The feed parse in plugins/feed does not get count links found in RSS2 enclosure tags as Outlinks. It's a pretty simple patch - SyndEntry has a getEnclosures call. I'll submit the patch tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-664: --- Fix Version/s: 2.2 Possibility to update already stored documents. --- Key: NUTCH-664 URL: https://issues.apache.org/jira/browse/NUTCH-664 Project: Nutch Issue Type: Wish Reporter: Sergey Khilkov Priority: Minor Fix For: 2.2 We have huge index of stored documents. It is high cost procedure to fetch page, merge indexes any time we update some information about page. The information can be changed 1-3 times per day. At this moment we have to store changed info in database, but in this case we have lots of problems with sorting, search restricions and so on. Lucene itself allows delete single document and add new one into existing index. But there is a problem with hadoop... As I understand hadoop filesystem has no possibility to write in random positions. But it will be great feature if nutch will be able to update created index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-718) urlfilter-subnets plugin
[ https://issues.apache.org/jira/browse/NUTCH-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-718: --- Fix Version/s: 2.2 1.7 urlfilter-subnets plugin Key: NUTCH-718 URL: https://issues.apache.org/jira/browse/NUTCH-718 Project: Nutch Issue Type: New Feature Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-718-nutchbase.patch, NUTCH-718_urlfilter_subnets.patch, NUTCH-718_urlfilter_subnets_v2.patch This plugin filter urls by netmasks in CIDR-notation -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-750) HtmlParser plugin - page title extraction
[ https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-750: --- Fix Version/s: 1.7 HtmlParser plugin - page title extraction - Key: NUTCH-750 URL: https://issues.apache.org/jira/browse/NUTCH-750 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Alexey Torochkov Priority: Minor Fix For: 1.7 Attachments: SkipBody.patch A little improvement to trying to extract title tag in body if it doesn't exist in head. In current version DOMContentUtils just skip all after body in getTitle() method. Attached patch allows to change this behavior (for default it doesn't change anything) and can cope with webmasters mistakes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-737) urlnormalizer-unalias plugin
[ https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-737: --- Fix Version/s: 1.7 urlnormalizer-unalias plugin Key: NUTCH-737 URL: https://issues.apache.org/jira/browse/NUTCH-737 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.7 Attachments: NUTCH-737_urlfilter_unalias.patch I tried to search any whole site duplication detection tools without success. This plugin allows to do domain name transformation (for example www.google.com - google.com). It is very stupid, but can be useful when fighting with site aliases. For detect site aliases I use my own ugly class (based on SolrDeleteDuplicates). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552041#comment-13552041 ] Ben McCann commented on NUTCH-1345: --- You've probably set $NUTCH_JAVA_HOME then. I don't see why that should be required however if Java in on your path. It's fine to allow for an override, but it's just one extra thing to do to get setup for most people. JAVA_HOME should not be required Key: NUTCH-1345 URL: https://issues.apache.org/jira/browse/NUTCH-1345 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Ben McCann Priority: Minor Attachments: nutch, nutch.patch Trying to run Nutch spits out the message Error: JAVA_HOME is not set. I already have java on my path, so I really wish I didn't need to set JAVA_HOME. It's an extra step to get up and running and is not updated by Ubuntu's update-alternatives, so it makes it a lot harder to switch between versions of Java. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-690) bug in DomContentUtils.shouldThrowAwayLink?
[ https://issues.apache.org/jira/browse/NUTCH-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-690: --- Fix Version/s: 1.7 bug in DomContentUtils.shouldThrowAwayLink? --- Key: NUTCH-690 URL: https://issues.apache.org/jira/browse/NUTCH-690 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Peter Sparks Priority: Minor Fix For: 1.7 I found a potential bug in DomContentUtils.shouldThrowAwayLink. It returns true for the 5 links at the top of the home page for www.aksteel.com. Here are the links in the source: a id=Search href=/search/default.aspx/a a id=Investor style=height: 15px; href=/investor_information//a a id=Markets href=/markets_products//a a id=Production href=/production_facilities//a a id=News href=/news//a Perhaps I am just ignorant of what this function is supposed to do but returning true for these 5 links on that site make that site impossible to crawl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-589) Hierarchical Classloaders
[ https://issues.apache.org/jira/browse/NUTCH-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-589: --- Fix Version/s: 1.7 Hierarchical Classloaders - Key: NUTCH-589 URL: https://issues.apache.org/jira/browse/NUTCH-589 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Ryan Levering Priority: Minor Fix For: 1.7 Currently the Nutch plugin classloader flattens all the jars from a plugins' dependencies and instantiates a new classloader for each plugin. I think it would be better to create a hierarchical classloader chain. Currently plugins can't pass objects from a common plugin to one another because the objects are created using different classloaders. Nutch currently avoids this by only using interfaces from a common classloader to pass objects between plugins, but I can't see the harm in improving the plugin classloader. It would require a change to PluginDescription and PluginClassLoader in order to override ClassLoader to maintain the export filter functionality that currently exists. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-569) Protocol plugins should report progress to the fetcher
[ https://issues.apache.org/jira/browse/NUTCH-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-569: --- Fix Version/s: 1.7 Protocol plugins should report progress to the fetcher -- Key: NUTCH-569 URL: https://issues.apache.org/jira/browse/NUTCH-569 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Priority: Minor Fix For: 1.7 When downloading very large files over slow connections, protocol plugins spend long time in Protocol.getProtocolOutput(...). This sometimes leads to a timeout in Fetcher / Fetcher2, with the message aborting with hung threads. Protocol plugins should periodically notify their caller about progress. In a situation when the call to getProtocolOutput takes very long time to return, this will help the caller to determine whether the wait is justified. Preferably, the callback interface should allow the monitoring of not only the binary progress / no-progress, but also the download speed, so that the caller could terminate slow connections. E.g. {noformat} interface ProtocolReporter { void progress(long bytesDownloaded); } {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-566: --- Fix Version/s: 2.2 1.7 Sun's URL class has bug in creation of relative query URLs -- Key: NUTCH-566 URL: https://issues.apache.org/jira/browse/NUTCH-566 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: MacOS X and Linux (CentOS 4.5) both Reporter: Doug Cook Priority: Minor Fix For: 1.7, 2.2 Attachments: RelativeURL.java I'm using 0.81, but this will affect all other versions as well. Relative links of the form ?blah are resolved incorrectly. For example, with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link of ?id_entrep=111, Nutch will resolve this pair to the link http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers I tried will resolve the pair to http://www.fleurie.org/entreprise.asp?id_entrep=111;. I tracked this down to what could be called a bug in Sun's URL class. According to Sun's spec, they parse the relative URL according to RFC 2396. But the original RFC for relative links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with ?. Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for compatibility and also because the behavior makes more sense). Apparently even the people that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed in RFC 3986 to match what browsers do. For a discussion of this, see http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and is out of step with the rest of the world. This breaks link extraction on a number of sites. I implemented a simple workaround, which I'm attaching. It is a static method to create URLs which behaves exactly as new URL(URL base, String relativePath), and I use it as a drop-in replacement for that in DOMContentUtils, Javascript link extraction, etc. Obviously, it really only matters wherever links are extracted. I haven't included the calling code from DOMContentUtils, etc. because my local versions are largely rewritten, but it should be pretty obvious. I put it in the org.apache.nutch.net directory, but obviously feel free to move it to another place if you feel it belongs there! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-431) Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins
[ https://issues.apache.org/jira/browse/NUTCH-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-431: --- Fix Version/s: 2.2 1.7 Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins -- Key: NUTCH-431 URL: https://issues.apache.org/jira/browse/NUTCH-431 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: MacBook Pro, Intel Core Duo 2.0 Ghz, 1.5 GB RAM, Mac OSX 10.4 although improvement is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7, 2.2 Currently, there are many plugin-specific properties that live in the global nutch properties files, nutch-site.xml and nutch-default.xml. These would be things like the protocol-ftp properties, even the protocol-http properties. It would be nice to refactor these properties out, into plugin specific configuration files, that ship with the plugins themselves. Thoughts? Comments? Tomatoes? :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmen
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-427: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. -- Key: NUTCH-427 URL: https://issues.apache.org/jira/browse/NUTCH-427 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: JAVA - OS independent Reporter: Armel Nene Priority: Minor Fix For: 1.7, 2.2 Attachments: protocol-smb-diff.txt, protocol-smb-dist.zip, protocol-smb.zip, protocol-smb.zip Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares Author: Armel T. Nene Update: Vadim Bauer Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r AT g m x . d e A. Introduction The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also support all the properties from the JCifs library. You can find more information on the following site: http://jcifs.samba.org/ The smb protocol syntax for crawling is as follow: smb://x (i.e. smb://server/share). B. Installation 1) Binaries only: The protocol-smb files can be found in the ../plugins directory. Copy the protocol-smb to NUTCHHOME/build/plugins directory. Put the smb.properties file in the NUTCHHOME/conf directory. Configure the properties in smb.properties file Enable the plugin by updating nutch-site.xml file found in NUTCHHOME/conf directory e.g. property nameplugin.includes/name valueprotocol-smb| other plugins.../value description /description /property 2) Source code:The protocol-smb sources can be found in the ../src directory. Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin Update the build.xml in NUTCHHOME/src/plugin to include plugin Update the NUTCHHOME/default.properties file to include plugin run ant to build Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties Enable the plugin by updating the nutch-site.xml file C: Known Issues 1) URLMalformedException: unkown protocol: smb The SMB URL protocol handler is not being successfully installed. In short, the jCIFS jar must be loaded by the System class loader. Workaround: a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty(java.protocol.handler.pkgs, jcifs); new java.util.PropertyPermission(java.protocol.handler.pkgs,read, write) //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html 2) FATAL smb.SMB - Could not read content of protocol: smb://xx This problem usually occurs if the following properties are not set correctly in the smb.properties
[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features
[ https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-410: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Faster RegexNormalize with more features Key: NUTCH-410 URL: https://issues.apache.org/jira/browse/NUTCH-410 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Environment: Tested on MacOS X 10.4.7/10.4.8 Reporter: Doug Cook Priority: Minor Fix For: 1.7, 2.2 Attachments: betterRegexNorm.patch The patch associated with this is backwards-compatible and has several improvements over the stock 0.8 RegexURLNormalizer: 1) About a 34% performance improvement, from only executing the superclass (BasicURLNormalizer) once in most cases, instead of twice as the stock version did. 2) Support for expensive host-specific normalizations with good performance. Each regex block optionally takes a list of hosts to which to apply the associated regex. If supplied, the regex will only be applied to these hosts. This should have scalable performance; the comparison is O(1) regardless of the number of hosts. The format is: regex hostwww.host1.com/host hosthost2.site2.com/host pattern my pattern here /pattern substitution my substitution here /substitution /regex 3) Support for decoding URLs with escaped character encodings (e.g. %20, etc.). This is useful, for example, to decode jump redirects which have the target URL encoded within the source, as on Yahoo. I tried to create an extensible notion of options, the first of which is unescape. The unescape function is applied *after* the substitution and *only* if the substitution pattern matches. A simple pattern to unescape Yahoo directory redirects would be something like: regex pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern substitution$1/substitution optionsunescape/options /regex 4) Added the notion of iterating the pattern chain. This is useful when the result of a normalization can itself be normalized. While some of this can be handled in the stock version by repeating patterns, or by careful ordering of patterns, the notion of iterating is cleaner and more powerful. The chain is defined to iterate only when the previous iteration changes the input, up to a configurable maxium number of iterations. The config parameter to change is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous behavior). The change is performance-neutral when disabled, and has a relatively small performance cost when enabled. Pardon any potentially unconventional Java on my part. I've got lots of C/C++ search engine experience, but Nutch is my first large Java app. I welcome any feedback, and hope this is useful. Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling
[ https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-409: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Add short circuit notion to filters to speedup mixed site/subsite crawling Key: NUTCH-409 URL: https://issues.apache.org/jira/browse/NUTCH-409 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Doug Cook Priority: Minor Fix For: 1.7, 2.2 Attachments: shortcircuit.patch In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be. I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a short circuit match that means accept this URL and don't run any of the remaining filters in the filter chain. Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them. One minor gotcha: * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain. I get my best performance using the following filter chain: 1) The SuffixURLFilter to throw away anything with unwanted extensions 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.) 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching 4) The AutomatonURLFilter to match those sites needing subsite pattern matching. I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10). There are only two drawbacks to the implementation, and I think they're pretty minor: 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named _PASS_, there would be problems. I find this highly unlikely, since that's an invalid URL. 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a new kind of filter which essentially combined prefix automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically). As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc. Doug -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-449) Format of junit output should be configurable
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-449: --- Fix Version/s: 2.2 1.7 Format of junit output should be configurable - Key: NUTCH-449 URL: https://issues.apache.org/jira/browse/NUTCH-449 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Reporter: Nigel Daley Priority: Minor Fix For: 1.7, 2.2 Attachments: hudson.patch Allow the junit output format to be set by a system property. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-449) Format of junit output should be configurable
[ https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-449: --- Patch Info: Patch Available Format of junit output should be configurable - Key: NUTCH-449 URL: https://issues.apache.org/jira/browse/NUTCH-449 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Reporter: Nigel Daley Priority: Minor Fix For: 1.7, 2.2 Attachments: hudson.patch Allow the junit output format to be set by a system property. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-386) Plugin to index categories by url rules
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-386: --- Patch Info: Patch Available Fix Version/s: 1.7 Plugin to index categories by url rules --- Key: NUTCH-386 URL: https://issues.apache.org/jira/browse/NUTCH-386 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Ernesto De Santis Priority: Minor Fix For: 1.7 Attachments: index-url-category-0.1.zip, index-url-category.jar The compressed zip has a install_notes.txt file with instructions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-351) Protocol forward proxy
[ https://issues.apache.org/jira/browse/NUTCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-351: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Protocol forward proxy -- Key: NUTCH-351 URL: https://issues.apache.org/jira/browse/NUTCH-351 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Reporter: Sami Siren Assignee: Sami Siren Priority: Minor Fix For: 1.7, 2.2 Attachments: protocol-http-proxy-adapter.txt Protocol proxy adapter takes advantage of protocols known to http forward proxy. Usually there's atleast http, https and ftp. You must configure nutch to use this plugin and to use http proxy before use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-346) Improve readability of logs/hadoop.log
[ https://issues.apache.org/jira/browse/NUTCH-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-346: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: https://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor Fix For: 1.7, 2.2 Attachments: log4j_plugins.diff adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-477: --- Fix Version/s: 1.7 Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.7 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-490: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 1.7, 2.2 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-248) add support for internationalized domain names
[ https://issues.apache.org/jira/browse/NUTCH-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-248. Resolution: Won't Fix this is legacy add support for internationalized domain names -- Key: NUTCH-248 URL: https://issues.apache.org/jira/browse/NUTCH-248 Project: Nutch Issue Type: Improvement Components: web gui Reporter: Sami Siren Priority: Minor Internationalized domain names are gaining ground and so nutch should give a little bit more support to this feature, atleast we need punycode encoding/decoding functionality so we can display/enter internationalized domain names in ui. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-213) checkstyle
[ https://issues.apache.org/jira/browse/NUTCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-213: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 checkstyle -- Key: NUTCH-213 URL: https://issues.apache.org/jira/browse/NUTCH-213 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Assignee: Dennis Kubes Priority: Minor Fix For: 1.7, 2.2 Attachments: checkstyle-all-4.1.jar, checkstyle.patch Adding checkstyle target to ant build file to support contributers verifying whitespace problems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-215) Plugin execution order
[ https://issues.apache.org/jira/browse/NUTCH-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-215. Resolution: Won't Fix we can now explicitly specify the order of indexing, parsing etc. plugins. This can be closed as a legacy issue. Plugin execution order -- Key: NUTCH-215 URL: https://issues.apache.org/jira/browse/NUTCH-215 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Enrico Triolo Priority: Minor Attachments: plugin_order.patch This patch allows nutch to automatically guess the correct order of execution of plugins, depending on their dependencies. This means that, for example, if plugin A depends on plugin B (as stated in the plugins.xml file), then B will be executed before A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag
[ https://issues.apache.org/jira/browse/NUTCH-49?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-49. --- Resolution: Won't Fix This is well and truly a legacy issue. The FetchListTool no longer even exists. Flag for generate to fetch only new pages to complement the -refetchonly flag - Key: NUTCH-49 URL: https://issues.apache.org/jira/browse/NUTCH-49 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Luke Baker Priority: Minor Attachments: fetchnewonly.patch It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the information from the webdb that you're generating the fetchlist from). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-737) urlnormalizer-unalias plugin
[ https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-737. - Resolution: Duplicate urlnormalizer-unalias plugin Key: NUTCH-737 URL: https://issues.apache.org/jira/browse/NUTCH-737 Project: Nutch Issue Type: New Feature Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.7 Attachments: NUTCH-737_urlfilter_unalias.patch I tried to search any whole site duplication detection tools without success. This plugin allows to do domain name transformation (for example www.google.com - google.com). It is very stupid, but can be useful when fighting with site aliases. For detect site aliases I use my own ugly class (based on SolrDeleteDuplicates). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-693: --- Fix Version/s: 2.2 1.7 Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Fix For: 1.7, 2.2 Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1513: Fix Version/s: 2.2 1.7 Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7, 2.2 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1500: Fix Version/s: 1.7 bin/crawl fails on step solrindex with wrong path to segment Key: NUTCH-1500 URL: https://issues.apache.org/jira/browse/NUTCH-1500 Project: Nutch Issue Type: Bug Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.7 Attachments: NUTCH-1500.patch The bin/crawl script calls the command (bin/nutch) solrindex with the wrong path to the segment which causes solrindex to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1489) elasticindex should report the indexed documents like solrindex does
[ https://issues.apache.org/jira/browse/NUTCH-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1489. --- Resolution: Not A Problem This functionality is addressed both when deployed in local mode and via the Hadoop Map output record counters. elasticindex should report the indexed documents like solrindex does Key: NUTCH-1489 URL: https://issues.apache.org/jira/browse/NUTCH-1489 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: Rogério Pereira Araújo Priority: Trivial When I run: nutch elasticindex elasticsearch To index crawled documents in a standard elasticsearch setup, the process takes some time, finishes, but doesn't report how many documents was indexed, it would be nice to have the same feedback as solrindex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag
[ https://issues.apache.org/jira/browse/NUTCH-49?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552049#comment-13552049 ] Markus Jelsma commented on NUTCH-49: This has been implemented in NUTCH-1248. Flag for generate to fetch only new pages to complement the -refetchonly flag - Key: NUTCH-49 URL: https://issues.apache.org/jira/browse/NUTCH-49 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Luke Baker Priority: Minor Attachments: fetchnewonly.patch It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the information from the webdb that you're generating the fetchlist from). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552050#comment-13552050 ] Markus Jelsma commented on NUTCH-693: - Vote for `won't fix`. We also don't implement an ignore.robotstxt option for the above reasons. Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Fix For: 1.7, 2.2 Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552053#comment-13552053 ] Lewis John McGibbney commented on NUTCH-693: +1 Markus. Please close off when you can. Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Fix For: 1.7, 2.2 Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-693. --- Resolution: Won't Fix Fix Version/s: (was: 2.2) (was: 1.7) Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552082#comment-13552082 ] Sebastian Nagel commented on NUTCH-1345: JAVA_HOME (or NUTCH_JAVA_HOME) is currently used for two things: # use $JAVA_HOME/bin/java as Java executable # determining the location of lib/tools.jar which is part of JDK (not JRE). It's probably an unneeded artifact, cf. MAPREDUCE-3624 and HADOOP-7374. If JAVA_HOME is not set bin/nutch definitely refuses to work. I agree that setting an environment variable may be a little hurdle, however there are arguments in favour of using JAVA_HOME: - I had to install Nutch on many customers' machines where the default java executable on PATH was not the correct one (= 1.6): setting JAVA_HOME is more transparent than manipulating PATH. NUTCH_JAVA_HOME is even more explicit. - back-ward compatibility: Nutch should be run by the same JVM as before, not accidentally by another one. - staying parallel to Hadoop which still uses JAVA_HOME Btw., let JAVA_HOME point to /usr/lib/jvm/default-java for Ubuntu's update-alternatives. JAVA_HOME should not be required Key: NUTCH-1345 URL: https://issues.apache.org/jira/browse/NUTCH-1345 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Ben McCann Priority: Minor Attachments: nutch, nutch.patch Trying to run Nutch spits out the message Error: JAVA_HOME is not set. I already have java on my path, so I really wish I didn't need to set JAVA_HOME. It's an extra step to get up and running and is not updated by Ubuntu's update-alternatives, so it makes it a lot harder to switch between versions of Java. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552083#comment-13552083 ] Ben McCann commented on NUTCH-1345: --- I think it's fine to allow overriding the version of Java used with JAVA_HOME or NUTCH_JAVA_HOME, but it shouldn't be required. Convention over configuration. There's far too much configuration required for Nutch. JAVA_HOME should not be required Key: NUTCH-1345 URL: https://issues.apache.org/jira/browse/NUTCH-1345 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Ben McCann Priority: Minor Attachments: nutch, nutch.patch Trying to run Nutch spits out the message Error: JAVA_HOME is not set. I already have java on my path, so I really wish I didn't need to set JAVA_HOME. It's an extra step to get up and running and is not updated by Ubuntu's update-alternatives, so it makes it a lot harder to switch between versions of Java. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-trunk #2082
See https://builds.apache.org/job/Nutch-trunk/2082/ -- [...truncated 3965 lines...] [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:89: warning: [deprecation] delete(Path) in FileSystem has been deprecated [javac] fs.delete(testDir); [javac] ^ [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:108: warning: [rawtypes] found raw type: Iterator [javac] Iterator it = expected.keySet().iterator(); [javac] ^ [javac] missing type arguments for generic class IteratorE [javac] where E is a type-variable: [javac] E extends Object declared in interface Iterator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:123: warning: [deprecation] delete(Path) in FileSystem has been deprecated [javac] fs.delete(testDir); [javac] ^ [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:126: warning: [rawtypes] found raw type: TreeSet [javac] private void createCrawlDb(Configuration config, FileSystem fs, Path crawldb, TreeSet init, CrawlDatum cd) throws Exception { [javac] ^ [javac] missing type arguments for generic class TreeSetE [javac] where E is a type-variable: [javac] E extends Object declared in class TreeSet [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:130: warning: [rawtypes] found raw type: Iterator [javac] Iterator it = init.iterator(); [javac] ^ [javac] missing type arguments for generic class IteratorE [javac] where E is a type-variable: [javac] E extends Object declared in interface Iterator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:71: warning: [rawtypes] found raw type: TreeMap [javac] TreeMap init1 = new TreeMap(); [javac] ^ [javac] missing type arguments for generic class TreeMapK,V [javac] where K,V are type-variables: [javac] K extends Object declared in class TreeMap [javac] V extends Object declared in class TreeMap [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:71: warning: [rawtypes] found raw type: TreeMap [javac] TreeMap init1 = new TreeMap(); [javac] ^ [javac] missing type arguments for generic class TreeMapK,V [javac] where K,V are type-variables: [javac] K extends Object declared in class TreeMap [javac] V extends Object declared in class TreeMap [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:72: warning: [rawtypes] found raw type: TreeMap [javac] TreeMap init2 = new TreeMap(); [javac] ^ [javac] missing type arguments for generic class TreeMapK,V [javac] where K,V are type-variables: [javac] K extends Object declared in class TreeMap [javac] V extends Object declared in class TreeMap [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:72: warning: [rawtypes] found raw type: TreeMap [javac] TreeMap init2 = new TreeMap(); [javac] ^ [javac] missing type arguments for generic class TreeMapK,V [javac] where K,V are type-variables: [javac] K extends Object declared in class TreeMap [javac] V extends Object declared in class TreeMap [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:73: warning: [rawtypes] found raw type: HashMap [javac] HashMap expected = new HashMap(); [javac] ^ [javac] missing type arguments for generic class HashMapK,V [javac] where K,V are type-variables: [javac] K extends Object declared in class HashMap [javac] V extends Object declared in class HashMap [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:73: warning: [rawtypes] found raw type: HashMap [javac] HashMap expected = new HashMap(); [javac] ^ [javac] missing type arguments for generic class