[jira] Created: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
document deduplication (exact duplicates) failed using MD5Signature --- Key: NUTCH-835 URL: https://issues.apache.org/jira/browse/NUTCH-835 Project: Nutch Issue Type: Bug Affects Versions: 1.1, 1.0.0 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 Reporter: Sebastian Nagel The MD5Signature class calculates different signatures for identical documents. The reason is that byte[] data = content.getContent(); ... StringBuilder().append(data) ... uses java.lang.Object.toString() to get a string representation of the (binary) content which results in unique hash codes (e.g., [...@30dc9065) even for two byte arrays with identical content. A solution would be to take the MD5 sum of the binary content as first part of the final signature calculation (the parsed content is the second part): ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-862) HttpClient null pointer exception
HttpClient null pointer exception - Key: NUTCH-862 URL: https://issues.apache.org/jira/browse/NUTCH-862 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: linux, java 6 Reporter: Sebastian Nagel Priority: Minor When re-fetching a document (a continued crawl) HttpClient throws an null pointer exception causing the document to be emptied: 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching http://localhost/doc/selfhtml/html/index.htm 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of http://localhost/doc/selfhtml/html/index.htm failed with: java.lang.NullPointerException Because the document is re-fetched the server answers "304" (not modified): 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm HTTP/1.0" 304 174 "-" "Nutch-1.0" No content is sent in this case (empty http body). Index: trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java === --- trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (revision 979647) +++ trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (working copy) @@ -134,7 +134,8 @@ if (code == 200) throw new IOException(e.toString()); // for codes other than 200 OK, we are fine with empty content } finally { -in.close(); +if (in != null) + in.close(); get.abort(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-862) HttpClient null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-862: -- Attachment: NUTCH-862.patch patch > HttpClient null pointer exception > - > > Key: NUTCH-862 > URL: https://issues.apache.org/jira/browse/NUTCH-862 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.0.0 > Environment: linux, java 6 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-862.patch > > > When re-fetching a document (a continued crawl) HttpClient throws an null > pointer exception causing the document to be emptied: > 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching > http://localhost/doc/selfhtml/html/index.htm > 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) > 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of > http://localhost/doc/selfhtml/html/index.htm failed with: > java.lang.NullPointerException > Because the document is re-fetched the server answers "304" (not modified): > 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm > HTTP/1.0" 304 174 "-" "Nutch-1.0" > No content is sent in this case (empty http body). > Index: > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > === > --- > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (revision 979647) > +++ > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (working copy) > @@ -134,7 +134,8 @@ > if (code == 200) throw new IOException(e.toString()); > // for codes other than 200 OK, we are fine with empty content >} finally { > -in.close(); > +if (in != null) > + in.close(); > get.abort(); >} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930588#action_12930588 ] Sebastian Nagel commented on NUTCH-933: --- The modifiedTime stored in a CrawlDatum record is not the "Last-Modified" time sent by the responding server (or the time stamp of a file, in case protocol-file is used) but the time a document was fetched. Is there any reason? Determining the "Last-Modified" time is somewhat difficult since it may be specified in the HTTP header or in HTML as . But it would be a nice-to-have information. In addition, the index-more indexing filter which provides a field "lastModified" does the job not very well: it should take the value from content meta data (which seems to be mostly correct) and not from parse meta data. Beside: re-crawling with if-modified-since is not affected: there is no difference if the time of the last fetch is sent because only if the document has been modified since the last fetch it must be re-fetched. > Fetcher does not save a pages Last-Modified value in CrawlDatum > --- > > Key: NUTCH-933 > URL: https://issues.apache.org/jira/browse/NUTCH-933 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2 >Reporter: Joe Kemp > > I added the following code in the output method just after the If (content > !=null) statement. > String lastModified = metadata.get("Last-Modified"); > if (lastModified !=null && !lastModified.equals("")) { > try { > Date lastModifiedDate = > DateUtil.parseDate(lastModified); > > datum.setModifiedTime(lastModifiedDate.getTime()); > } catch (DateParseException e) { > > } > } > I now get 304 for pages that haven't changed when I recrawl. Need to do > further testing. Might also need a configuration parameter to turn off this > behavior, allowing pages to be forced to be refreshed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects
max. redirects not handled correctly: fetcher stops at max-1 redirects -- Key: NUTCH-962 URL: https://issues.apache.org/jira/browse/NUTCH-962 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.2, 1.3, 2.0 Reporter: Sebastian Nagel The fetcher stops following redirects one redirect before the max. redirects is reached. The description of http.redirect.max > The maximum number of redirects the fetcher will follow when > trying to fetch a page. If set to negative or 0, fetcher won't immediately > follow redirected URLs, instead it will record them for later fetching. suggests that if set to 1 that one redirect will be followed. I tried to crawl two documents the first redirecting by to the second with http.redirect.max = 1 The second document is not fetched and the URL has state GONE in CrawlDb. fetching file:/test/redirects/meta_refresh.html redirectCount=0 -finishing thread FetcherThread, activeThreads=1 - content redirect to file:/test/redirects/to/meta_refresh_target.html (fetching now) - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed. Of course, this would mean there is no possibility to skip redirects at all since 0 (as well as negative values) means "treat redirects as ordinary links". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-962: -- Attachment: Fetcher_redir.patch patch for 1.3 to respect count of redirects literally: http.redirect.max = 0 (or negative) :: treat redirects as ordinary links http.redirect.max = 1 :: follow max. 1 redirect http.redirect.max = 2 :: follow max. 2 redirects, etc. > max. redirects not handled correctly: fetcher stops at max-1 redirects > -- > > Key: NUTCH-962 > URL: https://issues.apache.org/jira/browse/NUTCH-962 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2, 1.3, 2.0 >Reporter: Sebastian Nagel > Attachments: Fetcher_redir.patch > > > The fetcher stops following redirects one redirect before the max. redirects > is reached. > The description of http.redirect.max > > The maximum number of redirects the fetcher will follow when > > trying to fetch a page. If set to negative or 0, fetcher won't immediately > > follow redirected URLs, instead it will record them for later fetching. > suggests that if set to 1 that one redirect will be followed. > I tried to crawl two documents the first redirecting by > > to the second with http.redirect.max = 1 > The second document is not fetched and the URL has state GONE in CrawlDb. > fetching file:/test/redirects/meta_refresh.html > redirectCount=0 > -finishing thread FetcherThread, activeThreads=1 > - content redirect to file:/test/redirects/to/meta_refresh_target.html > (fetching now) > - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html > The attached patch would fix this: if http.redirect.max is 1 : one redirect > is followed. > Of course, this would mean there is no possibility to skip redirects at all > since 0 > (as well as negative values) means "treat redirects as ordinary links". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] [Created] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
Sebastian Nagel created NUTCH-1344: -- Summary: BasicURLNormalizer to normalize https same as http Key: NUTCH-1344 URL: https://issues.apache.org/jira/browse/NUTCH-1344 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Most of the normalization done by BasicURLNormalizer (lowercasing host, removing default port, removal of page anchors, cleaning . and . in the path) is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1344: --- Attachment: NUTCH-1344.patch > BasicURLNormalizer to normalize https same as http > --- > > Key: NUTCH-1344 > URL: https://issues.apache.org/jira/browse/NUTCH-1344 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel > Attachments: NUTCH-1344.patch > > > Most of the normalization done by BasicURLNormalizer (lowercasing host, > removing default port, removal of page anchors, cleaning . and . in the path) > is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely
[ https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258827#comment-13258827 ] Sebastian Nagel commented on NUTCH-1339: BasicURLNormalizer does not remove the anchor for https URLs (NUTCH-1344). At least, in my case this was the real reason for the large number of bad URLs. The only motivation to remove the anchor not completely is the rare case that anchor and query parameters are accidentally swapped. > Default URL normalization rules to remove page anchors completely > - > > Key: NUTCH-1339 > URL: https://issues.apache.org/jira/browse/NUTCH-1339 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel > Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch > > > The default rules of URLNormalizerRegex remove the anchor up to the first > occurrence of ? or &. The remaining part of the anchor is kept > which may cause a large, possibly infinite number of outlinks when the same > document > fetched again and again with different URLs, > see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html > Parameters in inner-page anchors are a common practice in AJAX web sites. > Currently, crawling AJAX content is not supported (NUTCH-1323). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263124#comment-13263124 ] Sebastian Nagel commented on NUTCH-1293: The content type should be added to metadata after the check for content == null. {noformat} % nutch indexchecker file:/ fetching: file:/ org.apache.nutch.protocol.file.FileError: File Error: 404 ... Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) {noformat} > IndexingFiltersChecker to store detected content type in crawldatum metadata > > > Key: NUTCH-1293 > URL: https://issues.apache.org/jira/browse/NUTCH-1293 > Project: Nutch > Issue Type: Bug >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Attachments: NUTCH-1293-1.5-1.patch > > > NUTCH-1259 is not implemented in the checker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273954#comment-13273954 ] Sebastian Nagel commented on NUTCH-1323: After a small test crawl on http://si.draagle.com: # usage is cumbersome because you have to carefully think about in which steps to normalize URLs. This is because AjaxNormalizer acts as a flip-flop: hashbang URLs are escaped, escaped ones are unescaped. If URLs are normalized during parsing and then during CrawlDb update, you get the hashbang URL again. # relative hashbang links are not resolved correctly. The outlink of {noformat} base: http://si.draagle.com/?_escaped_fragment_=browse/group/root/ {noformat} should be {noformat} http://si.draagle.com/?_escaped_fragment_=static/draagle_pogoji_uporabe.html {noformat} but hardly {noformat} http://si.draagle.com/?_escaped_fragment_=browse/group/root/&_escaped_fragment_=static/draagle_pogoji_uporabe.html {noformat} # the outlink set of one page with escaped base URL may contain escaped and unescaped URLs simultaneously as results of ** a relative link without hashbang, e.g., {{}} ** a global link with hashbang If understood right: * URLs with escaped fragments are used ** in crawlDb, segments, linkDb (URL acts as key) ** for fetching * unescaped hashbang URLs ** are used in the index (and shown to the user) ** may appear in outlinks, redirects, and seeds Couldn't we bind the decision whether to (un)escape to the current normalizer scope: * if URL contains #! and scope is one of { inject, fetcher/redirect, outlink, ?crawldb/update? } => escape * if URL contains _escaped_fragment_= and scope is index => unescape > AjaxNormalizer > -- > > Key: NUTCH-1323 > URL: https://issues.apache.org/jira/browse/NUTCH-1323 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1323-1.6-1.patch > > > A two-way normalizer for Nutch able to deal with AJAX URL's, converting them > to _escaped_fragment_ URL's and back to an AJAX URL. > https://developers.google.com/webmasters/ajax-crawling/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception
Sebastian Nagel created NUTCH-1383: -- Summary: IndexingFiltersChecker to show error message instead of null pointer exception Key: NUTCH-1383 URL: https://issues.apache.org/jira/browse/NUTCH-1383 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.5, 1.6 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.6 IndexingFiltersChecker may throw null pointer exceptions if # content returned by protocol implementation is null (artifact of NUTCH-1293) # if one of the indexing filters sets doc to null (the interface IndexingFilter allows to exclude documents by returning null, cf. the IndexingFilter of NUTCH-966) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1383: --- Attachment: NUTCH-1383.patch patch for both null pointer exceptions > IndexingFiltersChecker to show error message instead of null pointer exception > -- > > Key: NUTCH-1383 > URL: https://issues.apache.org/jira/browse/NUTCH-1383 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.5, 1.6 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1383.patch > > > IndexingFiltersChecker may throw null pointer exceptions if > # content returned by protocol implementation is null (artifact of NUTCH-1293) > # if one of the indexing filters sets doc to null (the interface > IndexingFilter allows to exclude documents by returning null, cf. the > IndexingFilter of NUTCH-966) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1389) parsechecker and indexchecker to report truncated content
Sebastian Nagel created NUTCH-1389: -- Summary: parsechecker and indexchecker to report truncated content Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: 1.5, nutchgora Reporter: Sebastian Nagel Priority: Minor ParserChecker and IndexingFiltersChecker should report when a document is truncated due to {http,file,ftp}.content.limit. Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats. A hint that truncation (and not a broken plugin) is the possible reason would be useful. See NUTCH-965 and {{ParseSegment.isTruncated(content)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
Sebastian Nagel created NUTCH-1415: -- Summary: release packages to contain top level folder apache-nutch-x.x Key: NUTCH-1415 URL: https://issues.apache.org/jira/browse/NUTCH-1415 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6, 1.5.1 Reporter: Sebastian Nagel Priority: Minor The release packages should contain a top level folder named apache-nutch-x.x (x replaced by major and minor version) as in previous releases. Unpacking the packages from the command line via tar xvfz package.tar.gz or unzip package.zip should place all files in that folder. Cf. discussions on mailing lists: * http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E * http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415.patch Fix ant targets tar-src, tar-bin, zip-src, zip-bin Also set appropriate permissions for bin/nutch > release packages to contain top level folder apache-nutch-x.x > - > > Key: NUTCH-1415 > URL: https://issues.apache.org/jira/browse/NUTCH-1415 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6, 1.5.1 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1415.patch > > > The release packages should contain a top level folder named apache-nutch-x.x > (x replaced by major and minor version) as in previous releases. Unpacking > the packages from the command line via tar xvfz package.tar.gz or unzip > package.zip should place all files in that folder. Cf. discussions on mailing > lists: > * > http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E > * > http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1415: --- Attachment: NUTCH-1415-2.patch Hi Lewis, you are completely right: the tarfileset / zipfileset of the *-bin targets are missing the parameter prefix="${final.name}" Here is a corrected patch, or manually add 4x the prefix paramater. > release packages to contain top level folder apache-nutch-x.x > - > > Key: NUTCH-1415 > URL: https://issues.apache.org/jira/browse/NUTCH-1415 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6, 1.5.1 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch > > > The release packages should contain a top level folder named apache-nutch-x.x > (x replaced by major and minor version) as in previous releases. Unpacking > the packages from the command line via tar xvfz package.tar.gz or unzip > package.zip should place all files in that folder. Cf. discussions on mailing > lists: > * > http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E > * > http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1419) parsechecker and indexchecker to report protocol status
Sebastian Nagel created NUTCH-1419: -- Summary: parsechecker and indexchecker to report protocol status Key: NUTCH-1419 URL: https://issues.apache.org/jira/browse/NUTCH-1419 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Priority: Minor Parsechecker and indexchecker should report the protocol status when the fetch was not successful (status other than 200/ok). In case of a redirect, the protocol status contains the URL a redirect points to. Usually, this URL should be checked instead of the original one which is not indexed. The content of a redirect response is less useful (and often empty): {code} % nutch indexchecker http://lucene.apache.org/nutch/ fetching: http://lucene.apache.org/nutch/ parsing: http://lucene.apache.org/nutch/ contentType: text/html content : 301 Moved Permanently Moved Permanently The document has moved here . Apache/2.4.1 (Unix) OpenSSL/1. title : 301 Moved Permanently host : lucene.apache.org tstamp :Tue Jul 03 13:27:32 CEST 2012 url : http://lucene.apache.org/nutch/ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1419) parsechecker and indexchecker to report protocol status
[ https://issues.apache.org/jira/browse/NUTCH-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1419: --- Attachment: NUTCH-1419-1.patch Simple patch: in case of a protocol status other than 200 (success): # report the protocol status # exit (since those documents are not parsed and indexed when crawling: parsechecker and indexchecker should behave similar to an "ordinary" crawl) > parsechecker and indexchecker to report protocol status > --- > > Key: NUTCH-1419 > URL: https://issues.apache.org/jira/browse/NUTCH-1419 > Project: Nutch > Issue Type: Improvement > Components: indexer, parser >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1419-1.patch > > > Parsechecker and indexchecker should report the protocol status when the > fetch was not successful (status other than 200/ok). > In case of a redirect, the protocol status contains the URL a redirect points > to. Usually, this URL should be checked instead of the original one which is > not indexed. The content of a redirect response is less useful (and often > empty): > {code} > % nutch indexchecker http://lucene.apache.org/nutch/ > fetching: http://lucene.apache.org/nutch/ > parsing: http://lucene.apache.org/nutch/ > contentType: text/html > content : 301 Moved Permanently Moved Permanently The document has > moved here . Apache/2.4.1 (Unix) OpenSSL/1. > title : 301 Moved Permanently > host : lucene.apache.org > tstamp :Tue Jul 03 13:27:32 CEST 2012 > url : http://lucene.apache.org/nutch/ > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns
Sebastian Nagel created NUTCH-1421: -- Summary: RegexURLNormalizer to only skip rules with invalid patterns Key: NUTCH-1421 URL: https://issues.apache.org/jira/browse/NUTCH-1421 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora, 1.6 Reporter: Sebastian Nagel Priority: Minor If a regex-normalize.xml file contains one rule with a syntactically invalid regular expression patterns, all rules are discarded and no normalization is done. In combination with a detailed error message, RegexURLNormalizer should only skip the invalid rule but use all other (valid) rules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1421: --- Attachment: NUTCH-1421-1.patch > RegexURLNormalizer to only skip rules with invalid patterns > --- > > Key: NUTCH-1421 > URL: https://issues.apache.org/jira/browse/NUTCH-1421 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1421-1.patch > > > If a regex-normalize.xml file contains one rule with a syntactically invalid > regular expression patterns, all rules are discarded and no normalization is > done. > In combination with a detailed error message, RegexURLNormalizer should only > skip the invalid rule but use all other (valid) rules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1422) reset signature for redirects
Sebastian Nagel created NUTCH-1422: -- Summary: reset signature for redirects Key: NUTCH-1422 URL: https://issues.apache.org/jira/browse/NUTCH-1422 Project: Nutch Issue Type: Bug Components: crawldb, fetcher Affects Versions: 1.4 Reporter: Sebastian Nagel Fix For: 1.6 In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment / CrawlDb data): 2012-02-23 : injected 2012-02-24 : fetched 2012-03-30 : re-fetched, signature changed 2012-04-20 : re-fetched, redirected 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! The signature of a previously fetched document is not reset when the URL/doc is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified because the new signature in with fetch status is identical to the old one. Possible fixes (??): * reset the signature in Fetcher * handle this case in CrawlDbReducer.reduce -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1422) reset signature for redirects
[ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1422: --- Attachment: NUTCH-1422_redir_notmodified_log.txt > reset signature for redirects > - > > Key: NUTCH-1422 > URL: https://issues.apache.org/jira/browse/NUTCH-1422 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher >Affects Versions: 1.4 >Reporter: Sebastian Nagel > Fix For: 1.6 > > Attachments: NUTCH-1422_redir_notmodified_log.txt > > > In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect > (http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short > protocol (cf. attached dumped segment / CrawlDb data): > 2012-02-23 : injected > 2012-02-24 : fetched > 2012-03-30 : re-fetched, signature changed > 2012-04-20 : re-fetched, redirected > 2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content! > The signature of a previously fetched document is not reset when the URL/doc > is changed to a redirect at a later time. CrawlDbReducer.reduce then sets the > status to db_notmodified because the new signature in with fetch status is > identical to the old one. > Possible fixes (??): > * reset the signature in Fetcher > * handle this case in CrawlDbReducer.reduce -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1328) a problem with regex-normalize.xml
[ https://issues.apache.org/jira/browse/NUTCH-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410905#comment-13410905 ] Sebastian Nagel commented on NUTCH-1328: Duplicate of NUTCH-706 > a problem with regex-normalize.xml > -- > > Key: NUTCH-1328 > URL: https://issues.apache.org/jira/browse/NUTCH-1328 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: behnam nikbakht > Labels: parse > > there is a regex-pattern in regex-normalize.xml: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$) > that remove session ids from urls, but there is some sites, like: > http://www.mehrnews.com/fa > that have urls, like: > http://www.mehrnews.com/fa/newsdetail.aspx?NewsID=1567539 > and with this pattern, this url converted to an invalid url: > http://www.mehrnews.com/fa/newsdetail.aspx?New -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Attachment: NUTCH-706.patch - fix the pattern by adding an anchor prohibiting inner-word matches such as in New{color:red}sId{color} - add test > Url regex normalizer > > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1436) bin/nutch absent in zip package
Sebastian Nagel created NUTCH-1436: -- Summary: bin/nutch absent in zip package Key: NUTCH-1436 URL: https://issues.apache.org/jira/browse/NUTCH-1436 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.5.1 Reporter: Sebastian Nagel The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip, the tar-bin package is not affected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1436) bin/nutch absent in zip package
[ https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1436: --- Attachment: NUTCH-1436.patch Patch for branch-1.5.1 (if a new bin package is desired). For trunk the last patch of NUTCH-1415 is ok. > bin/nutch absent in zip package > --- > > Key: NUTCH-1436 > URL: https://issues.apache.org/jira/browse/NUTCH-1436 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.5.1 >Reporter: Sebastian Nagel > Attachments: NUTCH-1436.patch > > > The script bin/nutch is absent in the package apache-nutch-1.5.1-bin.zip, > the tar-bin package is not affected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Attachment: NUTCH-706-2.patch Second trial for patch. The first one does not remove: {code} ?_sessionID=... {code} Added more tests to cover more types of real session ids and a further counterexample: {code} ?addressid=... {code} > Url regex normalizer > > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1454) parsing chm failed
Sebastian Nagel created NUTCH-1454: -- Summary: parsing chm failed Key: NUTCH-1454 URL: https://issues.apache.org/jira/browse/NUTCH-1454 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.5.1 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.6, 2.1 (reported by Jan Riewe, see http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html) Nutch fails to parse chm files with {quote} ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp {quote} Tested with chm test files from Tika: {code} % bin/nutch parsechecker file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm {code} Tika parses this document (but does not extract any content). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1455) RobotRulesParser to match multi-word user-agent names
Sebastian Nagel created NUTCH-1455: -- Summary: RobotRulesParser to match multi-word user-agent names Key: NUTCH-1455 URL: https://issues.apache.org/jira/browse/NUTCH-1455 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.5.1 Reporter: Sebastian Nagel If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*" If the robots.txt (http://en.wikipedia.org/robots.txt) contains {code} User-agent: Download Ninja Disallow: / {code} all content should be forbidden. But it isn't: {code} % curl 'http://en.wikipedia.org/robots.txt' > robots.txt % grep -A1 -i ninja robots.txt User-agent: Download Ninja Disallow: / % cat test.urls http://en.wikipedia.org/ % bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser robots.txt test.urls 'Download Ninja' ... allowed:http://en.wikipedia.org/ {code} The rfc (http://www.robotstxt.org/norobots-rfc.txt) states that bq. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. Assumed that "Downlaod Ninja" is a substring of itself it should match and http://en.wikipedia.org/ should be forbidden. The point is that the agent name from the User-Agent line is split at spaces while the names from the http.robots.agents property are not (they are only split at ","). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454282#comment-13454282 ] Sebastian Nagel commented on NUTCH-1467: Since nutch.metadata.Metadata, NutchField, and SolrInputField are multi-valued wouldn't it be preferable to keep the multiple values instead of concatenating them in advance? This would require to change HTMLMetaTags.generalTags so that it can store multiple values. > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1415: -- Assignee: Sebastian Nagel > release packages to contain top level folder apache-nutch-x.x > - > > Key: NUTCH-1415 > URL: https://issues.apache.org/jira/browse/NUTCH-1415 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6, 1.5.1 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch > > > The release packages should contain a top level folder named apache-nutch-x.x > (x replaced by major and minor version) as in previous releases. Unpacking > the packages from the command line via tar xvfz package.tar.gz or unzip > package.zip should place all files in that folder. Cf. discussions on mailing > lists: > * > http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E > * > http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457753#comment-13457753 ] Sebastian Nagel commented on NUTCH-1415: This has been fixed only for 1.5.1 and 2.0 branches. Should be fixed for trunk and 2.x before branching 2.1 and 1.6. Are there any objections? Otherwise I would apply the patches today night and check the resulting packages (cf. NUTCH-1436). > release packages to contain top level folder apache-nutch-x.x > - > > Key: NUTCH-1415 > URL: https://issues.apache.org/jira/browse/NUTCH-1415 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6, 1.5.1 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch > > > The release packages should contain a top level folder named apache-nutch-x.x > (x replaced by major and minor version) as in previous releases. Unpacking > the packages from the command line via tar xvfz package.tar.gz or unzip > package.zip should place all files in that folder. Cf. discussions on mailing > lists: > * > http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E > * > http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1415. Resolution: Fixed Fix Version/s: 2.1 1.6 committed to trunk (revision 1387357) and 2.x (revision 1387356) > release packages to contain top level folder apache-nutch-x.x > - > > Key: NUTCH-1415 > URL: https://issues.apache.org/jira/browse/NUTCH-1415 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6, 1.5.1 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1415-2.patch, NUTCH-1415.patch > > > The release packages should contain a top level folder named apache-nutch-x.x > (x replaced by major and minor version) as in previous releases. Unpacking > the packages from the command line via tar xvfz package.tar.gz or unzip > package.zip should place all files in that folder. Cf. discussions on mailing > lists: > * > http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E > * > http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467990#comment-13467990 ] Sebastian Nagel commented on NUTCH-706: --- Are there objections to apply and commit the patch? Tests pass for both trunk and 2.x. The problem is reported twice. Until there is a more sophisticated URL normalizer (see Ken Krugler's comment) there is no real alternative then improving the regex pattern. > Url regex normalizer > > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place
Sebastian Nagel created NUTCH-1476: -- Summary: SegmentReader getStats should set parsed = -1 if no parsing took place Key: NUTCH-1476 URL: https://issues.apache.org/jira/browse/NUTCH-1476 Project: Nutch Issue Type: Bug Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.6 Attachments: NUTCH-1476.patch The method getStats in SegmentReader sets the number of parsed documents (and also the number of parseErrors) to 0 if no parsing took place for a segment. The values should be set to -1 analogous to the number of fetched docs and fetchErrors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1476: --- Attachment: NUTCH-1476.patch > SegmentReader getStats should set parsed = -1 if no parsing took place > -- > > Key: NUTCH-1476 > URL: https://issues.apache.org/jira/browse/NUTCH-1476 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 1.6 > > Attachments: NUTCH-1476.patch > > > The method getStats in SegmentReader sets the number of parsed documents (and > also the number of parseErrors) to 0 if no parsing took place for a segment. > The values should be set to -1 analogous to the number of fetched docs and > fetchErrors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-1252: -- Assignee: Sebastian Nagel > SegmentReader -get shows wrong data > --- > > Key: NUTCH-1252 > URL: https://issues.apache.org/jira/browse/NUTCH-1252 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 1.6 > > Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch > > > The command/option -get of the SegmentReader may show wrong data associated > with the given URL. > To reproduce: > {code} > % mkdir -p test_readseg/urls > % echo -e > "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0"; > > test_readseg/urls/seeds > % nutch inject test_readseg/crawldb test_readseg/urls > Injector: starting at 2012-01-18 09:32:25 > Injector: crawlDb: test_readseg/crawldb > Injector: urlDir: test_readseg/urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 > % nutch generate test_readseg/crawldb test_readseg/segments/ > Generator: starting at 2012-01-18 09:32:30 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: test_readseg/segments/20120118093232 > Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 > % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' > -nocontent -noparse -nofetch -noparsedata -noparsetext > SegmentReader: get 'http://nutch.apache.org/' > Crawl Generate:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jan 18 09:32:26 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 10.0 > Signature: null > Metadata: _ngt_: 1326875550401test: AbcTest > {code} > The metadata and the score indicate that the CrawlDatum shown is the wrong > one (that associated to http://abc.test/ but not to http://nutch.apache.org/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471915#comment-13471915 ] Sebastian Nagel commented on NUTCH-1344: Is there any reason why https should be treated different from http (and ftp)? > BasicURLNormalizer to normalize https same as http > --- > > Key: NUTCH-1344 > URL: https://issues.apache.org/jira/browse/NUTCH-1344 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel > Attachments: NUTCH-1344.patch > > > Most of the normalization done by BasicURLNormalizer (lowercasing host, > removing default port, removal of page anchors, cleaning . and . in the path) > is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-706: -- Fix Version/s: 2.2 Summary: Url regex normalizer: default pattern for session id removal not to match "newsId" (was: Url regex normalizer) > Url regex normalizer: default pattern for session id removal not to match > "newsId" > -- > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-706. --- Resolution: Fixed committed to trunk (revision 1396796) and 2.x (revision 1396795) > Url regex normalizer: default pattern for session id removal not to match > "newsId" > -- > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1344. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk (revision 1396801) and 2.x (revision 1396800) > BasicURLNormalizer to normalize https same as http > --- > > Key: NUTCH-1344 > URL: https://issues.apache.org/jira/browse/NUTCH-1344 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1344.patch > > > Most of the normalization done by BasicURLNormalizer (lowercasing host, > removing default port, removal of page anchors, cleaning . and . in the path) > is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-706) Url regex normalizer: default pattern for session id removal not to match "newsId"
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473599#comment-13473599 ] Sebastian Nagel commented on NUTCH-706: --- First commit erroneously with wrong patch. Correct patch (NUTCH-706-2.patch) now committed to trunk (revision 1396817) and 2.x (revision 1396822). > Url regex normalizer: default pattern for session id removal not to match > "newsId" > -- > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Meghna Kukreja >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$). > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474460#comment-13474460 ] Sebastian Nagel commented on NUTCH-1475: Indeed, a modified time in the future is a bad choice. But CrawlDatum and WebPage both have a field modifiedTime. It should contain the time of the last fetch or (ideally) even the time of former fetch if the document is not modified. > Nutch 2.1 Index-More Plugin -- A better fall back value for date field > -- > > Key: NUTCH-1475 > URL: https://issues.apache.org/jira/browse/NUTCH-1475 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1, 1.5.1 > Environment: All >Reporter: James Sullivan >Priority: Minor > Labels: index-more, plugins > Fix For: 1.6, 2.2 > > Attachments: index-more-1xand2x.patch, index-more-2x.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Among other fields, the more plugin for Nutch 2.x provides a "last modified" > and "date" field for the Solr index. The "last modified" field is the last > modified date from the http headers if available, if not available it is left > empty. Currently, the "date" field is the same as the "last modified" field > unless that field is empty in which case getFetchTime is used as a fall back. > I think getFetchTime is not a good fall back as it is the next fetch time and > often a month or more in the future which doesn't make sense for the date > field. Users do not expect webpages/documents with future dates. A more > sensible fallback would be current date at the time it is indexed. > This is possible by simply changing line 97 of > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > from > time = page.getFetchTime(); // use fetch time > to > time = new Date().getTime(); > Users interested in the getFetchTime value can still get it from the "tstamp" > field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1252) SegmentReader -get shows wrong data
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1252. Resolution: Fixed committed to trunk (revision 1397281) > SegmentReader -get shows wrong data > --- > > Key: NUTCH-1252 > URL: https://issues.apache.org/jira/browse/NUTCH-1252 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel > Fix For: 1.6 > > Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch > > > The command/option -get of the SegmentReader may show wrong data associated > with the given URL. > To reproduce: > {code} > % mkdir -p test_readseg/urls > % echo -e > "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0"; > > test_readseg/urls/seeds > % nutch inject test_readseg/crawldb test_readseg/urls > Injector: starting at 2012-01-18 09:32:25 > Injector: crawlDb: test_readseg/crawldb > Injector: urlDir: test_readseg/urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 > % nutch generate test_readseg/crawldb test_readseg/segments/ > Generator: starting at 2012-01-18 09:32:30 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: test_readseg/segments/20120118093232 > Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 > % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' > -nocontent -noparse -nofetch -noparsedata -noparsetext > SegmentReader: get 'http://nutch.apache.org/' > Crawl Generate:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jan 18 09:32:26 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 10.0 > Signature: null > Metadata: _ngt_: 1326875550401test: AbcTest > {code} > The metadata and the score indicate that the CrawlDatum shown is the wrong > one (that associated to http://abc.test/ but not to http://nutch.apache.org/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1476) SegmentReader getStats should set parsed = -1 if no parsing took place
[ https://issues.apache.org/jira/browse/NUTCH-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1476. Resolution: Fixed committed to trunk (revision 1397298) > SegmentReader getStats should set parsed = -1 if no parsing took place > -- > > Key: NUTCH-1476 > URL: https://issues.apache.org/jira/browse/NUTCH-1476 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 1.6 > > Attachments: NUTCH-1476.patch > > > The method getStats in SegmentReader sets the number of parsed documents (and > also the number of parseErrors) to 0 if no parsing took place for a segment. > The values should be set to -1 analogous to the number of fetched docs and > fetchErrors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1383) IndexingFiltersChecker to show error message instead of null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1383. Resolution: Fixed committed to trunk (revision 1397308) > IndexingFiltersChecker to show error message instead of null pointer exception > -- > > Key: NUTCH-1383 > URL: https://issues.apache.org/jira/browse/NUTCH-1383 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.5, 1.6 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1383.patch > > > IndexingFiltersChecker may throw null pointer exceptions if > # content returned by protocol implementation is null (artifact of NUTCH-1293) > # if one of the indexing filters sets doc to null (the interface > IndexingFilter allows to exclude documents by returning null, cf. the > IndexingFilter of NUTCH-966) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482644#comment-13482644 ] Sebastian Nagel commented on NUTCH-1467: Hi Kiran, thanks for the patch. After a look at it: * instead of replacing {{Properties generalTags}} in HTMLMetaTags.java by a {{HashMap}} it seems preferable to use the class {{metadata.Metadata}}: ** provides the required methods *** add one more value to an array of values *** {{toString()}} etc. ** would shorten the code significantly ** sufficiently tested (own JUnit test) * in addition to {{parse.html.HTMLMetaProcessor.java}} also {{parse.tika.HTMLMetaProcessor.java}} needs to be modified Also, as Julien mentioned, a test would be useful. Added NUTCH-1467-TEST-1.patch as a first draft. Can you have a look at the test? Are all situations covered? Promising: test passes with the current patch applied :) > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, > Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, > Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1467: --- Attachment: NUTCH-1467-TEST-1.patch > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, > Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, > Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1421) RegexURLNormalizer to only skip rules with invalid patterns
[ https://issues.apache.org/jira/browse/NUTCH-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1421. Resolution: Fixed Fix Version/s: 2.2 1.6 committed to trunk (rev. 1401459) and 2.x (rev. 1401460) > RegexURLNormalizer to only skip rules with invalid patterns > --- > > Key: NUTCH-1421 > URL: https://issues.apache.org/jira/browse/NUTCH-1421 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1421-1.patch > > > If a regex-normalize.xml file contains one rule with a syntactically invalid > regular expression patterns, all rules are discarded and no normalization is > done. > In combination with a detailed error message, RegexURLNormalizer should only > skip the invalid rule but use all other (valid) rules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a large patch for a test but the idea is to extend it to test also other transitions of CrawlDatum states. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-578-TEST-1.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a s
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486144#comment-13486144 ] Sebastian Nagel commented on NUTCH-1482: +1 > Rename HTMLParseFilter > -- > > Key: NUTCH-1482 > URL: https://issues.apache.org/jira/browse/NUTCH-1482 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.5.1 >Reporter: Julien Nioche > > See NUTCH-861 for a background discussion. We have changed the name in 2.x to > better reflect what it does and I think we should do the same for 1.x. > any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch leads to a fetch_gone (assumed there is little delay between fetch and updatedb)? Attached patch sets the fetchInterval to db.fetch.interval.max and does not call forceRefetch. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.int
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486290#comment-13486290 ] Sebastian Nagel commented on NUTCH-1482: Markus, you are right: I remember the API change of HTMLParseFilter in 1.0: it took me some hours to get the custom plugins compiled. - is it possible to deprecate the extension point and keep it for some time? - at least, place a warning in CHANGES.txt with a link to update instructions in the wiki > Rename HTMLParseFilter > -- > > Key: NUTCH-1482 > URL: https://issues.apache.org/jira/browse/NUTCH-1482 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.5.1 >Reporter: Julien Nioche > > See NUTCH-861 for a background discussion. We have changed the name in 2.x to > better reflect what it does and I think we should do the same for 1.x. > any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may
[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486484#comment-13486484 ] Sebastian Nagel commented on NUTCH-578: --- NUTCH-1245 provides a test to catch this problem. Attached v5 patch: * call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All attached patches do this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again. * reset the retry counter in setPageGoneSchedule so that it cannot overflow and to get again 3 trials after db.max.fetch.interval is reached. > URL fetched with 403 is generated over and over again > - > > Key: NUTCH-578 > URL: https://issues.apache.org/jira/browse/NUTCH-578 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.0.0 > Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I > have checked out the most recent version of the trunk as of Nov 20, 2007 >Reporter: Nathaniel Powell >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: crawl-urlfilter.txt, NUTCH-578.patch, > NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, > NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt > > > I have not changed the following parameter in the nutch-default.xml: > > db.fetch.retry.max > 3 > The maximum number of times a url that has encountered > recoverable errors is generated for fetch. > > However, there is a URL which is on the site that I'm crawling, > www.teachertube.com, which keeps being generated over and over again for > almost every segment (many more times than 3): > fetch of http://www.teachertube.com/images/ failed with: Http code=403, > url=http://www.teachertube.com/images/ > This is a bug, right? > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-578: -- Attachment: NUTCH-578_v5.patch > URL fetched with 403 is generated over and over again > - > > Key: NUTCH-578 > URL: https://issues.apache.org/jira/browse/NUTCH-578 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.0.0 > Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I > have checked out the most recent version of the trunk as of Nov 20, 2007 >Reporter: Nathaniel Powell >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: crawl-urlfilter.txt, NUTCH-578.patch, > NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, > NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt > > > I have not changed the following parameter in the nutch-default.xml: > > db.fetch.retry.max > 3 > The maximum number of times a url that has encountered > recoverable errors is generated for fetch. > > However, there is a URL which is on the site that I'm crawling, > www.teachertube.com, which keeps being generated over and over again for > almost every segment (many more times than 3): > fetch of http://www.teachertube.com/images/ failed with: Http code=403, > url=http://www.teachertube.com/images/ > This is a bug, right? > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487316#comment-13487316 ] Sebastian Nagel commented on NUTCH-1370: +1 Would be nice to see also the number of injected URLs rejected by URL filters. > Expose exact number of urls injected @runtime > -- > > Key: NUTCH-1370 > URL: https://issues.apache.org/jira/browse/NUTCH-1370 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.6, 2.2 > > > Example: When using trunk, currently we see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > I would like to see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Injected N urls to > crawl/crawldb > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > This would make debugging easier and would help those who end up getting > {code} > 2012-05-22 09:04:04,850 WARN crawl.Generator - Generator: 0 records selected > for fetching, exiting ... > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487318#comment-13487318 ] Sebastian Nagel commented on NUTCH-578: --- Resetting the retry counter in setPageGoneSchedule has some disadvantages: * the information is lost that the db_gone results from a number of unsuccessful fetches due to transient errors * maybe you do not want to "get again 3 trials after db.max.fetch.interval is reached". If a page has been fetched 3 times in a row with a 403 and we try again after one month and get a 403 again, we do not need 3 trials any more. > URL fetched with 403 is generated over and over again > - > > Key: NUTCH-578 > URL: https://issues.apache.org/jira/browse/NUTCH-578 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.0.0 > Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I > have checked out the most recent version of the trunk as of Nov 20, 2007 >Reporter: Nathaniel Powell >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: crawl-urlfilter.txt, NUTCH-578.patch, > NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, > NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt > > > I have not changed the following parameter in the nutch-default.xml: > > db.fetch.retry.max > 3 > The maximum number of times a url that has encountered > recoverable errors is generated for fetch. > > However, there is a URL which is on the site that I'm crawling, > www.teachertube.com, which keeps being generated over and over again for > almost every segment (many more times than 3): > fetch of http://www.teachertube.com/images/ failed with: Http code=403, > url=http://www.teachertube.com/images/ > This is a bug, right? > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488146#comment-13488146 ] Sebastian Nagel commented on NUTCH-1483: Confirmed. The problem is caused by the rule {code} (? Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0 >Reporter: Rogério Pereira Araújo > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Affects Version/s: 1.6 > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0 >Reporter: Rogério Pereira Araújo > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488200#comment-13488200 ] Sebastian Nagel commented on NUTCH-1483: I tried with 1.x/trunk. For 2.x URLs with only one slash breaks usage of reverted URLs. Have you tried removing the regex normalizer rule? > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0 >Reporter: Rogério Pereira Araújo > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1483: --- Attachment: NUTCH-1483.patch StringUtils.split(String, char) does not preserve empty parts: host is empty in case of file: URLs. Patch includes a test case. > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 >Reporter: Rogério Pereira Araújo > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488254#comment-13488254 ] Sebastian Nagel commented on NUTCH-1483: Rogério, can you apply the patch, re-compile and try again? > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 >Reporter: Rogério Pereira Araújo > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs
Sebastian Nagel created NUTCH-1484: -- Summary: TableUtil unreverseURL fails on file:// URLs Key: NUTCH-1484 URL: https://issues.apache.org/jira/browse/NUTCH-1484 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Sebastian Nagel Priority: Critical Fix For: 2.2 (reported by Rogério Pereira Araújo, see NUTCH-1483) When crawling the local filesystem TableUtil.unreverseURL fails for URLs with empty host part (file:///Documents/). StringUtils.split(String, char) does not preserve empty parts which causes: {code} java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558 ] Sebastian Nagel commented on NUTCH-1483: Thanks! Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical (no work-around). Fixing the URL normalizers (and filters, see last comment) will take more time. Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986]: {quote} the "file" URI scheme is defined so that no authority, an empty host, and "localhost" all mean the end-user's machine {quote} Maybe we could also make protocol-file more lazy. > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 >Reporter: Rogério Pereira Araújo > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488558#comment-13488558 ] Sebastian Nagel edited comment on NUTCH-1483 at 11/1/12 8:55 AM: - Thanks! Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical (no work-around). Fixing the URL normalizers (and filters, see last comment) will take more time. Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986] (2005): {quote} the "file" URI scheme is defined so that no authority, an empty host, and "localhost" all mean the end-user's machine {quote} Maybe we could also make protocol-file more lazy. was (Author: wastl-nagel): Thanks! Issue with un-reversing URLs pulled out to NUTCH-1484 since it's more critical (no work-around). Fixing the URL normalizers (and filters, see last comment) will take more time. Btw., {{file://localhost/Documents/}} is the only legal for according to [RFC 1738|http://tools.ietf.org/html/rfc1738] (1994) while {{file:///Documents/}} is allowed by [RFC 3986|http://tools.ietf.org/html/rfc3986]: {quote} the "file" URI scheme is defined so that no authority, an empty host, and "localhost" all mean the end-user's machine {quote} Maybe we could also make protocol-file more lazy. > Can't crawl filesystem with protocol-file plugin > > > Key: NUTCH-1483 > URL: https://issues.apache.org/jira/browse/NUTCH-1483 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1 > Environment: OpenSUSE 12.1, OpenJDK 1.6.0, HBase 0.90.4 >Reporter: Rogério Pereira Araújo > Attachments: NUTCH-1483.patch > > > I tried to follow the same steps described in this wiki page: > http://wiki.apache.org/nutch/IntranetDocumentSearch > I made all required changes on regex-urlfilter.txt and added the following > entry in my seed file: > file:///home/rogerio/Documents/ > The permissions are ok, I'm running nutch with the same user as folder owner, > so nutch has all the required permissions, unfortunately I'm getting the > following error: > org.apache.nutch.protocol.file.FileError: File Error: 404 > at > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:105) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:514) > fetch of file://home/rogerio/Documents/ failed with: > org.apache.nutch.protocol.file.FileError: File Error: 404 > Why the logs are showing file://home/rogerio/Documents/ instead of > file:///home/rogerio/Documents/ ??? > Note: The regex-urlfilter entry only works as expected if I add the entry > +^file://home/rogerio/Documents/ instead of +^file:///home/rogerio/Documents/ > as wiki says. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1485) TableUtil reverseURL to keep userinfo part
Sebastian Nagel created NUTCH-1485: -- Summary: TableUtil reverseURL to keep userinfo part Key: NUTCH-1485 URL: https://issues.apache.org/jira/browse/NUTCH-1485 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Sebastian Nagel Priority: Minor The reversed URL key does not contain the userinfo part of an URL (user name and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC 3986|http://tools.ietf.org/html/rfc3986] and [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it easy to crawl a fixed list of protected content. However, URLs with userinfo can be tricky, eg [http://cnn.com&story=breaking_news@199.239.136.200/mostpopular], so it's ok when the default is to remove the userinfo. But this should be done in default URL normalizers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1461) Problem with TableUtil
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488585#comment-13488585 ] Sebastian Nagel commented on NUTCH-1461: Cf. NUTCH-1484: same error with file:// URLs which do not contain a host. > Problem with TableUtil > -- > > Key: NUTCH-1461 > URL: https://issues.apache.org/jira/browse/NUTCH-1461 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: nutchgora > Environment: Debian / CDH3 / Nutch 2.0 Release >Reporter: Christian Johnsson > Attachments: regex-urlfilter.txt, TabelUtil_Fix.patch > > > Affects parse and updatedb and parse. > Think i got some missformated urls into hbase but i can't fin them. > It generates this error though. If i empty hbase and restart it goes for a > couple of million pages indexed then it comes up again. Any tips on how to > locate what row in the table that genereates this error? > 2012-08-28 01:48:10,871 WARN org.apache.hadoop.mapred.Child: Error running > child > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:102) > at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:76) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:266) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) > at org.apache.hadoop.mapred.Child.main(Child.java:260) > 2012-08-28 01:48:10,875 INFO org.apache.hadoop.mapred.Task: Runnning cleanup > for the task -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488935#comment-13488935 ] Sebastian Nagel commented on NUTCH-1245: They are not duplicates but the effects are similar: NUTCH-1245 - caused by calling forceRefetch just after a fetch leads to a fetch_gone. If the fetchInterval is close to db.fetch.interval.max, setPageGoneSchedule calls forceRefetch. That's useless since we got a 404 right now (or within the last day(s) for large crawls). - proposed fix: setPageGoneSchedule should not call forceRefetch but keep the fetchInterval within/below db.fetch.interval.max NUTCH-578 - although the status of a page fetched 3 times (db.fetch.retry.max) with a transient error (fetch_retry) is set to db_gone, the fetchInterval is still only incremented by one day. So next day this page is fetched again. - every fetch_retry still increments the retry counter so that it may overflow (NUTCH-1247) - fix: -* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All patches (by various users/committers) agree in this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again. -* reset the retry counter to 0 or prohibit an overflow. I'm not sure what the best solution is, see comments on NUTCH-578 Markus, would be great if you start with a look on the JUnit patch. It has two aims: catch the error and make analysis easier (it logs a lot). I would like to extend the test to other CrawlDatum state transitions: these are complex for continuous crawls in combination with retry counters, intervals, signatures, etc. An exhaustive test could ensure that we do not break other state transitions. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fet
[jira] [Created] (NUTCH-1488) bin/nutch to run junit from any directory
Sebastian Nagel created NUTCH-1488: -- Summary: bin/nutch to run junit from any directory Key: NUTCH-1488 URL: https://issues.apache.org/jira/browse/NUTCH-1488 Project: Nutch Issue Type: Improvement Affects Versions: 1.5.1, 2.1 Reporter: Sebastian Nagel Priority: Trivial It should be possible to run a JUnit test via {{bin/nutch junit}} (see [http://wiki.apache.org/nutch/bin/nutch%20junit] and NUTCH-672) from elsewhere not only from {{runtime/local/}}. All parts of the class path are absolute but {{test/classes/}} is relative. Is there any reason for this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1488) bin/nutch to run junit from any directory
[ https://issues.apache.org/jira/browse/NUTCH-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1488: --- Attachment: NUTCH-1488.patch > bin/nutch to run junit from any directory > - > > Key: NUTCH-1488 > URL: https://issues.apache.org/jira/browse/NUTCH-1488 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1, 1.5.1 >Reporter: Sebastian Nagel >Priority: Trivial > Attachments: NUTCH-1488.patch > > > It should be possible to run a JUnit test via {{bin/nutch junit}} (see > [http://wiki.apache.org/nutch/bin/nutch%20junit] and NUTCH-672) from > elsewhere not only from {{runtime/local/}}. All parts of the class path are > absolute but {{test/classes/}} is relative. Is there any reason for this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1496) ParserJob logs skipped urls with level info
[ https://issues.apache.org/jira/browse/NUTCH-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494950#comment-13494950 ] Sebastian Nagel commented on NUTCH-1496: +1 > ParserJob logs skipped urls with level info > --- > > Key: NUTCH-1496 > URL: https://issues.apache.org/jira/browse/NUTCH-1496 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1 >Reporter: Nathan Gass >Priority: Trivial > Attachments: patch-parserjob-log-level-2012.txt > > > ParserJob is the only one which logs *all* skipped urls with level info. > Attached patch changes this to level debug, the same level already used by > FetcherJob, IndexerJob, and GeneratorJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1484: --- Attachment: NUTCH-1484.patch Revised patch: replaced StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String separator) by splitPreserveAllTokens(String str, char separator) which is significantly faster (as fast as StringUtils.split(String, char)). > TableUtil unreverseURL fails on file:// URLs > > > Key: NUTCH-1484 > URL: https://issues.apache.org/jira/browse/NUTCH-1484 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 2.2 > > Attachments: NUTCH-1484.patch > > > (reported by Rogério Pereira Araújo, see NUTCH-1483) > When crawling the local filesystem TableUtil.unreverseURL fails for URLs with > empty host part (file:///Documents/). StringUtils.split(String, char) does > not preserve empty parts which causes: > {code} > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494952#comment-13494952 ] Sebastian Nagel edited comment on NUTCH-1484 at 11/11/12 7:56 PM: -- Revised patch: replaced StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String separator) by splitPreserveAllTokens(String str, char separator) which is significantly faster (as fast as StringUtils.split(String, char)). was (Author: wastl-nagel): Revised patch: replaced StringUtils.splitByWholeSeparatorPreserveAllTokens(String str, String separator) by splitPreserveAllTokens(String str, char separator) which is significantly faster (as fast as StringUtils.split(String, char)). > TableUtil unreverseURL fails on file:// URLs > > > Key: NUTCH-1484 > URL: https://issues.apache.org/jira/browse/NUTCH-1484 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 2.2 > > Attachments: NUTCH-1484.patch > > > (reported by Rogério Pereira Araújo, see NUTCH-1483) > When crawling the local filesystem TableUtil.unreverseURL fails for URLs with > empty host part (file:///Documents/). StringUtils.split(String, char) does > not preserve empty parts which causes: > {code} > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1484. Resolution: Fixed Committed to 2.x (rev. 1408465) > TableUtil unreverseURL fails on file:// URLs > > > Key: NUTCH-1484 > URL: https://issues.apache.org/jira/browse/NUTCH-1484 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 2.2 > > Attachments: NUTCH-1484.patch > > > (reported by Rogério Pereira Araújo, see NUTCH-1483) > When crawling the local filesystem TableUtil.unreverseURL fails for URLs with > empty host part (file:///Documents/). StringUtils.split(String, char) does > not preserve empty parts which causes: > {code} > java.lang.ArrayIndexOutOfBoundsException: 1 > at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-1.x.patch Ferdy is right: custom counters are more transparent. Patch for 1.x > Expose exact number of urls injected @runtime > -- > > Key: NUTCH-1370 > URL: https://issues.apache.org/jira/browse/NUTCH-1370 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch > > > Example: When using trunk, currently we see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > I would like to see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Injected N urls to > crawl/crawldb > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > This would make debugging easier and would help those who end up getting > {code} > 2012-05-22 09:04:04,850 WARN crawl.Generator - Generator: 0 records selected > for fetching, exiting ... > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1370: --- Attachment: NUTCH-1370-2.x-v3.patch Hi Lewis, yes, the 1.x patch is not easily transferred for 2.x because of different (old vs. new) map reduce APIs. Here is a trial... One question: the logged line "number of urls attempting to inject" suggests that there is a third count "urls successfully injected" or similar. What's the intention with "attempting"? > Expose exact number of urls injected @runtime > -- > > Key: NUTCH-1370 > URL: https://issues.apache.org/jira/browse/NUTCH-1370 > Project: Nutch > Issue Type: Improvement > Components: injector >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, > NUTCH-1370-2.x-v2.patch, NUTCH-1370-2.x-v3.patch > > > Example: When using trunk, currently we see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > I would like to see > {code} > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: starting at > 2012-05-22 09:04:00 > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: crawlDb: > crawl/crawldb > 2012-05-22 09:04:00,239 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Injected N urls to > crawl/crawldb > 2012-05-22 09:04:00,253 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2012-05-22 09:04:00,955 INFO plugin.PluginRepository - Plugins: looking in: > {code} > This would make debugging easier and would help those who end up getting > {code} > 2012-05-22 09:04:04,850 WARN crawl.Generator - Generator: 0 records selected > for fetching, exiting ... > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504136#comment-13504136 ] Sebastian Nagel commented on NUTCH-1499: Short and precise patch. However, is there a reason why the problem is not solved on hardware or system level, cf. [[bonding|http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding]]? > Usage of multiple ipv4 addresses and network cards on fetcher machines > -- > > Key: NUTCH-1499 > URL: https://issues.apache.org/jira/browse/NUTCH-1499 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.5.1 >Reporter: Walter Tietze >Priority: Minor > Attachments: apache-nutch-1.5.1.NUTCH-1499.patch > > > Adds for the fetcher threads the ability to use multiple configured ipv4 > addresses. > On some cluster machines there are several ipv4 addresses configured where > each ip address is associated with its own network interface. > This patch enables to configure the protocol-http and the protocol-httpclient > to use these network interfaces in a round robin style. > If the feature is enabled, a helper class reads at *startup* the network > configuration. In each http network connection the next ip address is taken. > This method is synchronized, but this should be no bottleneck for the overall > performance of the fetcher threads. > This feature is tested on our cluster for the protocol-http and the > protocol-httpclient protocol. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment
Sebastian Nagel created NUTCH-1500: -- Summary: bin/crawl fails on step solrindex with wrong path to segment Key: NUTCH-1500 URL: https://issues.apache.org/jira/browse/NUTCH-1500 Project: Nutch Issue Type: Bug Affects Versions: 1.6 Reporter: Sebastian Nagel Priority: Trivial The bin/crawl script calls the command (bin/nutch) solrindex with the wrong path to the segment which causes solrindex to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1500: --- Attachment: NUTCH-1500.patch > bin/crawl fails on step solrindex with wrong path to segment > > > Key: NUTCH-1500 > URL: https://issues.apache.org/jira/browse/NUTCH-1500 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6 >Reporter: Sebastian Nagel >Priority: Trivial > Attachments: NUTCH-1500.patch > > > The bin/crawl script calls the command (bin/nutch) solrindex with the wrong > path to the segment which causes solrindex to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507944#comment-13507944 ] Sebastian Nagel commented on NUTCH-1499: Thanks! That's a plausible reason: (let's call it) "administrative constraints". +1 (lean patch, look's good, I'll try to test it on a machine with suitable network settings) > Usage of multiple ipv4 addresses and network cards on fetcher machines > -- > > Key: NUTCH-1499 > URL: https://issues.apache.org/jira/browse/NUTCH-1499 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.5.1 >Reporter: Walter Tietze >Priority: Minor > Attachments: apache-nutch-1.5.1.NUTCH-1499.patch > > > Adds for the fetcher threads the ability to use multiple configured ipv4 > addresses. > On some cluster machines there are several ipv4 addresses configured where > each ip address is associated with its own network interface. > This patch enables to configure the protocol-http and the protocol-httpclient > to use these network interfaces in a round robin style. > If the feature is enabled, a helper class reads at *startup* the network > configuration. In each http network connection the next ip address is taken. > This method is synchronized, but this should be no bottleneck for the overall > performance of the fetcher threads. > This feature is tested on our cluster for the protocol-http and the > protocol-httpclient protocol. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Patch Info: Patch Available > Port IndexingFiltersChecker to 2.0 > -- > > Key: NUTCH-1038 > URL: https://issues.apache.org/jira/browse/NUTCH-1038 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Markus Jelsma > Fix For: 2.2 > > Attachments: NUTCH-1038.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038.patch > Port IndexingFiltersChecker to 2.0 > -- > > Key: NUTCH-1038 > URL: https://issues.apache.org/jira/browse/NUTCH-1038 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Markus Jelsma > Fix For: 2.2 > > Attachments: NUTCH-1038.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker
Sebastian Nagel created NUTCH-1501: -- Summary: Harmonize behavior of parsechecker and indexchecker Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue Type: Improvement Components: indexer, parser Reporter: Sebastian Nagel Priority: Minor Fix For: 2.2 Behaviour of ParserChecker and IndexingFiltersChecker has diverged between trunk and 2.x - missing in 2.x: NUTCH-1320, NUTCH-1207 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1502) Test for CrawlDatum state transitions
Sebastian Nagel created NUTCH-1502: -- Summary: Test for CrawlDatum state transitions Key: NUTCH-1502 URL: https://issues.apache.org/jira/browse/NUTCH-1502 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7, 2.2 Reporter: Sebastian Nagel An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested. The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13525439#comment-13525439 ] Sebastian Nagel commented on NUTCH-1245: @kiran: yes, 2.x is affected since fetch schedulers do not differ (much) between 1.x and 2.x. However, with default settings you need a couple of month continuously crawling to run into this problem. @Markus: good news! Pulled the test out to NUTCH-1502 (broader coverage, need more time). Are there objections regarding the proposed patch? > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is g
[jira] [Commented] (NUTCH-1503) Configuration properties not in sync between FetcherReducer and nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529497#comment-13529497 ] Sebastian Nagel commented on NUTCH-1503: Hi Lewis, both time limit properties are necessary: * fetcher.timelimit.mins for the user to configure the limit (max. duration in minutes) * fetcher.timelimit (internal use only) to set the time the fetcher has to finish (system time in milliseconds, same time for all distributed jobs) Regarding fetcher.threads.per.host.by.ip: maybe we should not add already deprecated properties which will be removed later anyway (cf. NUTCH-1409). +1 for adding fetcher.queue.use.host.settings to nutch-default.xml Btw., your efforts to clean up properties remembered me that some time ago I promised on [user@nutch|http://lucene.472066.n3.nabble.com/Javadoc-incorrect-or-missing-code-in-1-5-1-Generator-td3997883.html] to prepare a list with all Nutch properties and flags whether they are "defined" and documented in nutch-default.xml: [it's in the wiki now|http://wiki.apache.org/nutch/NutchPropertiesCompleteList]. > Configuration properties not in sync between FetcherReducer and > nutch-default.xml > - > > Key: NUTCH-1503 > URL: https://issues.apache.org/jira/browse/NUTCH-1503 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1503.patch > > > FetcherReducer.java > Bug: Following properties appear in FetcherReducer but not in > nutch-default.xml > {code} > 290 useHostSettings = > conf.getBoolean("fetcher.queue.use.host.settings", false); > 300 this.timelimit = conf.getLong("fetcher.timelimit", -1); > 450 this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", true); > 698 timelimit = context.getConfiguration().getLong("fetcher.timelimit", > -1); > {code} > Therefore they cannot be used properly in code execution and must be updated, > removed and/or added to nutch-default.xml. > Patch coming up just now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1038) Port IndexingFiltersChecker to 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1038: --- Attachment: NUTCH-1038v2.patch Hi Lewis, it's a problem of the patch: the fetch time of a WebPage (unlike CrawlDatum) must be explicitly set. Good catch! Attached improved patch. > Port IndexingFiltersChecker to 2.0 > -- > > Key: NUTCH-1038 > URL: https://issues.apache.org/jira/browse/NUTCH-1038 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Markus Jelsma > Fix For: 2.2 > > Attachments: NUTCH-1038.patch, NUTCH-1038v2.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545480#comment-13545480 ] Sebastian Nagel commented on NUTCH-1514: +1 But do we need a reference to the removed property in nutch-default.xml? {quote} Replaces the deprecated parameter db.default.fetch.interval. {quote} It was deprecated for long now, so it could be removed without trace. > Phase out the deprecated configuration properties (if possible) > --- > > Key: NUTCH-1514 > URL: https://issues.apache.org/jira/browse/NUTCH-1514 > Project: Nutch > Issue Type: Improvement > Components: fetcher, generator >Affects Versions: 1.6, 2.1 >Reporter: Tejas Patil >Priority: Trivial > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1514.patch > > > In reference to [0], the deprecated configuration properties can be removed > (only if possible without affecting the functionality). > [0] : > http://mail-archives.apache.org/mod_mbox/nutch-user/201301.mbox/%3ccafkhtfwvm7w-cvusgzwkegdcwrvshptbdftdcn1nnpm1z2-...@mail.gmail.com%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552028#comment-13552028 ] Sebastian Nagel commented on NUTCH-1499: So, a vote for "won't fix". Comments? > Usage of multiple ipv4 addresses and network cards on fetcher machines > -- > > Key: NUTCH-1499 > URL: https://issues.apache.org/jira/browse/NUTCH-1499 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 1.5.1 >Reporter: Walter Tietze >Priority: Minor > Fix For: 1.7 > > Attachments: apache-nutch-1.5.1.NUTCH-1499.patch > > > Adds for the fetcher threads the ability to use multiple configured ipv4 > addresses. > On some cluster machines there are several ipv4 addresses configured where > each ip address is associated with its own network interface. > This patch enables to configure the protocol-http and the protocol-httpclient > to use these network interfaces in a round robin style. > If the feature is enabled, a helper class reads at *startup* the network > configuration. In each http network connection the next ip address is taken. > This method is synchronized, but this should be no bottleneck for the overall > performance of the fetcher threads. > This feature is tested on our cluster for the protocol-http and the > protocol-httpclient protocol. > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-813. --- Resolution: Duplicate The described problem is identical to that of NUTCH-578. The provided patch (call setPageGoneSchedule when retry counter hits db.fetch.retry.max) is included in all patches of NUTCH-578. > Repetitive crawl 403 status page > > > Key: NUTCH-813 > URL: https://issues.apache.org/jira/browse/NUTCH-813 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.1 >Reporter: Nguyen Manh Tien >Priority: Minor > Fix For: 1.7 > > Attachments: ASF.LICENSE.NOT.GRANTED--Patch > > > When we crawl a page the return a 403 status. It will be crawl repetitively > each days with default schedule. > Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required
[ https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552082#comment-13552082 ] Sebastian Nagel commented on NUTCH-1345: JAVA_HOME (or NUTCH_JAVA_HOME) is currently used for two things: # use $JAVA_HOME/bin/java as Java executable # determining the location of lib/tools.jar which is part of JDK (not JRE). It's probably an unneeded artifact, cf. MAPREDUCE-3624 and HADOOP-7374. If JAVA_HOME is not set bin/nutch definitely refuses to work. I agree that setting an environment variable may be a little hurdle, however there are arguments in favour of using JAVA_HOME: - I had to install Nutch on many customers' machines where the default java executable on PATH was not the correct one (>= 1.6): setting JAVA_HOME is more transparent than manipulating PATH. NUTCH_JAVA_HOME is even more explicit. - back-ward compatibility: Nutch should be run by the same JVM as before, not accidentally by another one. - staying parallel to Hadoop which still uses JAVA_HOME Btw., let JAVA_HOME point to /usr/lib/jvm/default-java for Ubuntu's update-alternatives. > JAVA_HOME should not be required > > > Key: NUTCH-1345 > URL: https://issues.apache.org/jira/browse/NUTCH-1345 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Ben McCann >Priority: Minor > Attachments: nutch, nutch.patch > > > Trying to run Nutch spits out the message "Error: JAVA_HOME is not set." I > already have java on my path, so I really wish I didn't need to set > JAVA_HOME. It's an extra step to get up and running and is not updated by > Ubuntu's update-alternatives, so it makes it a lot harder to switch between > versions of Java. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554353#comment-13554353 ] Sebastian Nagel commented on NUTCH-1087: Hi Tristan, thanks for the patch! The segment path of solrindex was already reported in NUTCH-1500 Can you open a new issue for the Mac OS problem? It's more verbose to separate the problems then reopening resolved issues again. Thanks. Btw., maybe a simple solution {code} SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1` {code} without sed or awk is preferable. Does it work on Mac OS? > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment
[ https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1500. Resolution: Fixed committed to trunk (rev. 1433658) > bin/crawl fails on step solrindex with wrong path to segment > > > Key: NUTCH-1500 > URL: https://issues.apache.org/jira/browse/NUTCH-1500 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 1.7 > > Attachments: NUTCH-1500.patch > > > The bin/crawl script calls the command (bin/nutch) solrindex with the wrong > path to the segment which causes solrindex to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554381#comment-13554381 ] Sebastian Nagel commented on NUTCH-1087: yes, of course, but currently there is already a if-else to separate local from distributed mode. But let's move the discussion to a new issue. > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1520) SegmentMerger looses records
[ https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556093#comment-13556093 ] Sebastian Nagel commented on NUTCH-1520: Hi Markus, have a look at NUTCH-1113. An alternative solution is to take in certain cases more than one CrawlDatum into the merged segment. > SegmentMerger looses records > > > Key: NUTCH-1520 > URL: https://issues.apache.org/jira/browse/NUTCH-1520 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6 >Reporter: Markus Jelsma >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1520-1.7-1.patch > > > It seems the SegmentMerger tool looses documents. You're likely to see less > documents in an index if you index one or more already merged segments than > if you index all unmerged segments. > This is really nasty! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564274#comment-13564274 ] Sebastian Nagel commented on NUTCH-1465: Hi Tejas, thanks and a few comments on the patch: “??for a given host, sitemaps are processed just once??” But they are not cached over cycles because the cache is bound to the protocol object. Is this correct? So a sitemap is fetched and processed every cycle for every host? If yes and sitemaps are large (they can!) this would cause a lot of extra traffic. Shouldn't sitemap URLs handled the same way as any other URL: add them to CrawlDb, fetch and parse once, add found links to CrawlDb, cf. [Ken's post at CC|https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/DrAX4Th1A4I]. There are some complications: - due to their size, sitemaps may require larger values regarding size and time limits - sitemaps may require more frequent re-fetching (eg. by MimeAdaptiveFetchSchedule) - the current Outlink class cannot hold extra information contained in sitemaps (lastmod, changefreq, etc.) There is another way which we use it for several customers: A SitemapInjector fetches the sitemaps, extracts URLs and injects them with all extra information. It's a simple use case for a customized site-search: there is a sitemap and it shall be used as seed list or even exclusive list of documents to be crawled. Is there any interest in this solution? It's not a general solution and not adaptable to a large web crawl. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.7 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564768#comment-13564768 ] Sebastian Nagel commented on NUTCH-1465: Yes, SitemapInjector is a map-reduce job. The scenario for its use is the following: - a small set of sites to be crawled (eg, to feed a site-search index) - you can think of sitemaps as "remote seed lists". Because many content management systems can generate sitemaps it is convenient for the site owners to publish seeds. The URLs contained in the sitemap can be also the complete and exclusive set of URLs to be crawled (you can use the plugin scoring-depth to limit the crawl to seed URLs). - because you can trust in the sitemap's content -* checks for "cross submissions" are not necessary -* extra information (lastmod, changefreq, priority) can be used That's we use sitemaps: remote seed lists, maintained by customers, quite convenient if you run a crawler as a service. For large web crawls there is also another aspect: detection of sitemaps which is bound to processing of robots.txt. Processing of sitemaps can (and should?) be done the usual Nutch way: - detection is done in the protocol plugin (see Tejas' patch) - record in CrawlDb: done by Fetcher (cross submission information can be added) - fetch (if not yet done), parse (a plugin parse-sitemap based on crawler-commons?) and extract outlinks: sitemaps may require special treatment here because they can be large in size and usually contain many outlinks. Also the Outlink class needs to be extended to deal with the extra info relevant for scheduling To use an extra tool (as the SitemapInjector) for processing the sitemaps has the disadvantage that we first must get all sitemap URLs out of the CrawlDb. On the contrary, special treatment can easily be realized in a separate map-reduce job. Comments?! Thanks, Tejas: the feature is moving forward thanks to your initiative! > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.7 > > Attachments: NUTCH-1465-trunk.v1.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564827#comment-13564827 ] Sebastian Nagel commented on NUTCH-1047: As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point. Some concerns about the usability of IndexingJob as a "daily" tool: - it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes - options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira