[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915563#comment-13915563 ] Yasin Kılınç commented on NUTCH-1253: - I checked and tested patch file into 2.x branch. I used ant eclipse target, then I opened via eclipse IDE. The project compile but eclipse shows warning because of, version of nekohtml is old. I want to attach patch file for this problem. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Talat UYARER updated NUTCH-1478: Attachment: NUTCH-1478v5.patch I fixed several mistakes within the patch. This is final. Parse-metatags and index-metadata plugin for Nutch 2.x series -- Key: NUTCH-1478 URL: https://issues.apache.org/jira/browse/NUTCH-1478 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.1 Reporter: kiran Fix For: 2.3 Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, NUTCH-1478v4.patch, NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467). The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) This is only the first version and does not include the junit test. I will update the new version soon. This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. Please let me know if you have any suggestions This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915619#comment-13915619 ] Talat UYARER edited comment on NUTCH-1478 at 2/28/14 10:03 AM: --- I fixed several mistakes within the patch. This is final. [~popalka], can you test the patch ? was (Author: talat): I fixed several mistakes within the patch. This is final. Parse-metatags and index-metadata plugin for Nutch 2.x series -- Key: NUTCH-1478 URL: https://issues.apache.org/jira/browse/NUTCH-1478 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.1 Reporter: kiran Fix For: 2.3 Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, NUTCH-1478v4.patch, NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467). The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) This is only the first version and does not include the junit test. I will update the new version soon. This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. Please let me know if you have any suggestions This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915701#comment-13915701 ] Lewis John McGibbney commented on NUTCH-1253: - The version of nekohtml we are using is dependency org=net.sourceforge.nekohtml name=nekohtml rev=1.9.19 conf=*-master/ AFAIK this is most recent. Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1727) Configurable length for Tlds
[ https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1727: - Attachment: (was: NUTCH-1727.patch) Configurable length for Tlds Key: NUTCH-1727 URL: https://issues.apache.org/jira/browse/NUTCH-1727 Project: Nutch Issue Type: Bug Reporter: Sertac TURKEL Priority: Minor Fix For: 2.3 Length of the tld should be selectable, there is some available tld's like .travel and url-validator plugin filters this type of urls. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915730#comment-13915730 ] Yasin Kılınç commented on NUTCH-1253: - Ok. But there is a line in target eclipse NUTCH_HOME/build.xml like this {code} library path=${basedir}/build/plugins/lib-nekohtml/nekohtml-0.9.5.jar exported=false / {code} Incompatible neko and xerces versions - Key: NUTCH-1253 URL: https://issues.apache.org/jira/browse/NUTCH-1253 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Environment: Ubuntu 10.04 Reporter: Dennis Spathis Assignee: Lewis John McGibbney Fix For: 2.3, 1.8 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, nutch1253parsed.html, nutch1253test.html The Nutch 1.4 distribution includes - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a catch(Throwable) clause in the getParse method to log the stacktrace.) I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: plugin id=lib-nekohtml name=CyberNeko HTML Parser version=1.9.11 provider-name=org.cyberneko runtime library name=nekohtml-0.9.5.jar export name=*/ /library /runtime /plugin Note the conflicting version numbers (version tag is 1.9.11 but the specified library is nekohtml-0.9.5.jar). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1727) Configurable length for Tlds
[ https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1727: - Attachment: NUTCH-1727.patch Hi [~lewismc], there is a point that I missed. I found it and I updated patch file. I think it is OK. Could you review again? Configurable length for Tlds Key: NUTCH-1727 URL: https://issues.apache.org/jira/browse/NUTCH-1727 Project: Nutch Issue Type: Bug Reporter: Sertac TURKEL Priority: Minor Fix For: 2.3 Attachments: NUTCH-1727.patch Length of the tld should be selectable, there is some available tld's like .travel and url-validator plugin filters this type of urls. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents
Sebastian Nagel created NUTCH-1732: -- Summary: IndexerMapReduce to delete explicitly not indexable documents Key: NUTCH-1732 URL: https://issues.apache.org/jira/browse/NUTCH-1732 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8 Reporter: Sebastian Nagel Fix For: 1.9 In a continuous crawl a previously successfully indexed document (identified by a URL) can become not indexable for a couple of reasons and must then explicitly deleted from the index. Some cases are handled in IndexerMapReduce (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not: * failed to parse (but previously successfully parsed): e.g., the document became larger and is now truncated * rejected by indexing filter (but previously accepted) In both cases (maybe there are more) the document should be explicitly deleted (if {{-deleteGone}} is set). Note that this cannot be done in CleaningJob because data from segments is required. We should also update/add a description for {{-deleteGone}}: it does not only trigger deletion of gone documents but also of redirects and duplicates (and unparseable and skipped docs). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents
[ https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915828#comment-13915828 ] Markus Jelsma commented on NUTCH-1732: -- We have an explicit deleteSkippedByIndexingFilter option instead, it seems i have never committed it to Apache Nutch in NUTCH-1449 {code} // skip documents discarded by indexing filters if (doc == null) { // https://issues.apache.org/jira/browse/NUTCH-1449 if (deleteSkippedByIndexingFilter) { NutchIndexAction action = new NutchIndexAction(NutchIndexAction.DELETE); output.collect(key, action); reporter.incrCounter(IndexerStatus, Deleted by filters, 1); } else { reporter.incrCounter(IndexerStatus, Skipped by filters, 1); } return; } {code} IndexerMapReduce to delete explicitly not indexable documents - Key: NUTCH-1732 URL: https://issues.apache.org/jira/browse/NUTCH-1732 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8 Reporter: Sebastian Nagel Fix For: 1.9 In a continuous crawl a previously successfully indexed document (identified by a URL) can become not indexable for a couple of reasons and must then explicitly deleted from the index. Some cases are handled in IndexerMapReduce (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not: * failed to parse (but previously successfully parsed): e.g., the document became larger and is now truncated * rejected by indexing filter (but previously accepted) In both cases (maybe there are more) the document should be explicitly deleted (if {{-deleteGone}} is set). Note that this cannot be done in CleaningJob because data from segments is required. We should also update/add a description for {{-deleteGone}}: it does not only trigger deletion of gone documents but also of redirects and duplicates (and unparseable and skipped docs). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915831#comment-13915831 ] Sebastian Nagel commented on NUTCH-1113: Results of tests: The number of documents in index after all segments have been * (A) indexed in chronological order segment by segment or * (B) merged into one segment which then has been indexed is shown below. B has been run twice: (1) without any patch, and (2) using the patch as of 2014-02-21. IndexerMapReduce was patched by NUTCH-1706-trunk-v2.patch for all 3 runs. |||| coll 1 || coll 2 || coll 3 || | A seg-by-seg| 22178 | 6959 | 45944 | | B1 merged| 21122 | 6579 | 46029 | | B2 patched, merged | 22161 | 6959 | 46135 | 3 collections have been tested, all of them with ~100 segments and 100.000 URLs, but with many redirects, robots noindex etc. (far more than indexable documents). With patch (B2 compared to B1) the index contains more documents. For collection 2 it's now equal to the expected number (A). For the other two collections the numbers differ but that's because of problems in IndexerMapReduce (NUTCH-1708 and NUTCH-1732). +1 to commit [~markus17]'s latest patch. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908265#comment-13908265 ] Sebastian Nagel edited comment on NUTCH-1113 at 2/28/14 2:45 PM: - Hi [~markus17], your patch should work (I've tested it exactly the same way). The indexer was run with {{indexer.skip.notmodified == false}}. The problem is that in the merged segment fetch_success datums have been lost and the following test skipped these URLs: {code} if (!parseData.getStatus().isSuccess() || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { return; } {code} Just to clarify that we use the same test set-up: # start with an empty index # index (case A) segments in chronological order or (case B) merged segment # compare both indexes The CrawlDb was updated with URLs from all segments. The same CrawlDb is used for all index runs, right? I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, the index will not contain any pages with status notmodified. was (Author: wastl-nagel): Hi [~markus17], your patch should work (I've tested it exactly the same way). The indexer was run with {{indexer.skip.notmodified == false}}. The problem is that in the merged segment fetch_success datums have been lost and the following test skipped these URLs: {code} if (!parseData.getStatus().isSuccess() || fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { return; } {code} Just to clarify that we use the same test set-up: # start with an empty index # index (case A) segments in chronological order or (case B) merged segment # compare both indexes The CrawlDb was updated with URLs from all segments. The same CrawlDb is used for all index runs, right? I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, we the index will not contain any pages with status notmodified. Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
HTTP Post request
Hi, I would like to be able to send HTTP POST request for Nutch to crawl. I mean if I ever wanted to crawl a search result, I could do http://www.example.com/search?q=mySearch But if the server use HTTP post I have not found a way to do it. So what I wanted to do is from a conf file retrieve the method post/get, the name of the parameters and when Nutch come accross a given URL, it will access the file with the right HTTP request. For my example the conf.xml would be like: url href=http://www.example.com/search; method=post /url But as I am new to Nutch, could someone provide me with some clue, on how to start this new plugin? Best regards, Zabini -- View this message in context: http://lucene.472066.n3.nabble.com/HTTP-Post-request-tp4120405.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Attachment: NUTCH-1113-trunk-junit-final.patch Final patch including the stuff mentioned by Sebastian and the junit test. I will commit shortly unless there are some final objections :) Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents
[ https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915889#comment-13915889 ] Sebastian Nagel commented on NUTCH-1732: Hi [~markus17], looks like a partial duplicate. I've seen documents which failed to parse in their latest version in the index. They should not be in the index, no matter how segments are indexed: segment by segment in chronological order, all segments in one turn, or first merged (cf. NUTCH-1113). To have one extra option is ok. But other case (failed parses) could be subsumed under {{-deleteGone}}. IndexerMapReduce to delete explicitly not indexable documents - Key: NUTCH-1732 URL: https://issues.apache.org/jira/browse/NUTCH-1732 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.8 Reporter: Sebastian Nagel Fix For: 1.9 In a continuous crawl a previously successfully indexed document (identified by a URL) can become not indexable for a couple of reasons and must then explicitly deleted from the index. Some cases are handled in IndexerMapReduce (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not: * failed to parse (but previously successfully parsed): e.g., the document became larger and is now truncated * rejected by indexing filter (but previously accepted) In both cases (maybe there are more) the document should be explicitly deleted (if {{-deleteGone}} is set). Note that this cannot be done in CleaningJob because data from segments is required. We should also update/add a description for {{-deleteGone}}: it does not only trigger deletion of gone documents but also of redirects and duplicates (and unparseable and skipped docs). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.9) 1.8 Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1113. -- Resolution: Fixed Assignee: Markus Jelsma Committed revision 1572975. Thanks all for contributing. I am very happy this is fixed once and for all. :) Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.9 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Nutch roadmap and documentation
Hi Mateusz, On Thu, Feb 27, 2014 at 10:35 AM, Mateusz Zakarczemny mateusz.zakarcze...@up2data.pl wrote: Docs from 1 and 2 branch are mixed together. As far as I can see they are separate. The tutorials are clearly under different subsections, and the Nutch 2.x docs have their own section as well. I understand that detailed documentation easily became outdated. But providing information about the existence of the feature is very basic task of documentation. It should be always up to date. What exactly are you referring to here? I am slightly puzzled as to why not one other user has requested documentation on individual issues. IMHO, if people wish to use and develop Nutch, they should at minimum subscribe to user@ and dev@... the latter contains EVERY issue which is discussed and feature enhancement. The same is done for other projects. A changeling is not features documentation. No but the point of the changelog is to refer people to what is included. We also now provide a link to the Jira release profile. It is up to users/developers to read up if they wish to learn more about individual issues. An example of such a link can be found in the 2.x CHANGES.txt Release Report - http://s.apache.org/PGa If new user looks at Nuch he will not check the changelog but documentation. Is this your opinion or are you commenting from a wider audiences perspective? I think the new user should be provided with clear information about which branch to choose. I agree with this. This is why the lists exist. You can ask questions. You can also read some archives. It takes a minimal well spent investment of time to dig up what other have asked many many times. Don't get me wrong, I am all for informing people about the software... however I am not in the immediate position to write a decent quality book on Nutch which would do the community and software justice. If you are then please do. What is more, the doc should be divided in branch 1 and 2. Please see the table of contents on the wiki. Please also see my comments above. Pages could link together, but there should be a clean branch tree in the docs. As like in source code. You do not mix packages from two branches but you keep them in separated repos. ditto I don't think that for bugs documentation is essential. Only for new features or refactoring. It doesn't have to be big document. It just has to exist. But what happens if fixing a bug changes functionality? Then what? Nowadays there are some plugins which are not mentioned in plugin central. It is very confusing. Yes I agree with this. it is not entirely up-to-date. This is something we should most likely address. I know that sometimes developers don't have time to create documentation. But in such case they should create a new task for such doc. Otherwise nobody knows that doc is missing and cannot help. Not true. All you need to do is request Karma for the project wiki and you can contribute to whatever you feel is missing. I don't take this argument sorry. I am not saying that confluence is best for this project. But in my opinion Nutch docs should be moved to some community/social solutions. It would be great if it enables comments and pull requests (like on github) to improve it. AFAICT the wiki we currently have IS community oriented. Anyone over the years that has wished to add/edit has been granted Karma to do so. Are you really saying that enabling pull request via Github is a better way than simply granting someone Karma to edit a page as they wish? Maybe MD files would be better? Documentation could be stored with source code. Eg. Doc folder in each plugin. It would be fixed to source code structure. This approach has many advantages. When I contribute some doc on github I don't have to apply anywhere or ask anybody. I just create pull request to documentation. Project leader sees it and next could review and apply it. Whole process take 3-4 mouse clicks. One drawback is moving to such solution would be quite complex and time consuming. Yes it certainly would be. Over last 10 years Nuch documentation grew incrementally. I think It is time to refactor it to more modular and structured way (like source code). I don't want to rewrite it but just create better structure. Honestly I haven't seen anything from your commentary which would suggest benefits for Nutch as a whole... I am trying NOT to be pessimistic, but I am just struggling to see your point here. If the wiki is outdated... then we should update it. Not change to another solution just so we can receive pull requests for documentation. There is an argument to make it as easy as possible to contribute documentation to Nutch. However as far as I can see, there are not crowds of people rushing to contribute. Please don't take these comments negatively. I am behind any motion to make documentation better. I just don't see eye-to-eye with some of your points. PS
[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915917#comment-13915917 ] Julien Nioche commented on NUTCH-1113: -- Well done, thanks guys! Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc
[ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915919#comment-13915919 ] Sebastian Nagel commented on NUTCH-1706: Latest patch tested successfully (see NUTCH-1113). Will commit shortly. [~markus17], can you open an issue about the fetch_retry? Regarding the ordering of values when indexing multiple segments of a continuous crawl: there are already NUTCH-1416 and NUTCH-1617. IndexerMapReduce does not remove db_redir_temp etc -- Key: NUTCH-1706 URL: https://issues.apache.org/jira/browse/NUTCH-1706 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, nutch-1706-testdata.tgz Code path is wrong in IndexerMapReduce, the delete code should be located after all reducer values have been gathered. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Build failed in Jenkins: Nutch-trunk #2545
See https://builds.apache.org/job/Nutch-trunk/2545/changes Changes: [markus] NUTCH-1113 SegmentMerger can now be safely used to merge segments. If this damn thing breaks again -- [...truncated 3001 lines...] [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: subcollection [javac] Compiling 3 source files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/subcollection/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/subcollection/subcollection.jar deps-test: deploy: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/subcollection copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/subcollection init: [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/classes [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/test [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: tld [javac] Compiling 2 source files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/tld.jar deps-test: deploy: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/test/data [copy] Copying 6 files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/test/data init: [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/classes [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton init-plugin: deps-jar: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-regex-filter jar: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-regex-filter compile-test: [javac] Compiling 1 source file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/lib-regex-filter/test [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-automaton [javac] Compiling 1 source file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/urlfilter-automaton.jar deps-test: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-domain/test/data [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-domain/test/data init: [mkdir] Created dir:
[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915969#comment-13915969 ] Hudson commented on NUTCH-1113: --- FAILURE: Integrated in Nutch-trunk #2545 (See [https://builds.apache.org/job/Nutch-trunk/2545/]) NUTCH-1113 SegmentMerger can now be safely used to merge segments. If this damn thing breaks again (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1572975) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java * /nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Assignee: Markus Jelsma Priority: Blocker Fix For: 1.8 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Build failed in Jenkins: Nutch-trunk #2546
See https://builds.apache.org/job/Nutch-trunk/2546/ -- [...truncated 2159 lines...] copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-ftp init: [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/classes [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/test [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http init-plugin: deps-jar: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-http jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: protocol-http [javac] Compiling 2 source files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/protocol-http.jar deps-test: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-http jar: deps-test: deploy: copy-generated-lib: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: jar: deps-test: deploy: copy-generated-lib: deploy: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test/data [copy] Copying 5 files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test/data init: [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/classes [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient init-plugin: deps-jar: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: lib-http jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: protocol-httpclient [javac] Compiling 8 source files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/protocol-httpclient.jar deps-test: [copy] Copying 2 files to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test [copy] Copied 6 empty directories to 5 empty directories under /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test deploy: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient copy-generated-lib: [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient [copy] Copying 1 file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/parse-ext init: [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/classes [mkdir] Created dir: /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/test init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: parse-ext [javac] Compiling 1 source file to /home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/classes [javac] warning: [options] bootstrap class