date:20140228

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-02-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915563#comment-13915563
 ] 

Yasin Kılınç commented on NUTCH-1253:
-

I checked and tested patch file into 2.x branch. I used ant eclipse target, 
then I opened via eclipse IDE. The project compile but eclipse shows warning 
because of, version of nekohtml is old. I want to attach patch file for this 
problem.

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-02-28 Thread Talat UYARER (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Talat UYARER updated NUTCH-1478:

Attachment: NUTCH-1478v5.patch

I fixed several mistakes within the patch. This is final.

Parse-metatags and index-metadata plugin for Nutch 2.x series
--

Key: NUTCH-1478
URL: https://issues.apache.org/jira/browse/NUTCH-1478
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 2.1
Reporter: kiran
Fix For: 2.3

Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch,
NUTCH-1478v4.patch, NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip,
metadata_parseChecker_sites.png

I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.
This will take multiple values of same tag and index in Solr as i patched
before (https://issues.apache.org/jira/browse/NUTCH-1467).
The usage is same as described here
(http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is
no need to give 'metatag' keyword before metatag names. For example my
configuration looks like this
(https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)

This is only the first version and does not include the junit test. I will
update the new version soon.
This will parse the tags and index the tags in Solr. Make sure you create the
fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
Please let me know if you have any suggestions
This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-02-28 Thread Talat UYARER (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915619#comment-13915619
]

Talat UYARER edited comment on NUTCH-1478 at 2/28/14 10:03 AM:
---

I fixed several mistakes within the patch. This is final. [~popalka], can you
test the patch ?

was (Author: talat):
I fixed several mistakes within the patch. This is final.

Parse-metatags and index-metadata plugin for Nutch 2.x series
--

Key: NUTCH-1478
URL: https://issues.apache.org/jira/browse/NUTCH-1478
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 2.1
Reporter: kiran
Fix For: 2.3

Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch,
NUTCH-1478v4.patch, NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip,
metadata_parseChecker_sites.png

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-02-28 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915701#comment-13915701
 ] 

Lewis John McGibbney commented on NUTCH-1253:
-

The version of nekohtml we are using is 

dependency org=net.sourceforge.nekohtml name=nekohtml rev=1.9.19 
conf=*-master/

AFAIK this is most recent.

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1727) Configurable length for Tlds

2014-02-28 Thread Sertac TURKEL (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1727:
-

Attachment: (was: NUTCH-1727.patch)

 Configurable length for Tlds
 

 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.3


 Length of the tld  should be selectable, there is some available tld's like 
 .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2014-02-28 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915730#comment-13915730
 ] 

Yasin Kılınç commented on NUTCH-1253:
-

Ok. But there is a line in target eclipse NUTCH_HOME/build.xml like this 
{code}
library path=${basedir}/build/plugins/lib-nekohtml/nekohtml-0.9.5.jar  
exported=false /
{code}

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1727) Configurable length for Tlds

2014-02-28 Thread Sertac TURKEL (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1727:
-

Attachment: NUTCH-1727.patch

Hi [~lewismc], there is a point that I missed. I found it and I updated patch 
file. I think it is OK. Could you review again?

 Configurable length for Tlds
 

 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1727.patch


 Length of the tld  should be selectable, there is some available tld's like 
 .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

2014-02-28 Thread Sebastian Nagel (JIRA)

Sebastian Nagel created NUTCH-1732:
--

 Summary: IndexerMapReduce to delete explicitly not indexable 
documents
 Key: NUTCH-1732
 URL: https://issues.apache.org/jira/browse/NUTCH-1732
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8
Reporter: Sebastian Nagel
 Fix For: 1.9


In a continuous crawl a previously successfully indexed document (identified by 
a URL) can become not indexable for a couple of reasons and must then 
explicitly deleted from the index. Some cases are handled in IndexerMapReduce 
(duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not:
* failed to parse (but previously successfully parsed): e.g., the document 
became larger and is now truncated
* rejected by indexing filter (but previously accepted)

In both cases (maybe there are more) the document should be explicitly deleted 
(if {{-deleteGone}} is set). Note that this cannot be done in CleaningJob 
because data from segments is required. 

We should also update/add a description for {{-deleteGone}}: it does not only 
trigger deletion of gone documents but also of redirects and duplicates (and 
unparseable and skipped docs).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

2014-02-28 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915828#comment-13915828
 ] 

Markus Jelsma commented on NUTCH-1732:
--

We have an explicit deleteSkippedByIndexingFilter option instead, it seems i 
have never committed it to Apache Nutch in NUTCH-1449

{code}
// skip documents discarded by indexing filters
if (doc == null) {
  // https://issues.apache.org/jira/browse/NUTCH-1449
  if (deleteSkippedByIndexingFilter) {
NutchIndexAction action = new NutchIndexAction(NutchIndexAction.DELETE);
output.collect(key, action);
reporter.incrCounter(IndexerStatus, Deleted by filters, 1);
  } else {
reporter.incrCounter(IndexerStatus, Skipped by filters, 1);
  }
  return;
}
{code}

 IndexerMapReduce to delete explicitly not indexable documents
 -

 Key: NUTCH-1732
 URL: https://issues.apache.org/jira/browse/NUTCH-1732
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8
Reporter: Sebastian Nagel
 Fix For: 1.9


 In a continuous crawl a previously successfully indexed document (identified 
 by a URL) can become not indexable for a couple of reasons and must then 
 explicitly deleted from the index. Some cases are handled in IndexerMapReduce 
 (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not:
 * failed to parse (but previously successfully parsed): e.g., the document 
 became larger and is now truncated
 * rejected by indexing filter (but previously accepted)
 In both cases (maybe there are more) the document should be explicitly 
 deleted (if {{-deleteGone}} is set). Note that this cannot be done in 
 CleaningJob because data from segments is required. 
 We should also update/add a description for {{-deleteGone}}: it does not only 
 trigger deletion of gone documents but also of redirects and duplicates (and 
 unparseable and skipped docs).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915831#comment-13915831
 ] 

Sebastian Nagel commented on NUTCH-1113:


Results of tests: The number of documents in index after all segments have been
* (A) indexed in chronological order segment by segment or
* (B) merged into one segment which then has been indexed

is shown below. B has been run twice: (1) without any patch, and (2) using the 
patch as of 2014-02-21. IndexerMapReduce was patched by 
NUTCH-1706-trunk-v2.patch for all 3 runs.

|||| coll 1 || coll 2 || coll 3 ||
| A  seg-by-seg|  22178 |   6959  |   45944 |
| B1 merged|  21122 |   6579  |   46029 |
| B2 patched, merged   |  22161 |   6959  |   46135 |

3 collections have been tested, all of them with ~100 segments and 100.000 
URLs, but with many redirects, robots noindex etc. (far more than indexable 
documents). With patch (B2 compared to B1) the index contains more documents. 
For collection 2 it's now equal to the expected number (A). For the other two 
collections the numbers differ but that's because of problems in 
IndexerMapReduce (NUTCH-1708 and NUTCH-1732). 

+1 to commit [~markus17]'s latest patch.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908265#comment-13908265
 ] 

Sebastian Nagel edited comment on NUTCH-1113 at 2/28/14 2:45 PM:
-

Hi [~markus17], your patch should work (I've tested it exactly the same way).

The indexer was run with {{indexer.skip.notmodified == false}}. The problem is 
that in the merged segment fetch_success datums have been lost and the 
following test skipped these URLs:
{code}
if (!parseData.getStatus().isSuccess() ||
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
  return;
}
{code}
Just to clarify that we use the same test set-up:
# start with an empty index
# index (case A) segments in chronological order or (case B) merged segment
# compare both indexes

The CrawlDb was updated with URLs from all segments. The same CrawlDb is used 
for all index runs, right?

I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, 
the index will not contain any pages with status notmodified.


was (Author: wastl-nagel):
Hi [~markus17], your patch should work (I've tested it exactly the same way).

The indexer was run with {{indexer.skip.notmodified == false}}. The problem is 
that in the merged segment fetch_success datums have been lost and the 
following test skipped these URLs:
{code}
if (!parseData.getStatus().isSuccess() ||
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
  return;
}
{code}
Just to clarify that we use the same test set-up:
# start with an empty index
# index (case A) segments in chronological order or (case B) merged segment
# compare both indexes
The CrawlDb was updated with URLs from all segments. The same CrawlDb is used 
for all index runs, right?

I plan to run the test with {{indexer.skip.notmodified == false}}. Otherwise, 
we the index will not contain any pages with status notmodified.

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, 
 merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

HTTP Post request

2014-02-28 Thread Zabini

Hi, 

I would like to be able to send HTTP POST request for Nutch to crawl.
I mean if I ever wanted to crawl a search result, I could do 
http://www.example.com/search?q=mySearch

But if the server use HTTP post I have not found a way to do it.

So what I wanted to do is from a conf file retrieve the method post/get,
the name of the parameters and when Nutch come accross a given URL,
it will access the file with the right HTTP request.

For my example the conf.xml would be like:
url href=http://www.example.com/search; method=post

/url

But as I am new to Nutch, could someone provide me with some clue,
on how to start this new plugin?

Best regards,
Zabini




--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTTP-Post-request-tp4120405.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Attachment: NUTCH-1113-trunk-junit-final.patch

Final patch including the stuff mentioned by Sebastian and the junit test. I 
will commit shortly unless there are some final objections :)

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

2014-02-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915889#comment-13915889
 ] 

Sebastian Nagel commented on NUTCH-1732:


Hi [~markus17], looks like a partial duplicate. I've seen documents which 
failed to parse in their latest version in the index. They should not be in the 
index, no matter how segments are indexed: segment by segment in chronological 
order, all segments in one turn, or first merged (cf. NUTCH-1113).
To have one extra option is ok. But other case (failed parses) could be 
subsumed under {{-deleteGone}}.

 IndexerMapReduce to delete explicitly not indexable documents
 -

 Key: NUTCH-1732
 URL: https://issues.apache.org/jira/browse/NUTCH-1732
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8
Reporter: Sebastian Nagel
 Fix For: 1.9


 In a continuous crawl a previously successfully indexed document (identified 
 by a URL) can become not indexable for a couple of reasons and must then 
 explicitly deleted from the index. Some cases are handled in IndexerMapReduce 
 (duplicates, gone documents or redirects, cf. NUTCH-1139) but others are not:
 * failed to parse (but previously successfully parsed): e.g., the document 
 became larger and is now truncated
 * rejected by indexing filter (but previously accepted)
 In both cases (maybe there are more) the document should be explicitly 
 deleted (if {{-deleteGone}} is set). Note that this cannot be done in 
 CleaningJob because data from segments is required. 
 We should also update/add a description for {{-deleteGone}}: it does not only 
 trigger deletion of gone documents but also of redirects and duplicates (and 
 unparseable and skipped docs).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.9)
   1.8

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1113.
--

Resolution: Fixed
  Assignee: Markus Jelsma

Committed revision 1572975.

Thanks all for contributing.  I am very happy this is fixed once and for all. :)

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.9

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: Nutch roadmap and documentation

2014-02-28 Thread Lewis John Mcgibbney

Hi Mateusz,

On Thu, Feb 27, 2014 at 10:35 AM, Mateusz Zakarczemny 
mateusz.zakarcze...@up2data.pl wrote:

 Docs from 1 and 2 branch are mixed together.


As far as I can see they are separate. The tutorials are clearly under
different subsections, and the Nutch 2.x docs have their own section as
well.


 I understand that detailed documentation easily became outdated. But
 providing information about the existence of the feature is very basic task
 of documentation. It should be always up to date.


What exactly are you referring to here? I am slightly puzzled as to why not
one other user has requested documentation on individual issues. IMHO, if
people wish to use and develop Nutch, they should at minimum subscribe to
user@ and dev@... the latter contains EVERY issue which is discussed and
feature enhancement. The same is done for other projects.


 A changeling is not features documentation.


No but the point of the changelog is to refer people to what is included.
We also now provide a link to the Jira release profile. It is up to
users/developers to read up if they wish to learn more about individual
issues. An example of such a link can be found in the 2.x CHANGES.txt
Release Report - http://s.apache.org/PGa


 If new user looks at Nuch he will not check the changelog but
 documentation.


Is this your opinion or are you commenting from a wider audiences
perspective?


 I think the new user should be provided with clear information about which
 branch to choose.


I agree with this. This is why the lists exist. You can ask questions. You
can also read some archives. It takes a minimal well spent investment of
time to dig up what other have asked many many times. Don't get me wrong, I
am all for informing people about the software... however I am not in the
immediate position to write a decent quality book on Nutch which would do
the community and software justice. If you are then please do.


 What is more, the doc should be divided in branch 1 and 2.


Please see the table of contents on the wiki. Please also see my comments
above.


 Pages could link together, but there should be a clean branch tree in the
 docs. As like in source code. You do not mix packages from two branches but
 you keep them in separated repos.


ditto


 I don't think that for bugs documentation is essential. Only for new
 features or refactoring. It doesn't have to be big document. It just has to
 exist.


But what happens if fixing a bug changes functionality? Then what?


 Nowadays there are some plugins which are not mentioned in plugin central.
 It is very confusing.


Yes I agree with this. it is not entirely up-to-date. This is something we
should most likely address.


 I know that sometimes developers don't have time to create documentation.
 But in such case they should create a new task for such doc. Otherwise
 nobody knows that doc is missing and cannot help.


Not true. All you need to do is request Karma for the project wiki and you
can contribute to whatever you feel is missing. I don't take this argument
sorry.


 I am not saying that confluence is best for this project. But in my
 opinion Nutch docs should be moved to some community/social solutions. It
 would be great if it enables comments and pull requests (like on github) to
 improve it.


AFAICT the wiki we currently have IS community oriented. Anyone over the
years that has wished to add/edit has been granted Karma to do so. Are you
really saying that enabling pull request via Github is a better way than
simply granting someone Karma to edit a page as they wish?


 Maybe MD files would be better? Documentation could be stored with source
 code. Eg. Doc folder in each plugin. It would be fixed to source code
 structure. This approach has many advantages. When I contribute some doc on
 github I don't have to apply anywhere or ask anybody. I just create pull
 request to documentation.  Project leader sees it and next could review and
 apply it. Whole process take 3-4 mouse clicks. One drawback is moving to
 such solution would be quite complex and time consuming.


Yes it certainly would be.



 Over last 10 years Nuch documentation grew incrementally. I think It is
 time to refactor it to more modular and structured way (like source code).
 I don't want to rewrite it but just create better structure.


Honestly I haven't seen anything from your commentary which would suggest
benefits for Nutch as a whole... I am trying NOT to be pessimistic, but I
am just struggling to see your point here.
If the wiki is outdated... then we should update it. Not change to another
solution just so we can receive pull requests for documentation.
There is an argument to make it as easy as possible to contribute
documentation to Nutch. However as far as I can see, there are not crowds
of people rushing to contribute.
Please don't take these comments negatively. I am behind any motion to make
documentation better. I just don't see eye-to-eye with some of your points.




 PS

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915917#comment-13915917
 ] 

Julien Nioche commented on NUTCH-1113:
--

Well done, thanks guys!

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

2014-02-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915919#comment-13915919
 ] 

Sebastian Nagel commented on NUTCH-1706:


Latest patch tested successfully (see NUTCH-1113). Will commit shortly.

[~markus17], can you open an issue about the fetch_retry?
Regarding the ordering of values when indexing multiple segments of a 
continuous crawl: there are already NUTCH-1416 and NUTCH-1617. 

 IndexerMapReduce does not remove db_redir_temp etc
 --

 Key: NUTCH-1706
 URL: https://issues.apache.org/jira/browse/NUTCH-1706
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, 
 nutch-1706-testdata.tgz


 Code path is wrong in IndexerMapReduce, the delete code should be located 
 after all reducer values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Build failed in Jenkins: Nutch-trunk #2545

2014-02-28 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/2545/changes

Changes:

[markus] NUTCH-1113 SegmentMerger can now be safely used to merge segments. If 
this damn thing breaks again

--
[...truncated 3001 lines...]
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: subcollection
[javac] Compiling 3 source files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/subcollection/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/subcollection/subcollection.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/subcollection

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/subcollection

init:
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/classes
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/test
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: tld
[javac] Compiling 2 source files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/tld/tld.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/tld
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/test/data
 [copy] Copying 6 files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/test/data

init:
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/classes
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton

init-plugin:

deps-jar:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:
[javac] Compiling 1 source file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/lib-regex-filter/test
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-automaton
[javac] Compiling 1 source file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-automaton/urlfilter-automaton.jar

deps-test:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-automaton
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-domain/test/data
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/urlfilter-domain/test/data

init:
[mkdir] Created dir:

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-28 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915969#comment-13915969
 ] 

Hudson commented on NUTCH-1113:
---

FAILURE: Integrated in Nutch-trunk #2545 (See 
[https://builds.apache.org/job/Nutch-trunk/2545/])
NUTCH-1113 SegmentMerger can now be safely used to merge segments. If this damn 
thing breaks again (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1572975)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
* 
/nutch/trunk/src/test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java


 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.8

 Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
 NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-final.patch, 
 NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt, 
 unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Build failed in Jenkins: Nutch-trunk #2546

2014-02-28 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/2546/

--
[...truncated 2159 lines...]

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-ftp

init:
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/classes
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/test
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http

init-plugin:

deps-jar:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-http

jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: protocol-http
[javac] Compiling 2 source files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-http/protocol-http.jar

deps-test:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-http

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-http
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test/data
 [copy] Copying 5 files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test/data

init:
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/classes
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient

init-plugin:

deps-jar:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: lib-http

jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: protocol-httpclient
[javac] Compiling 8 source files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/protocol-httpclient.jar

deps-test:
 [copy] Copying 2 files to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test
 [copy] Copied 6 empty directories to 5 empty directories under 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/protocol-httpclient/test

deploy:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient

copy-generated-lib:
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/protocol-httpclient
 [copy] Copying 1 file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/plugins/parse-ext

init:
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/classes
[mkdir] Created dir: 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/test

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: parse-ext
[javac] Compiling 1 source file to 
/home/hudson/jenkins-slave/workspace/Nutch-trunk/trunk/build/parse-ext/classes
[javac] warning: [options] bootstrap class

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

[jira] [Comment Edited] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

[jira] [Updated] (NUTCH-1727) Configurable length for Tlds

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

[jira] [Updated] (NUTCH-1727) Configurable length for Tlds

[jira] [Created] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

[jira] [Comment Edited] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

HTTP Post request

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

[jira] [Commented] (NUTCH-1732) IndexerMapReduce to delete explicitly not indexable documents

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

[jira] [Resolved] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Re: Nutch roadmap and documentation

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

Build failed in Jenkins: Nutch-trunk #2545

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

Build failed in Jenkins: Nutch-trunk #2546

22 matches

Site Navigation

Mail list logo

Footer information