[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453904#comment-16453904
 ] 

Hudson commented on NUTCH-2517:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2517 mergesegs corrupts segment data - fix name of output (snagel: 
[https://github.com/apache/nutch/commit/2f50e801005493d0217160b7239eb2db82ca89f4])
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/indexer/IndexerOutputFormat.java


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single seg

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453849#comment-16453849
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

sebastian-nagel closed pull request #321: NUTCH-2517 mergesegs corrupts segment 
data
URL: https://github.com/apache/nutch/pull/321
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 87bf58525..424db3d5a 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -99,7 +99,6 @@ private void openReaders(String crawlDb, Configuration config)
 if (readers != null)
   return;
 Path crawlDbPath = new Path(crawlDb, CrawlDb.CURRENT_NAME);
-FileSystem fs = crawlDbPath.getFileSystem(config);
 readers = MapFileOutputFormat.getReaders(crawlDbPath, config);
   }
 
@@ -180,7 +179,7 @@ public synchronized void close(TaskAttemptContext context) 
throws IOException {
 
 public RecordWriter getRecordWriter(TaskAttemptContext
 context) throws IOException {
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
   DataOutputStream fileOut = fs.create(new Path(dir, name), context);
diff --git a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java 
b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
index 54b98dfce..359f9d1fc 100644
--- a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
+++ b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
@@ -34,7 +34,7 @@
 Configuration conf = context.getConfiguration();
 final IndexWriters writers = new IndexWriters(conf);
 
-String name = context.getTaskAttemptID().toString();
+String name = getUniqueFile(context, "part", "");
 writers.open(conf, name);
 
 return new RecordWriter() {
diff --git a/src/java/org/apache/nutch/segment/SegmentMerger.java 
b/src/java/org/apache/nutch/segment/SegmentMerger.java
index b1f1d8948..f4adf52b4 100644
--- a/src/java/org/apache/nutch/segment/SegmentMerger.java
+++ b/src/java/org/apache/nutch/segment/SegmentMerger.java
@@ -139,7 +139,6 @@
 throws IOException {
 
   context.setStatus(split.toString());
-  Configuration conf = context.getConfiguration();
 
   // find part name
   SegmentPart segmentPart;
@@ -213,7 +212,7 @@ public synchronized void close() throws IOException {
 public RecordWriter getRecordWriter(TaskAttemptContext 
context)
 throws IOException {
   Configuration conf = context.getConfiguration();
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
 
diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java 
b/src/java/org/apache/nutch/segment/SegmentReader.java
index 0b65a2b81..7193c58f7 100644
--- a/src/java/org/apache/nutch/segment/SegmentReader.java
+++ b/src/java/org/apache/nutch/segment/SegmentReader.java
@@ -106,8 +106,7 @@ public void map(WritableComparable key, Writable value,
   FileOutputFormat, Writable> {
 public RecordWriter, Writable> getRecordWriter(
 TaskAttemptContext context) throws IOException, InterruptedException {
-  Configuration conf = context.getConfiguration();
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-04-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447333#comment-16447333
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

sebastian-nagel opened a new pull request #321: NUTCH-2517 mergesegs corrupts 
segment data
URL: https://github.com/apache/nutch/pull/321
 
 
   - fix name of output directories for SegmentMerger and other tools


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying t

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398901#comment-16398901
 ] 

Hudson commented on NUTCH-2517:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3508 (See 
[https://builds.apache.org/job/Nutch-trunk/3508/])
NUTCH-2517 mergesegs corrupts segment data (lewis.mcgibbney: 
[https://github.com/apache/nutch/commit/dc516b700f9e7b735db5af2b5fb439681f3a0e87])
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/crawl/LinkDb.java


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398765#comment-16398765
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

lewismc closed pull request #293: NUTCH-2517 mergesegs corrupts segment data
URL: https://github.com/apache/nutch/pull/293
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/LinkDb.java 
b/src/java/org/apache/nutch/crawl/LinkDb.java
index b7a89b229..c6a32ba86 100644
--- a/src/java/org/apache/nutch/crawl/LinkDb.java
+++ b/src/java/org/apache/nutch/crawl/LinkDb.java
@@ -17,31 +17,39 @@
 
 package org.apache.nutch.crawl;
 
-import java.io.*;
+import java.io.File;
+import java.io.IOException;
 import java.lang.invoke.MethodHandles;
+import java.net.MalformedURLException;
+import java.net.URL;
 import java.text.SimpleDateFormat;
-import java.util.*;
-import java.net.*;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Random;
 
-// Commons Logging imports
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
-import org.apache.hadoop.io.*;
-import org.apache.hadoop.fs.*;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.conf.*;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat;
-import org.apache.hadoop.mapreduce.Mapper.Context;
-import org.apache.hadoop.util.*;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
 import org.apache.nutch.metadata.Nutch;
 import org.apache.nutch.net.URLFilters;
 import org.apache.nutch.net.URLNormalizers;
-import org.apache.nutch.parse.*;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.util.HadoopFSUtil;
 import org.apache.nutch.util.LockUtil;
 import org.apache.nutch.util.NutchConfiguration;
@@ -53,7 +61,7 @@
 public class LinkDb extends NutchTool implements Tool {
 
   private static final Logger LOG = LoggerFactory
-  .getLogger(MethodHandles.lookup().lookupClass());
+  .getLogger(MethodHandles.lookup().lookupClass());
 
   public static final String IGNORE_INTERNAL_LINKS = 
"linkdb.ignore.internal.links";
   public static final String IGNORE_EXTERNAL_LINKS = 
"linkdb.ignore.external.links";
@@ -62,6 +70,7 @@
   public static final String LOCK_NAME = ".locked";
 
   public LinkDb() {
+//default constructor
   }
 
   public LinkDb(Configuration conf) {
@@ -69,7 +78,7 @@ public LinkDb(Configuration conf) {
   }
 
   public static class LinkDbMapper extends 
-  Mapper {
+  Mapper {
 private int maxAnchorLength;
 private boolean ignoreInternalLinks;
 private boolean ignoreExternalLinks;
@@ -94,17 +103,16 @@ public void cleanup(){
 }
 
 public void map(Text key, ParseData parseData,
-Context context)
-throws IOException, InterruptedException {
+Context context)
+throws IOException, InterruptedException {
   String fromUrl = key.toString();
   String fromHost = getHost(fromUrl);
   if (urlNormalizers != null) {
 try {
   fromUrl = urlNormalizers
-  .normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // normalize 
the
-// url
+  .normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // 
normalize the url
 } catch (Exception e) {
-  LOG.warn("Skipping " + fromUrl + ":" + e);
+  LOG.warn("Skipping {} :", fromUrl, e);
   fromUrl = null;
 }
   }
@@ -112,7 +120,7 @@ public void map(Text key, ParseData parseData,
 try {
   fromUrl = urlFilters.filter(fromUrl); // filter the url
 } catch (Exception e) {
-  LOG.warn("Skipping " + fromUrl + ":" + e);
+  LOG.warn("Skipping {} :", fromUrl, e);
   fromUrl = null;
 }
   }
@@ -131,17 +139,16 @@ public void map(Text key, ParseData parseData,
   }
 } else if (ignoreExternalLinks) {
   String toHost = getHost(toUrl);
-  if (toHost == null || !toHost.equals(fromHost)) { // external link
-continue;

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398026#comment-16398026
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

lewismc commented on issue #293: NUTCH-2517 mergesegs corrupts segment data
URL: https://github.com/apache/nutch/pull/293#issuecomment-372889814
 
 
   Ni semantic changes here folks, will commit in 24hrs unless objections.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398025#comment-16398025
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Correct [~wastl-nagel]

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395022#comment-16395022
 ] 

Sebastian Nagel commented on NUTCH-2517:


Hi [~mebbinghaus], first I'm also not able to reproduce the problem - I've 
successfully merged 9 segments from a test crawl:
 - (as a precondition) all 9 segments contain all 6 subfolders (content, 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text)
 It's clear: if one of the segments lacks one of these 6 subdirectories, the 
SegmentMerger will fall-back to a reduced set of subdirs, see Lewis' log output 
"SegmentMerger: using segment data from: crawl_generate crawl_parse".
 - the index contains exactly the same documents when indexing (a) the 9 
segments and (b) the single merged segment
 - tested both, on my local machine and in a Docker container

This problem could be a side-effect of NUTCH-2518: it may happen that some job 
failure (including the SegmentMerger job) is not detected.

Also dubious: the directory tree of a merged segment contains instead of 
"part-0" (or equiv.) "attempt_local" dirs/files:
{noformat}
merged
└── 20180312094858
├── content
│   └── attempt_local198660722_0001_r_00_0
│   ├── data
│   └── index
├── crawl_fetch
│   └── attempt_local198660722_0001_r_00_0
│   ├── data
│   └── index
├── crawl_generate
│   └── attempt_local198660722_0001_r_00_0
├── crawl_parse
│   └── attempt_local198660722_0001_r_00_0
├── parse_data
│   └── attempt_local198660722_0001_r_00_0
│   ├── data
│   └── index
└── parse_text
└── attempt_local198660722_0001_r_00_0
├── data
└── index
{noformat}
The directory names do not really matter, however, we should have a look at 
this.

[~lewismc]: your PR does not include any "semantic" changes (except formatting, 
etc.), right?

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.inp

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391830#comment-16391830
 ] 

Marco Ebbinghaus commented on NUTCH-2517:
-

I double checked the behaviors of version 1.14 and the current master (via 
Docker containers).

*If you do a single crawling cycle*

_1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 2 folders: 
crawl_generate and crawl_parse

_master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is 
creating a merged segment (out of one segment) containing 1 folder: 
crawl_generate

_(But it might be that doing a mergesegs after one single crawling cycle is a 
missuse of the software anyway (idk), so let's have a look on doing multiple 
crawling cycles, which works better.)_

*If you do two crawling cycles:*

_1.14_: bin/nutch 
inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 6 folders: 
crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text

_master_: bin/nutch 
inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs
 is creating a merged segment (out of two segments) containing 2 folders: 
crawl_generate and crawl_parse

I am not sure how invertlinks works, but I can imagine it depends of ALL 
segment sub folders. The SegmentMerger says:
{quote}SegmentMerger: using segment data from: content crawl_generate 
crawl_fetch crawl_parse parse_data parse_text
{quote}
And I think all of these folders are still needed (by eg invertlinks), even if 
multiple segments are merged. So it works for 1.14, and no longer for master.

 

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.a

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391440#comment-16391440
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Can anyone else confirm the above ?

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391439#comment-16391439
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

lewismc opened a new pull request #293: NUTCH-2517 mergesegs corrupts segment 
data
URL: https://github.com/apache/nutch/pull/293
 
 
   This is mostly a cleanup of the Classes concerned with 
https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2517, namely 
SegmentMerger and LinkDb


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs comm

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391430#comment-16391430
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of 
your results, there is one nuance however. I'll explain below
When I run  mergesegs and inspect the data structures created within the 
mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. 
So there must be something wrong with your crawl cycle for you only to have 
generated on directory. I'll leave that down to you to confirm.

The other issue however is that when I attempt to invertlinks using one of the 
merged segs, I end up with the same stack track as you so i am looking into the 
code right now.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder dur

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Marco Ebbinghaus (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389143#comment-16389143
 ] 

Marco Ebbinghaus commented on NUTCH-2517:
-

I can also reproduce this when NOT running this from a Docker container. I 
checked out the master on my desktop 30 minutes ago and did the exactly same 
workflow as described above and the result is the same: after mergesegs I only 
have one folder crawl_generate in the merged segment. For the log output please 
see the attached screenshot.

In the meantime I am using the apache/nutch container with tag release-1.14, 
which is working as intended.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem als

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388718#comment-16388718
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Should be noted that I didn't run this from the Docker container.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-06 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388650#comment-16388650
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

I cannot reproduce this... see below for tests

{code}

//inject

/usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb 
urls/seed.txt
Injector: starting at 2018-03-06 14:31:10
Injector: crawlDb: mycrawl/crawldb
Injector: urlDir: urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01
{code}

{code}

//simple 'ls' to see what we have

/usr/local/nutch(master) $ ls mycrawl/crawldb/
current/ old/
{code}

{code}
// generate

/usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb 
mycrawl/segments 1
Generator: starting at 2018-03-06 14:31:37
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: mycrawl/segments/20180306143139
Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03
{code}

{code}
//fetch

/usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch 
mycrawl/segments/20180306143139 -threads 2
Fetcher: starting at 2018-03-06 14:32:15
Fetcher: segment: mycrawl/segments/20180306143139
Fetcher: threads: 2
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records hit by time limit :0
FetcherThread 36 Using queue mode : byHost
FetcherThread 36 Using queue mode : byHost
FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
FetcherThread 41 has no more work available
FetcherThread 41 -finishing thread FetcherThread, activeThreads=1
robots.txt whitelist not configured.
FetcherThread 40 has no more work available
FetcherThread 40 -finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02
{code}

{code}
//parse

/usr/local/nutch(master) $ ./runtime/local/bin/nutch parse 
mycrawl/segments/20180306143139 -threads 2
ParseSegment: starting at 2018-03-06 14:32:45
ParseSegment: segment: mycrawl/segments/20180306143139
Parsed (140ms):http://nutch.apache.org:-1/
ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/
/usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/
content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ 
parse_text/
{code}

{code}
//updatedb

/usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb 
mycrawl/segments/20180306143139/
CrawlDb update: starting at 2018-03-06 14:33:40
CrawlDb update: db: mycrawl/crawldb
CrawlDb update: segments: [mycrawl/segments/20180306143139]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01
{code}

{code}
//lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
crawldb/  segments/

{code}
//mergesegs with -dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter
Merging 1 segments to mycrawl/MERGEDsegments/20180306143518
SegmentMerger:   adding file:/usr/local/nutch/mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// lets see what we have

/usr/local/nutch(master) $ ls mycrawl/
MERGEDsegments/ crawldb/segments/

/usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_
crawl_generate/ crawl_parse/
{code}

{code}
//mergesegs with single segment directory without dir option

/usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs 
mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter
Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617
SegmentMerger:   adding mycrawl/segments/20180306143139
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text
{code}

{code}
// mergesegs with array of segment directories

lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch 
mergesegs mycrawl/MERGEDsegments3 mycr

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386469#comment-16386469
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Thank you [~mebbinghaus] for reporting. This appears to be a major bug and 
hence a blocker for the next release. I will begin work on a solution ASAP.
FYI [~omkar20895] this is post Hadoop upgrade.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)