[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453904#comment-16453904 ] Hudson commented on NUTCH-2517: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See [https://builds.apache.org/job/Nutch-trunk/3522/]) NUTCH-2517 mergesegs corrupts segment data - fix name of output (snagel: [https://github.com/apache/nutch/commit/2f50e801005493d0217160b7239eb2db82ca89f4]) * (edit) src/java/org/apache/nutch/segment/SegmentMerger.java * (edit) src/java/org/apache/nutch/segment/SegmentReader.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java * (edit) src/java/org/apache/nutch/indexer/IndexerOutputFormat.java > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single seg
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453849#comment-16453849 ] ASF GitHub Bot commented on NUTCH-2517: --- sebastian-nagel closed pull request #321: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/321 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java index 87bf58525..424db3d5a 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java @@ -99,7 +99,6 @@ private void openReaders(String crawlDb, Configuration config) if (readers != null) return; Path crawlDbPath = new Path(crawlDb, CrawlDb.CURRENT_NAME); -FileSystem fs = crawlDbPath.getFileSystem(config); readers = MapFileOutputFormat.getReaders(crawlDbPath, config); } @@ -180,7 +179,7 @@ public synchronized void close(TaskAttemptContext context) throws IOException { public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException { - String name = context.getTaskAttemptID().toString(); + String name = getUniqueFile(context, "part", ""); Path dir = FileOutputFormat.getOutputPath(context); FileSystem fs = dir.getFileSystem(context.getConfiguration()); DataOutputStream fileOut = fs.create(new Path(dir, name), context); diff --git a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java index 54b98dfce..359f9d1fc 100644 --- a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java +++ b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java @@ -34,7 +34,7 @@ Configuration conf = context.getConfiguration(); final IndexWriters writers = new IndexWriters(conf); -String name = context.getTaskAttemptID().toString(); +String name = getUniqueFile(context, "part", ""); writers.open(conf, name); return new RecordWriter() { diff --git a/src/java/org/apache/nutch/segment/SegmentMerger.java b/src/java/org/apache/nutch/segment/SegmentMerger.java index b1f1d8948..f4adf52b4 100644 --- a/src/java/org/apache/nutch/segment/SegmentMerger.java +++ b/src/java/org/apache/nutch/segment/SegmentMerger.java @@ -139,7 +139,6 @@ throws IOException { context.setStatus(split.toString()); - Configuration conf = context.getConfiguration(); // find part name SegmentPart segmentPart; @@ -213,7 +212,7 @@ public synchronized void close() throws IOException { public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException { Configuration conf = context.getConfiguration(); - String name = context.getTaskAttemptID().toString(); + String name = getUniqueFile(context, "part", ""); Path dir = FileOutputFormat.getOutputPath(context); FileSystem fs = dir.getFileSystem(context.getConfiguration()); diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java b/src/java/org/apache/nutch/segment/SegmentReader.java index 0b65a2b81..7193c58f7 100644 --- a/src/java/org/apache/nutch/segment/SegmentReader.java +++ b/src/java/org/apache/nutch/segment/SegmentReader.java @@ -106,8 +106,7 @@ public void map(WritableComparable key, Writable value, FileOutputFormat, Writable> { public RecordWriter, Writable> getRecordWriter( TaskAttemptContext context) throws IOException, InterruptedException { - Configuration conf = context.getConfiguration(); - String name = context.getTaskAttemptID().toString(); + String name = getUniqueFile(context, "part", ""); Path dir = FileOutputFormat.getOutputPath(context); FileSystem fs = dir.getFileSystem(context.getConfiguration()); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker >
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447333#comment-16447333 ] ASF GitHub Bot commented on NUTCH-2517: --- sebastian-nagel opened a new pull request #321: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/321 - fix name of output directories for SegmentMerger and other tools This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying t
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398901#comment-16398901 ] Hudson commented on NUTCH-2517: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3508 (See [https://builds.apache.org/job/Nutch-trunk/3508/]) NUTCH-2517 mergesegs corrupts segment data (lewis.mcgibbney: [https://github.com/apache/nutch/commit/dc516b700f9e7b735db5af2b5fb439681f3a0e87]) * (edit) src/java/org/apache/nutch/segment/SegmentMerger.java * (edit) src/java/org/apache/nutch/crawl/LinkDb.java > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398765#comment-16398765 ] ASF GitHub Bot commented on NUTCH-2517: --- lewismc closed pull request #293: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/293 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/java/org/apache/nutch/crawl/LinkDb.java b/src/java/org/apache/nutch/crawl/LinkDb.java index b7a89b229..c6a32ba86 100644 --- a/src/java/org/apache/nutch/crawl/LinkDb.java +++ b/src/java/org/apache/nutch/crawl/LinkDb.java @@ -17,31 +17,39 @@ package org.apache.nutch.crawl; -import java.io.*; +import java.io.File; +import java.io.IOException; import java.lang.invoke.MethodHandles; +import java.net.MalformedURLException; +import java.net.URL; import java.text.SimpleDateFormat; -import java.util.*; -import java.net.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import java.util.Random; -// Commons Logging imports import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import org.apache.hadoop.io.*; -import org.apache.hadoop.fs.*; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.conf.*; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat; -import org.apache.hadoop.mapreduce.Mapper.Context; -import org.apache.hadoop.util.*; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.net.URLFilters; import org.apache.nutch.net.URLNormalizers; -import org.apache.nutch.parse.*; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.ParseData; import org.apache.nutch.util.HadoopFSUtil; import org.apache.nutch.util.LockUtil; import org.apache.nutch.util.NutchConfiguration; @@ -53,7 +61,7 @@ public class LinkDb extends NutchTool implements Tool { private static final Logger LOG = LoggerFactory - .getLogger(MethodHandles.lookup().lookupClass()); + .getLogger(MethodHandles.lookup().lookupClass()); public static final String IGNORE_INTERNAL_LINKS = "linkdb.ignore.internal.links"; public static final String IGNORE_EXTERNAL_LINKS = "linkdb.ignore.external.links"; @@ -62,6 +70,7 @@ public static final String LOCK_NAME = ".locked"; public LinkDb() { +//default constructor } public LinkDb(Configuration conf) { @@ -69,7 +78,7 @@ public LinkDb(Configuration conf) { } public static class LinkDbMapper extends - Mapper { + Mapper { private int maxAnchorLength; private boolean ignoreInternalLinks; private boolean ignoreExternalLinks; @@ -94,17 +103,16 @@ public void cleanup(){ } public void map(Text key, ParseData parseData, -Context context) -throws IOException, InterruptedException { +Context context) +throws IOException, InterruptedException { String fromUrl = key.toString(); String fromHost = getHost(fromUrl); if (urlNormalizers != null) { try { fromUrl = urlNormalizers - .normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // normalize the -// url + .normalize(fromUrl, URLNormalizers.SCOPE_LINKDB); // normalize the url } catch (Exception e) { - LOG.warn("Skipping " + fromUrl + ":" + e); + LOG.warn("Skipping {} :", fromUrl, e); fromUrl = null; } } @@ -112,7 +120,7 @@ public void map(Text key, ParseData parseData, try { fromUrl = urlFilters.filter(fromUrl); // filter the url } catch (Exception e) { - LOG.warn("Skipping " + fromUrl + ":" + e); + LOG.warn("Skipping {} :", fromUrl, e); fromUrl = null; } } @@ -131,17 +139,16 @@ public void map(Text key, ParseData parseData, } } else if (ignoreExternalLinks) { String toHost = getHost(toUrl); - if (toHost == null || !toHost.equals(fromHost)) { // external link -continue;
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398026#comment-16398026 ] ASF GitHub Bot commented on NUTCH-2517: --- lewismc commented on issue #293: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/293#issuecomment-372889814 Ni semantic changes here folks, will commit in 24hrs unless objections. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398025#comment-16398025 ] Lewis John McGibbney commented on NUTCH-2517: - Correct [~wastl-nagel] > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395022#comment-16395022 ] Sebastian Nagel commented on NUTCH-2517: Hi [~mebbinghaus], first I'm also not able to reproduce the problem - I've successfully merged 9 segments from a test crawl: - (as a precondition) all 9 segments contain all 6 subfolders (content, crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text) It's clear: if one of the segments lacks one of these 6 subdirectories, the SegmentMerger will fall-back to a reduced set of subdirs, see Lewis' log output "SegmentMerger: using segment data from: crawl_generate crawl_parse". - the index contains exactly the same documents when indexing (a) the 9 segments and (b) the single merged segment - tested both, on my local machine and in a Docker container This problem could be a side-effect of NUTCH-2518: it may happen that some job failure (including the SegmentMerger job) is not detected. Also dubious: the directory tree of a merged segment contains instead of "part-0" (or equiv.) "attempt_local" dirs/files: {noformat} merged └── 20180312094858 ├── content │ └── attempt_local198660722_0001_r_00_0 │ ├── data │ └── index ├── crawl_fetch │ └── attempt_local198660722_0001_r_00_0 │ ├── data │ └── index ├── crawl_generate │ └── attempt_local198660722_0001_r_00_0 ├── crawl_parse │ └── attempt_local198660722_0001_r_00_0 ├── parse_data │ └── attempt_local198660722_0001_r_00_0 │ ├── data │ └── index └── parse_text └── attempt_local198660722_0001_r_00_0 ├── data └── index {noformat} The directory names do not really matter, however, we should have a look at this. [~lewismc]: your PR does not include any "semantic" changes (except formatting, etc.), right? > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.inp
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391830#comment-16391830 ] Marco Ebbinghaus commented on NUTCH-2517: - I double checked the behaviors of version 1.14 and the current master (via Docker containers). *If you do a single crawling cycle* _1.14_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 2 folders: crawl_generate and crawl_parse _master_: bin/nutch inject->generate->fetch->parse->updatedb->mergesegs is creating a merged segment (out of one segment) containing 1 folder: crawl_generate _(But it might be that doing a mergesegs after one single crawling cycle is a missuse of the software anyway (idk), so let's have a look on doing multiple crawling cycles, which works better.)_ *If you do two crawling cycles:* _1.14_: bin/nutch inject->+{color:#33}generate->fetch->parse->updatedb{color}+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 6 folders: crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text _master_: bin/nutch inject->+generate->fetch->parse->updatedb+->+generate->fetch->parse->updatedb+->mergesegs is creating a merged segment (out of two segments) containing 2 folders: crawl_generate and crawl_parse I am not sure how invertlinks works, but I can imagine it depends of ALL segment sub folders. The SegmentMerger says: {quote}SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {quote} And I think all of these folders are still needed (by eg invertlinks), even if multiple segments are merged. So it works for 1.14, and no longer for master. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.a
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391440#comment-16391440 ] Lewis John McGibbney commented on NUTCH-2517: - Can anyone else confirm the above ? > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391439#comment-16391439 ] ASF GitHub Bot commented on NUTCH-2517: --- lewismc opened a new pull request #293: NUTCH-2517 mergesegs corrupts segment data URL: https://github.com/apache/nutch/pull/293 This is mostly a cleanup of the Classes concerned with https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2517, namely SegmentMerger and LinkDb This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs comm
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391430#comment-16391430 ] Lewis John McGibbney commented on NUTCH-2517: - Hi [~mebbinghaus] I ran it from the Docker container and can reproduce some of your results, there is one nuance however. I'll explain below When I run mergesegs and inspect the data structures created within the mycrawl/MERGEDsegments/segment/... I see BOTH crawl_generate and crawl_parse. So there must be something wrong with your crawl cycle for you only to have generated on directory. I'll leave that down to you to confirm. The other issue however is that when I attempt to invertlinks using one of the merged segs, I end up with the same stack track as you so i am looking into the code right now. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png, > Screenshot_2018-03-07_07-50-05.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder dur
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389143#comment-16389143 ] Marco Ebbinghaus commented on NUTCH-2517: - I can also reproduce this when NOT running this from a Docker container. I checked out the master on my desktop 30 minutes ago and did the exactly same workflow as described above and the result is the same: after mergesegs I only have one folder crawl_generate in the merged segment. For the log output please see the attached screenshot. In the meantime I am using the apache/nutch container with tag release-1.14, which is working as intended. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Assignee: Lewis John McGibbney >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem als
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388718#comment-16388718 ] Lewis John McGibbney commented on NUTCH-2517: - Should be noted that I didn't run this from the Docker container. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388650#comment-16388650 ] Lewis John McGibbney commented on NUTCH-2517: - I cannot reproduce this... see below for tests {code} //inject /usr/local/nutch(master) $ ./runtime/local/bin/nutch inject mycrawl/crawldb urls/seed.txt Injector: starting at 2018-03-06 14:31:10 Injector: crawlDb: mycrawl/crawldb Injector: urlDir: urls/seed.txt Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 0 Injector: Total new urls injected: 1 Injector: finished at 2018-03-06 14:31:12, elapsed: 00:00:01 {code} {code} //simple 'ls' to see what we have /usr/local/nutch(master) $ ls mycrawl/crawldb/ current/ old/ {code} {code} // generate /usr/local/nutch(master) $ ./runtime/local/bin/nutch generate mycrawl/crawldb mycrawl/segments 1 Generator: starting at 2018-03-06 14:31:37 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: mycrawl/segments/20180306143139 Generator: finished at 2018-03-06 14:31:40, elapsed: 00:00:03 {code} {code} //fetch /usr/local/nutch(master) $ ./runtime/local/bin/nutch fetch mycrawl/segments/20180306143139 -threads 2 Fetcher: starting at 2018-03-06 14:32:15 Fetcher: segment: mycrawl/segments/20180306143139 Fetcher: threads: 2 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records hit by time limit :0 FetcherThread 36 Using queue mode : byHost FetcherThread 36 Using queue mode : byHost FetcherThread 40 fetching http://nutch.apache.org:-1/ (queue crawl delay=5000ms) Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 FetcherThread 41 has no more work available FetcherThread 41 -finishing thread FetcherThread, activeThreads=1 robots.txt whitelist not configured. FetcherThread 40 has no more work available FetcherThread 40 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 -activeThreads=0 Fetcher: finished at 2018-03-06 14:32:18, elapsed: 00:00:02 {code} {code} //parse /usr/local/nutch(master) $ ./runtime/local/bin/nutch parse mycrawl/segments/20180306143139 -threads 2 ParseSegment: starting at 2018-03-06 14:32:45 ParseSegment: segment: mycrawl/segments/20180306143139 Parsed (140ms):http://nutch.apache.org:-1/ ParseSegment: finished at 2018-03-06 14:32:46, elapsed: 00:00:01 {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ /usr/local/nutch(master) $ ls mycrawl/segments/20180306143139/ content/crawl_fetch/crawl_generate/ crawl_parse/parse_data/ parse_text/ {code} {code} //updatedb /usr/local/nutch(master) $ ./runtime/local/bin/nutch updatedb mycrawl/crawldb mycrawl/segments/20180306143139/ CrawlDb update: starting at 2018-03-06 14:33:40 CrawlDb update: db: mycrawl/crawldb CrawlDb update: segments: [mycrawl/segments/20180306143139] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2018-03-06 14:33:41, elapsed: 00:00:01 {code} {code} //lets see what we have /usr/local/nutch(master) $ ls mycrawl/ crawldb/ segments/ {code} //mergesegs with -dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments -dir mycrawl/segments/ -filter Merging 1 segments to mycrawl/MERGEDsegments/20180306143518 SegmentMerger: adding file:/usr/local/nutch/mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // lets see what we have /usr/local/nutch(master) $ ls mycrawl/ MERGEDsegments/ crawldb/segments/ /usr/local/nutch(master) $ ls mycrawl/MERGEDsegments/20180306143518/crawl_ crawl_generate/ crawl_parse/ {code} {code} //mergesegs with single segment directory without dir option /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments2 mycrawl/segments/20180306143139/ -filter Merging 1 segments to mycrawl/MERGEDsegments2/20180306143617 SegmentMerger: adding mycrawl/segments/20180306143139 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text {code} {code} // mergesegs with array of segment directories lmcgibbn@LMC-056430 /usr/local/nutch(master) $ ./runtime/local/bin/nutch mergesegs mycrawl/MERGEDsegments3 mycr
[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386469#comment-16386469 ] Lewis John McGibbney commented on NUTCH-2517: - Thank you [~mebbinghaus] for reporting. This appears to be a major bug and hence a blocker for the next release. I will begin work on a solution ASAP. FYI [~omkar20895] this is post Hadoop upgrade. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)