[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508748 ] Hudson commented on NUTCH-498: -- Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/]) Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505 ] Doğacan Güney commented on NUTCH-498: - I tested creating a linkdb from ~6M urls: Combine input records42,091,902 Combine output records 15,684,838 (Combiner reduces number of records to around 1/3.) Job took ~15 min overall with combiner, ~20 minutes without combiner. So, +1 from me. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506 ] Andrzej Bialecki commented on NUTCH-498: - +1. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508 ] Sami Siren commented on NUTCH-498: -- +1 Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505454 ] Doğacan Güney commented on NUTCH-498: - Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce. Sounds good. I opened NUTCH-499 for this. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505197 ] Doğacan Güney commented on NUTCH-498: - Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner job.setCombinerClass(LinkDb.class); should do the trick, shouldn't it? Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505242 ] Espen Amble Kolstad commented on NUTCH-498: --- Yes, you're right I forgot I added a new class just to get the Counter ... Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505249 ] Doğacan Güney commented on NUTCH-498: - After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing? Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302 ] Andrzej Bialecki commented on NUTCH-498: - Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.