[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508748
 ] 

Hudson commented on NUTCH-498:
--

Integrated in Nutch-Nightly #131 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/])

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505
 ] 

Doğacan Güney commented on NUTCH-498:
-

I tested creating a linkdb from ~6M urls:

Combine input records42,091,902
Combine output records  15,684,838

(Combiner reduces number of records to around 1/3.)

Job took ~15 min overall with combiner, ~20 minutes without combiner.

So, +1 from me.




 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506
 ] 

Andrzej Bialecki  commented on NUTCH-498:
-

+1.

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508
 ] 

Sami Siren commented on NUTCH-498:
--

+1

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505454
 ] 

Doğacan Güney commented on NUTCH-498:
-

 Currently there is no difference, indeed. The version in LinkDb.reduce is 
 safer, because it uses a separate instance of Inlinks. Perhaps we could 
 replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely 
 remove LinkDb.reduce.

Sounds good. I opened NUTCH-499 for this.

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505197
 ] 

Doğacan Güney commented on NUTCH-498:
-

Why can't we just set combiner class as LinkDb? AFAICS, you are not doing 
anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner

job.setCombinerClass(LinkDb.class);

should do the trick, shouldn't it?

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread Espen Amble Kolstad (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505242
 ] 

Espen Amble Kolstad commented on NUTCH-498:
---

Yes, you're right

I forgot I added a new class just to get the Counter ...

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505249
 ] 

Doğacan Güney commented on NUTCH-498:
-

After examining the code better, I am a bit confused. We have a 
LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate 
inlinks until its size is maxInlinks then collect). Why do we have them 
seperately? Is there a difference between them that I am missing?

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302
 ] 

Andrzej Bialecki  commented on NUTCH-498:
-

Currently there is no difference, indeed. The version in LinkDb.reduce is 
safer, because it uses a separate instance of Inlinks. Perhaps we could replace 
LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove 
LinkDb.reduce.

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.