[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2007-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445
 ] 

Doğacan Güney commented on NUTCH-289:
-

It seems this issue has kind of died down, but this would be a great feature to 
have. 

Here is how I think we can do this one (my proposal is _heavily_ based on 
Stefan Groschupf's work):

* Add ip as a field to CrawlDatum

* Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum 
already has an ip).

* A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and 
probably crawldb) that (optionally) runs before updatedb. 
  - map: url, CrawlDatum -  host of url, url, CrawlDatum . Add a field 
to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) 
it is coming from(which will be removed in reduce). No lookup is performed in 
map().

  - reduce: host, list(url, CrawlDatum) - url, CrawlDatum. If any 
CrawlDatum already contains an ip address (ip addresses in crawl_fetch having 
precedence over ones in crawldb) then output all crawl_parse datums with this 
ip address. Otherwise, perform a lookup. This way, we will not have to resolve 
ip for most of urls (in a way, we will still be getting the benefits of jvm 
cache :).

A downside of this approach is that we will either have to read crawldb twice 
or perform ip lookups for hosts in crawldb (but not in crawl_fetch).

* use cached ip during generation, if it exists.


 CrawlDatum should store IP address
 --

 Key: NUTCH-289
 URL: https://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cutting
 Attachments: ipInCrawlDatumDraftV1.patch, 
 ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
 ipInCrawlDatumDraftV5.patch


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code

2007-06-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508449
 ] 

Sami Siren commented on NUTCH-499:
--

+1, seems good to me


 Refactor LinkDb and LinkDbMerger to reuse code
 --

 Key: NUTCH-499
 URL: https://issues.apache.org/jira/browse/NUTCH-499
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Trivial
 Attachments: NUTCH-499.patch


 LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so 
 that we can use the same code for both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



JIRA email question

2007-06-27 Thread Doğacan Güney

Hi list,

There is this sentence at the end of every JIRA message:

You can reply to this email to add a comment to the issue online.

But, replying to a JIRA message through nutch-dev doesn't add it as a
comment. So you have to either reply to an email through JIRA (in
which case, it looks like you are responding to an imaginary person:)
or through email (in which case, part of the discussion doesn't get
documented in JIRA). Why doesn't this work?

--
Doğacan Güney


[jira] Closed: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-06-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-434.
---


Issue resolved and committed.

 Replace usage of ObjectWritable with something based on GenericWritable
 ---

 Key: NUTCH-434
 URL: https://issues.apache.org/jira/browse/NUTCH-434
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-434.patch, NUTCH-434_v2.patch, NUTCH-434_v3.patch


 We should replace the usage of ObjectWritable and classes extending it with 
 class extending GenericWritable. Classes based on GenericWritable have 
 smaller footprint on disc and the baseclass also does not contain any classes 
 that are Deprecated.
 There is one problem though: the ParseData currently needs Configuration 
 object before it can deserialize itself and GenericWritable
 doesn't provide a way to inject configuration in. We could either a) remove 
 the need for Configuration, or b) write a class similar to GenericWritable 
 that does conf injecting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code

2007-06-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-499.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

Committed in rev. 551098.

 Refactor LinkDb and LinkDbMerger to reuse code
 --

 Key: NUTCH-499
 URL: https://issues.apache.org/jira/browse/NUTCH-499
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Trivial
 Fix For: 1.0.0

 Attachments: NUTCH-499.patch


 LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so 
 that we can use the same code for both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code

2007-06-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-499.
---


Issue resolved and committed.

 Refactor LinkDb and LinkDbMerger to reuse code
 --

 Key: NUTCH-499
 URL: https://issues.apache.org/jira/browse/NUTCH-499
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Trivial
 Fix For: 1.0.0

 Attachments: NUTCH-499.patch


 LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so 
 that we can use the same code for both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-27 Thread Rob Young (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508479
 ] 

Rob Young commented on NUTCH-479:
-

Hi I've found a bug in this patch. If I search for  title:red ORtitle:blue I 
would expect it to be expanded to
+title:red title:blue but in fact it expands to +title:red title:blue 
so there is no way to do term specific queries.

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-479) Support for OR queries

2007-06-27 Thread Rob Young (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Young updated NUTCH-479:


Attachment: or.patch

I've changed the patch slightly to work around the bug I mentioned earlier.
Now the queries look like this
name:name value OR name:other value
and are expanded to
+name:name value name:other value


 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505
 ] 

Doğacan Güney commented on NUTCH-498:
-

I tested creating a linkdb from ~6M urls:

Combine input records42,091,902
Combine output records  15,684,838

(Combiner reduces number of records to around 1/3.)

Job took ~15 min overall with combiner, ~20 minutes without combiner.

So, +1 from me.




 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506
 ] 

Andrzej Bialecki  commented on NUTCH-498:
-

+1.

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508
 ] 

Sami Siren commented on NUTCH-498:
--

+1

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney resolved NUTCH-498.
-

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Doğacan Güney

Committed in rev. 551147.

 Use Combiner in LinkDb to increase speed of linkdb generation
 -

 Key: NUTCH-498
 URL: https://issues.apache.org/jira/browse/NUTCH-498
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 0.9.0
Reporter: Espen Amble Kolstad
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch


 I tried to add the follwing combiner to LinkDb
public static enum Counters {COMBINED}
public static class LinkDbCombiner extends MapReduceBase implements 
 Reducer {
   private int _maxInlinks;
   @Override
   public void configure(JobConf job) {
  super.configure(job);
  _maxInlinks = job.getInt(db.max.inlinks, 1);
   }
   public void reduce(WritableComparable key, Iterator values, 
 OutputCollector output, Reporter reporter) throws IOException {
 final Inlinks inlinks = (Inlinks) values.next();
 int combined = 0;
 while (values.hasNext()) {
Inlinks val = (Inlinks) values.next();
for (Iterator it = val.iterator(); it.hasNext();) {
   if (inlinks.size() = _maxInlinks) {
  if (combined  0) {
 reporter.incrCounter(Counters.COMBINED, combined);
  }
  output.collect(key, inlinks);
  return;
   }
   Inlink in = (Inlink) it.next();
   inlinks.add(in);
}
combined++;
 }
 if (inlinks.size() == 0) {
return;
 }
 if (combined  0) {
reporter.incrCounter(Counters.COMBINED, combined);
 }
 output.collect(key, inlinks);
   }
}
 This greatly reduced the time it took to generate a new linkdb. In my case it 
 reduced the time by half.
 Map output records8717810541
 Combined  7632541507
 Resulting output rec 1085269034
 That's a 87% reduction of output records from the map phase

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: JIRA email question

2007-06-27 Thread Doug Cutting
The problem is that nutch-dev (like most Apache mailing lists) sets the 
Reply-to header to be itself, so that responses don't go back to the 
sender.  If you override this when responding (changing the To: line) 
and respond to the sender, then it should end up as a comment, which 
will be then copied to nutch-dev.  But there's unfortunately no way to 
automatically override this.  Thus its best to click on the link in the 
message and respond directly in Jira.  This is also more reliable. 
Sending messages to Jira doesn't always seem to work correctly.  It 
might be good to disable that sentence suggesting that folks reply to 
the email, but I don't know if that's possible.


Doug

Doğacan Güney wrote:

Hi list,

There is this sentence at the end of every JIRA message:

You can reply to this email to add a comment to the issue online.

But, replying to a JIRA message through nutch-dev doesn't add it as a
comment. So you have to either reply to an email through JIRA (in
which case, it looks like you are responding to an imaginary person:)
or through email (in which case, part of the discussion doesn't get
documented in JIRA). Why doesn't this work?



Re: NUTCH-119 :: how hard to fix

2007-06-27 Thread Kai_testing Middleton
wow, setting db.max.outlinks.per.page immediately fixed my problem.  It looks 
like I totally mis-diagnosed things.

May I pose two questions:
1) how did you view all the outlinks?
2) how severe is NUTCH-119 - does it occur on a lot of sites?


- Original Message 
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix

On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
 I am evaluating nutch+lucene as a crawl and search solution.

 However, I am finding major bugs in nutch right off the bat.

 In particular, NUTCH-119: nutch is not crawling relative URLs.  I have some 
 discussion of it here:
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html

 Most of the links off www.variety.com, one of my main test sites, have 
 relative URLs.  It seems incredible that nutch, which is capable of 
 mapreduce, cannot fetch these URLs.

 It could be that I would fix this bug if, for other reasons, I decide to go 
 with nutch+lucene.  Has anyone tried fixing this problem?  Is it intractable? 
  Or are the developers, who are just volunteers anyway, more interested in 
 fixing other problems?

 Could someone outline the issue for me a bit more clearly so I would know how 
 to evaluate it?

Both this one and the other site you were mentioning (sf911truth) have
more than 100 outlinks. Nutch, by default, only stores 100 outlinks
per page (db.max.outlinks.per.page). Link about.html happens to be
105th link or so, so nutch doesn't store it. All you have to do is
either increase db.max.outlinks.per.page or set it  to -1 (which
means, store all outlinks).





   
 
 Park yourself in front of a world of choices in alternative vehicles. Visit 
 the Yahoo! Auto Green Center.
 http://autos.yahoo.com/green_center/


-- 
Doğacan Güney







   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=listsid=396545433