[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445 ] Doğacan Güney commented on NUTCH-289: - It seems this issue has kind of died down, but this would be a great feature to have. Here is how I think we can do this one (my proposal is _heavily_ based on Stefan Groschupf's work): * Add ip as a field to CrawlDatum * Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip). * A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb. - map: url, CrawlDatum - host of url, url, CrawlDatum . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map(). - reduce: host, list(url, CrawlDatum) - url, CrawlDatum. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache :). A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch). * use cached ip during generation, if it exists. CrawlDatum should store IP address -- Key: NUTCH-289 URL: https://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code
[ https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508449 ] Sami Siren commented on NUTCH-499: -- +1, seems good to me Refactor LinkDb and LinkDbMerger to reuse code -- Key: NUTCH-499 URL: https://issues.apache.org/jira/browse/NUTCH-499 Project: Nutch Issue Type: Improvement Components: linkdb Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Trivial Attachments: NUTCH-499.patch LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so that we can use the same code for both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
JIRA email question
Hi list, There is this sentence at the end of every JIRA message: You can reply to this email to add a comment to the issue online. But, replying to a JIRA message through nutch-dev doesn't add it as a comment. So you have to either reply to an email through JIRA (in which case, it looks like you are responding to an imaginary person:) or through email (in which case, part of the discussion doesn't get documented in JIRA). Why doesn't this work? -- Doğacan Güney
[jira] Closed: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-434. --- Issue resolved and committed. Replace usage of ObjectWritable with something based on GenericWritable --- Key: NUTCH-434 URL: https://issues.apache.org/jira/browse/NUTCH-434 Project: Nutch Issue Type: Improvement Reporter: Sami Siren Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-434.patch, NUTCH-434_v2.patch, NUTCH-434_v3.patch We should replace the usage of ObjectWritable and classes extending it with class extending GenericWritable. Classes based on GenericWritable have smaller footprint on disc and the baseclass also does not contain any classes that are Deprecated. There is one problem though: the ParseData currently needs Configuration object before it can deserialize itself and GenericWritable doesn't provide a way to inject configuration in. We could either a) remove the need for Configuration, or b) write a class similar to GenericWritable that does conf injecting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code
[ https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-499. - Resolution: Fixed Fix Version/s: 1.0.0 Committed in rev. 551098. Refactor LinkDb and LinkDbMerger to reuse code -- Key: NUTCH-499 URL: https://issues.apache.org/jira/browse/NUTCH-499 Project: Nutch Issue Type: Improvement Components: linkdb Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Trivial Fix For: 1.0.0 Attachments: NUTCH-499.patch LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so that we can use the same code for both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-499) Refactor LinkDb and LinkDbMerger to reuse code
[ https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-499. --- Issue resolved and committed. Refactor LinkDb and LinkDbMerger to reuse code -- Key: NUTCH-499 URL: https://issues.apache.org/jira/browse/NUTCH-499 Project: Nutch Issue Type: Improvement Components: linkdb Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Trivial Fix For: 1.0.0 Attachments: NUTCH-499.patch LinkDb.Merger.reduce and LinkDb.reduce works the same way. Refactor Nutch so that we can use the same code for both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508479 ] Rob Young commented on NUTCH-479: - Hi I've found a bug in this patch. If I search for title:red ORtitle:blue I would expect it to be expanded to +title:red title:blue but in fact it expands to +title:red title:blue so there is no way to do term specific queries. Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Young updated NUTCH-479: Attachment: or.patch I've changed the patch slightly to work around the bug I mentioned earlier. Now the queries look like this name:name value OR name:other value and are expanded to +name:name value name:other value Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: or.patch, or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508505 ] Doğacan Güney commented on NUTCH-498: - I tested creating a linkdb from ~6M urls: Combine input records42,091,902 Combine output records 15,684,838 (Combiner reduces number of records to around 1/3.) Job took ~15 min overall with combiner, ~20 minutes without combiner. So, +1 from me. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506 ] Andrzej Bialecki commented on NUTCH-498: - +1. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508508 ] Sami Siren commented on NUTCH-498: -- +1 Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Priority: Minor Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-498. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Doğacan Güney Committed in rev. 551147. Use Combiner in LinkDb to increase speed of linkdb generation - Key: NUTCH-498 URL: https://issues.apache.org/jira/browse/NUTCH-498 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 0.9.0 Reporter: Espen Amble Kolstad Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 Attachments: LinkDbCombiner.patch, LinkDbCombiner.patch I tried to add the follwing combiner to LinkDb public static enum Counters {COMBINED} public static class LinkDbCombiner extends MapReduceBase implements Reducer { private int _maxInlinks; @Override public void configure(JobConf job) { super.configure(job); _maxInlinks = job.getInt(db.max.inlinks, 1); } public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { final Inlinks inlinks = (Inlinks) values.next(); int combined = 0; while (values.hasNext()) { Inlinks val = (Inlinks) values.next(); for (Iterator it = val.iterator(); it.hasNext();) { if (inlinks.size() = _maxInlinks) { if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); return; } Inlink in = (Inlink) it.next(); inlinks.add(in); } combined++; } if (inlinks.size() == 0) { return; } if (combined 0) { reporter.incrCounter(Counters.COMBINED, combined); } output.collect(key, inlinks); } } This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half. Map output records8717810541 Combined 7632541507 Resulting output rec 1085269034 That's a 87% reduction of output records from the map phase -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: JIRA email question
The problem is that nutch-dev (like most Apache mailing lists) sets the Reply-to header to be itself, so that responses don't go back to the sender. If you override this when responding (changing the To: line) and respond to the sender, then it should end up as a comment, which will be then copied to nutch-dev. But there's unfortunately no way to automatically override this. Thus its best to click on the link in the message and respond directly in Jira. This is also more reliable. Sending messages to Jira doesn't always seem to work correctly. It might be good to disable that sentence suggesting that folks reply to the email, but I don't know if that's possible. Doug Doğacan Güney wrote: Hi list, There is this sentence at the end of every JIRA message: You can reply to this email to add a comment to the issue online. But, replying to a JIRA message through nutch-dev doesn't add it as a comment. So you have to either reply to an email through JIRA (in which case, it looks like you are responding to an imaginary person:) or through email (in which case, part of the discussion doesn't get documented in JIRA). Why doesn't this work?
Re: NUTCH-119 :: how hard to fix
wow, setting db.max.outlinks.per.page immediately fixed my problem. It looks like I totally mis-diagnosed things. May I pose two questions: 1) how did you view all the outlinks? 2) how severe is NUTCH-119 - does it occur on a lot of sites? - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, June 26, 2007 10:56:32 PM Subject: Re: NUTCH-119 :: how hard to fix On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I am evaluating nutch+lucene as a crawl and search solution. However, I am finding major bugs in nutch right off the bat. In particular, NUTCH-119: nutch is not crawling relative URLs. I have some discussion of it here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html Most of the links off www.variety.com, one of my main test sites, have relative URLs. It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs. It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or are the developers, who are just volunteers anyway, more interested in fixing other problems? Could someone outline the issue for me a bit more clearly so I would know how to evaluate it? Both this one and the other site you were mentioning (sf911truth) have more than 100 outlinks. Nutch, by default, only stores 100 outlinks per page (db.max.outlinks.per.page). Link about.html happens to be 105th link or so, so nutch doesn't store it. All you have to do is either increase db.max.outlinks.per.page or set it to -1 (which means, store all outlinks). Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ -- Doğacan Güney Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=listsid=396545433