[proposal] Generic Markup Language Parser

2005-11-23 Thread Jérôme Charron
Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch/MarkupLanguageParserProposal Here is the Summary of Issue: "Currently, Nutch provides some specific markup language parsing plugins: one for handling H

[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

2005-11-23 Thread Paul Baclace (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ] Paul Baclace commented on NUTCH-120: Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after th

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
Andrzej Bialecki wrote: Or to use an implementation of ObjectWritable, which contains all needed partial data? Yes, but ObjectWritable is considerably bigger, and hence slower to copy, sort, etc., since it writes the class name with every instance. LongWritable is a good way to write 64-bit q

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Andrzej Bialecki
Doug Cutting wrote: [EMAIL PROTECTED] wrote: Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!). The reader offers similar functionality to the classic "readdb" command. This looks great! Thanks, Andrzej. No problem - I don't know about you, but I felt like

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
Doug Cutting wrote: I just ran it on a 50M page crawl. FYI, here's the output: 051123 191703 TOTAL urls: 167780785 051123 191703 avg score:1.152 051123 191703 max score:47357.137 051123 191703 min score:1.0 051123 191703 retry 0: 167780785 051123 191703 status 1

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!). The reader offers similar functionality to the classic "readdb" command. This looks great! Thanks, Andrzej. I just ran it on a 50M page crawl. It took longer than I expected. The reduce

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Andrzej Bialecki
Sami Siren wrote: + if (k.contains("score")) { Since: 1.5 Ah, indeed. Fixed - thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Uni

Re: MapRed Generator

2005-11-23 Thread Doug Cutting
Are you crawling only a single host? If so, I can see how this would happen. Using two hosts to crawl a single host is probably not a good idea anyway, no? Doug Anton Potehin wrote: Class Generator We have 2 Reduce Tasks Limit = TopN / 2; Generator.Selector.Reduce for first t

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Sami Siren
+ if (k.contains("score")) { Since: 1.5 -- Sami Siren

MapRed Generator

2005-11-23 Thread Anton Potehin
Class Generator We have 2 Reduce Tasks Limit = TopN / 2; Generator.Selector.Reduce for first task receive all K,V pairs from maps, but select only half of them (work limit) Generator.Selector.Reduce for second task doesn't receive pairs at all! In result on output we have half of m

Re: Small bug in Generator

2005-11-23 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Not to mention that the code uses a local variable with the same name and different type, to obscure the picture... I'll fix it. Thanks! We were both too late... ;-) r332371 | cutting | 2005-11-10

Re: Small bug in Generator

2005-11-23 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: In method Generator.Selector.reduce small bug. Now it: ... while (values.hasNext() && ++count < limit) { ... Must be: ... while (values.hasNext() && ++count <= limit) { ... Not to mention that the code uses a local variable with the same name and different type

mapred crawl

2005-11-23 Thread Anton Potehin
We used nutch for whole web crawling. In infinite loop we run tasks: 1) bin/nutch generate db -topN 1 2) bin/nutch fetch 3) bin/nutch updatedb db 4) bin/nutch analyze db 5) bin/nutch index 6) bin/nutch dedup segments dedup.tmp After each iteration we produce new segment and ma

Small bug in Generator

2005-11-23 Thread anton
In method Generator.Selector.reduce small bug. Now it: ... while (values.hasNext() && ++count < limit) { ... Must be: ... while (values.hasNext() && ++count <= limit) { ...