Hi,
We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just
add a new proposal on the nutch Wiki:
http://wiki.apache.org/nutch/MarkupLanguageParserProposal
Here is the Summary of Issue:
"Currently, Nutch provides some specific markup language parsing plugins:
one for handling H
[
http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ]
Paul Baclace commented on NUTCH-120:
Indeed there is a comment that indicates the code keeps trying, but luckily it
does not, and it might be unwise to keep trying after th
Andrzej Bialecki wrote:
Or to use an implementation of ObjectWritable, which contains all needed
partial data?
Yes, but ObjectWritable is considerably bigger, and hence slower to
copy, sort, etc., since it writes the class name with every instance.
LongWritable is a good way to write 64-bit q
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks
Stefan!).
The reader offers similar functionality to the classic "readdb" command.
This looks great! Thanks, Andrzej.
No problem - I don't know about you, but I felt like
Doug Cutting wrote:
I just ran it on a 50M page crawl.
FYI, here's the output:
051123 191703 TOTAL urls: 167780785
051123 191703 avg score:1.152
051123 191703 max score:47357.137
051123 191703 min score:1.0
051123 191703 retry 0: 167780785
051123 191703 status 1
[EMAIL PROTECTED] wrote:
Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!).
The reader offers similar functionality to the classic "readdb" command.
This looks great! Thanks, Andrzej.
I just ran it on a 50M page crawl. It took longer than I expected. The
reduce
Sami Siren wrote:
+ if (k.contains("score")) {
Since:
1.5
Ah, indeed. Fixed - thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Uni
Are you crawling only a single host? If so, I can see how this would
happen. Using two hosts to crawl a single host is probably not a good
idea anyway, no?
Doug
Anton Potehin wrote:
Class Generator
We have 2 Reduce Tasks
Limit = TopN / 2;
Generator.Selector.Reduce for first t
+ if (k.contains("score")) {
Since:
1.5
--
Sami Siren
Class Generator
We have 2 Reduce Tasks
Limit = TopN / 2;
Generator.Selector.Reduce for first task receive all K,V pairs from
maps, but select only half of them (work limit)
Generator.Selector.Reduce for second task doesn't receive pairs at all!
In result on output we have half of m
Andrzej Bialecki wrote:
Not to mention that the code uses a local variable with the same name
and different type, to obscure the picture... I'll fix it. Thanks!
We were both too late... ;-)
r332371 | cutting | 2005-11-10
[EMAIL PROTECTED] wrote:
In method Generator.Selector.reduce small bug.
Now it:
...
while (values.hasNext() && ++count < limit) {
...
Must be:
...
while (values.hasNext() && ++count <= limit) {
...
Not to mention that the code uses a local variable with the same name
and different type
We used nutch for whole web crawling.
In infinite loop we run tasks:
1) bin/nutch generate db -topN 1
2) bin/nutch fetch
3) bin/nutch updatedb db
4) bin/nutch analyze db
5) bin/nutch index
6) bin/nutch dedup segments dedup.tmp
After each iteration we produce new segment and ma
In method Generator.Selector.reduce small bug.
Now it:
...
while (values.hasNext() && ++count < limit) {
...
Must be:
...
while (values.hasNext() && ++count <= limit) {
...
14 matches
Mail list logo