Please do not cross post questions!
Checkout the map reduce branche in the svn. The map reduce will do
all what you are looking for and it works well for me.
Stefan
Am 04.11.2005 um 14:32 schrieb Arsen Popovyan:
At the moment we are using nutch-nightly (nutch-2005-07-20). We are
not
md5 duplicates, but url
duplicates are currently handled elsewhere in the mapred branch.
What problems did you see?
Doug
Stefan Groschupf wrote:
Hi,
what is the status of the dedub tool in the mapreduce branche.
The javadoc mentioned that the second part isn't implemented
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Stefan Groschupf updated NUTCH-114:
---
Attachment: CrawlDbStatMapper.java
As discussed now with UTF8 keys and the text based output format.
getting number of urls and links from crawldb
Andrzej,
thanks for the hint, I will have a look may later today. :-)
Stefan
Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki:
Stefan Groschupf wrote:
Michael,
I'm afraid to say but the segread doesn't exists in the 0.8
branch anymore.
I was knowing both methods but with map reduce
segread / readdb is on the way... it's actually easy to implement,
look at LinkDbReader for inspiration. If you have some time on your
hands I'm pretty sure you could implement it... if not, I'll do it
in the beginning of next month.
Just using the MapFileOutputFormat and writing a simple
Hi,
what is meaning this and how to fix this ? :-o
051015 221418 Problem opening checksum file: java.io.IOException:
Cannot find filename /user/myuser/db/current/part-00012/.index.crc.
Ignoring.
Looks like it isn't critical at all but I was wondering why this can
happen.
Thanks for any
schrieb Andrzej Bialecki:
Stefan Groschupf wrote:
Michael,
I'm afraid to say but the segread doesn't exists in the 0.8
branch anymore.
I was knowing both methods but with map reduce the file
structures are different, that is why I was asking.
segread / readdb is on the way... it's
Oh interesting, the apache mailing list system filter out
attachments. :-)
That make sense, I will put everything to the issue tracking...
Am 16.10.2005 um 04:42 schrieb Stefan Groschupf:
Hi nutch 0.8 geeks.
what you think about following solution?
As mentioned we may have later a map reduce
Hi,
is there any chance to read the statistics of the nutch 0.8 crawl db
or a trick to get an idea of how many pages are already crawled?
Thanks for the hints.
Stefan
of Pages in text format,
Michael Ji,
--- Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
is there any chance to read the statistics of the
nutch 0.8 crawl db
or a trick to get an idea of how many pages are
already crawled?
Thanks for the hints.
Stefan
Hi,
I tried to understand the jobtracker code.
Hmm more than 1000 lines of code in just one class. :-( This makes
understanding code very difficult.
Anyway I'm missing a mechanism to reprocess hanging tasks. May I just
didn't find the code, but I invest some time to find it.
As the google
that crash, I
mean tasks that are 20 times slower on one machine as the other tasks
on the other machines.
Stefan
Am 10.10.2005 um 20:16 schrieb Doug Cutting:
Stefan Groschupf wrote:
Do I miss the section in the jobtracker where this is done, or
are people interested that I submit a patch
Hi,
I notice 2 problems, but wasn't able to find the source until yet.
Does may someone notice the same problem and may already know the
source of the problem?
First I notice that in case the local hard-drive is full the reduce
job crashs without any reexecution on a other node.
Second I
tasktracker crashs when reconnecting to a new jobtracker.
-
Key: NUTCH-108
URL: http://issues.apache.org/jira/browse/NUTCH-108
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ]
Stefan Groschupf updated NUTCH-99:
--
Attachment: port_patch_03.txt
As discussed, tasktracker iterates now until it is finding a free port,
starting with a configurable port from nutch
.
-
Key: NUTCH-108
URL: http://issues.apache.org/jira/browse/NUTCH-108
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
051008 213532 Lost connection to JobTracker [/
192.168.200.100:7020
Hi,
the umbilical.done is called two times in case a task is finished.
The map and the reduce task implementation call done when in the last
line of the run methods. (Maptask: 132, ReduceTask: 273)
But the tasktracker calls the the umbilical.done a second time in
line 585.
Is this a bug?
[
http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12331224 ]
Stefan Groschupf commented on NUTCH-99:
---
OK, make sense.
Do you prefer command line args for the ports for this 'lets search for a port'
code?
I personal would prefer
Hi,
with the latest mr branch I have a strange problem copy a file from
local to the ndfs.
Has anyone manage to use the ndfs client with this branch until last
days?
The problems occurs until connecting to a datanode to get the next
free block:
cluster2:~/mr_nutch-0.8-dev myuser$
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ]
Stefan Groschupf updated NUTCH-99:
--
Attachment: port_patch_02.txt
I notice there are no tests for ndfs and mapreduce trackers, so the test suite
was running after patching the sources. But my
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ]
Stefan Groschupf updated NUTCH-99:
--
Attachment: port_patch.txt
This patch make the port of datanode and tasktracker configurable in the
nutch-default.xml.
I changed as less as possible code
Paul,
I am thinking about the mapred branch and the case of a mapred
multiprocess run over one or more machines. In this case,
multiple tasktracker processes are created.
I'm not sure what you mean.
As far I understand the code there is only one tasktracker per machine.
why are the
Hi,
Nutch comes with a own rpc implementation that is very lightweight
and fast - much faster than jms.
Beside that the distribution of tasks is down via map reduce so there
is no need for jms.
However I heard that helix people plan to use jms.
Greetings.
Stefan
Am 27.09.2005 um 09:53
make datanode starting port configurable
Key: NUTCH-97
URL: http://issues.apache.org/jira/browse/NUTCH-97
Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
Fix
Hi,
why are the taskReportPort and mapOutputPort randomly generated?
I can not see any reasons for that and wondering why we not just have
that configurable as well.
I can understand that in some situations it is necessary to
reinitialize the task tracker but it can use in any case the same
Hi Paul,
my call stack say that actually no other classes using the tasktracker.
Beside that tasktracker could be implement NutchConfigurable than all
problems would be solved since this is IOC pattern.
Or do I oversee something?
Stefan
Am 27.09.2005 um 01:24 schrieb Paul Baclace:
Stefan
Jack,
That is max outlinks per html page.
All your example pages have less than 100 outlinks, right?!
Stefan
Am 07.09.2005 um 18:43 schrieb Jack Tang:
Hi All
Here is the db.max.outlinks.per.page property and its description in
nutch-default.xml
property
+1!
Am 06.09.2005 um 11:41 schrieb Jérôme Charron:
Since the plugins can specify some dependencies each over, it
raises an
administrator problem.
For a Nutch administrator, it is not user-friendly to specify which
plugins
to activate/deactivate.
With plugin inter-dependencies, the
Hi,
There is a set of standalone crawler available,
the coolst one from my point of view is crawler.archive.org
Stefan
Am 05.09.2005 um 13:15 schrieb Camilo Abel Monreal:
Hi :
I try to separate the nutch crawler from entire project. I need to
download the page to a file.Please if someone
Ok. So, after a long private mail exchange with Stefan (thanks for
your time
and help stefan), it seems that these modifications are ok.
No! Thanks to you for fixing the problem! :-D
Cheers,
Stefan
In some cases, though, focused crawling requirements may require
extra data to be stored, which is not useful for whole-web, for
example, storing a url's parent and seed url and its depth
(essential for crawl scopes).
Sounds like meta data for a page. :)
Some time ago I submit a patch to
Hi Jérôme,
I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in
nutch core but I would like to give another option. I think it is
possible to create a plugin which contains and exports this library
and make other analysis plugin depend on it.
Yes, that is possible and sure..
Hi,
I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in
nutch core but I would like to give another option. I think it is
possible to create a plugin which contains and exports this library
and make other analysis plugin depend on it. I am not an expert in
it but I think
Hi,
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
Thanks for any hints.
Stefan
What do others think?
I think, RMI isn't a good idea. I waste a lot of time with it. I
like the nutch rpc very much.
However I think usage of Externalizable is a good idea, first it is a
very small change.
Second many users use nutch for very custom things and usage of
Externalizable
try:
http://wiki.media-style.com/display/nutchDocu/Home
Stefan
Am 04.08.2005 um 19:54 schrieb Nishant Chandra:
Hi,
I am new to nutch. Is there any articles/tutorials which explains the
internal working of the crawler (crawl stratergy) etc.
Nishant
Hi Doug,
The slides from my talk yesterday at OSCON give some hints on how
to get started. We need a MapReduce tutorial.
http://wiki.apache.org/nutch/Presentations
Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?
Thanks.
Stefan
Hi there,
does anyhow nutch cache dns lookups.
I found this paper and section 3.7 gives some very interesting
information.
We notice that our crawlers often crash after a set of unknown host
exceptions.
We have already one dual cpu box with a 1Gbit network connection
running BIND.
So I
, if you dump 1
clients worth
of dns traffic they can break or not return results so I made my own
internal dns server cache, the machine a quad xeon 4gb ram uses
over 500mb
of ram just for caching of the domains in memory!!!
-Jay
- Original Message -
From: Stefan Groschupf [EMAIL
server to setup if your a windows person is
windows 2000
server or windows 2003 server, you just enable it and it runs,
there are
many dns servers for linux, most distributions come with it on cd,
mac osx
server has it also.
- Original Message -
From: Stefan Groschupf [EMAIL PROTECTED
I'm sure it is a misspelling.
Stefan
Am 20.07.2005 um 16:37 schrieb Erik Hatcher:
ExtensionPoint.getExtentens() - is this intentional or a misspelling?
Erik
---
company:http://www.media-style.com
forum:
301 - 341 of 341 matches
Mail list logo