Re: nutch cluster questions.

2005-11-04 Thread Stefan Groschupf
Please do not cross post questions! Checkout the map reduce branche in the svn. The map reduce will do all what you are looking for and it works well for me. Stefan Am 04.11.2005 um 14:32 schrieb Arsen Popovyan: At the moment we are using nutch-nightly (nutch-2005-07-20). We are not

Re: status dedub

2005-10-25 Thread Stefan Groschupf
md5 duplicates, but url duplicates are currently handled elsewhere in the mapred branch. What problems did you see? Doug Stefan Groschupf wrote: Hi, what is the status of the dedub tool in the mapreduce branche. The javadoc mentioned that the second part isn't implemented

[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb

2005-10-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ] Stefan Groschupf updated NUTCH-114: --- Attachment: CrawlDbStatMapper.java As discussed now with UTF8 keys and the text based output format. getting number of urls and links from crawldb

Re: crawl db stats

2005-10-15 Thread Stefan Groschupf
Andrzej, thanks for the hint, I will have a look may later today. :-) Stefan Am 15.10.2005 um 08:23 schrieb Andrzej Bialecki: Stefan Groschupf wrote: Michael, I'm afraid to say but the segread doesn't exists in the 0.8 branch anymore. I was knowing both methods but with map reduce

Re: crawl db stats

2005-10-15 Thread Stefan Groschupf
segread / readdb is on the way... it's actually easy to implement, look at LinkDbReader for inspiration. If you have some time on your hands I'm pretty sure you could implement it... if not, I'll do it in the beginning of next month. Just using the MapFileOutputFormat and writing a simple

Problem opening checksum file

2005-10-15 Thread Stefan Groschupf
Hi, what is meaning this and how to fix this ? :-o 051015 221418 Problem opening checksum file: java.io.IOException: Cannot find filename /user/myuser/db/current/part-00012/.index.crc. Ignoring. Looks like it isn't critical at all but I was wondering why this can happen. Thanks for any

patch: Re: crawl db stats

2005-10-15 Thread Stefan Groschupf
schrieb Andrzej Bialecki: Stefan Groschupf wrote: Michael, I'm afraid to say but the segread doesn't exists in the 0.8 branch anymore. I was knowing both methods but with map reduce the file structures are different, that is why I was asking. segread / readdb is on the way... it's

Re: patch: Re: crawl db stats

2005-10-15 Thread Stefan Groschupf
Oh interesting, the apache mailing list system filter out attachments. :-) That make sense, I will put everything to the issue tracking... Am 16.10.2005 um 04:42 schrieb Stefan Groschupf: Hi nutch 0.8 geeks. what you think about following solution? As mentioned we may have later a map reduce

crawl db stats

2005-10-14 Thread Stefan Groschupf
Hi, is there any chance to read the statistics of the nutch 0.8 crawl db or a trick to get an idea of how many pages are already crawled? Thanks for the hints. Stefan

Re: crawl db stats

2005-10-14 Thread Stefan Groschupf
of Pages in text format, Michael Ji, --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, is there any chance to read the statistics of the nutch 0.8 crawl db or a trick to get an idea of how many pages are already crawled? Thanks for the hints. Stefan

reprocessing hanging tasks

2005-10-10 Thread Stefan Groschupf
Hi, I tried to understand the jobtracker code. Hmm more than 1000 lines of code in just one class. :-( This makes understanding code very difficult. Anyway I'm missing a mechanism to reprocess hanging tasks. May I just didn't find the code, but I invest some time to find it. As the google

Re: reprocessing hanging tasks

2005-10-10 Thread Stefan Groschupf
that crash, I mean tasks that are 20 times slower on one machine as the other tasks on the other machines. Stefan Am 10.10.2005 um 20:16 schrieb Doug Cutting: Stefan Groschupf wrote: Do I miss the section in the jobtracker where this is done, or are people interested that I submit a patch

mr: tasks crash tasks assign to old nodes

2005-10-09 Thread Stefan Groschupf
Hi, I notice 2 problems, but wasn't able to find the source until yet. Does may someone notice the same problem and may already know the source of the problem? First I notice that in case the local hard-drive is full the reduce job crashs without any reexecution on a other node. Second I

[jira] Created: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2005-10-09 Thread Stefan Groschupf (JIRA)
tasktracker crashs when reconnecting to a new jobtracker. - Key: NUTCH-108 URL: http://issues.apache.org/jira/browse/NUTCH-108 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf

[jira] Updated: (NUTCH-99) ports are hardcoded or random

2005-10-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ] Stefan Groschupf updated NUTCH-99: -- Attachment: port_patch_03.txt As discussed, tasktracker iterates now until it is finding a free port, starting with a configurable port from nutch

Re: [jira] Created: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2005-10-09 Thread Stefan Groschupf
. - Key: NUTCH-108 URL: http://issues.apache.org/jira/browse/NUTCH-108 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical 051008 213532 Lost connection to JobTracker [/ 192.168.200.100:7020

umbilical.done is called two times

2005-10-03 Thread Stefan Groschupf
Hi, the umbilical.done is called two times in case a task is finished. The map and the reduce task implementation call done when in the last line of the run methods. (Maptask: 132, ReduceTask: 273) But the tasktracker calls the the umbilical.done a second time in line 585. Is this a bug?

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-10-03 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12331224 ] Stefan Groschupf commented on NUTCH-99: --- OK, make sense. Do you prefer command line args for the ports for this 'lets search for a port' code? I personal would prefer

strange problem with ndfs

2005-09-30 Thread Stefan Groschupf
Hi, with the latest mr branch I have a strange problem copy a file from local to the ndfs. Has anyone manage to use the ndfs client with this branch until last days? The problems occurs until connecting to a datanode to get the next free block: cluster2:~/mr_nutch-0.8-dev myuser$

[jira] Updated: (NUTCH-99) ports are hardcoded or random

2005-09-30 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ] Stefan Groschupf updated NUTCH-99: -- Attachment: port_patch_02.txt I notice there are no tests for ndfs and mapreduce trackers, so the test suite was running after patching the sources. But my

[jira] Updated: (NUTCH-99) ports are hardcoded or random

2005-09-29 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ] Stefan Groschupf updated NUTCH-99: -- Attachment: port_patch.txt This patch make the port of datanode and tasktracker configurable in the nutch-default.xml. I changed as less as possible code

Re: why task tracker ports random?

2005-09-27 Thread Stefan Groschupf
Paul, I am thinking about the mapred branch and the case of a mapred multiprocess run over one or more machines. In this case, multiple tasktracker processes are created. I'm not sure what you mean. As far I understand the code there is only one tasktracker per machine. why are the

Re: jms

2005-09-27 Thread Stefan Groschupf
Hi, Nutch comes with a own rpc implementation that is very lightweight and fast - much faster than jms. Beside that the distribution of tasks is down via map reduce so there is no need for jms. However I heard that helix people plan to use jms. Greetings. Stefan Am 27.09.2005 um 09:53

[jira] Created: (NUTCH-97) make datanode starting port configurable

2005-09-26 Thread Stefan Groschupf (JIRA)
make datanode starting port configurable Key: NUTCH-97 URL: http://issues.apache.org/jira/browse/NUTCH-97 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix

why task tracker ports random?

2005-09-26 Thread Stefan Groschupf
Hi, why are the taskReportPort and mapOutputPort randomly generated? I can not see any reasons for that and wondering why we not just have that configurable as well. I can understand that in some situations it is necessary to reinitialize the task tracker but it can use in any case the same

Re: why task tracker ports random?

2005-09-26 Thread Stefan Groschupf
Hi Paul, my call stack say that actually no other classes using the tasktracker. Beside that tasktracker could be implement NutchConfigurable than all problems would be solved since this is IOC pattern. Or do I oversee something? Stefan Am 27.09.2005 um 01:24 schrieb Paul Baclace: Stefan

Re: db.max.outlinks.per.page is misunderstood?

2005-09-07 Thread Stefan Groschupf
Jack, That is max outlinks per html page. All your example pages have less than 100 outlinks, right?! Stefan Am 07.09.2005 um 18:43 schrieb Jack Tang: Hi All Here is the db.max.outlinks.per.page property and its description in nutch-default.xml property

Re: Plugins dependencies enhancement proposal

2005-09-06 Thread Stefan Groschupf
+1! Am 06.09.2005 um 11:41 schrieb Jérôme Charron: Since the plugins can specify some dependencies each over, it raises an administrator problem. For a Nutch administrator, it is not user-friendly to specify which plugins to activate/deactivate. With plugin inter-dependencies, the

Re: separate Crawler from nutch

2005-09-05 Thread Stefan Groschupf
Hi, There is a set of standalone crawler available, the coolst one from my point of view is crawler.archive.org Stefan Am 05.09.2005 um 13:15 schrieb Camilo Abel Monreal: Hi : I try to separate the nutch crawler from entire project. I need to download the page to a file.Please if someone

Re: MS related plugins refactoring

2005-09-05 Thread Stefan Groschupf
Ok. So, after a long private mail exchange with Stefan (thanks for your time and help stefan), it seems that these modifications are ok. No! Thanks to you for fixing the problem! :-D Cheers, Stefan

Re: To mapred or not

2005-09-01 Thread Stefan Groschupf
In some cases, though, focused crawling requirements may require extra data to be stored, which is not useful for whole-web, for example, storing a url's parent and seed url and its depth (essential for crawl scopes). Sounds like meta data for a page. :) Some time ago I submit a patch to

Re: Analysis plugins and lucene-analyzers

2005-08-29 Thread Stefan Groschupf
Hi Jérôme, I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in nutch core but I would like to give another option. I think it is possible to create a plugin which contains and exports this library and make other analysis plugin depend on it. Yes, that is possible and sure..

Re: Analysis plugins and lucene-analyzers

2005-08-27 Thread Stefan Groschupf
Hi, I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in nutch core but I would like to give another option. I think it is possible to create a plugin which contains and exports this library and make other analysis plugin depend on it. I am not an expert in it but I think

Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
Hi, can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? Thanks for any hints. Stefan

Re: Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
What do others think? I think, RMI isn't a good idea. I waste a lot of time with it. I like the nutch rpc very much. However I think usage of Externalizable is a good idea, first it is a very small change. Second many users use nutch for very custom things and usage of Externalizable

Re: Documentation

2005-08-04 Thread Stefan Groschupf
try: http://wiki.media-style.com/display/nutchDocu/Home Stefan Am 04.08.2005 um 19:54 schrieb Nishant Chandra: Hi, I am new to nutch. Is there any articles/tutorials which explains the internal working of the crawler (crawl stratergy) etc. Nishant

Re: near-term plan

2005-08-04 Thread Stefan Groschupf
Hi Doug, The slides from my talk yesterday at OSCON give some hints on how to get started. We need a MapReduce tutorial. http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? Thanks. Stefan

dns lookup cache?

2005-08-03 Thread Stefan Groschupf
Hi there, does anyhow nutch cache dns lookups. I found this paper and section 3.7 gives some very interesting information. We notice that our crawlers often crash after a set of unknown host exceptions. We have already one dual cpu box with a 1Gbit network connection running BIND. So I

Re: dns lookup cache?

2005-08-03 Thread Stefan Groschupf
, if you dump 1 clients worth of dns traffic they can break or not return results so I made my own internal dns server cache, the machine a quad xeon 4gb ram uses over 500mb of ram just for caching of the domains in memory!!! -Jay - Original Message - From: Stefan Groschupf [EMAIL

Re: dns lookup cache?

2005-08-03 Thread Stefan Groschupf
server to setup if your a windows person is windows 2000 server or windows 2003 server, you just enable it and it runs, there are many dns servers for linux, most distributions come with it on cd, mac osx server has it also. - Original Message - From: Stefan Groschupf [EMAIL PROTECTED

Re: API misspelling?

2005-07-20 Thread Stefan Groschupf
I'm sure it is a misspelling. Stefan Am 20.07.2005 um 16:37 schrieb Erik Hatcher: ExtensionPoint.getExtentens() - is this intentional or a misspelling? Erik --- company:http://www.media-style.com forum:

<    1   2   3   4