Tutorial

2005-08-08 Thread Piotr Kosiorowski
Hello, Some time ago someone mentioned on the list a problem with nutch tutorial (I cannot find this email now). I have checked it today and he/she was right. If you follow the nutch Intranet Crawling tutorial you will end up with not very interesting index. This is because it recommends users to

Re: Tutorial

2005-08-08 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: I can commit such changes for 0.7 release (it means today) if I got positive feedback from other committers. +1 -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: luke??

2005-08-08 Thread Jay Pound
I got it to work now, it wasent selecting the directory I had chosen, so I typed it in and it works fine BTW very cool tool -J - Original Message - From: Fredrik Andersson [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Sunday, August 07, 2005 6:16 PM Subject: Re: luke?? That's

Re: Ignore external links from crawled domains

2005-08-08 Thread Ken Krugler
A very basic facility seem to be missing in Nutch. If I have a 2000 urls list in Nutch DB and want to ignore external links, I have to build a regex-filter with thousands of different domain I want to crawl. No parameter to only crawl the different domain and ignore external links. At these

NUTCH 79 Fault tolerant searching.

2005-08-08 Thread Piotr Kosiorowski
Hello, I just created an issue in JIRA http://issues.apache.org/jira/browse/NUTCH-79 containing the code for fault tolerant searching. I think it is too late to include it in 0.7 release but I would wait for comments and test it in the meantime. I would like to commit it when release would be

Re: JIRA access

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able to resolve issues, etc. Doug

Re: JIRA access

2005-08-08 Thread Piotr Kosiorowski
Thanks. It works. Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be

Re: Nutch website deployment

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: So I have installed forrest and modified src/site/src/documentation/content/xdocs. Than run 'forrest'. And it generated content in src/site/build/site. And now the questions: Should I copy src/site/build/site to site and commit it? Yes. I'm impressed that you got

Re: regex-url filter

2005-08-08 Thread Jay Pound
is there any way to filter results to english via search, so I can setup a multi-language search, I thought I saw somewhere that you could put something into the form of the html, a switch while submiting the form that would use a plugin to filter the results? I know I had seen some benchmarks on

Re: svn commit: r230867 - /lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

2005-08-08 Thread Piotr Kosiorowski
No problem for me. I have just run the test crawl on http://lucene.apache.org/nutch as described in new tutorial and a lot of pdf and png files were causing big exceptions and stack traces in log. I thought that people (usually using nutch for the first time) might think that they did something

Re: ndfs problem needs fix

2005-08-08 Thread Doug Cutting
Jay Pound wrote: 1.) we need to split up chunks of data into sub-folders as not to run the filesystem out of its physical limitations of concurrent files in a single directory, like the way squid splits up its data into directories. I agree. I am currently using reiser with NDFS so this is

User agent string

2005-08-08 Thread Piotr Kosiorowski
Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current values: property

Re: User agent string

2005-08-08 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I think this should now be: http://lucene.apache.org/nutch/bot.html The docs/en pages have mostly been reduced to the about page, whose translations I hate to

Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
Hi, can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? Thanks for any hints. Stefan

Re: Writable vs Externalizable

2005-08-08 Thread Doug Cutting
Stefan Groschupf wrote: can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? You don't miss much! I avoided using Java's

Re: Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
What do others think? I think, RMI isn't a good idea. I waste a lot of time with it. I like the nutch rpc very much. However I think usage of Externalizable is a good idea, first it is a very small change. Second many users use nutch for very custom things and usage of Externalizable