Re: mapred patch for improved error message and some javadoc comments

2005-09-19 Thread Doug Cutting
Paul Baclace wrote: Here is a patch for improving the error message that is displayed when an intranet crawl commandline has a file instead of a directory of files containing URLs. I have committed this to the mapred branch. Thanks, Paul! Doug

Re: use nutch file system independence ...

2005-09-18 Thread Doug Cutting
NDFS is not recommended in 0.7. The version of NDFS in the mapred branch is much improved. Note however that the mapred branch is substantially different than 0.7 and is still incomplete. Doug Transbuerg Tian wrote: hi, all friends, I download nutch0.7 ,and want use ndfs independence.

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-09-15 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329474 ] Doug Cutting commented on NUTCH-92: --- A minor detail: In Searcher, instead of int[] getDocFreqs(Term[]); The new method will probably have to be something like public

Re: merge mapred to trunk

2005-09-15 Thread Doug Cutting
I will postpone the merge of the mapred branch into trunk until I have a chance to (a) add some MapReduce documentation; and (b) implement MapReduce-based dedup. Doug Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances

Re: Whole-web crawling with the mapreduce branch

2005-09-15 Thread Doug Cutting
For now, look at the source for crawl/Crawl.java. I'll try to add some documentation ASAP. Doug Steffen Viken Valvåg wrote: Hi, I'm playing around with the mapreduce branch, and got it working for a simple intranet crawl by following the nutch tutorial on

Re: Event queues vs threads

2005-09-01 Thread Doug Cutting
Kelvin Tan wrote: Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Doug Cutting
Apache Wiki wrote: 1. The SVN repository consists of the following areas: a. '''trunk''' [ ... ] a. '''Release-x.x''' branches [ ... ] This should also mention tags, fixed versions of the code where no development occurs. I also would prefer that tag names and branch names are distinct,

Re: Automating workflow using ndfs

2005-08-31 Thread Doug Cutting
I assume that in most NDFS-based configurations the production search system will not run out of NDFS. Rather, indexes will be created offline for a deployment (i.e., merging things to create an index per search node), then copied out of NDFS to the local filesystem on a production search

Re: merge mapred to trunk

2005-08-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging mapred to trunk means losing Kelvin's changes, then I

Re: [mapred] Possible bug, static primatives holding config values?

2005-08-30 Thread Doug Cutting
Jeremy Bensley (sent by Nabble.com) wrote: I have been experimenting with MapReduce to perform some distributed tasks aside from the normal fetch/index routine of Nutch, and overall have had much success. I'm glad to hear this! Today I have been experimenting with running extended duration

Re: svn.apache.org down?

2005-08-19 Thread Doug Cutting
Jérôme Charron wrote: svn.apache.org http://svn.apache.org down, or the problem is on my side? A good way to answer this is to look at: http://monitoring.apache.org/status/ It looks like SVN is currently up. And it works for me too. Doug

Re: Release 0.7

2005-08-16 Thread Doug Cutting
Piotr Kosiorowski wrote: Is anyone working on preparing the release? I am not. If not I can spent some time on it in an hour or so. +1 Thanks, Doug

Re: Slow Results

2005-08-16 Thread Doug Cutting
What API are you using to get hits, NutchBean or OpenSearchServlet? If you're using OpenSearchServlet, then, with 1000 hits, most of your time is probably spent constructing summaries. Do you need the summaries? If not, use NutchBean instead, or modify OpenSearchServlet to not generate

Re: Release 0.7 problem

2005-08-16 Thread Doug Cutting
Piotr Kosiorowski wrote: After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). It is strange nobody reported it so far so it may still be my fault. No, it looks like a problem with ant's tar task,

Re: mapred

2005-08-15 Thread Doug Cutting
Jay Pound wrote: is the org.apache.nutch.crawl package a part of the nightly builds? No. Nightly builds are from trunk. The mapred code is in a separate branch in subversion. After the 0.7 release, when the mapred branch is folded into trunk, then it will be in nightly builds. Until then

Re: MapRed - Injector - urlDir - Format?

2005-08-15 Thread Doug Cutting
Fuad Efendi wrote: Which parameter should I pass to Crawl? It should be directory containing smth. in which format? As before, inject takes a flat text files of urls, one per line. If you wish to inject DMOZ urls, there is now a utility main() that will convert the DMOZ file to such a file.

Re: Nutch versions - Was: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-10 Thread Doug Cutting
Piotr Kosiorowski wrote: I think we all refer to 0.7 as next number (and 0.6 as current) so nutch-default.xml contains wrong format. In fact it should still contain -dev suffix. To make undocumented comvention documented I would also like to suggest naming releases with X.Y format and naming

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Doug Cutting
Piotr Kosiorowski wrote: I read your email ten times and still I am not sure what the problem is. The problem is with me. Doug Cutting wrote: [EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I clicked

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Doug Cutting
Piotr Kosiorowski wrote: Will do it tommorow - I wanted to put down a kind of release checklist in Wiki - starting with where to change numbers. But would like to cover also release howto - but in fact I am not sure how to do make a relase yet. But will try to gather this information. A

Re: JIRA access

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able to resolve issues, etc. Doug

Re: Nutch website deployment

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: So I have installed forrest and modified src/site/src/documentation/content/xdocs. Than run 'forrest'. And it generated content in src/site/build/site. And now the questions: Should I copy src/site/build/site to site and commit it? Yes. I'm impressed that you got

Re: ndfs problem needs fix

2005-08-08 Thread Doug Cutting
Jay Pound wrote: 1.) we need to split up chunks of data into sub-folders as not to run the filesystem out of its physical limitations of concurrent files in a single directory, like the way squid splits up its data into directories. I agree. I am currently using reiser with NDFS so this is

Re: User agent string

2005-08-08 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I think this should now be: http://lucene.apache.org/nutch/bot.html The docs/en pages have mostly been reduced to the about page, whose translations I hate to

Re: Writable vs Externalizable

2005-08-08 Thread Doug Cutting
Stefan Groschupf wrote: can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? You don't miss much! I avoided using Java's

Re: near-term plan

2005-08-04 Thread Doug Cutting
Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and

Re: near-term plan

2005-08-04 Thread Doug Cutting
Jay Pound wrote: Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate When did you test this? Which version of Nutch? How many nodes? My benchmark results from just a few days

Re: 0.7-dev, the search scoring

2005-07-28 Thread Doug Cutting
Fredrik Andersson wrote: I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff has changed I see! One thing I can't quite grasp though, is why the Hit.getScore() has been removed in favour for the TopDocs-thingie instead? Hit.getScore() was generalized to Hit.getSortValue() in

Re: hits.getTotal()

2005-07-07 Thread Doug Cutting
Ilia S. Yatsenko wrote: Why hits.getTotal() ignore hitsPerSite? hits.getTotal() always returns the total number of hits, regardless of site. hitsPerSite is a filter on hits as they are displayed. This is the way Google Yahoo handle this too. Search for NutchAnalysis there. If you look

<    1   2   3   4