Nutch returns irrelevant site
Hi I'm currently setting up a nutch search engine that searches travel websites. It works quite well but sometimes returns odd results. One good example: One of the 100 or so sites I've asked it to crawl is http://www.hfholidays.co.uk/ . This site is mainly about walking holidays and has many pages with the word walking in it, so when I type in walking into nutch then I'd expect it to turn up, however the first result I get back from using the keyword walking is http://www.hfholidays.co.uk/email.asp . This page doesn't have the word walking in it anywhere. Could someone please explain if this is a bug or the way nutch works. I've got an idea how google works, if nutch works in a similar fashion does this page appear because it is linked from many pages with the word walking in them? Thanks Aled This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
Re: ad feed for nutch
phpadsnew is ok.. not easy to integrate with a keyword based system such as search. I've used Inclick before with moderate success.. was under heavy development at the time however the developers seem to have a strong base to work from. With my experience it's not affordable to really do your own PPC and try and compete.. Backfill with Google specific sites or establish a mutual/beneficial relationship with a 2nd/3rd tier PPC engine that will co-market with you. -byron --- Thomas Delnoij [EMAIL PROTECTED] wrote: It should be fairly easy to integrate PhpAdsNew with Nutch: http://phpadsnew.com/. Rgrds, Thomas On 12/7/05, Greg Cohen [EMAIL PROTECTED] wrote: Glenn, I'm trying to put together a project that will also require ad serving, but want it to be open source and give greater transparency to the advertisers than they get today with google and overture. If you start developing one, were you thinking of making this open source project? Thanks. -greg -Original Message- From: Insurance Squared Inc. [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 06, 2005 3:37 AM To: nutch-user@lucene.apache.org Subject: ad feed for nutch Has anyone had any luck with advertising/ad management systems being integrated into nutch? Not just something for the owner to admin ads, but to allow external advertisers to manage their accounts/bids, that kind of thing. I'm drawing up plans for one if none are available, but clearly something that's already running would be nicer. Thanks, -glenn
Re: ad feed for nutch
Thank you Byron and Greg for your comments. Upon reflection I think I'll do as Byron has suggested and attempt an ad feed from third party until I've got enough of a base to make my own system with my own advertisers worthwhile (at which point Greg, yes, I'd be happy to make it available). My concern with a Google feed is that the base feed I don't think integrates well - I don't want to be promoting 'ads by Google' and would like some control over design. I've done some research and I'm going to check out searchfeed.com. I'm sure if I get big enough Yahoo or Google will provide something custom. I know some folks that have a Yahoo feed and claim they're very happy with it however they've got such huge volume that they would get more consideration that a bit player such as myself. I've heard rumours that MSN is hooking up some beta testers for their ad feed as well, no info on what they're doing. Regards, Glenn Byron Miller wrote: phpadsnew is ok.. not easy to integrate with a keyword based system such as search. I've used Inclick before with moderate success.. was under heavy development at the time however the developers seem to have a strong base to work from. With my experience it's not affordable to really do your own PPC and try and compete.. Backfill with Google specific sites or establish a mutual/beneficial relationship with a 2nd/3rd tier PPC engine that will co-market with you. -byron --- Thomas Delnoij [EMAIL PROTECTED] wrote: It should be fairly easy to integrate PhpAdsNew with Nutch: http://phpadsnew.com/. Rgrds, Thomas On 12/7/05, Greg Cohen [EMAIL PROTECTED] wrote: Glenn, I'm trying to put together a project that will also require ad serving, but want it to be open source and give greater transparency to the advertisers than they get today with google and overture. If you start developing one, were you thinking of making this open source project? Thanks. -greg -Original Message- From: Insurance Squared Inc. [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 06, 2005 3:37 AM To: nutch-user@lucene.apache.org Subject: ad feed for nutch Has anyone had any luck with advertising/ad management systems being integrated into nutch? Not just something for the owner to admin ads, but to allow external advertisers to manage their accounts/bids, that kind of thing. I'm drawing up plans for one if none are available, but clearly something that's already running would be nicer. Thanks, -glenn
Nutch and Google Map togather for Real Estate search.
I think I found a website that put Nutch and google map togather for real estate search. http://www.realestateadvisor.com/ Nutch is amazing.
Re: Nutch and Google Map togather for Real Estate search.
Actually, I just made a guess. When I typed search.jsp in the site root directory, the file is there even some errors popup. On 12/7/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Interesting! What makes you think that they use nutch? Am 07.12.2005 um 16:48 schrieb Benny Krauss: I think I found a website that put Nutch and google map togather for real estate search. http://www.realestateadvisor.com/ Nutch is amazing.
Re: Nutch and Google Map togather for Real Estate search.
Yeah, I see it too. At http://www.realestateadvisor.com/search.jsp a url that I entered , not linked off of, I see an invalid page with Nutch java exceptions: root cause java.lang.NullPointerException at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:82) Diane Palla Web Services Developer Seton Hall University 973 313-6199 [EMAIL PROTECTED] Benny Krauss [EMAIL PROTECTED] 12/07/2005 11:42 AM Please respond to nutch-user@lucene.apache.org To nutch-user@lucene.apache.org cc Subject Re: Nutch and Google Map togather for Real Estate search. Actually, I just made a guess. When I typed search.jsp in the site root directory, the file is there even some errors popup. On 12/7/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Interesting! What makes you think that they use nutch? Am 07.12.2005 um 16:48 schrieb Benny Krauss: I think I found a website that put Nutch and google map togather for real estate search. http://www.realestateadvisor.com/ Nutch is amazing.
Re: try to restart aborted crawl
Hi, I had the same problems with JVM crashes and it was in fact hardware problem (memory). It can also be a problem with your software config (but as far as I remember you are using quite standard configuration). I doubt it has anything to do with nutch (except nutch stresses JVM/whole box) so it can be seen easier than during normal system usage. Regards, Piotr wmelo wrote: The biggest problem, is not to restart the crawl, but the problem that lead do failure itself, more precisely: Exception in thread main java.io.IOException: key out of order: http://web.mit .edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html; This kind of problem occurs, with me, almost all the time (together with another that says that there is some problem with Java Hot Spot), preventing me to, really, use Nutch. I have reported those two problems before, without any answer. I don't know, but this may be (or not) a bug of Nutch (or of Lucene , I don't have any idea), The only thing I know, is that both issues are very big non-conformities, that should be corrected as soon as possible. Wmelo No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.12/192 - Release Date: 5/12/2005
NDFS problem on mapred branch
Hi, We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes) I can access the file system from these machines without any problem. However, when I tried to write a file to the NDFS on a machine other than these 4 machines I got the following error: ~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch 051206 152157 parsing file:/home/agmlab/nutch-mapred/conf/nutch-default.xml 051206 152158 parsing file:/home/agmlab/nutch-mapred/conf/nutch-site.xml 051206 152158 No FS indicated, using default:192.168.15.118:9001 051206 152158 Client connection to 192.168.15.118:9001: starting Exception in thread main java.io.IOException: Cannot create file /user/agmlab/nutch on client NDFSClient_1904460956 at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:128) at $Proxy0.create(Unknown Source) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream( NDFSClient.java:537) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init( NDFSClient.java:512) at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:74) at org.apache.nutch.fs.NDFSFileSystem.createRaw(NDFSFileSystem.java :67) at org.apache.nutch.fs.NFSDataOutputStream$Summer.init( NFSDataOutputStream.java:41) at org.apache.nutch.fs.NFSDataOutputStream.init( NFSDataOutputStream.java:129) at org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java :175) at org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java :162) at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile( NDFSFileSystem.java:174) at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile( NDFSFileSystem.java:149) at org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java:46) at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) From the same machine I was able to list the files and create directories. What may be the problem? Thanks. -- Hamza Kaya
Re: NDFS problem on mapred branch
Hamza Kaya wrote: Hi, We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes) I can access the file system from these machines without any problem. However, when I tried to write a file to the NDFS on a machine other than these 4 machines I got the following error: ~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch Could you try the same, but using absolute paths? NDFS client has no notion of relative or current directory, so the file names must always be absolute, i.e. starting with the leading / . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NDFS problem on mapred branch
I had the same problem, you will find it in the mail archive. In my case it one box was unable to connect to the other. There can be 2 cases, first may a fire wall block the ports or a common case is that the network dns name and the name the box uses to identify itself against other boxes is different. Check that you can telnet on the port or / and ping all boxes from all other boxes by using the names that are setuped in the host.conf Let me please know if this was also the problem in your case, since other people had this problem as well and we may should add this to the faq. Stefan Am 06.12.2005 um 14:48 schrieb Hamza Kaya: Hi, We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes) I can access the file system from these machines without any problem. However, when I tried to write a file to the NDFS on a machine other than these 4 machines I got the following error: ~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch 051206 152157 parsing file:/home/agmlab/nutch-mapred/conf/nutch- default.xml 051206 152158 parsing file:/home/agmlab/nutch-mapred/conf/nutch- site.xml 051206 152158 No FS indicated, using default:192.168.15.118:9001 051206 152158 Client connection to 192.168.15.118:9001: starting Exception in thread main java.io.IOException: Cannot create file /user/agmlab/nutch on client NDFSClient_1904460956 at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:128) at $Proxy0.create(Unknown Source) at org.apache.nutch.ndfs.NDFSClient $NDFSOutputStream.nextBlockOutputStream( NDFSClient.java:537) at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init( NDFSClient.java:512) at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:74) at org.apache.nutch.fs.NDFSFileSystem.createRaw (NDFSFileSystem.java :67) at org.apache.nutch.fs.NFSDataOutputStream$Summer.init( NFSDataOutputStream.java:41) at org.apache.nutch.fs.NFSDataOutputStream.init( NFSDataOutputStream.java:129) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.java :175) at org.apache.nutch.fs.NutchFileSystem.create (NutchFileSystem.java :162) at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile( NDFSFileSystem.java:174) at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile( NDFSFileSystem.java:149) at org.apache.nutch.fs.NDFSShell.copyFromLocal (NDFSShell.java:46) at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234) From the same machine I was able to list the files and create directories. What may be the problem? Thanks. -- Hamza Kaya --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: Nutch returns irrelevant site
You can use explain page to find out why this page is scored the way it is. I would expect anchor text would be th emain component of it. Regards Piotr Aled Jones wrote: Hi I'm currently setting up a nutch search engine that searches travel websites. It works quite well but sometimes returns odd results. One good example: One of the 100 or so sites I've asked it to crawl is http://www.hfholidays.co.uk/ . This site is mainly about walking holidays and has many pages with the word walking in it, so when I type in walking into nutch then I'd expect it to turn up, however the first result I get back from using the keyword walking is http://www.hfholidays.co.uk/email.asp . This page doesn't have the word walking in it anywhere. Could someone please explain if this is a bug or the way nutch works. I've got an idea how google works, if nutch works in a similar fashion does this page appear because it is linked from many pages with the word walking in them? Thanks Aled This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
Re: Upgrading from Nutch 0.7.1 to 0.8
Dave, here is a step by step tutorial to setup a .08 on a set of boxes: http://wiki.media-style.com/display/nutchDocu/setup+a+map+reduce+multi +box+system May this can help you. Stefan Am 07.12.2005 um 17:50 schrieb Goldschmidt, Dave: Hello, Any caveats or pitfalls in upgrading from Nutch 0.7.1 to the latest 0.8 nightly build? I'd like to rebuild a 1-machine 0.7.1 environment, then distribute it out to 2 machines using NDFS. Thanks! DaveG --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: searching while crawling.
Hi. you do the generate, fetch, update, index cycle and than you can add this segment to be searched. I prefer to have a ready folder and a working folder and in case a segemtn is ready indexed my shell script just move it to ready. After 30 days I delete the segment to start from beginning. HTH Stefan Am 07.12.2005 um 07:53 schrieb K.A.Hussain Ali: HI all, while crawling using Nutch could we make out a search over the segments crawled and indexed. i had some error while doing the above way , i dont get any hits, kindly send me ur suggestion to overcome the above,and also should we do search only after the whole crawling ends ? Thanks in advance regards -Hussain --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
RE: Class Not Found
Sorry it took so long, but I downloaded and installed ant from ant.apache.org and was able to build the war file without a problem (once I'd gotten /etc/ant.conf from the rpm out of the way). So if anyone else hits this, just install from either the source or binary installation from ant.apache.org instead of using an rpm. Thanks, Jake. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Saturday, December 03, 2005 1:32 PM To: nutch-user@lucene.apache.org Subject: Re: Class Not Found Vanderdray, Jacob wrote: I installed ant from an rpm. It is possible that the rpm I grabbed just doesn't have everything I need. I have seen this problem too using ant installed from rpm. I recommend downloading ant from Apache. Doug
Re: Setting up a crawler for a country.
Mostly an FYI post on those working with country specific SE's: Just to continue on this topic, the country code TLD I'm looking at doesn't provide any information so we're back to crawling to find domains. To add to the complexity, there's lots of people who register .com's as their main domain here, instead of the country specific TLD. So our intended solution is to hack the filter so that it only crawls and follows sites that match the specific TLD *or* match ARIN's IP list for addresses in the country. (note: Arin publishes a list of IP assignments by country). Not perfect, but it sure beats hand review.) We'll assume that if they're hosted here that they're likely a site relevant to the country. For any remaining sites we're going to offer a manually submission service. The sites will be reviewed manually, then added into the filter. (we've got a preliminary php program that does this running now). ARIN's IP mapping to country isn't quite perfect. For example our servers are located here yet show up as being in the range of another country. I expect I'll occassionally review the list of sites we've added manually and look for trends in the IP address list to see if we have any missing ranges. At that point I can pull those domains from the filter and just add them to the IP address list. I'm concerned that adding a huge range of IP's to check will cause the crawler to slow. However of the four bytes in an ip address, there are only about 10 possibilities in the first byte (i.e. the 000.XXX.XXX.XXX). So we'll check just the first byte, then continue to drill down if there's a match. HTH. Matt Kangas wrote: glenn, i know that verisign makes this available for .com and .net as TLD zone files. for ccTLDs like .us and .uk, you'll have to see if the TLD registrar provides the same. the following page has some useful links to these folks: http://www.dnsstuff.com/info/dnslinks.htm --matt On Nov 29, 2005, at 10:23 AM, Insurance Squared Inc. wrote: Along these same lines (as I'm interested in a similiar country- specific project), is there any place to get a list of all the domains for a specific TLD to use to seed nutch? i.e. if I wanted to get a list of all currently registered .it, .de, or .ca's? I've looked without success. I'm thinking that this information isn't available due to spamming issues, however in the paper you referenced they discuss crawling an entire TLD which seemed to indicate they may have access to this info. Thanks, Glenn Ken Krugler wrote: Is there anyone that can implement a country crawler? I estimate around 40m documents. Please send me info about your prev work and how much time it would take to setup and money :-) Check out the paper titled Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering by Ricardo Baeza-Yates others. They were using a crawl of Chilean domains to test strategies for efficient crawling, so it seems like it would be of interest to you. The main problem we've run into in doing similar limited domain crawls is that you wind up with many fewer hosts, and thus more URLs/host in any given fetch loop. The restriction of being polite (one thread per host) leads to lots of retry errors caused by fetcher threads blocking on a host (IP address) that is already being accessed by another fetcher thread, and thus lower pages/ second throughput. So we've been making some mods to Nutch to improve our performance, but it's not debugged yet...getting closer, though. -- Ken -- Matt Kangas / [EMAIL PROTECTED]
Luke and Indexes
I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does Index version refer to? 2. Also in the overview tab, if Has Deletions? is equal to yes, where are the possible sources of deletions? Dedup? Manual deletions through luke? 3. Is there any way (w/ Luke or otherwise) to get a file listing all of the docs in an index. Basically is there an index equivalent of this command (which outputs all the URLs in a segment): bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir 4. Finally, my last question is the one I'm most perplexed by: I called bin/nutch segread -list -dir for a particular segments directory and found out that one directory had 93 entries. BUT, when I opened up the index of that segment in Luke, there were only 23 documents (and 3 deletions)! Where did the rest of the URLs go?? Thanks ahead of time for any helpful suggestions, Bryan
Re: fetch of file:///F:/xxx/xxx/xxx.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
I am unable to understand, what u want to say. Is it possible for u to send me any configuration onm the form of attachment. with Thanx On 12/8/05, Hasan Diwan [EMAIL PROTECTED] wrote: On Dec 5, 2005, at 4:57 AM, Arun Kaundal wrote: I am getting protocol not found error. What configuartionsetting require for my case. Plz come up with solution soon, I am waiting my posting from long time. In your crawl-filter.txt: -^(file|ftp|mailto): # remove the word file, leaving # -^(ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm| tgz|mov|MOV|e xe|png|PNG)$ [EMAIL PROTECTED] +^http*://([a-z0-9]*\.)*/ +^https*://([a-z0-9]*\.)*/ +^file:///* # Add this -. Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: [Nutch-general] RE: Speed of indexing
It's slightly different, actually. mergeFactor only controls the rate of Lucene index segment creation. It doesn't control in-memory stuff. That is what minmergeDocs controls - it controls the size of in-memory buffer. If you are curious about this and other Lucene details, they are described in Lucene in Action - http://www.lucenebook.com/ - code examples are free to download. Otis --- Goldschmidt, Dave [EMAIL PROTECTED] wrote: Thanks, I'm trying to get a better understanding of this. Does anyone have experience working with these parameters for large datasets (7-20M documents)? What's the interplay between mergeFactor and minMergeDocs? I think the mergeFactor specifies how many documents to store in memory before writing to disk, yes? But this may be overridden with the minMergeDocs parameter, which specifies how many documents must be buffered in memory before being merged -- DOES THIS MEAN that nothing is written to disk until minMergeDocs is reached? If I understand it correctly, the mergeFactor also specifies when to merge Lucene (not Nutch) segments into a new segment. If I've a mergeFactor of 10, after processing 10 documents in memory, a Lucene segment may be written to disk. When 10 Lucene segments exist, they are merged into a single 100-document segment, etc. Back to minMergeDocs, DOES THIS MEAN that all this mergeFactor-based merging occurs in memory until minMergeDocs is reached? What about using RAMDirectory instead? Help? Thanks! DaveG -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Monday, December 05, 2005 4:29 PM To: nutch-user@lucene.apache.org Subject: Re: Speed of indexing The lucene wiki and the lucene in action book provides at least a description of the formula, but also no magic formula. Just check the nutch configration file. Every value that minimize disk access (increase memory usage) improve speed. Stefan Am 05.12.2005 um 22:24 schrieb Goldschmidt, Dave: Hello, In searching for solutions, I found an old post from Doug on tuning these parameters -- but this old message applied to ~30,000 documents only: http://marc.theaimsgroup.com/?l=lucene-userm=110235452004799w=2 I've upped both the mergeFactor and minMergeDocs to 1000 and was able to get ~200 records/second -- not bad, but is this the best I can do? That's still ~8 hours of indexing time (I haven't hit the 'updatedb' phase yet!). I'm going to keep playing with these parameters -- BUT does anyone have a FORMULA for tuning these parameters given the memory, Java heap size, etc.? :-) Thanks, DaveG -Original Message- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: Monday, December 05, 2005 2:43 PM To: nutch-user@lucene.apache.org Subject: RE: Speed of indexing Hi, no additional plugins enabled -- just an out of the box build. And, no, I haven't set any nutch-site settings -- default setting for indexer.minMergeDocs is 50, indexer.maxMergeDocs is 2147483647, indexer.mergeFactor is 50. Any rule-of-thumb formula for setting these values? Note I've upped the number of open files from 1024 to 4096. Thanks, DaveG -Original Message- From: Byron Miller [mailto:[EMAIL PROTECTED] Sent: Monday, December 05, 2005 2:36 PM To: nutch-user@lucene.apache.org Subject: Re: Speed of indexing Which plugins do you have enabled? Have you optimized any of your nutch-site settings yet? -byron --- Goldschmidt, Dave [EMAIL PROTECTED] wrote: Hello, I'm currently indexing ~50 segments, each ~2GB in size, for a total of only ~7,000,000 pages. From the log output, I see an index rate of ~72 records/second. Doing the math, this is over 24 hours of time to index these segments. Does this sound slow? If so, any suggestions as to how to tune this? Note I'm using Nutch 0.7.1 on a Linux box with dual CPUs, 2GB of memory and a 250GB partition to play with. Thanks, DaveG --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_idv37alloc_id865op=click ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general