Re: Scalability for one site

2009-11-16 Thread Alex McLintock
2009/11/16 Mark Kerzner markkerz...@gmail.com: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Nutch basically uses

Re: How do I get all the documents in the index without searching?

2009-08-12 Thread Alex McLintock
Try looking at how the indexers work. They *do* iterate through all the documents in the crawl (or rather one segment at a time). However they do it in a Hadoop way... 2009/8/11 Paul Tomblin ptomb...@xcski.com: I want to iterate through all the documents that are in the crawl,

Re: Nutch to SolR. First steps

2009-08-12 Thread Alex McLintock
tied the two together. Alex 2009/8/11 Alex McLintock alex.mclint...@gmail.com: Further information to this I'm running on a single machine in fake clustering mode. A tmp directory gets created, with nothing but another empty directory inside of it. The hadoop log file just says

Nutch to SolR. First steps

2009-08-11 Thread Alex McLintock
I'm trying to send my Nutch crawl to SolR. I've generated, fetched, updated, several times. I've done an invertlinks. But when I try to do the solrindex it just sits there for ages and doesnt seem to stress the solr server at all. I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.

Re: Nutch to SolR. First steps

2009-08-11 Thread Alex McLintock
org.apache.nutch.indexer.anchor.AnchorIndexingFilter Is Solr output a plugin, and is it not set up above? 2009/8/11 Alex McLintock alex.mclint...@gmail.com: I'm trying to send my Nutch crawl to SolR. I've generated, fetched, updated, several times. I've done an invertlinks. But when I try to do

Focussed Web Crawling with Nutch

2009-07-31 Thread Alex McLintock
I've been using a perl based focussed web crawler with a MySQL back end, but am now looking at Nutch instead. It seems like a few other people have done something similar. I'm wondering whether we could pool our resources and work together on this? It seems to me that we would be building a few

Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Alex McLintock
I am not sure if it solves your problem but you might do something like disconnect your machines from the internet - preferably by making your dns server return dont know that domain This will relatively quickly cause the remaining part of the fetch to fail. Just a suggestion... Alex 2009/7/23

Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Alex McLintock
2009/7/25 Andrzej Bialecki a...@getopt.org: I solved this once by implementing a check in Fetcher.run() for a marker file on HDFS. If the presence of this file was detected, the FetcherThreads would be stopped one by one (again, by setting a flag in their run() methods to terminate the loop).

Re: error in using generate command

2009-07-23 Thread Alex McLintock
Why does your example say both monster.crawl and test.crawl ? Are you perhaps entering the command wrong or is this just an error in the email? Alex 2009/7/18 Beats tarun_agrawal...@yahoo.com: hi, i m getting this weird error ( at least for me): i m trying to crawl a some web pages..

Re: Tutorial followup - Nutch webapp not seeing stuff?

2009-07-15 Thread Alex McLintock
2009/7/14 oh...@cox.net: BUT, I think that I may have just gotten an idea about why this was not working. It looks like when I run the nutch crawl, the index and indexes directories are not being created until the crawl is completely done. [Is this normal nutch behavior???] I believe

Re: Just getting started w/tutorial- errors in crawl.log

2009-07-14 Thread Alex McLintock
but I get a number of messages in crawl.log, like: Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js        at

Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Alex McLintock
2009/7/14 Jake Jacobson jakecjacob...@gmail.com: I did attach it. I am afraid that I cant see anything either. Can you perhaps upload it somewhere and link to it? I'd like to say thank you for your effort. We could do with more tutorials which look at it in different ways. Alex

Re: Integrating Nutch frontend with Backend.

2009-07-13 Thread Alex McLintock
Hello Zaihan, So you have your servlet container running providing a web application - but it doesnt know where your crawled data is Find the nutch-site file something like /var/lib/tomcat6/webapps/ROOT/WEB-INF/classes/nutch-site.xml And make sure it contains something like configuration

Solr Integration since v1.0 ?

2009-07-07 Thread Alex McLintock
I've looked at a lot of tutorials for linking Nutch and Solr but it seems that this has been improved a lot in version 1.0. Can anyone point me at documentation which takes this into account? Cheers Alex

Writing Plugins - Documentation?

2009-07-06 Thread Alex McLintock
Can anyone point me at some Nutch Plugin documentation which goes into more detail than http://wiki.apache.org/nutch/WritingPluginExample-0.9 I want to understand all the different places where you might put a plugin and why/how you might do so. Basically I am trying to extra some information

Getting Nutch1.0 example working in tomcat 6 (on ubuntu)

2009-07-04 Thread Alex McLintock
I'm trying out nutch-1.0 release on Ubuntu 9.04. I'm trying to follow the 0.8 version tutorial found at http://lucene.apache.org/nutch/tutorial8.html Now I seem to be able to do the web crawling and create indexes. I can do searches from the command line using eg /local/apps/software/nutch$

Re: Problems when deploy nutch-1.0.war

2009-07-04 Thread Alex McLintock
2009/7/4 xiao yang yangxiao9...@gmail.com: I have downloaded Tomcat6 from the official site and reinstalled it manually. It works! Maybe there's something wrong with Tomcat6 in the ubuntu mirror site. Oops - I just posted exactly the same problem with the same stack trace, same ubuntu

Re: Problems when deploy nutch-1.0.war

2009-07-04 Thread Alex McLintock
OK, here is how i fixed this in my ubuntu 9.04 setup using the normal tomcatt6 ubuntu package. I added this permission line to the /etc/tomcat6/policy.d/04webapps.policy grant { // Attempt to get Nutch working permission java.security.AllPermission; Now I can get the Nutch web app