2009/11/16 Mark Kerzner markkerz...@gmail.com:
Hi,
I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?
Nutch basically uses
Try looking at how the indexers work. They *do* iterate through all
the documents in the crawl (or rather one segment at a time). However
they do it in a Hadoop way...
2009/8/11 Paul Tomblin ptomb...@xcski.com:
I want to iterate through all the documents that are in the crawl,
tied the two together.
Alex
2009/8/11 Alex McLintock alex.mclint...@gmail.com:
Further information to this
I'm running on a single machine in fake clustering mode.
A tmp directory gets created, with nothing but another empty directory
inside of it.
The hadoop log file just says
I'm trying to send my Nutch crawl to SolR. I've generated, fetched,
updated, several times. I've done an invertlinks.
But when I try to do the solrindex it just sits there for ages and
doesnt seem to stress the solr server at all.
I'm using Nutch 1.0, Sun Java 1.6, Ubuntu Linux 9.04.
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
Is Solr output a plugin, and is it not set up above?
2009/8/11 Alex McLintock alex.mclint...@gmail.com:
I'm trying to send my Nutch crawl to SolR. I've generated, fetched,
updated, several times. I've done an invertlinks.
But when I try to do
I've been using a perl based focussed web crawler with a MySQL back
end, but am now looking at Nutch instead. It seems like a few other
people have done something similar. I'm wondering whether we could
pool our resources and work together on this?
It seems to me that we would be building a few
I am not sure if it solves your problem but you might do something
like disconnect your machines from the internet - preferably by making
your dns server return dont know that domain
This will relatively quickly cause the remaining part of the fetch to fail.
Just a suggestion...
Alex
2009/7/23
2009/7/25 Andrzej Bialecki a...@getopt.org:
I solved this once by implementing a check in Fetcher.run() for a marker
file on HDFS. If the presence of this file was detected, the FetcherThreads
would be stopped one by one (again, by setting a flag in their run() methods
to terminate the loop).
Why does your example say both monster.crawl and test.crawl ?
Are you perhaps entering the command wrong or is this just an error in
the email?
Alex
2009/7/18 Beats tarun_agrawal...@yahoo.com:
hi,
i m getting this weird error ( at least for me):
i m trying to crawl a some web pages..
2009/7/14 oh...@cox.net:
BUT, I think that I may have just gotten an idea about why this was not
working.
It looks like when I run the nutch crawl, the index and indexes
directories are not being created until the crawl is completely done.
[Is this normal nutch behavior???]
I believe
but I get a number of messages in crawl.log, like:
Error parsing: http://lucene.apache.org/skin/getMenu.js:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/javascript
url=http://lucene.apache.org/skin/getMenu.js
at
2009/7/14 Jake Jacobson jakecjacob...@gmail.com:
I did attach it.
I am afraid that I cant see anything either. Can you perhaps upload it
somewhere and link to it?
I'd like to say thank you for your effort. We could do with more
tutorials which look at it in different ways.
Alex
Hello Zaihan,
So you have your servlet container running providing a web application
- but it doesnt know where your crawled data is
Find the nutch-site file something like
/var/lib/tomcat6/webapps/ROOT/WEB-INF/classes/nutch-site.xml
And make sure it contains something like
configuration
I've looked at a lot of tutorials for linking Nutch and Solr but it
seems that this has been improved a lot in version 1.0.
Can anyone point me at documentation which takes this into account?
Cheers
Alex
Can anyone point me at some Nutch Plugin documentation which goes into
more detail than
http://wiki.apache.org/nutch/WritingPluginExample-0.9
I want to understand all the different places where you might put a
plugin and why/how you might do so.
Basically I am trying to extra some information
I'm trying out nutch-1.0 release on Ubuntu 9.04. I'm trying to follow
the 0.8 version tutorial found at
http://lucene.apache.org/nutch/tutorial8.html
Now I seem to be able to do the web crawling and create indexes. I can
do searches from the command line using
eg
/local/apps/software/nutch$
2009/7/4 xiao yang yangxiao9...@gmail.com:
I have downloaded Tomcat6 from the official site and reinstalled it
manually. It works!
Maybe there's something wrong with Tomcat6 in the ubuntu mirror site.
Oops - I just posted exactly the same problem with the same stack trace,
same ubuntu
OK, here is how i fixed this in my ubuntu 9.04 setup using the normal
tomcatt6 ubuntu package.
I added this permission line to the /etc/tomcat6/policy.d/04webapps.policy
grant {
// Attempt to get Nutch working
permission java.security.AllPermission;
Now I can get the Nutch web app
18 matches
Mail list logo