Hi All,
I'm using Nutch with Hadoop with great pleasure - working great and really
increase crawling performance on multiple machines.
I have two strong machines and two older machines which I would like to use.
So far I've been using only the two strong machines with Hadoop.
Now I would like
Hi all ,
I recently dowloaded the latest nutchbase ... just finding out that many
things have changed for my plugins notably the addition on hhbase columns.
Not really difficult to change although plenty of plugins now don't work.
Still I face a problem with the new scoring plug-in because
More over the class Crawl does exist after building so you can not run nutch
crawl... I'm going to revert to the reference code it does not work.
2009/11/11 MilleBii mille...@gmail.com
Hi all ,
I recently dowloaded the latest nutchbase ... just finding out that many
things have changed for
Hello,
Thanks for the reply, but this doesn't seem to work either. I removed the
crawl dir, added the regex you posted, removed the one I had in
regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
crawls spend about 90% of their time on who.int .. I have no idea how to
hello,
the first matching rule wins.
may be you have a rule before, which matches.
can you send me your filter files by private mail?
regards
reinhard
opsec schrieb:
Hello,
Thanks for the reply, but this doesn't seem to work either. I removed the
crawl dir, added the regex you posted,
Hej,
I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.
I download succesfully that code in:
Pablo Aragón wrote:
Hej,
I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.
I download succesfully that code in:
Solr is just a search and indexing server. It doesn't do crawling. Nutch does
the crawling and page parsing, and can index into Lucene or into a Solr server.
Nutch is a biggish beast, and if you just need to index a site or even a small
set of them, you may have an easier time with Droids.
Hi all,
I have been trying to run a crawl on a couple of different domains using
nutch:
bin/nutch crawl urls -dir crawled -depth 3
Everytime I get the response:
Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the
first level get crawled and in most other cases, nothing
Hi,
I have setup Nutch on a multinode cluster and the crawl is working fine.
However it seems that Nutch cannot crawl any pages with ~.
I setup nutch to crawl http://www.cs.umbc.edu . Within this website it did
not crawl pages like -
- www.cs.umbc.edu/~varish1
- www.cs.umbc.edu/~relan1
hi,
i hope someone can find me an answer on this;
why is nutch re-fetching pages that we fetched and indexed already - over
and over regardless of the db.fetch properties? here are my settings:
db.fetch.interval.default = 2592000
db.default.fetch.interval = 30
db.fetch.interval.max = 7776000
Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1
--
View this message in context:
http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Any other rules in your filter that preceed that one?
(+^http://([a-z0-9]*\.)*blogspot.com/)
--
View this message in context:
http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html
Sent from the Nutch - User mailing list archive at Nabble.com.
13 matches
Mail list logo