Still stuck,
A few other people have reported problems in the mailing but they got no
answers...
I found out that my Tomcat install is using common-logging-1.1 whilst nutch
uses commong-logging-1.0.4 not sure what to think about it... the cause of
the problem or another problem.
SEVERE: Error
On Sun, Nov 15, 2009 at 2:45 AM, prashant ullegaddi
prashullega...@gmail.com wrote:
-activeThreads=0
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
Installed Tomcat 6 on Ubuntu 8.0.4 (works fine)
Deployed Nutch-1.0.war and I get the following error.
Modified the nutch-site.xml to point the search dir on my crawl directory
getting the same error
Any idea what's wrong
SEVERE: Exception sending initialized context event (context
Hi
I have spend more than a week fetching using nutch. Near the end i get
a message in the console indicating aborting with 100 hung threads as
below.
Has anyone seen this before... Does anyone know if i can still use the
fetched segment... or would this cause any error when i'm updating...
Yes had it in the past and one needs to apply a certain patch... but I don't
remember which one from the top of my head, search the mailing list.
2009/11/15 Kalaimathan Mahenthiran matha...@gmail.com
Hi
I have spend more than a week fetching using nutch. Near the end i get
a message in the
Found : It has to do with the security policies of Tomcat 6 which I had to
turn off... probably worth some more analysis to find out how to keep
security policies active in general and just grant what's required to nutch.
2009/11/15 MilleBii mille...@gmail.com
Installed Tomcat 6 on Ubuntu 8.0.4
Hi, All
I'm using nutch 1.0 on a 12 nodes cluster. When using crawler to index
intranet, it crashed after 12 hours crawling. One of my slave crashed
too.
Following are logs of crashed node(tasktacker and datanode) and the job:
Here is the question:
1. What's the reason it crashes?
2. How to
Hi,
Would it be possible to include in Nutch, the ability to crawl download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the last
Hi,
What you ask for is not possible using the prune command. Prune is to remove
URLs that follow a specific patter specified by the administrator.
You will need to parse the HTML page so that the unwanted portions
mentioned by you i.e. div class=menu... do not get included in the
CONTENT field
Droids is much simpler if all you want to do is do a little bit of crawling.
Nutch is built to scale to many millions of web pages.
If you need to crawl just a few sites, I'd suggest Droids.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta,
Hi,
I tried to crawl http://www.cs.umbc.edu. Used the default
crawl_urlfilter.txt and added the line
+^http://([a-z0-9]*\.)cs.umbc.edu/http://csee.umbc.edu/%7Evarish1/at
the end of the file. In the url directory I added
www.cs.umbc.edu. It works fine and crawls ~varish1 and ~relan1.
Are you
11 matches
Mail list logo