Re: can't deploy nutch-1.0.war ???

2009-11-15 Thread MilleBii
Still stuck, A few other people have reported problems in the mailing but they got no answers... I found out that my Tomcat install is using common-logging-1.1 whilst nutch uses commong-logging-1.0.4 not sure what to think about it... the cause of the problem or another problem. SEVERE: Error

Re: Problem with Indexing Local Filesystem.

2009-11-15 Thread Paul Tomblin
On Sun, Nov 15, 2009 at 2:45 AM, prashant ullegaddi prashullega...@gmail.com wrote: -activeThreads=0 Exception in thread main java.io.IOException: Job failed!    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)  

loading nutchBeanConstructor error with Tomcat 6

2009-11-15 Thread MilleBii
Installed Tomcat 6 on Ubuntu 8.0.4 (works fine) Deployed Nutch-1.0.war and I get the following error. Modified the nutch-site.xml to point the search dir on my crawl directory getting the same error Any idea what's wrong SEVERE: Exception sending initialized context event (context

at the end of fetching, hung threads

2009-11-15 Thread Kalaimathan Mahenthiran
Hi I have spend more than a week fetching using nutch. Near the end i get a message in the console indicating aborting with 100 hung threads as below. Has anyone seen this before... Does anyone know if i can still use the fetched segment... or would this cause any error when i'm updating...

Re: at the end of fetching, hung threads

2009-11-15 Thread MilleBii
Yes had it in the past and one needs to apply a certain patch... but I don't remember which one from the top of my head, search the mailing list. 2009/11/15 Kalaimathan Mahenthiran matha...@gmail.com Hi I have spend more than a week fetching using nutch. Near the end i get a message in the

Re: loading nutchBeanConstructor error with Tomcat 6

2009-11-15 Thread MilleBii
Found : It has to do with the security policies of Tomcat 6 which I had to turn off... probably worth some more analysis to find out how to keep security policies active in general and just grant what's required to nutch. 2009/11/15 MilleBii mille...@gmail.com Installed Tomcat 6 on Ubuntu 8.0.4

Nutch 1.0 - Crawler Crashed - How to Resume

2009-11-15 Thread xiao yang
Hi, All I'm using nutch 1.0 on a 12 nodes cluster. When using crawler to index intranet, it crashed after 12 hours crawling. One of my slave crashed too. Following are logs of crashed node(tasktacker and datanode) and the job: Here is the question: 1. What's the reason it crashes? 2. How to

Re: Nutch near future - strategic directions

2009-11-15 Thread Subhojit Roy
Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course depends on the last

Re: PRUNE : need some help on pruning syntax.

2009-11-15 Thread Subhojit Roy
Hi, What you ask for is not possible using the prune command. Prune is to remove URLs that follow a specific patter specified by the administrator. You will need to parse the HTML page so that the unwanted portions mentioned by you i.e. div class=menu... do not get included in the CONTENT field

Re: crawling / data aggregation - is nutch the right tool?

2009-11-15 Thread Otis Gospodnetic
Droids is much simpler if all you want to do is do a little bit of crawling. Nutch is built to scale to many millions of web pages. If you need to crawl just a few sites, I'd suggest Droids. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta,

Re: Nutch does not crawl pages starting with ~

2009-11-15 Thread Subhojit Roy
Hi, I tried to crawl http://www.cs.umbc.edu. Used the default crawl_urlfilter.txt and added the line +^http://([a-z0-9]*\.)cs.umbc.edu/http://csee.umbc.edu/%7Evarish1/at the end of the file. In the url directory I added www.cs.umbc.edu. It works fine and crawls ~varish1 and ~relan1. Are you