common-terms.utf8 not found in class path when using Nutch from WAR file
Hello everybody, I have run into a rather weird problem that occurs when deploying a Grails (http://grails.codehaus.org/) app as a WAR file in Tomcat. My app instantiates a NutchDocumentAnalyzer during startup as a Spring resource. The Nutch classes and config files are loaded from a JAR inside the lib directory of the app. All of this works fine when running the app via 'grails run-app'. However, when running the app under Tomcat via the WAR file generated by 'grails war' I get the following stacktrace (excerpt): Caused by: org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [org.apache.nutch.analysis.NutchDocumentAnalyzer]: Constructor threw exception; nested exception is java.lang.NullPointerException at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:98) at org .springframework .beans .factory .support .SimpleInstantiationStrategy .instantiate(SimpleInstantiationStrategy.java:87) at org .springframework .beans .factory .support .ConstructorResolver.autowireConstructor(ConstructorResolver.java:233) ... 63 more Caused by: java.lang.NullPointerException at java.io.Reader.init(Reader.java:61) at java.io.BufferedReader.init(BufferedReader.java:76) at java.io.BufferedReader.init(BufferedReader.java:91) at org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:152) at org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:52) at org.apache.nutch.analysis.NutchDocumentAnalyzer $ContentAnalyzer.init(NutchDocumentAnalyzer.java:64) at org .apache .nutch .analysis.NutchDocumentAnalyzer.init(NutchDocumentAnalyzer.java:55) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun .reflect .NativeConstructorAccessorImpl .newInstance(NativeConstructorAccessorImpl.java:39) at sun .reflect .DelegatingConstructorAccessorImpl .newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:83) ... 65 more This is caused by the common-terms.utf8 file not being found in line 152 of org.apache.nutch.analysis.CommonGrams. However, this file is located on the root level of the nutch.jar in the lib directory that also contains the classes themselves. I have also tried copying the file to TOMCAT/webapps/MY_APP/WEB-INF/classes, TOMCAT/webapps/MY_APP/ WEB-INF/ and TOMCAT/webapps/MY_APP/WEB-INF/lib, all to no avail. Does anybody know what this could possibly be caused by? -- Best regards, Bjoern Wilmsmann PGP.sig Description: This is a digitally signed message part
Re: Vidoe search
Ed Whittaker wrote: I am working on a plugin that connects to a speech recognizer. Is there any interest in this in the Nutch community? [...] How much interest is there in indexing audio/video in this manner? This sounds very interesting indeed. However, this approach might create a performance bottleneck caused by slow network connections and potential server overload ( as you said, this approach would be quite quite expensive in terms of computing power). -- Best regards, Bjoern Wilmsmann PGP.sig Description: Signierter Teil der Nachricht
Unique IDs for URLs in crawl file
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everybody, I need to attach a unique ID to each URL in the file processed by the nutch crawler in order to identify URLs for saving the parsed and indexed results in a database. Does anybody have an idea of what could be considered the best way and place to implement such a feature? - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Darwin) iD8DBQFFYiGvgz0R1bg11MERAi3cAJ9Vv+EXu3AHf5jPEdVX6AJzyvbFogCeOs4Q zobesdszGf52elrTB2Al6Ik= =6nM5 -END PGP SIGNATURE-
Re: Lucene query support in Nutch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Am 07.10.2006 um 17:40 schrieb Cristina Belderrain: Let me remind you that all this must be done just to provide something that's already there: Nutch is built on top of Lucene, after all. If it's hard to understand why Lucene's capabilities were simply neutralized in Nutch, it's even harder to figure out why no choice was left to users by means of some configuration file. I think this issue is rooted in the underlying philosophy of Nutch: Nutch was designed with the idea of a possible Google(and the likes)- sized crawler and indexer in mind. Regular expressions and wildcard queries do not seem to fit into this philosophy, as such queries would be way less efficient on a huge data set than simple boolean queries. Nevertheless, I agree that there should be an option to choose the Lucene query engine instead of the Nutch flavour one because Nutch has been proven to be equally suitable for areas which do not require as efficient queries (like intranet crawling for instance) as an all- out web indexing application. - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFFJ+75gz0R1bg11MERAgT7AJ4mPRF8Z0BR2yLCm5Pxsz4VvtTI6QCfcS8b q8gM8LQapjAloNIRwNV+osE= =v7Lf -END PGP SIGNATURE-
Re: Lucene query support in Nutch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everybody, On 05/10/2006 05:44 Ravi Chintakunta wrote: public Hits search(String queryString, int numHits, String dedupField, String sortField, boolean reverse) throws IOException { org.apache.lucene.queryParser.QueryParser parser = new org.apache.lucene.queryParser.QueryParser(content, new org.apache.lucene.analysis.standard.StandardAnalyzer()); org.apache.lucene.search.Query luceneQuery = parser.parse (queryString); return translateHits (optimizer.optimize(luceneQuery, luceneSearcher, numHits, sortField, reverse), dedupField, sortField); } This seems to be a good approach. I have not yet tried it out in detail, however, the method optimize() in LuceneQueryOptimizer does only take BooleanQuery as an argument, so the line 'return translateHits...' would cause a compile error, wouldn't it? - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm sFAZIcCv3CoIBJC5g8FbOyo= =vzdw -END PGP SIGNATURE-
Re: How do I write a nutch query.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey, I have run into the same problem, too. Sometimes nutch won't return results for queries although there clearly are pages containing the search term. I agree that this must have something to do with Nutch scoring however I have not yet found out how to change this behaviour Am 08.08.2006 um 14:57 schrieb Fred Tyre: How do I do a search in nutch. If I go to google.com, I just type in the keyword(s) that I am looking for. Is this not the case with nutch, or do I have to change the default configuration to enable that ability. Example test case... I enter forum on the nutch website and click Search or I run the following command line... bin/nutch org.apache.nutch.searcher.NutchBean forum In both cases it returns 0 Results. However, if I go into luke and run a content search on forum, I get 2 results. I looked on your FAQ for this topic and could not find the question/ answer. I would think that the above question would be more frequently asked, then Common words are saturating my search results. or How can I influence Nutch scoring?. Please, help. I have asked this kind of question before and not gotten a response. Sincerely, Fred Fred Tyre Information Services Heartland Communications, Inc. 515-574-2147 [EMAIL PROTECTED] - -- Best regards, Björn Wilmsmann -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFE2Iy2gz0R1bg11MERAoBnAKCedV5b7IScRSFuj5B356D7mrNyzACg7rvq VVdN+hUYbWpRXIkH2GDYguI= =E+g8 -END PGP SIGNATURE-