common-terms.utf8 not found in class path when using Nutch from WAR file

2008-01-28 Thread Björn Wilmsmann

Hello everybody,

I have run into a rather weird problem that occurs when deploying a  
Grails (http://grails.codehaus.org/) app as a WAR file in Tomcat. My  
app instantiates a NutchDocumentAnalyzer during startup as a Spring  
resource. The Nutch classes and config files are loaded from a JAR  
inside the lib directory of the app.
All of this works fine when running the app via 'grails run-app'.  
However, when running the app under Tomcat via the WAR file generated  
by 'grails war' I get the following stacktrace (excerpt):


Caused by: org.springframework.beans.BeanInstantiationException: Could  
not instantiate bean class  
[org.apache.nutch.analysis.NutchDocumentAnalyzer]: Constructor threw  
exception; nested exception is java.lang.NullPointerException
	at  
org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:98)
	at  
org 
.springframework 
.beans 
.factory 
.support 
.SimpleInstantiationStrategy 
.instantiate(SimpleInstantiationStrategy.java:87)
	at  
org 
.springframework 
.beans 
.factory 
.support 
.ConstructorResolver.autowireConstructor(ConstructorResolver.java:233)

... 63 more
Caused by: java.lang.NullPointerException
at java.io.Reader.init(Reader.java:61)
at java.io.BufferedReader.init(BufferedReader.java:76)
at java.io.BufferedReader.init(BufferedReader.java:91)
at org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:152)
at org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:52)
	at org.apache.nutch.analysis.NutchDocumentAnalyzer 
$ContentAnalyzer.init(NutchDocumentAnalyzer.java:64)
	at  
org 
.apache 
.nutch 
.analysis.NutchDocumentAnalyzer.init(NutchDocumentAnalyzer.java:55)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native  
Method)
	at  
sun 
.reflect 
.NativeConstructorAccessorImpl 
.newInstance(NativeConstructorAccessorImpl.java:39)
	at  
sun 
.reflect 
.DelegatingConstructorAccessorImpl 
.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at  
org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:83)

... 65 more

This is caused by the common-terms.utf8 file not being found in line  
152 of org.apache.nutch.analysis.CommonGrams. However, this file is  
located on the root level of the nutch.jar in the lib directory that  
also contains the classes themselves. I have also tried copying the  
file to TOMCAT/webapps/MY_APP/WEB-INF/classes, TOMCAT/webapps/MY_APP/ 
WEB-INF/ and TOMCAT/webapps/MY_APP/WEB-INF/lib, all to no avail.


Does anybody know what this could possibly be caused by?

--
Best regards,
Bjoern Wilmsmann





PGP.sig
Description: This is a digitally signed message part


Re: Vidoe search

2007-03-22 Thread Björn Wilmsmann


Ed Whittaker wrote:

I am working on a plugin that connects to a speech recognizer. Is  
there any

interest in this in the Nutch community?


[...]


How much interest is there in indexing audio/video in this manner?


This sounds very interesting indeed. However, this approach might  
create a performance bottleneck caused by slow network connections  
and potential server overload ( as you said, this approach would be  
quite quite expensive in terms of computing power).


--
Best regards,
Bjoern Wilmsmann





PGP.sig
Description: Signierter Teil der Nachricht


Unique IDs for URLs in crawl file

2006-11-21 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi everybody,

I need to attach a unique ID to each URL in the file processed by the  
nutch crawler in order to identify URLs for saving the parsed and  
indexed results in a database. Does anybody have an idea of what  
could be considered the best way and place to implement such a feature?


- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFFYiGvgz0R1bg11MERAi3cAJ9Vv+EXu3AHf5jPEdVX6AJzyvbFogCeOs4Q
zobesdszGf52elrTB2Al6Ik=
=6nM5
-END PGP SIGNATURE-


Re: Lucene query support in Nutch

2006-10-07 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:


Let me remind you that all this must be done just to provide something
that's already there: Nutch is built on top of Lucene, after all. If
it's hard to understand why Lucene's capabilities were simply
neutralized in Nutch, it's even harder to figure out why no choice was
left to users by means of some configuration file.


I think this issue is rooted in the underlying philosophy of Nutch:  
Nutch was designed with the idea of a possible Google(and the likes)- 
sized crawler and indexer in mind. Regular expressions and wildcard  
queries do not seem to fit into this philosophy, as such queries  
would be way less efficient on a huge data set than simple boolean  
queries.


Nevertheless, I agree that there should be an option to choose the  
Lucene query engine instead of the Nutch flavour one because Nutch  
has been proven to be equally suitable for areas which do not require  
as efficient queries (like intranet crawling for instance) as an all- 
out web indexing application.


- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJ+75gz0R1bg11MERAgT7AJ4mPRF8Z0BR2yLCm5Pxsz4VvtTI6QCfcS8b
q8gM8LQapjAloNIRwNV+osE=
=v7Lf
-END PGP SIGNATURE-


Re: Lucene query support in Nutch

2006-10-05 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi everybody,


On 05/10/2006 05:44 Ravi Chintakunta wrote:


public Hits search(String queryString, int numHits,
String dedupField, String sortField, boolean
reverse)  throws IOException {

   org.apache.lucene.queryParser.QueryParser parser = new
org.apache.lucene.queryParser.QueryParser(content, new
org.apache.lucene.analysis.standard.StandardAnalyzer());

  org.apache.lucene.search.Query luceneQuery = parser.parse 
(queryString);


  return translateHits
 (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
 sortField, reverse),
  dedupField, sortField);
 }


This seems to be a good approach. I have not yet tried it out in  
detail, however, the method optimize() in LuceneQueryOptimizer does  
only take BooleanQuery as an argument, so the line 'return  
translateHits...'  would cause a compile error, wouldn't it?



- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm
sFAZIcCv3CoIBJC5g8FbOyo=
=vzdw
-END PGP SIGNATURE-


Re: How do I write a nutch query.

2006-08-08 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hey,

I have run into the same problem, too. Sometimes nutch won't return  
results for queries although there clearly are pages containing the  
search term. I agree that this must have something to do with Nutch  
scoring however I have not yet found out how to change this behaviour


Am 08.08.2006 um 14:57 schrieb Fred Tyre:



How do I do a search in nutch.
If I go to google.com, I just type in the keyword(s) that I am  
looking for.

Is this not the case with nutch, or do I have to change the default
configuration to enable that ability.

Example test case...
I enter forum on the nutch website and click Search or I run the
following command line...
bin/nutch org.apache.nutch.searcher.NutchBean forum

In both cases it returns 0 Results.

However, if I go into luke and run a content search on forum, I  
get 2

results.

I looked on your FAQ for this topic and could not find the question/ 
answer.


I would think that the above question would be more frequently  
asked, then
Common words are saturating my search results. or How can I  
influence

Nutch scoring?.

Please, help.

I have asked this kind of question before and not gotten a response.

Sincerely,
Fred




   Fred Tyre
   Information Services
   Heartland Communications, Inc.
   515-574-2147
   [EMAIL PROTECTED]









- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFE2Iy2gz0R1bg11MERAoBnAKCedV5b7IScRSFuj5B356D7mrNyzACg7rvq
VVdN+hUYbWpRXIkH2GDYguI=
=E+g8
-END PGP SIGNATURE-