Malaga-fi Finnish plugin for Nutch

2010-04-12 Thread Hannu Väisänen
Malaga-fi is a Nutch plugin for indexing documents written in Finnish.


Malaga-fi analyses words morphologically, converts them to a base form
(that you find in dictionaries) and indexes the base forms, so that
you find all inflections of a word by just searching for the base
form.

To use an English example, if you search for the word give you find
all documents that have give, gives, gave, given, or giving.

This is very important in Finnish since Finnish words have literally
tens of thousands of inflected forms.


What you need:

1. Malaga programming language.
   http://home.arcor.de/bjoern-beutel/malaga/


2. Suomimalaga - Description of Finnish morphology written in Malaga.
   http://sourceforge.net/project/showfiles.php?group_id=156731

   Newest version:
   svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga


3. JNA library - Simplified native library access for Java.
   https://jna.dev.java.net/


4. Malaga-fi - Nutch plugin for documents written in Finnish.
   http://sourceforge.net/projects/malaga-fi/


5. Nutch: http://lucene.apache.org/nutch/


Malaga-fi is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Nutch, tomcat6, UTF-8 and query filter = crash

2010-04-01 Thread Hannu Väisänen
I have a query filter that works when I search from the command line

$ bin/nutch org.apache.nutch.searcher.NutchBean word


The query filter crashes when it calls native code when I search
through tomcat6 for a word that contains letters that are not in
ASCII.

Filter assumes that its input is in UTF-8 and
I have configured tomcat6 to use UTF-8 everywhere.

So either I have configured tomcat6 incorrectly or I should
configure Nutch to use UTF-8.


This is a log sippet from file catalina.out (MorphologyHVQueryFilter is my 
query filter),

$ less catalina.out

2010-03-31 19:30:05,455 INFO  NutchBean - query request from ::1
2010-03-31 19:30:05,466 INFO  NutchBean - query: kE4si  This 
is not UTF-8.
2010-03-31 19:30:05,466 INFO  NutchBean - lang: fi
2010-03-31 19:30:05,472 INFO  NutchBean - searching for 20 raw hits
2010-03-31 19:30:05,472 INFO  MorphologyHVQueryFilter - 
MorphologyHVQueryFilter.filter käsi @ +(url:käsi^4.0 anchor:käsi^2.0 
content:käsi title
:käsi^1.5 host:käsi^2.0)
2010-03-31 19:30:05,472 INFO  MorphologyHVQueryFilter - clauses.length 1
2010-03-31 19:30:05,472 INFO  MorphologyHVQueryFilter - TermSet  [käsi] 
Clause käsi
2010-03-31 19:30:05,472 INFO  MorphologyHVQueryFilter - Word [käsi]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0115f0e5, pid=3840, tid=2392849264


As you can see, the INFO output from NutchBean is not in UTF-8.
Does that mean that I should configure Nutch or reconfiure tomcat6?

Do you have any ideas on what I shoud do next?


Re: Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12

2010-03-11 Thread Hannu Väisänen
On Wed, Feb 24, 2010 at 03:42:20PM +0200, Sami Siren wrote:
 Hannu,
 
 Do you use same set of QueryFilters both in the webapp and when
 running from shell?
 
 Perhaps your filter is not executed when running from cli? You can
 verify how your query is transformed by running bin/nutch
 org.apache.nutch.searcher.Query and entering some queries.

That seems to be the case:

Parsed: kuusi
Translated: +(url:kuusi^4.0 anchor:kuusi^2.0 content:kuusi title:kuusi^1.5 
host:kuusi^2.0)
Query: Parsed: kuuden
Translated: +(url:kuuden^4.0 anchor:kuuden^2.0 content:kuuden title:kuuden^1.5 
host:kuuden^2.0)
Query: 

I my filter was executed, Query should search for kuusi or kuu in
the first case (when inputing kuusi) and for kuusi in the second
case (when inputing kuuden).


Query knows that I have a plugin called malaga-fi:

DEBUG plugin.PluginRepository (PluginManifestParser.java:parsePlugin(187)) - 
plugin: id=malaga-fi name=Malaga Analysis Plug-in version=0.0.1 
provider=joensuu.ficlass=null
DEBUG plugin.PluginRepository (PluginManifestParser.java:parseExtension(287)) - 
impl: point=org.apache.nutch.analysis.NutchAnalyzer 
class=fi.joensuu.joyds1.nutch.MalagaHVSuggestionAnalyzer
INFO  plugin.PluginRepository (PluginRepository.java:displayStatus(316)) -  
Malaga Analysis Plug-in (malaga-fi)

How do I get my filter to execute?


Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12

2010-02-17 Thread Hannu Väisänen
I am using Nutch 1.0 to index files written in Finnish.

I have written a filter MorphologyHVSuggestionFilter that converts
Finnish words to a base form (that you find in dictionaries) and
I index just the base forms so that I find all inflected forms
when searching just for the base form.



When I search for the word 'kuka' like this

bin/nutch org.apache.nutch.searcher.NutchBean kuka
Total hits: 245

Tomcat6 finds also 245 hits.


But when I search for word 'kuusi'

bin/nutch org.apache.nutch.searcher.NutchBean kuusi
Total hits: 212

Tomcat6 finds only 14 hits.



Tomcat6 log shows this for word 'kuka':


2010-02-16 21:25:40,909 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,909 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token1 (kuka,0,4)
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token2 (kuka,0,4)
2010-02-16 21:25:40,910 INFO  NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO  NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO  NutchBean - lang: fi
2010-02-16 21:25:40,910 INFO  NutchBean - lang: fi
2010-02-16 21:25:40,911 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,911 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,939 INFO  NutchBean - re-searching for 40 raw hits, query: 
kuka -site:
2010-02-16 21:25:40,939 INFO  NutchBean - re-searching for 40 raw hits, query: 
kuka -site:
2010-02-16 21:25:40,941 INFO  NutchBean - found 0 raw hits
2010-02-16 21:25:40,941 INFO  NutchBean - found 0 raw hits
2010-02-16 21:25:40,969 INFO  NutchBean - total hits: 245
2010-02-16 21:25:40,969 INFO  NutchBean - total hits: 245


Tomcat6 log shows this for word 'kuusi':

2010-02-16 21:23:12,777 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,777 INFO  NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token1 (kuusi,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuu,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 
(kuusi,0,0,posIncr=0)
2010-02-16 21:23:12,778 INFO  NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO  NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO  NutchBean - lang: fi
2010-02-16 21:23:12,778 INFO  NutchBean - lang: fi
2010-02-16 21:23:12,780 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 INFO  NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for url
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for anchor
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for content
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for title
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for host
2010-02-16 21:23:12,813 INFO  NutchBean - total hits: 14
2010-02-16 21:23:12,813 INFO  NutchBean - total hits: 14


The difference between words 'kuka' and 'kuusi' is that the word 'kuka'
has only one base form (which happens to be 'kuka') but the word
'kuusi' has two base forms 'kuusi' and 'kuu' ('moon'; 'si' is a
possessive suffix).

So is it possible that when I search through tomcat6 Nutch returns
only those files that have both words 'kuusi' and 'kuu'. If so, how
can I change this that it finds files that has either 'kuusi' or 'kuu'
(or, of course, any other base forms of the word I search for :-).


Malaga-fi is in SourceForge

2009-10-08 Thread Hannu Väisänen
Malaga-fi is a Nutch plugin for indexing documents written in Finnish.
It analyses words morphologically and indexes only the base forms
(that you find in dictionaries) so that you find all inflections of a
word by just searching for the base form.

Now malaga-fi is in SourceForge.


http://sourceforge.net/projects/malaga-fi/
https://malaga-fi.svn.sourceforge.net/svnroot/malaga-fi/


Malaga-fi - Finnish plugin for Nutch - a new version

2009-09-03 Thread Hannu Väisänen
I have released a new version of malaga-fi.

Changes from previous version: malaga-fi recognizes some
common spelling errors.



Malaga-fi is a Nutch plugin for indexing documents written in Finnish.


Malaga-fi analyses words morphologically, converts them to a base form
(that you find in dictionaries) and indexes the base forms, so that
you find all inflections of a word by just searching for the base
form.

To use an English example, if you search for the word give you find
all documents that have give, gives, gave, given, or giving.

This is very important in Finnish since Finnish words have literally
tens of thousands of inflected forms.


What you need:

1. Malaga programming language.
   http://home.arcor.de/bjoern-beutel/malaga/


2. Suomimalaga - Description of Finnish morphology written in Malaga.
   http://sourceforge.net/project/showfiles.php?group_id=156731

   Newest version:
   svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga


3. Malaga-Java - Java interface to Malaga.
   http://joyds1.joensuu.fi/programs/index.html

   Malaga-Java has two versions; both are in the same file.
   You need the thread-safe version.


4. Malaga-fi - Nutch plugin for documents written in Finnish.
   http://joyds1.joensuu.fi/programs/index.html


5. Nutch: http://lucene.apache.org/nutch/


Malaga-fi is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


shouldFetch rejects all files

2009-08-24 Thread Hannu Väisänen
I am using Nutch to index some directories on my hard disk. It used to
work but now Nutch rejects all files.

File logs/hadoop.log has this

DEBUG crawl.Generator - -shouldFetch rejected [file name here] 
fetchTime=1253697537652, curTime=1251105859942

for every file in directories I want to index.


How can I start to debug the problem?


How to tell Nutch that text files are text files?

2009-07-01 Thread Hannu Väisänen
I am using Nutch to index plain text and LaTeX files.

Nutch thinks that some of the files are of type
application/octet-stream.

I have put these lines to file parse-plugins.xml

   mimeType name=application/octet-stream
  plugin id=parse-text /
   /mimeType

Now Nutch parses and indexes the files but when I look the search
results on Firefox/tomcat6 Nutch says that they are of type
application/octet-stream and does not show them.

How do I tell Nutch that it should show files of type
application/octet-stream as if they were text files?


Malaga-fi - Finnish plugin for Nutch

2009-06-28 Thread Hannu Väisänen
Malaga-fi is a Nutch plugin for indexing documents written in Finnish.


Malaga-fi analyses words morphologically, converts them to a base form
(that you find in dictionaries) and indexes the base forms, so that
you find all inflections of a word by just searching for the base
form.

To use an English example, if you search for the word give you find
all documents that have give, gives, gave, given, or giving.

This is very important in Finnish since Finnish words have literally
tens of thousands of inflected forms.


What you need:

1. Malaga programming language.
   http://home.arcor.de/bjoern-beutel/malaga/


2. Suomimalaga - Description of Finnish morphology written in Malaga.
   http://sourceforge.net/project/showfiles.php?group_id=156731

   Newest version:
   svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga


3. Malaga-Java - Java interface to Malaga.
   http://joyds1.joensuu.fi/programs/index.html

   Malaga-Java has two versions; both are in the same file.
   You need the thread-safe version.


4. Malaga-fi - Nutch plugin for documents written in Finnish.
   http://joyds1.joensuu.fi/programs/index.html


5. Nutch: http://lucene.apache.org/nutch/



Malaga-fi is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Re: Nutch can't find all files

2009-04-08 Thread Hannu Väisänen
On Wed, Apr 08, 2009 at 08:54:37AM +0200, Andrzej Bialecki wrote:
 Most likely this is related to the setting db.max.outlinks.per.page. The  
 default is 1000. In case of file:// URLs this means that directory  
 listings with more than 1000 entries will be truncated. Solution: simply  
 increase the limit.

That helped a little. Now Nutch is fetching more files but it is still
skipping files.

I have more questions.

How does Nutch select the files it fetches?

Is it reading every file name in a directory and then selecting what it
fetches?

Is it possible to output the file names Nutch consideres for fetching?

Where do I look in the code? (-:


Re: Nutch can't find all files

2009-04-07 Thread Hannu Väisänen
On Mon, Apr 06, 2009 at 11:18:59PM +0800, yanky young wrote:
 Maybe it is about Windows path names and file names.
 In Windows, path names and file names can have whitespace.

I am running Linux and I have no whitespace in my file names.


 log4j.logger.org.apache.nutch.protocol.file=DEBUG,cmdstdout

This did not show the files Nutch is skipping.



Re: Problem with Crawler and Parent Directories

2009-04-06 Thread Hannu Väisänen
On Thu, Apr 02, 2009 at 05:00:47PM +0200, Wolf Fischer wrote:
 +^file:///c:/test/
 -.

Try this:

+^file:///c:/test/
+^file:/c:/test/
-.


That is, put three an one slashes after the file:.
That worked for me.


Nutch can't find all files

2009-04-02 Thread Hannu Väisänen
I am using Nutch to index my hard disk.

Nutch is skipping some files. They do not show in Nutch logs (like
fetching file:...) and it is as if Nutch do not notice that they
exist.

But when I moved one file that Nutch did not notice to a test
directory that had only a few files and indexed only that directory,
Nutch did index the file.

Any ideas on how I can debug the problem?