Malaga-fi Finnish plugin for Nutch
Malaga-fi is a Nutch plugin for indexing documents written in Finnish. Malaga-fi analyses words morphologically, converts them to a base form (that you find in dictionaries) and indexes the base forms, so that you find all inflections of a word by just searching for the base form. To use an English example, if you search for the word give you find all documents that have give, gives, gave, given, or giving. This is very important in Finnish since Finnish words have literally tens of thousands of inflected forms. What you need: 1. Malaga programming language. http://home.arcor.de/bjoern-beutel/malaga/ 2. Suomimalaga - Description of Finnish morphology written in Malaga. http://sourceforge.net/project/showfiles.php?group_id=156731 Newest version: svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga 3. JNA library - Simplified native library access for Java. https://jna.dev.java.net/ 4. Malaga-fi - Nutch plugin for documents written in Finnish. http://sourceforge.net/projects/malaga-fi/ 5. Nutch: http://lucene.apache.org/nutch/ Malaga-fi is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Nutch, tomcat6, UTF-8 and query filter = crash
I have a query filter that works when I search from the command line $ bin/nutch org.apache.nutch.searcher.NutchBean word The query filter crashes when it calls native code when I search through tomcat6 for a word that contains letters that are not in ASCII. Filter assumes that its input is in UTF-8 and I have configured tomcat6 to use UTF-8 everywhere. So either I have configured tomcat6 incorrectly or I should configure Nutch to use UTF-8. This is a log sippet from file catalina.out (MorphologyHVQueryFilter is my query filter), $ less catalina.out 2010-03-31 19:30:05,455 INFO NutchBean - query request from ::1 2010-03-31 19:30:05,466 INFO NutchBean - query: kE4si This is not UTF-8. 2010-03-31 19:30:05,466 INFO NutchBean - lang: fi 2010-03-31 19:30:05,472 INFO NutchBean - searching for 20 raw hits 2010-03-31 19:30:05,472 INFO MorphologyHVQueryFilter - MorphologyHVQueryFilter.filter käsi @ +(url:käsi^4.0 anchor:käsi^2.0 content:käsi title :käsi^1.5 host:käsi^2.0) 2010-03-31 19:30:05,472 INFO MorphologyHVQueryFilter - clauses.length 1 2010-03-31 19:30:05,472 INFO MorphologyHVQueryFilter - TermSet [käsi] Clause käsi 2010-03-31 19:30:05,472 INFO MorphologyHVQueryFilter - Word [käsi] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0115f0e5, pid=3840, tid=2392849264 As you can see, the INFO output from NutchBean is not in UTF-8. Does that mean that I should configure Nutch or reconfiure tomcat6? Do you have any ideas on what I shoud do next?
Re: Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12
On Wed, Feb 24, 2010 at 03:42:20PM +0200, Sami Siren wrote: Hannu, Do you use same set of QueryFilters both in the webapp and when running from shell? Perhaps your filter is not executed when running from cli? You can verify how your query is transformed by running bin/nutch org.apache.nutch.searcher.Query and entering some queries. That seems to be the case: Parsed: kuusi Translated: +(url:kuusi^4.0 anchor:kuusi^2.0 content:kuusi title:kuusi^1.5 host:kuusi^2.0) Query: Parsed: kuuden Translated: +(url:kuuden^4.0 anchor:kuuden^2.0 content:kuuden title:kuuden^1.5 host:kuuden^2.0) Query: I my filter was executed, Query should search for kuusi or kuu in the first case (when inputing kuusi) and for kuusi in the second case (when inputing kuuden). Query knows that I have a plugin called malaga-fi: DEBUG plugin.PluginRepository (PluginManifestParser.java:parsePlugin(187)) - plugin: id=malaga-fi name=Malaga Analysis Plug-in version=0.0.1 provider=joensuu.ficlass=null DEBUG plugin.PluginRepository (PluginManifestParser.java:parseExtension(287)) - impl: point=org.apache.nutch.analysis.NutchAnalyzer class=fi.joensuu.joyds1.nutch.MalagaHVSuggestionAnalyzer INFO plugin.PluginRepository (PluginRepository.java:displayStatus(316)) - Malaga Analysis Plug-in (malaga-fi) How do I get my filter to execute?
Nutch 1.0 with tomcat6 and Firefox does not find all files on Fedora 12
I am using Nutch 1.0 to index files written in Finnish. I have written a filter MorphologyHVSuggestionFilter that converts Finnish words to a base form (that you find in dictionaries) and I index just the base forms so that I find all inflected forms when searching just for the base form. When I search for the word 'kuka' like this bin/nutch org.apache.nutch.searcher.NutchBean kuka Total hits: 245 Tomcat6 finds also 245 hits. But when I search for word 'kuusi' bin/nutch org.apache.nutch.searcher.NutchBean kuusi Total hits: 212 Tomcat6 finds only 14 hits. Tomcat6 log shows this for word 'kuka': 2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token1 (kuka,0,4) 2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token2 (kuka,0,4) 2010-02-16 21:25:40,910 INFO NutchBean - query: kuka 2010-02-16 21:25:40,910 INFO NutchBean - query: kuka 2010-02-16 21:25:40,910 INFO NutchBean - lang: fi 2010-02-16 21:25:40,910 INFO NutchBean - lang: fi 2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka -site: 2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka -site: 2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits 2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits 2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245 2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245 Tomcat6 log shows this for word 'kuusi': 2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token1 (kuusi,0,5) 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuu,0,5) 2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuusi,0,0,posIncr=0) 2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi 2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi 2010-02-16 21:23:12,778 INFO NutchBean - lang: fi 2010-02-16 21:23:12,778 INFO NutchBean - lang: fi 2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for url 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for anchor 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for content 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for title 2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing kuu kuusi for host 2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14 2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14 The difference between words 'kuka' and 'kuusi' is that the word 'kuka' has only one base form (which happens to be 'kuka') but the word 'kuusi' has two base forms 'kuusi' and 'kuu' ('moon'; 'si' is a possessive suffix). So is it possible that when I search through tomcat6 Nutch returns only those files that have both words 'kuusi' and 'kuu'. If so, how can I change this that it finds files that has either 'kuusi' or 'kuu' (or, of course, any other base forms of the word I search for :-).
Malaga-fi is in SourceForge
Malaga-fi is a Nutch plugin for indexing documents written in Finnish. It analyses words morphologically and indexes only the base forms (that you find in dictionaries) so that you find all inflections of a word by just searching for the base form. Now malaga-fi is in SourceForge. http://sourceforge.net/projects/malaga-fi/ https://malaga-fi.svn.sourceforge.net/svnroot/malaga-fi/
Malaga-fi - Finnish plugin for Nutch - a new version
I have released a new version of malaga-fi. Changes from previous version: malaga-fi recognizes some common spelling errors. Malaga-fi is a Nutch plugin for indexing documents written in Finnish. Malaga-fi analyses words morphologically, converts them to a base form (that you find in dictionaries) and indexes the base forms, so that you find all inflections of a word by just searching for the base form. To use an English example, if you search for the word give you find all documents that have give, gives, gave, given, or giving. This is very important in Finnish since Finnish words have literally tens of thousands of inflected forms. What you need: 1. Malaga programming language. http://home.arcor.de/bjoern-beutel/malaga/ 2. Suomimalaga - Description of Finnish morphology written in Malaga. http://sourceforge.net/project/showfiles.php?group_id=156731 Newest version: svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga 3. Malaga-Java - Java interface to Malaga. http://joyds1.joensuu.fi/programs/index.html Malaga-Java has two versions; both are in the same file. You need the thread-safe version. 4. Malaga-fi - Nutch plugin for documents written in Finnish. http://joyds1.joensuu.fi/programs/index.html 5. Nutch: http://lucene.apache.org/nutch/ Malaga-fi is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
shouldFetch rejects all files
I am using Nutch to index some directories on my hard disk. It used to work but now Nutch rejects all files. File logs/hadoop.log has this DEBUG crawl.Generator - -shouldFetch rejected [file name here] fetchTime=1253697537652, curTime=1251105859942 for every file in directories I want to index. How can I start to debug the problem?
How to tell Nutch that text files are text files?
I am using Nutch to index plain text and LaTeX files. Nutch thinks that some of the files are of type application/octet-stream. I have put these lines to file parse-plugins.xml mimeType name=application/octet-stream plugin id=parse-text / /mimeType Now Nutch parses and indexes the files but when I look the search results on Firefox/tomcat6 Nutch says that they are of type application/octet-stream and does not show them. How do I tell Nutch that it should show files of type application/octet-stream as if they were text files?
Malaga-fi - Finnish plugin for Nutch
Malaga-fi is a Nutch plugin for indexing documents written in Finnish. Malaga-fi analyses words morphologically, converts them to a base form (that you find in dictionaries) and indexes the base forms, so that you find all inflections of a word by just searching for the base form. To use an English example, if you search for the word give you find all documents that have give, gives, gave, given, or giving. This is very important in Finnish since Finnish words have literally tens of thousands of inflected forms. What you need: 1. Malaga programming language. http://home.arcor.de/bjoern-beutel/malaga/ 2. Suomimalaga - Description of Finnish morphology written in Malaga. http://sourceforge.net/project/showfiles.php?group_id=156731 Newest version: svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga 3. Malaga-Java - Java interface to Malaga. http://joyds1.joensuu.fi/programs/index.html Malaga-Java has two versions; both are in the same file. You need the thread-safe version. 4. Malaga-fi - Nutch plugin for documents written in Finnish. http://joyds1.joensuu.fi/programs/index.html 5. Nutch: http://lucene.apache.org/nutch/ Malaga-fi is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Re: Nutch can't find all files
On Wed, Apr 08, 2009 at 08:54:37AM +0200, Andrzej Bialecki wrote: Most likely this is related to the setting db.max.outlinks.per.page. The default is 1000. In case of file:// URLs this means that directory listings with more than 1000 entries will be truncated. Solution: simply increase the limit. That helped a little. Now Nutch is fetching more files but it is still skipping files. I have more questions. How does Nutch select the files it fetches? Is it reading every file name in a directory and then selecting what it fetches? Is it possible to output the file names Nutch consideres for fetching? Where do I look in the code? (-:
Re: Nutch can't find all files
On Mon, Apr 06, 2009 at 11:18:59PM +0800, yanky young wrote: Maybe it is about Windows path names and file names. In Windows, path names and file names can have whitespace. I am running Linux and I have no whitespace in my file names. log4j.logger.org.apache.nutch.protocol.file=DEBUG,cmdstdout This did not show the files Nutch is skipping.
Re: Problem with Crawler and Parent Directories
On Thu, Apr 02, 2009 at 05:00:47PM +0200, Wolf Fischer wrote: +^file:///c:/test/ -. Try this: +^file:///c:/test/ +^file:/c:/test/ -. That is, put three an one slashes after the file:. That worked for me.
Nutch can't find all files
I am using Nutch to index my hard disk. Nutch is skipping some files. They do not show in Nutch logs (like fetching file:...) and it is as if Nutch do not notice that they exist. But when I moved one file that Nutch did not notice to a test directory that had only a few files and indexed only that directory, Nutch did index the file. Any ideas on how I can debug the problem?