Malaga-fi Finnish plugin for Nutch
Malaga-fi is a Nutch plugin for indexing documents written in Finnish. Malaga-fi analyses words morphologically, converts them to a base form (that you find in dictionaries) and indexes the base forms, so that you find all inflections of a word by just searching for the base form. To use an English example, if you search for the word give you find all documents that have give, gives, gave, given, or giving. This is very important in Finnish since Finnish words have literally tens of thousands of inflected forms. What you need: 1. Malaga programming language. http://home.arcor.de/bjoern-beutel/malaga/ 2. Suomimalaga - Description of Finnish morphology written in Malaga. http://sourceforge.net/project/showfiles.php?group_id=156731 Newest version: svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga 3. JNA library - Simplified native library access for Java. https://jna.dev.java.net/ 4. Malaga-fi - Nutch plugin for documents written in Finnish. http://sourceforge.net/projects/malaga-fi/ 5. Nutch: http://lucene.apache.org/nutch/ Malaga-fi is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Re: Nutch and EC2
Hi Yves, I'm going to start some test of nutch+solr on EC2 in a couple of days, so I will be able to give you some feedback on it soon. I'm actually a little concerned about computing speed, rather than ram or disk space, because I've experienced a consistent lack of performance in cpu-intensive tasks such as compiling large amounts of code. S -- Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't. Paul Graham A mathematician is a device for turning coffee into theorems. Paul Erdos (who obviously never met a sysadmin) - Messaggio originale - Da: Yves Petinot ypeti...@cs.columbia.edu A: nutch-user@lucene.apache.org Inviato: Ven 9 aprile 2010, 16:49:47 Oggetto: Nutch and EC2 Hi, I'm currently contemplating migrating my crawler cluster to EC2 and while this appears very tempting (infinite number of nodes), i've read about some potential limitations in terms of the number of map/red tasks that can effectively run on any instance. Especially for the L/XL instances there doesn't seem to be any swap space set up (by default at least), so that running more than 2 to 4 tasks per instance may not be feasible (assuming 8/16 G of RAM and ~ 3G per JVM). As a comparison, my current setup with dedicated blade servers can easily sustain 5 to 10 map/red task per node. I'm basically trying to understand whether this lack of swap space will effectively mean that i need an EC2 cluster with at least 2 to 3 times more instances than i have nodes in my current cluster Does anyone on the list have some experience in transitioning to EC2 and maybe with respect to this swap issue and/or on how to spec out and EC2 cluster ? cheers, -yp
Opinion crawling
Hi , I am newbie in nutch. As part of learning I have done some basic things in nutch like intranet crawling, internet crawling and tried plugin example etc. Actually our main objective is to do opinion crawling. Its like we need to crawl only html pages which contain opinions,i.e user reviews about products, items, movies etc. So My question is during fetching itself whether i can find this html page contains user opinions or not ?If the page contains opinions, parse it . If not discard it. This is our approach as of now. Please put your comments and suggestions. Thanks in advance. Best regards, Naresh -- View this message in context: http://n3.nabble.com/Opinion-crawling-tp713521p713521.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch and EC2
My experience on EC2 has been that the RAM and disk space are overkill, while the computing speed is lacking. I had been running my crawler on a 1GB slicehost slice, and when I moved it over to a medium high-cpu instance on EC2 (~2x the cost), the generate and update steps took 50% longer. Right now I'm looking at using rackspace cloud servers instead. Kevin On Mon, Apr 12, 2010 at 5:37 AM, Stefano Cherchi stefanocher...@yahoo.itwrote: Hi Yves, I'm going to start some test of nutch+solr on EC2 in a couple of days, so I will be able to give you some feedback on it soon. I'm actually a little concerned about computing speed, rather than ram or disk space, because I've experienced a consistent lack of performance in cpu-intensive tasks such as compiling large amounts of code. S -- Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't. Paul Graham A mathematician is a device for turning coffee into theorems. Paul Erdos (who obviously never met a sysadmin) - Messaggio originale - Da: Yves Petinot ypeti...@cs.columbia.edu A: nutch-user@lucene.apache.org Inviato: Ven 9 aprile 2010, 16:49:47 Oggetto: Nutch and EC2 Hi, I'm currently contemplating migrating my crawler cluster to EC2 and while this appears very tempting (infinite number of nodes), i've read about some potential limitations in terms of the number of map/red tasks that can effectively run on any instance. Especially for the L/XL instances there doesn't seem to be any swap space set up (by default at least), so that running more than 2 to 4 tasks per instance may not be feasible (assuming 8/16 G of RAM and ~ 3G per JVM). As a comparison, my current setup with dedicated blade servers can easily sustain 5 to 10 map/red task per node. I'm basically trying to understand whether this lack of swap space will effectively mean that i need an EC2 cluster with at least 2 to 3 times more instances than i have nodes in my current cluster Does anyone on the list have some experience in transitioning to EC2 and maybe with respect to this swap issue and/or on how to spec out and EC2 cluster ? cheers, -yp