Malaga-fi Finnish plugin for Nutch

2010-04-12 Thread Hannu Väisänen
Malaga-fi is a Nutch plugin for indexing documents written in Finnish.


Malaga-fi analyses words morphologically, converts them to a base form
(that you find in dictionaries) and indexes the base forms, so that
you find all inflections of a word by just searching for the base
form.

To use an English example, if you search for the word give you find
all documents that have give, gives, gave, given, or giving.

This is very important in Finnish since Finnish words have literally
tens of thousands of inflected forms.


What you need:

1. Malaga programming language.
   http://home.arcor.de/bjoern-beutel/malaga/


2. Suomimalaga - Description of Finnish morphology written in Malaga.
   http://sourceforge.net/project/showfiles.php?group_id=156731

   Newest version:
   svn co https://voikko.svn.sourceforge.net/svnroot/voikko/trunk/suomimalaga


3. JNA library - Simplified native library access for Java.
   https://jna.dev.java.net/


4. Malaga-fi - Nutch plugin for documents written in Finnish.
   http://sourceforge.net/projects/malaga-fi/


5. Nutch: http://lucene.apache.org/nutch/


Malaga-fi is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Re: Nutch and EC2

2010-04-12 Thread Stefano Cherchi
Hi Yves,

I'm going to start some test of nutch+solr on EC2 in a couple of days, so I 
will be able to give you some feedback on it soon. 

I'm actually a little concerned about computing speed, rather than ram or disk 
space, because I've experienced a consistent lack of performance in 
cpu-intensive tasks such as compiling large amounts of code.

S
 -- 
Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't.
Paul Graham


A mathematician is a device for turning coffee into theorems.
Paul Erdos (who obviously never met a sysadmin)



- Messaggio originale -
 Da: Yves Petinot ypeti...@cs.columbia.edu
 A: nutch-user@lucene.apache.org
 Inviato: Ven 9 aprile 2010, 16:49:47
 Oggetto: Nutch and EC2
 
 Hi,

I'm currently contemplating migrating my crawler cluster to EC2 and 
 while this appears very tempting (infinite number of nodes), i've read about 
 some potential limitations in terms of the number of map/red tasks that can 
 effectively run on any instance. Especially for the L/XL instances there 
 doesn't 
 seem to be any swap space set up (by default at least), so that running more 
 than 2 to 4 tasks per instance may not be feasible (assuming 8/16 G of RAM 
 and ~ 
 3G per JVM). As a comparison, my current setup with dedicated blade servers 
 can 
 easily sustain 5 to 10 map/red task per node. I'm basically trying to 
 understand 
 whether this lack of swap space will effectively mean that i need an EC2 
 cluster 
 with at least 2 to 3 times more instances than i have nodes in my current 
 cluster

Does anyone on the list have some experience in transitioning to 
 EC2 and maybe with respect to this swap issue and/or on how to spec out and 
 EC2 
 cluster ?

cheers,

-yp






Opinion crawling

2010-04-12 Thread NareshG

Hi ,

I am newbie in nutch. As part of learning I have done some basic things in
nutch like intranet crawling, internet crawling and tried plugin example
etc. Actually our main objective is to do opinion crawling. 
Its like we need to crawl only html pages which contain opinions,i.e user
reviews about products, items, movies etc. So My question is during fetching
itself whether i can find this html page contains user opinions or not ?If
the page contains opinions, parse it . If not discard it. 

This is our approach as of now. Please put your comments and suggestions. 

Thanks in advance.

Best regards,
Naresh
-- 
View this message in context: 
http://n3.nabble.com/Opinion-crawling-tp713521p713521.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch and EC2

2010-04-12 Thread Kevin Conor
My experience on EC2 has been that the RAM and disk space are overkill,
while the computing speed is lacking.  I had been running my crawler on a
1GB slicehost slice, and when I moved it over to a medium high-cpu instance
on EC2 (~2x the cost), the generate and update steps took 50% longer.  Right
now I'm looking at using rackspace cloud servers instead.

Kevin

On Mon, Apr 12, 2010 at 5:37 AM, Stefano Cherchi stefanocher...@yahoo.itwrote:

 Hi Yves,

 I'm going to start some test of nutch+solr on EC2 in a couple of days, so I
 will be able to give you some feedback on it soon.

 I'm actually a little concerned about computing speed, rather than ram or
 disk space, because I've experienced a consistent lack of performance in
 cpu-intensive tasks such as compiling large amounts of code.

 S
  --
 Anyone proposing to run Windows on servers should be prepared to explain
 what they know about servers that Google, Yahoo, and Amazon don't.
 Paul Graham


 A mathematician is a device for turning coffee into theorems.
 Paul Erdos (who obviously never met a sysadmin)



 - Messaggio originale -
  Da: Yves Petinot ypeti...@cs.columbia.edu
  A: nutch-user@lucene.apache.org
  Inviato: Ven 9 aprile 2010, 16:49:47
  Oggetto: Nutch and EC2
 
  Hi,

 I'm currently contemplating migrating my crawler cluster to EC2 and
  while this appears very tempting (infinite number of nodes), i've read
 about
  some potential limitations in terms of the number of map/red tasks that
 can
  effectively run on any instance. Especially for the L/XL instances there
 doesn't
  seem to be any swap space set up (by default at least), so that running
 more
  than 2 to 4 tasks per instance may not be feasible (assuming 8/16 G of
 RAM and ~
  3G per JVM). As a comparison, my current setup with dedicated blade
 servers can
  easily sustain 5 to 10 map/red task per node. I'm basically trying to
 understand
  whether this lack of swap space will effectively mean that i need an EC2
 cluster
  with at least 2 to 3 times more instances than i have nodes in my current
  cluster

 Does anyone on the list have some experience in transitioning to
  EC2 and maybe with respect to this swap issue and/or on how to spec out
 and EC2
  cluster ?

 cheers,

 -yp