Alexander, Thanks for the recommendations. Those are very valuable to me.
I would probably lean towards the Nutch 2.x then. On the side note, I do not find much of the tutorials/wiki entries on Nutch 2.x yet. I would go ahead and start re-implement custom plugins to SOLR and Tika while waiting for Nutch 2.x documentations. Thanks, Y T Thet On Mon, Jul 9, 2012 at 2:08 AM, Alexander Aristov < [email protected]> wrote: > Hi > > I would suggest you to take recent nutch versions anyway. Not only has > indexer/web part changed but a lot of bugs and very handy cookies have been > implemented. One of such noticable improvement was replacement of many doc > parsers with 3rd party tika parser. > > Another good improvement since old days was fetcher improvement.It works > much better and doesn't hang in some situations. > > as for which version to choose there are 2 versions: > > 1.5.x and 2.0 > > 2.0 version contains all stuff from 1.5.x but it uses "database" instead of > hdfs to keep data. > > Both versions send crawled data to solr which provides indexing and > searching capabilities. > > Unfortunately there is no easy way to migrate from 1.3 to newest version > and the easiest way will be to re-implement your custom plugins for these > versions. > > Best Regards > Alexander Aristov > > > On 8 July 2012 20:10, Ye T Thet <[email protected]> wrote: > > > Hi Folks, > > > > I am seeking recommendation whether I should use Pre Nutch 1.3 (without > > Solr) or New Nutch (2.x) with Solr integration for my research project. > > > > Little background information, > > I developed prototype for web search engine during my post grad days > using > > Nutch as crawler, indexer and searcher. It was developed using < Nutch > 1.3, > > meaning not using Solr as searcher. > > > > I am continuing my research after a year of on hold. I noticed a huge > > changes in Nutch such as using SOLR as indexer and searcher, 2.x has > > changed crawling implementation and etc. > > > > The requirements for my project is similar typical web search engine with > > lesser volume (less than 1 million pages for now). Additional > requirements > > are > > > > 1. Language Identification, (used language ID plug-in in Nutch using > ngram > > profile VS New Nutch used Tika for lang ID) > > 2. Custom lucene analyzer for the analysis (done in Nutch for Pre 1.3 VS > > done in SOLR) > > > > I would appreciate suggestions/comments on whether I should continue with > > pre 1.3 or new Nutch with SOLR. > > > > Thanks, > > > > Y T Thet > > >

