Alexander,

Thanks for the recommendations. Those are very valuable to me.

I would probably lean towards the Nutch 2.x then.

On the side note, I do not find much of the tutorials/wiki entries on Nutch
2.x yet. I would go ahead and start re-implement custom plugins to SOLR and
Tika while waiting for Nutch 2.x documentations.

Thanks,

Y T Thet

On Mon, Jul 9, 2012 at 2:08 AM, Alexander Aristov <
[email protected]> wrote:

> Hi
>
> I would suggest you to take recent nutch versions anyway. Not only has
> indexer/web part changed but a lot of bugs and very handy cookies have been
> implemented. One of such noticable improvement was replacement of many doc
> parsers with 3rd party tika parser.
>
> Another good improvement since old days was fetcher improvement.It works
> much better and doesn't hang in some situations.
>
> as for which version to choose there are 2 versions:
>
> 1.5.x and 2.0
>
> 2.0 version contains all stuff from 1.5.x but it uses "database" instead of
> hdfs to keep data.
>
> Both versions send crawled data to solr which provides indexing and
> searching capabilities.
>
> Unfortunately there is no easy way to migrate from 1.3 to newest version
> and the easiest way will be to re-implement your custom plugins for these
> versions.
>
> Best Regards
> Alexander Aristov
>
>
> On 8 July 2012 20:10, Ye T Thet <[email protected]> wrote:
>
> > Hi Folks,
> >
> > I am seeking recommendation whether I should use Pre Nutch 1.3 (without
> > Solr) or New Nutch (2.x) with Solr integration for my research project.
> >
> > Little background information,
> > I developed prototype for web search engine during my post grad days
> using
> > Nutch as crawler, indexer and searcher. It was developed using < Nutch
> 1.3,
> > meaning not using Solr as searcher.
> >
> > I am continuing my research after a year of on hold. I noticed a huge
> > changes in Nutch such as using SOLR as indexer and searcher, 2.x has
> > changed crawling implementation and etc.
> >
> > The requirements for my project is similar typical web search engine with
> > lesser volume (less than 1 million pages for now). Additional
> requirements
> > are
> >
> > 1. Language Identification, (used language ID plug-in in Nutch using
> ngram
> > profile VS New Nutch used Tika for lang ID)
> > 2. Custom lucene analyzer for the analysis (done in Nutch for Pre 1.3 VS
> > done in SOLR)
> >
> > I would appreciate suggestions/comments on whether I should continue with
> > pre 1.3 or new Nutch with SOLR.
> >
> > Thanks,
> >
> > Y T Thet
> >
>

Reply via email to