Thanks, Markus! I need some direction. Is it advisable to apply the text
extraction before Nutch begins crawling a web page or text extraction
should be applied to the data stored in HBase ?

Also, please correct me - in HBase does the column 'p: c' has the parsed
text generated from the column 'f: cnt' that has raw content?


On Thu, Sep 19, 2013 at 4:03 PM, Markus Jelsma
<[email protected]>wrote:

> Because you need proper text/data extraction tools such as the open source
> Boilerpipe (NUTCH-961) or other open source or commercial tools. It is, if
> you ask me, impossible to build a good search engine upon crawled data
> without proper text extraction and removal of boiler plate
> elements/block/widgets/bars.
>
> -----Original message-----
> > From:A Laxmi <[email protected]>
> > Sent: Thursday 19th September 2013 21:43
> > To: [email protected]
> > Subject: Re: Nutch with HBase examples?
> >
> > Thank you! I was wondering how you got the summary text below the title
> > crawled so well?
> > http://www.zwudi.com/immobilier/vente-immobiliere-appartement-maison
> >
> > When I crawled, I have the text summary below the title with lot of junk
> > (navigation, footers, etc)
> >
> >
> > On Thu, Sep 19, 2013 at 3:27 PM, lsroudi abdel <[email protected]>
> wrote:
> >
> > > Yes www.zwudi.com is a beta version for training
> > > Le 19 sept. 2013 21:25, "A Laxmi" <[email protected]> a écrit :
> > >
> > > > Do you know of any search websites developed using (Nutch +HBase +
> Solr)?
> > > >
> > > >
> > > > On Thu, Sep 19, 2013 at 3:21 PM, lsroudi abdel <[email protected]>
> > > wrote:
> > > >
> > > > > Actually i use nutch with hbase. Ans soit for search ans indexation
> > > > > Le 19 sept. 2013 21:13, "A Laxmi" <[email protected]> a
> écrit :
> > > > >
> > > > > > Can anyone give me some example - search websites that utilized
> Nutch
> > > > > 2.2.1
> > > > > > with HBase as a backend?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to