Thanks, Markus! I need some direction. Is it advisable to apply the text extraction before Nutch begins crawling a web page or text extraction should be applied to the data stored in HBase ?
Also, please correct me - in HBase does the column 'p: c' has the parsed text generated from the column 'f: cnt' that has raw content? On Thu, Sep 19, 2013 at 4:03 PM, Markus Jelsma <[email protected]>wrote: > Because you need proper text/data extraction tools such as the open source > Boilerpipe (NUTCH-961) or other open source or commercial tools. It is, if > you ask me, impossible to build a good search engine upon crawled data > without proper text extraction and removal of boiler plate > elements/block/widgets/bars. > > -----Original message----- > > From:A Laxmi <[email protected]> > > Sent: Thursday 19th September 2013 21:43 > > To: [email protected] > > Subject: Re: Nutch with HBase examples? > > > > Thank you! I was wondering how you got the summary text below the title > > crawled so well? > > http://www.zwudi.com/immobilier/vente-immobiliere-appartement-maison > > > > When I crawled, I have the text summary below the title with lot of junk > > (navigation, footers, etc) > > > > > > On Thu, Sep 19, 2013 at 3:27 PM, lsroudi abdel <[email protected]> > wrote: > > > > > Yes www.zwudi.com is a beta version for training > > > Le 19 sept. 2013 21:25, "A Laxmi" <[email protected]> a écrit : > > > > > > > Do you know of any search websites developed using (Nutch +HBase + > Solr)? > > > > > > > > > > > > On Thu, Sep 19, 2013 at 3:21 PM, lsroudi abdel <[email protected]> > > > wrote: > > > > > > > > > Actually i use nutch with hbase. Ans soit for search ans indexation > > > > > Le 19 sept. 2013 21:13, "A Laxmi" <[email protected]> a > écrit : > > > > > > > > > > > Can anyone give me some example - search websites that utilized > Nutch > > > > > 2.2.1 > > > > > > with HBase as a backend? > > > > > > > > > > > > > > > > > > > > >

