Hello, Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesn't have the expected data For some pages it brings back only the *Title *and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now 1. when my page has a image and 1 or 2 lines of text it doesn't get those lines of data.(the data is in the <p> tag) 2.why is it adding *Title* to the starting of the *content* is there a way not to include that.
For example see the following image for the first URL it came back with out any date [image: Inline image 1] On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[email protected]> wrote: > Hello. > > I am using tika boilerpipe with good results in aproximately 2000 websites. > Rushikesh if tika boilerpipe is not working for you maybe it is because > you don´t are parsing documents with tika. please check this configuration > and tell us. > > make sure that tika plugin is activated in plugin.included property then > check: > > *********************************************** > Use tika parser instead of parse-html. > > parse-plugins.xml > > <mimeType name="text/html"> > <plugin id="parse-tika" /> > </mimeType> > > <mimeType name="application/xhtml+xml"> > <plugin id="parse-tika" /> > </mimeType> > *********************************************** > > *********************************************** > nutch-site.xml > <property> > <name>tika.extractor</name> > <value>boilerpipe</value> > <description> > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > </description> > </property> > > <property> > <name>tika.extractor.boilerpipe.algorithm</name> > <value>ArticleExtractor</value> > <description> > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > </description> > </property> > **************************************** > > > > > > > > > > > > > ----- Mensaje original ----- > De: "Markus Jelsma" <[email protected]> > Para: [email protected] > Enviados: Martes, 14 de Noviembre 2017 17:40:08 > Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling > > Hello Rushikesh - why is Boilerpipe not working for you. Are you having > trouble getting it configured - it is really just setting a boolean value. > Or does it work, but not to your satisfaction? > > The Bayan solution should work, theoretically, but just with a lot of > tedious manual per-site configuration. > > Regards, > Markus > > -----Original message----- > > From:Rushikesh K <[email protected]> > > Sent: Tuesday 14th November 2017 23:30 > > To: [email protected] > > Cc: Sebastian Nagel <[email protected]>; > [email protected] > > Subject: Re: Removing header,Footer and left menus while crawling > > > > Hello, > > > > *Jorge* > > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also i > > tried configuring Tika boilerpipe with this version but this doesn't work > > for me.As you suggested to use own parser ,i am not a java developer by > > chance. > > By chance if you or anyone in the community has a working file ,it would > be > > great if you can share it because there are many people facing with this > > issue (i came to know this when i googled). > > > > Mark Vega > > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is > also > > not working.we followed the same steps.I can share the changes if you > want > > to take a look. > > > > I appreciate for your quick suggestions! > > > > Thanks > > Rushikesh > > > > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt < > > [email protected]> wrote: > > > > > Hello Rushikesh, > > > > > > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, > then you > > > could use the Tika boilerpipe implementation, on the nutch-site.xml you > > > need to enable this feature with: > > > > > > <property> > > > <name>tika.extractor</name> > > > <value>boilerpipe</value> > > > <description> > > > Which text extraction algorithm to use. Valid values are: boilerpipe > or > > > none. > > > </description> > > > </property> > > > > > > And configure the proper extractor with > > > the tika.extractor.boilerpipe.algorithm setting. > > > > > > This is not a perfect solution, but I've used it successfully in the > past, > > > of course, your results will depend on how is the structure (markup of > the > > > website). > > > > > > Other option could be to implement your own parser if you need to have > more > > > control over what to include/exclude from the HTML. You can take a > look at > > > this issue https://issues.apache.org/jira/browse/NUTCH-585 which > contains > > > some info and old patches. > > > > > > Best Regards, > > > Jorge > > > > > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected] > > > > > wrote: > > > > > > > Hello Sebastian, > > > > we are most excited in using the Nutch 1.3 (with solr 6.4) for > crawling > > > > our website and we are happy with the search results but we had > > > > requirement to skip the header footer and left menus and some other > parts > > > > of the page, can you please guide how can we exclude those parts.i > was > > > > trying various ways on google but nothing works for me. > > > > > > > > Appreciate for your help in Advance! > > > > -- > > > > Regards > > > > Rushikesh M > > > > .Net Developer > > > > > > > > > > > > > > > -- > > Regards > > Rushikesh M > > .Net Developer > > > La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a > la Revolución > 2002-2017 > -- Regards Rushikesh M .Net Developer

