Hello,

*Jorge*
Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
tried configuring Tika boilerpipe with this version but this doesn't work
for me.As you suggested to use own parser ,i am not a java developer by
chance.
By chance if you or anyone in the community has a working file ,it would be
great if you can share it because there are many people facing with this
issue (i came to know this when i googled).

Mark Vega
we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
not working.we followed the same steps.I can share the changes if you want
to take a look.

I appreciate for your quick suggestions!

Thanks
Rushikesh

On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
[email protected]> wrote:

> Hello Rushikesh,
>
> Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> could use the Tika boilerpipe implementation, on the nutch-site.xml you
> need to enable this feature with:
>
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>
> And configure the proper extractor with
> the tika.extractor.boilerpipe.algorithm setting.
>
> This is not a perfect solution, but I've used it successfully in the past,
> of course, your results will depend on how is the structure (markup of the
> website).
>
> Other option could be to implement your own parser if you need to have more
> control over what to include/exclude from the HTML. You can take a look at
> this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> some info and old patches.
>
> Best Regards,
> Jorge
>
> On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[email protected]>
> wrote:
>
> > Hello Sebastian,
> > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > our website and we are happy with the search results  but we had
> > requirement to skip the header footer and left menus and some other parts
> > of the page, can you please guide how can we exclude those parts.i was
> > trying various ways on google but nothing works for me.
> >
> > Appreciate for your help in Advance!
> > --
> > Regards
> > Rushikesh M
> > .Net Developer
> >
>



-- 
Regards
Rushikesh M
.Net Developer

Reply via email to