Hi,

depending on the homogeneity of your sites it does make sense to use Nutch
for it.
We actually wrote a plugin which extends the HTMLParseFilter and makes use
of XPath to extract interesting content.
The XPath expressions are defined inside the nutch-site.xml so we don't have
to change the code for parsing.

Regards,

Hannes

On Fri, Oct 29, 2010 at 10:47 AM, Andrew McCombe <[email protected]> wrote:

> Hi
>
> Thanks for the offer Eriend but unfortunately I'm not a java developer.  I
> opted to abandon doing it with Nutch and/or Tika in favour of a Python
> approach to retrieve, parse and index the content needed.
>
> I think that Nutch would have been too heavyweight for my purposes anyway.
>
> Thanks again
>
> Andrew
>
> On 29 October 2010 09:35, Erlend Garåsen <[email protected]> wrote:
>
> > On 28.10.10 13.01, Andrzej Bialecki wrote:
> >
> >  Hopefully this kind of functionality will be improved soon through the
> >> use of the Boilerpipe functionality in Tika - but this is still to be
> >> integrated into both Tika and Nutch.
> >>
> >
> > Interesting. I also missed this kind of functionality in Nutch, but I
> found
> > a workaround by adding a few lines in DOMContentUtils.java.
> >
> > I understand that Tika will replace parse-html, and therefor I wasn't
> sure
> > I should add a Jira issue about this missing functionality. Anyway, I
> have
> > found a simple way how one can stop parsing some contents in a web site,
> for
> > example contents between HTML comments (<!-- stopindex --> .... <!--
> > startindex -->) and for some div tags.
> >
> > Andrew, I can send you the lines I added so it will be easier for you to
> > write your filter. Or just simply add the lines to DOMContentUtils in
> case
> > you're using parse-html and adapt the code.
> >
> > Erlend
> >
> > --
> > Erlend Garåsen
> > Center for Information Technology Services
> > University of Oslo
> > P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> 31050
> >
>

Reply via email to