Hi, depending on the homogeneity of your sites it does make sense to use Nutch for it. We actually wrote a plugin which extends the HTMLParseFilter and makes use of XPath to extract interesting content. The XPath expressions are defined inside the nutch-site.xml so we don't have to change the code for parsing.
Regards, Hannes On Fri, Oct 29, 2010 at 10:47 AM, Andrew McCombe <[email protected]> wrote: > Hi > > Thanks for the offer Eriend but unfortunately I'm not a java developer. I > opted to abandon doing it with Nutch and/or Tika in favour of a Python > approach to retrieve, parse and index the content needed. > > I think that Nutch would have been too heavyweight for my purposes anyway. > > Thanks again > > Andrew > > On 29 October 2010 09:35, Erlend Garåsen <[email protected]> wrote: > > > On 28.10.10 13.01, Andrzej Bialecki wrote: > > > > Hopefully this kind of functionality will be improved soon through the > >> use of the Boilerpipe functionality in Tika - but this is still to be > >> integrated into both Tika and Nutch. > >> > > > > Interesting. I also missed this kind of functionality in Nutch, but I > found > > a workaround by adding a few lines in DOMContentUtils.java. > > > > I understand that Tika will replace parse-html, and therefor I wasn't > sure > > I should add a Jira issue about this missing functionality. Anyway, I > have > > found a simple way how one can stop parsing some contents in a web site, > for > > example contents between HTML comments (<!-- stopindex --> .... <!-- > > startindex -->) and for some div tags. > > > > Andrew, I can send you the lines I added so it will be easier for you to > > write your filter. Or just simply add the lines to DOMContentUtils in > case > > you're using parse-html and adapt the code. > > > > Erlend > > > > -- > > Erlend Garåsen > > Center for Information Technology Services > > University of Oslo > > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: > 31050 > > >

