Cool! I have since learned another method for handling the redundant templated spew in html pages: crawl the mobile site instead.
----- Original Message ----- | From: "Markus Jelsma" <markus.jel...@openindex.io> | To: solr-user@lucene.apache.org | Sent: Friday, September 7, 2012 3:05:40 AM | Subject: RE: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the DIH? | | It works indeed: | https://issues.apache.org/jira/browse/SOLR-3808 | | | -----Original message----- | > From:Markus Jelsma <markus.jel...@openindex.io> | > Sent: Fri 07-Sep-2012 10:40 | > To: solr-user@lucene.apache.org | > Subject: RE: Is Boilerpipe usable through Solr | > ExtractingUpdateHandler or the DIH? | > | > Hi, | > | > It should not be so hard but it looks like the current | > SolrContentHandler builds up the document via SAX-events. You | > could pass a | > BoilerpipeContentHandler((ContentHandler)parsingHandler, | > BoilerpipeExtractor) to the parser in | > ExtractingDocumentLoader.java. It should work. | > | > Markus | > | > | > | > -----Original message----- | > > From:Lance Norskog <goks...@gmail.com> | > > Sent: Thu 06-Sep-2012 05:51 | > > To: solr-user@lucene.apache.org | > > Subject: Is Boilerpipe usable through Solr | > > ExtractingUpdateHandler or the DIH? | > > | > > Tika integrated Boilerpipe a few releases back. Is it possible to | > > invoke it when using the ExtractingUpdateHandler (simple Tika) | > > or the DataImportHandler? | > > | > > http://code.google.com/p/boilerpipe/ | > > | > > | > > | > |