Hello Karl.
On 23 October 2014 17:57, Karl Wright <[email protected]> wrote: > Looking at the SOLR patch, I have two concerns. First, here's the > pertinent part of the patch: > > >>>>>> > + boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe; > + try { > + ClassLoader loader = > BoilerpipeExtractor.class.getClassLoader(); > + Class extractorClass = loader.loadClass(boilerpipe); > + > + BoilerpipeExtractor boilerpipeExtractor = > (BoilerpipeExtractor)extractorClass.newInstance(); > + BoilerpipeContentHandler boilerPipeContentHandler = new > BoilerpipeContentHandler(parsingHandler, boilerpipeExtractor); > + > + parsingHandler = (ContentHandler)boilerPipeContentHandler; > + } catch (ClassNotFoundException e) { > + log.warn("BoilerpipeExtractor " + boilerpipe + " not found!"); > + } catch (InstantiationException e) { > + log.warn("Could not instantiate " + boilerpipe); > + } catch (Exception e) { > + log.warn(e.toString()); > + } > <<<<<< > > The actual extractor in this patch must be specified (the "boilerpipe" > variable). I imagine there are a number of different extractors, probably > for different kinds of XML/XHTML. Am I right? If so, how do you expect a > user to be able to select this, since most jobs crawl documents of multiple > types? > > Yes, there are many extractors ( see http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html ). For instance if I am crawling a newspaper website, then I may choose to use the ArticleExtractor There is a demo at http://boilerpipe-web.appspot.com/ You can select the extractor you want and pass in the URL field a web site url ( for instance http://www.theregister.co.uk/2014/10/27/mozilla_hopes_to_challenge_raspbian_as_rpi_os_of_choice/ ) and see the output. The output varies depending on the chosen type of extractor. > Secondly, the BoilerPlateContentHandler is just a sax ContentHandler, > which basically implies that we'd be parsing XML somehow. But we don't > currently do that in ManifoldCF for the Tika extractor; I believe the > parsing occurs inside Tika in that case. If there's a way to configure > Tika to use a specific boilerpipe extractor, that would be the closest > match to this kind of functionality, I believe. > Boilerpiper is fully integrated and bundled with Tika http://tika.apache.org/1.4/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html > But in any case, this patch does NOT push tag data into metadata fields > -- there's no mechanism for that, unless Solr's implementation of > ContentHandler somehow does it. > You are right, that patch does not do tag extraction. Solr update chain does. > Can you give examples of input and output that you expect to see for this > proposed functionality? > > You can see the output to Solr from the boilerpipe-web above. Thanks. > Karl > > > On Thu, Oct 23, 2014 at 11:57 AM, Arcadius Ahouansou <[email protected] > > wrote: > >> Hello Abe-San. >> >> Thank you for the response. >> >> The BoilerPipe library I was referring to helps to remove >> common/repetitive page components such as menu items, headings, footers etc >> from the crawled content. >> >> There is a Solr Patch at >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808 >> >> That I have been using. >> Thought it would be good to have Manifold do this instead. >> >> It would also be interesting to have Manifold able to extract content of >> html tags such as div, h1,... like Solr. >> >> Thanks >> On 23 Oct 2014 07:03, "Shinichiro Abe" <[email protected]> >> wrote: >> >>> Hi Arcadius, >>> >>> > - use Tika's BoilerPipe to get cleaner content from web sites? >>> Yes, Tika extractor will remove tags in html >>> and send content and metadata to downstream pipeline/output connection. >>> >>> > - What about extracting specific HTML tags such as all h1 or h2 and >>> map them to a Solr field? >>> No, currently it can map only metadata which is extracted by Tika to >>> Solr field. >>> For h1, h2, p tags etc, Tika extractor doesn't capture them and doesn't >>> treat them as metadata. >>> Currently when capturing these tags and map them to fields, >>> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param). >>> >>> Regards, >>> Shinichiro Abe >>> >>> On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> >>> wrote: >>> >>> > >>> > Hello. >>> > >>> > Given that we now have pipelines in ManifoldCF, How feasible is it to: >>> > >>> > - use Tika's BoilerPipe to get cleaner content from web sites? >>> > - What about extracting specific HTML tags such as all h1 or h2 and >>> map them to a Solr field? >>> > >>> > Thank you very much. >>> > >>> > Arcadius. >>> > >>> >>> >
