Ok, I see now how it's supposed to work. See CONNECTORS-1088.
Karl On Tue, Oct 28, 2014 at 3:42 AM, Arcadius Ahouansou <[email protected]> wrote: > > Hello Karl. > > > On 23 October 2014 17:57, Karl Wright <[email protected]> wrote: > >> Looking at the SOLR patch, I have two concerns. First, here's the >> pertinent part of the patch: >> >> >>>>>> >> + boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe; >> + try { >> + ClassLoader loader = >> BoilerpipeExtractor.class.getClassLoader(); >> + Class extractorClass = loader.loadClass(boilerpipe); >> + >> + BoilerpipeExtractor boilerpipeExtractor = >> (BoilerpipeExtractor)extractorClass.newInstance(); >> + BoilerpipeContentHandler boilerPipeContentHandler = new >> BoilerpipeContentHandler(parsingHandler, boilerpipeExtractor); >> + >> + parsingHandler = (ContentHandler)boilerPipeContentHandler; >> + } catch (ClassNotFoundException e) { >> + log.warn("BoilerpipeExtractor " + boilerpipe + " not >> found!"); >> + } catch (InstantiationException e) { >> + log.warn("Could not instantiate " + boilerpipe); >> + } catch (Exception e) { >> + log.warn(e.toString()); >> + } >> <<<<<< >> >> The actual extractor in this patch must be specified (the "boilerpipe" >> variable). I imagine there are a number of different extractors, probably >> for different kinds of XML/XHTML. Am I right? If so, how do you expect a >> user to be able to select this, since most jobs crawl documents of multiple >> types? >> >> > > Yes, there are many extractors ( see > http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html > ). > > For instance if I am crawling a newspaper website, then I may choose to > use the ArticleExtractor > There is a demo at > http://boilerpipe-web.appspot.com/ > You can select the extractor you want and pass in the URL field a web site > url ( for instance > http://www.theregister.co.uk/2014/10/27/mozilla_hopes_to_challenge_raspbian_as_rpi_os_of_choice/ > ) and see the output. > The output varies depending on the chosen type of extractor. > > > >> Secondly, the BoilerPlateContentHandler is just a sax ContentHandler, >> which basically implies that we'd be parsing XML somehow. But we don't >> currently do that in ManifoldCF for the Tika extractor; I believe the >> parsing occurs inside Tika in that case. If there's a way to configure >> Tika to use a specific boilerpipe extractor, that would be the closest >> match to this kind of functionality, I believe. >> > > > Boilerpiper is fully integrated and bundled with Tika > > http://tika.apache.org/1.4/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html > > > >> But in any case, this patch does NOT push tag data into metadata fields >> -- there's no mechanism for that, unless Solr's implementation of >> ContentHandler somehow does it. >> > > You are right, that patch does not do tag extraction. > Solr update chain does. > > >> Can you give examples of input and output that you expect to see for this >> proposed functionality? >> >> > > You can see the output to Solr from the boilerpipe-web above. > > Thanks. > > > >> Karl >> >> >> On Thu, Oct 23, 2014 at 11:57 AM, Arcadius Ahouansou < >> [email protected]> wrote: >> >>> Hello Abe-San. >>> >>> Thank you for the response. >>> >>> The BoilerPipe library I was referring to helps to remove >>> common/repetitive page components such as menu items, headings, footers etc >>> from the crawled content. >>> >>> There is a Solr Patch at >>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808 >>> >>> That I have been using. >>> Thought it would be good to have Manifold do this instead. >>> >>> It would also be interesting to have Manifold able to extract content of >>> html tags such as div, h1,... like Solr. >>> >>> Thanks >>> On 23 Oct 2014 07:03, "Shinichiro Abe" <[email protected]> >>> wrote: >>> >>>> Hi Arcadius, >>>> >>>> > - use Tika's BoilerPipe to get cleaner content from web sites? >>>> Yes, Tika extractor will remove tags in html >>>> and send content and metadata to downstream pipeline/output connection. >>>> >>>> > - What about extracting specific HTML tags such as all h1 or h2 and >>>> map them to a Solr field? >>>> No, currently it can map only metadata which is extracted by Tika to >>>> Solr field. >>>> For h1, h2, p tags etc, Tika extractor doesn't capture them and >>>> doesn't treat them as metadata. >>>> Currently when capturing these tags and map them to fields, >>>> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param). >>>> >>>> Regards, >>>> Shinichiro Abe >>>> >>>> On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> >>>> wrote: >>>> >>>> > >>>> > Hello. >>>> > >>>> > Given that we now have pipelines in ManifoldCF, How feasible is it >>>> to: >>>> > >>>> > - use Tika's BoilerPipe to get cleaner content from web sites? >>>> > - What about extracting specific HTML tags such as all h1 or h2 and >>>> map them to a Solr field? >>>> > >>>> > Thank you very much. >>>> > >>>> > Arcadius. >>>> > >>>> >>>> >> > > >
