Hi Abe-san, Is this capability a configurable function of Tika? We could add Tika configuration to the Tika Extractor if so.
Karl On Thu, Oct 23, 2014 at 2:03 AM, Shinichiro Abe <[email protected]> wrote: > Hi Arcadius, > > > - use Tika's BoilerPipe to get cleaner content from web sites? > Yes, Tika extractor will remove tags in html > and send content and metadata to downstream pipeline/output connection. > > > - What about extracting specific HTML tags such as all h1 or h2 and map > them to a Solr field? > No, currently it can map only metadata which is extracted by Tika to Solr > field. > For h1, h2, p tags etc, Tika extractor doesn't capture them and doesn't > treat them as metadata. > Currently when capturing these tags and map them to fields, > we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param). > > Regards, > Shinichiro Abe > > On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> wrote: > > > > > Hello. > > > > Given that we now have pipelines in ManifoldCF, How feasible is it to: > > > > - use Tika's BoilerPipe to get cleaner content from web sites? > > - What about extracting specific HTML tags such as all h1 or h2 and map > them to a Solr field? > > > > Thank you very much. > > > > Arcadius. > > > >
