Hi Arcadius, > - use Tika's BoilerPipe to get cleaner content from web sites? Yes, Tika extractor will remove tags in html and send content and metadata to downstream pipeline/output connection.
> - What about extracting specific HTML tags such as all h1 or h2 and map them > to a Solr field? No, currently it can map only metadata which is extracted by Tika to Solr field. For h1, h2, p tags etc, Tika extractor doesn't capture them and doesn't treat them as metadata. Currently when capturing these tags and map them to fields, we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param). Regards, Shinichiro Abe On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> wrote: > > Hello. > > Given that we now have pipelines in ManifoldCF, How feasible is it to: > > - use Tika's BoilerPipe to get cleaner content from web sites? > - What about extracting specific HTML tags such as all h1 or h2 and map them > to a Solr field? > > Thank you very much. > > Arcadius. >
