Hi Abe-san,

Is this capability a configurable function of Tika?  We could add Tika
configuration to the Tika Extractor if so.

Karl

On Thu, Oct 23, 2014 at 2:03 AM, Shinichiro Abe <[email protected]>
wrote:

> Hi Arcadius,
>
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> Yes, Tika extractor will remove tags in html
> and send content and metadata to downstream pipeline/output connection.
>
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> No, currently it can map only metadata which is extracted by Tika to Solr
> field.
> For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't
> treat them as metadata.
> Currently when capturing these tags and map them to fields,
> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>
> Regards,
> Shinichiro Abe
>
> On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> wrote:
>
> >
> > Hello.
> >
> > Given that we now have pipelines in ManifoldCF, How feasible  is it to:
> >
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> >
> > Thank you very much.
> >
> > Arcadius.
> >
>
>

Reply via email to