Re: Extracting Content from Web Crawler using the new PipeLine

Shinichiro Abe Wed, 22 Oct 2014 23:04:27 -0700

Hi Arcadius,

> - use Tika's BoilerPipe to get cleaner content from web sites?
Yes, Tika extractor will remove tags in html
and send content and metadata to downstream pipeline/output connection.


> - What about extracting specific HTML tags such as all h1 or h2 and map them 
> to a Solr field?
No, currently it can map only metadata which is extracted by Tika to Solr field.
For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't treat 
them as metadata.
Currently when capturing these tags and map them to fields, 
we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).

Regards,
Shinichiro Abe

On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]> wrote:

> 
> Hello.
> 
> Given that we now have pipelines in ManifoldCF, How feasible  is it to:
> 
> - use Tika's BoilerPipe to get cleaner content from web sites?
> - What about extracting specific HTML tags such as all h1 or h2 and map them 
> to a Solr field?
> 
> Thank you very much.
> 
> Arcadius.
>

Re: Extracting Content from Web Crawler using the new PipeLine

Reply via email to