Extracting Content from Web Crawler using the new PipeLine

Arcadius Ahouansou Wed, 22 Oct 2014 18:23:10 -0700

Hello.

Given that we now have pipelines in ManifoldCF, How feasible  is it to:


- use Tika's BoilerPipe to get cleaner content from web sites?
- What about extracting specific HTML tags such as all h1 or h2 and map
them to a Solr field?

Thank you very much.

Arcadius.

Extracting Content from Web Crawler using the new PipeLine

Reply via email to