Re: Extracting Content from Web Crawler using the new PipeLine

Arcadius Ahouansou Tue, 28 Oct 2014 00:44:44 -0700

Hello Karl.


On 23 October 2014 17:57, Karl Wright <[email protected]> wrote:

> Looking at the SOLR patch, I have two concerns.  First, here's the
> pertinent part of the patch:
>
> >>>>>>
> +          boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
> +          try {
> +            ClassLoader loader =
> BoilerpipeExtractor.class.getClassLoader();
> +            Class extractorClass = loader.loadClass(boilerpipe);
> +
> +            BoilerpipeExtractor boilerpipeExtractor =
> (BoilerpipeExtractor)extractorClass.newInstance();
> +            BoilerpipeContentHandler boilerPipeContentHandler = new
> BoilerpipeContentHandler(parsingHandler, boilerpipeExtractor);
> +
> +            parsingHandler = (ContentHandler)boilerPipeContentHandler;
> +          } catch (ClassNotFoundException e) {
> +            log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
> +          } catch (InstantiationException e) {
> +            log.warn("Could not instantiate " + boilerpipe);
> +          } catch (Exception e) {
> +            log.warn(e.toString());
> +          }
> <<<<<<
>
> The actual extractor in this patch must be specified (the "boilerpipe"
> variable).  I imagine there are a number of different extractors, probably
> for different kinds of XML/XHTML.  Am I right?  If so, how do you expect a
> user to be able to select this, since most jobs crawl documents of multiple
> types?
>
>

Yes, there are many extractors ( see
http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html
).

For instance if I am crawling a newspaper website, then I may choose to use
the ArticleExtractor
There is a demo at
http://boilerpipe-web.appspot.com/
You can select the extractor you want and pass in the URL field a web site
url ( for instance
http://www.theregister.co.uk/2014/10/27/mozilla_hopes_to_challenge_raspbian_as_rpi_os_of_choice/
) and see the output.
The output varies depending on the chosen type of extractor.



> Secondly, the BoilerPlateContentHandler is just a sax ContentHandler,
> which basically implies that we'd be parsing XML somehow.  But we don't
> currently do that in ManifoldCF for the Tika extractor; I believe the
> parsing occurs inside Tika in that case.  If there's a way to configure
> Tika to use a specific boilerpipe extractor, that would be the closest
> match to this kind of functionality, I believe.
>


Boilerpiper is fully integrated and bundled with Tika
http://tika.apache.org/1.4/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html



>   But in any case, this patch does NOT push tag data into metadata fields
> -- there's no mechanism for that, unless Solr's implementation of
> ContentHandler somehow does it.
>

You are right, that patch does not do tag extraction.
Solr update chain does.


> Can you give examples of input and output that you expect to see for this
> proposed functionality?
>
>

You can see the output to Solr from the boilerpipe-web above.

Thanks.



> Karl
>
>
> On Thu, Oct 23, 2014 at 11:57 AM, Arcadius Ahouansou <[email protected]
> > wrote:
>
>> Hello Abe-San.
>>
>> Thank you for the response.
>>
>> The BoilerPipe library I was referring to helps to remove
>> common/repetitive page components such as menu items, headings, footers etc
>> from the crawled content.
>>
>> There is a Solr Patch at
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808
>>
>> That I have been using.
>> Thought it would be good to have Manifold do this instead.
>>
>> It would also be interesting to have Manifold able to extract content of
>> html tags such as div, h1,... like Solr.
>>
>> Thanks
>> On 23 Oct 2014 07:03, "Shinichiro Abe" <[email protected]>
>> wrote:
>>
>>> Hi Arcadius,
>>>
>>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>>> Yes, Tika extractor will remove tags in html
>>> and send content and metadata to downstream pipeline/output connection.
>>>
>>> > - What about extracting specific HTML tags such as all h1 or h2 and
>>> map them to a Solr field?
>>> No, currently it can map only metadata which is extracted by Tika to
>>> Solr field.
>>> For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't
>>> treat them as metadata.
>>> Currently when capturing these tags and map them to fields,
>>> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>>>
>>> Regards,
>>> Shinichiro Abe
>>>
>>> On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]>
>>> wrote:
>>>
>>> >
>>> > Hello.
>>> >
>>> > Given that we now have pipelines in ManifoldCF, How feasible  is it to:
>>> >
>>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>>> > - What about extracting specific HTML tags such as all h1 or h2 and
>>> map them to a Solr field?
>>> >
>>> > Thank you very much.
>>> >
>>> > Arcadius.
>>> >
>>>
>>>
>

Re: Extracting Content from Web Crawler using the new PipeLine

Reply via email to