Re: Extracting Content from Web Crawler using the new PipeLine

Karl Wright Tue, 28 Oct 2014 01:00:41 -0700

Ok, I see now how it's supposed to work.

See CONNECTORS-1088.


Karl


On Tue, Oct 28, 2014 at 3:42 AM, Arcadius Ahouansou <[email protected]>
wrote:

>
> Hello Karl.
>
>
> On 23 October 2014 17:57, Karl Wright <[email protected]> wrote:
>
>> Looking at the SOLR patch, I have two concerns.  First, here's the
>> pertinent part of the patch:
>>
>> >>>>>>
>> +          boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
>> +          try {
>> +            ClassLoader loader =
>> BoilerpipeExtractor.class.getClassLoader();
>> +            Class extractorClass = loader.loadClass(boilerpipe);
>> +
>> +            BoilerpipeExtractor boilerpipeExtractor =
>> (BoilerpipeExtractor)extractorClass.newInstance();
>> +            BoilerpipeContentHandler boilerPipeContentHandler = new
>> BoilerpipeContentHandler(parsingHandler, boilerpipeExtractor);
>> +
>> +            parsingHandler = (ContentHandler)boilerPipeContentHandler;
>> +          } catch (ClassNotFoundException e) {
>> +            log.warn("BoilerpipeExtractor " + boilerpipe + " not
>> found!");
>> +          } catch (InstantiationException e) {
>> +            log.warn("Could not instantiate " + boilerpipe);
>> +          } catch (Exception e) {
>> +            log.warn(e.toString());
>> +          }
>> <<<<<<
>>
>> The actual extractor in this patch must be specified (the "boilerpipe"
>> variable).  I imagine there are a number of different extractors, probably
>> for different kinds of XML/XHTML.  Am I right?  If so, how do you expect a
>> user to be able to select this, since most jobs crawl documents of multiple
>> types?
>>
>>
>
> Yes, there are many extractors ( see
> http://boilerpipe.googlecode.com/svn/trunk/boilerpipe-core/javadoc/1.0/de/l3s/boilerpipe/extractors/package-summary.html
> ).
>
> For instance if I am crawling a newspaper website, then I may choose to
> use the ArticleExtractor
> There is a demo at
> http://boilerpipe-web.appspot.com/
> You can select the extractor you want and pass in the URL field a web site
> url ( for instance
> http://www.theregister.co.uk/2014/10/27/mozilla_hopes_to_challenge_raspbian_as_rpi_os_of_choice/
> ) and see the output.
> The output varies depending on the chosen type of extractor.
>
>
>
>> Secondly, the BoilerPlateContentHandler is just a sax ContentHandler,
>> which basically implies that we'd be parsing XML somehow.  But we don't
>> currently do that in ManifoldCF for the Tika extractor; I believe the
>> parsing occurs inside Tika in that case.  If there's a way to configure
>> Tika to use a specific boilerpipe extractor, that would be the closest
>> match to this kind of functionality, I believe.
>>
>
>
> Boilerpiper is fully integrated and bundled with Tika
>
> http://tika.apache.org/1.4/api/org/apache/tika/parser/html/BoilerpipeContentHandler.html
>
>
>
>>   But in any case, this patch does NOT push tag data into metadata fields
>> -- there's no mechanism for that, unless Solr's implementation of
>> ContentHandler somehow does it.
>>
>
> You are right, that patch does not do tag extraction.
> Solr update chain does.
>
>
>> Can you give examples of input and output that you expect to see for this
>> proposed functionality?
>>
>>
>
> You can see the output to Solr from the boilerpipe-web above.
>
> Thanks.
>
>
>
>> Karl
>>
>>
>> On Thu, Oct 23, 2014 at 11:57 AM, Arcadius Ahouansou <
>> [email protected]> wrote:
>>
>>> Hello Abe-San.
>>>
>>> Thank you for the response.
>>>
>>> The BoilerPipe library I was referring to helps to remove
>>> common/repetitive page components such as menu items, headings, footers etc
>>> from the crawled content.
>>>
>>> There is a Solr Patch at
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808
>>>
>>> That I have been using.
>>> Thought it would be good to have Manifold do this instead.
>>>
>>> It would also be interesting to have Manifold able to extract content of
>>> html tags such as div, h1,... like Solr.
>>>
>>> Thanks
>>> On 23 Oct 2014 07:03, "Shinichiro Abe" <[email protected]>
>>> wrote:
>>>
>>>> Hi Arcadius,
>>>>
>>>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>>>> Yes, Tika extractor will remove tags in html
>>>> and send content and metadata to downstream pipeline/output connection.
>>>>
>>>> > - What about extracting specific HTML tags such as all h1 or h2 and
>>>> map them to a Solr field?
>>>> No, currently it can map only metadata which is extracted by Tika to
>>>> Solr field.
>>>> For h1, h2, p tags etc,  Tika extractor doesn't capture them and
>>>> doesn't treat them as metadata.
>>>> Currently when capturing these tags and map them to fields,
>>>> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>>>>
>>>> Regards,
>>>> Shinichiro Abe
>>>>
>>>> On 2014/10/23, at 10:21, Arcadius Ahouansou <[email protected]>
>>>> wrote:
>>>>
>>>> >
>>>> > Hello.
>>>> >
>>>> > Given that we now have pipelines in ManifoldCF, How feasible  is it
>>>> to:
>>>> >
>>>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>>>> > - What about extracting specific HTML tags such as all h1 or h2 and
>>>> map them to a Solr field?
>>>> >
>>>> > Thank you very much.
>>>> >
>>>> > Arcadius.
>>>> >
>>>>
>>>>
>>
>
>
>

Re: Extracting Content from Web Crawler using the new PipeLine

Reply via email to