On 2010-08-13 09:15, reinhard schwab wrote:
this parser fails to extract outlinks from

http://lucene.apache.org/solr/api/index.html

although there are some frame elements with src attributes.
i have tried to debug why this happens.
it seems that HtmlParser from tika is filtering something out.
when i use the tagsoup parser to feed the dom, i get the outlinks as
expected.

This is a known issue. For now, use parse-html for HTML parsing.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

  • TikaParser reinhard schwab
    • Re: TikaParser Andrzej Bialecki

Reply via email to