On 2010-08-13 09:15, reinhard schwab wrote:
this parser fails to extract outlinks from
http://lucene.apache.org/solr/api/index.html
although there are some frame elements with src attributes.
i have tried to debug why this happens.
it seems that HtmlParser from tika is filtering something out.
when i use the tagsoup parser to feed the dom, i get the outlinks as
expected.
This is a known issue. For now, use parse-html for HTML parsing.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com