TikaParser

reinhard schwab Fri, 13 Aug 2010 00:10:58 -0700

this parser fails to extract outlinks from

http://lucene.apache.org/solr/api/index.html


although there are some frame elements with src attributes.
i have tried to debug why this happens.
it seems that HtmlParser from tika is filtering something out.
when i use the tagsoup parser to feed the dom, i get the outlinks as
expected.

regards
reinhard

TikaParser

Reply via email to