[ https://issues.apache.org/jira/browse/TIKA-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835976#action_12835976 ]
Ken Krugler commented on TIKA-381: ---------------------------------- Things have changed w/the switch to TagSoup. Now the linefeed in the href attribute value is converted into a space before it gets passed to the XHTMLDowngradeHandler, which is unfortunate...we can no longer tell the difference between a real or an accidental space. I'll have to dig into this a bit more. > HtmlParser should strip linefeeds out of links > ---------------------------------------------- > > Key: TIKA-381 > URL: https://issues.apache.org/jira/browse/TIKA-381 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.6 > Reporter: Ken Krugler > Assignee: Ken Krugler > > A number of HTML pages contain links where the URL has a linefeed in the > middle of it. > Browsers such as Firefox will automatically remove the character but Tika > passes it back, which results in a broken URL. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.