[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769215#comment-17769215 ] Tim Allison commented on TIKA-2562: --- We're now getting this with the migration to JSoup in the 3.x/main branch. This looks right? Lorem ipsum dolor sit amet, consectetur adipiscing laborum. http://www.google.com";>http://www.google.com https://mail.google.com/mail/?tab=wm";>https://mail.google.com/mail/?tab=wm > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352666#comment-16352666 ] NW Brad commented on TIKA-2562: --- Thanks. I'll take a look at it. It definitely looks the the same issue, but for the title tag. Its too bad the SAXTransformer doesn't allow you the option to prevent the issue. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352515#comment-16352515 ] Tim Allison commented on TIKA-2562: --- Thank you for looking into this. IIUC, [~rgauss] offers a way to get the closing tags on empty elements in TIKA-895 via configuration of the SAXTransformer. Fellow Tika-devs, do we want to make this configurable in tika-server...or make it the default ...or...? > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351224#comment-16351224 ] NW Brad commented on TIKA-2562: --- I was doing some research on this today and this may not be a function of Tika. I think it is probably the SAXTransformerFactory (javax.xml.transform) that is making the change. At least I could find any code in Tika that did it directly. But anything I ran through the SAXTransformerFactory converted the HTML I provided with void (empty) elements and self-closing start tags as shown below: http://www.google.com";> *becomes* http://www.google.com*"/>* and *becomes* . >From an XML standpoint the converted syntax is correct, but the anchor tag >code while correct in XML, does not appear to work correctly as HTML in both >the current version of Chrome and Firefox. So, converting HTML via Tika in >this situation generates bad HTML for the examples I have. I believe the SAXTransformerFactory is also deleting the that is around the "empty" anchor tag since a div around nothing is may not be consider relevant. I least that is what I speculate... h1. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350822#comment-16350822 ] Tim Allison commented on TIKA-2562: --- I'll take a look. This will require some digging. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634 ] NW Brad commented on TIKA-2562: --- Thanks. I check it out, it and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the removal is coming from tika. Tagsoup parse results: http://www.google.com";>http://www.google.com Tika parse results: http://www.google.com";>http://www.google.com The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: http://www.google.com";> Tagsoup:results http://www.google.com*";>* Tika results: http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being with . This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350330#comment-16350330 ] Tim Allison commented on TIKA-2562: --- This is a "feature" of tagsoup see, e.g. [https://groups.google.com/forum/#!topic/tagsoup-friends/EfB6i12xBLw] I'm hesitant to fix this in Tika because we should probably migrate to jsoup, which is actively supported (TIKA-1599). > tika server parse HTML removes DIVs around hyperlink & adds shape > - > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server >Affects Versions: 1.17 >Reporter: NW Brad >Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > received back: > href="http://www.google.com";>[http://www.google.com|http://www.google.com/] > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)