[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667869#comment-16667869 ] Hudson commented on TIKA-2599: -- UNSTABLE: Integrated in Jenkins build tika-branch-1x #120 (See [https://builds.apache.org/job/tika-branch-1x/120/]) TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: [https://github.com/apache/tika/commit/eb53077d62ed31795e676b5bcdce01b8ad809c99]) * (add) tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: [https://github.com/apache/tika/commit/50a2a8f6391b87fa8f1b766143f2d759c99cae4b]) * (edit) CHANGES.txt > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667863#comment-16667863 ] Hudson commented on TIKA-2599: -- UNSTABLE: Integrated in Jenkins build Tika-trunk #1584 (See [https://builds.apache.org/job/Tika-trunk/1584/]) TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: [https://github.com/apache/tika/commit/10a48b7a0077fbe627d3a0111f92910228d05d77]) * (add) tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667848#comment-16667848 ] Hudson commented on TIKA-2599: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #338 (See [https://builds.apache.org/job/tika-2.x-windows/338/]) TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: rev 10a48b7a0077fbe627d3a0111f92910228d05d77) * (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java * (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java * (add) tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667823#comment-16667823 ] Dave Meikle commented on TIKA-2599: --- Commited to branch_1x in 324cbd2eb4d64f1e34aba9789ee8b06cbf4d991e and master in 6ccedbadd4f79d7888eabfcd3a74ab85e168. Thanks [~ronanos]! > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667821#comment-16667821 ] ASF GitHub Bot commented on TIKA-2599: -- dameikle closed pull request #254: TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by Ronan O'Sullivan. URL: https://github.com/apache/tika/pull/254 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java index 30bd4bb969..6f7d3785bd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java @@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int index, boolean skipStyli url = text.substring(start, end); } -xhtml.startElement("a", "href", url); closeStyleElements(skipStyling, xhtml); +xhtml.startElement("a", "href", url); for (CharacterRun cr : texts) { handleCharacterRun(cr, skipStyling, xhtml); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java index 7456ac409e..d2c38a42d5 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java @@ -560,6 +560,15 @@ public void testBoldHyperlink() throws Exception { assertContains("http://tika.apache.org/\;>hyper link; bold" , xml); } +@Test +public void testHyperlinkSurroundedByItalics() throws Exception { +//TIKA-2599 +String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml; +xml = xml.replaceAll("\\s+", " "); +assertContains("Italic Test before link http://www.google.com\;>" + +"hyperlink italics Italic text after hyperlink", xml); +} + @Test public void testMacros() throws Exception { diff --git a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc new file mode 100644 index 00..24edb8f718 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc differ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667820#comment-16667820 ] ASF GitHub Bot commented on TIKA-2599: -- dameikle opened a new pull request #254: TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by Ronan O'Sullivan. URL: https://github.com/apache/tika/pull/254 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Assignee: Dave Meikle >Priority: Minor > Fix For: 1.20 > > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667793#comment-16667793 ] ASF GitHub Bot commented on TIKA-2599: -- dameikle closed pull request #253: TIKA-2599: Fixed closing of styles around Hyperlinks (by Ronan O'Sullivan) URL: https://github.com/apache/tika/pull/253 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/CHANGES.txt b/CHANGES.txt index 1f793d2f62..187531acf1 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -3,6 +3,9 @@ Release 1.20 - ??? * Use -javaHome or $JAVA_HOME (if they exist) when spawning child in tika-server's -spawnChild mode. + * Fixed closing of styles around Hyperlinks in Word Parser + Contributed by Ronan O'Sullivan (TIKA-2599). + Release 1.19.1 - 10/4/2018 * Update PDFBox to 2.0.12, jempbox to 1.8.16 diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java index 30bd4bb969..6f7d3785bd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java @@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int index, boolean skipStyli url = text.substring(start, end); } -xhtml.startElement("a", "href", url); closeStyleElements(skipStyling, xhtml); +xhtml.startElement("a", "href", url); for (CharacterRun cr : texts) { handleCharacterRun(cr, skipStyling, xhtml); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java index 31bd8ba293..d7d6daee56 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java @@ -570,6 +570,15 @@ public void testBoldHyperlink() throws Exception { assertContains("http://tika.apache.org/\;>hyper link; bold" , xml); } +@Test +public void testHyperlinkSurroundedByItalics() throws Exception { +//TIKA-2599 +String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml; +xml = xml.replaceAll("\\s+", " "); +assertContains("Italic Test before link http://www.google.com\;>" + +"hyperlink italics Italic text after hyperlink", xml); +} + @Test public void testMacros() throws Exception { diff --git a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc new file mode 100644 index 00..24edb8f718 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc differ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Priority: Minor > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667789#comment-16667789 ] ASF GitHub Bot commented on TIKA-2599: -- dameikle opened a new pull request #253: TIKA-2599: Fixed closing of styles around Hyperlinks (by Ronan O'Sullivan) URL: https://github.com/apache/tika/pull/253 Contributed by Ronan O'Sullivan. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Priority: Minor > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly
[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388973#comment-16388973 ] Ronan O'Sullivan commented on TIKA-2599: Attaching diff of fix to JIRA. Cannot create review board as git diff is not submitting to reviewboard.. > Hyperlink surrounded by Italics not closed Properly > --- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any >Reporter: Ronan O'Sullivan >Priority: Minor > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > Italic Test before link href="http://www.google.com"/>hyperlink italics > Italic text after hyperlink > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)