[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734469#comment-13734469 ] Christoph Straßer edited comment on SOLR-4679 at 8/9/13 6:41 AM: - @Uwe: Big thanks for taking care of this issue! @Hoss Man: Thank you for your input! was (Author: christophs78): @Uwe: Big thanks for taking care of this issue! @Hoss Man: Thank you for your input'! HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM: -- There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. was (Author: thetaphi): There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM: -- There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element, see TIKA-171). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. was (Author: thetaphi): There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png,
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM: -- The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all synthetic whitespace added to support text-only extraction. [~hossman]: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain one more time and document under which circumstances TIKA emits ignorableWhitepsace. was (Author: thetaphi): The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all synthetic whitespace added to support-text only extraction. [~hossman]: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain one more time and document under which circumstances TIKA emits ignorableWhitepsace. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org