[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734469#comment-13734469
 ] 

Christoph Straßer edited comment on SOLR-4679 at 8/9/13 6:41 AM:
-

@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input!

  was (Author: christophs78):
@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input'!
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element, see TIKA-171).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, 

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM:
--

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support text-only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.

  was (Author: thetaphi):
The stuff with ignorableWhitespace was discussed between [~jukkaz] and me 
in TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org