subject:"\[jira\] \[Comment Edited\] \(SOLR\-4679\) HTML line breaks \(br\) are removed during indexing; causes wrong search results"

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734469#comment-13734469
 ] 

Christoph Straßer edited comment on SOLR-4679 at 8/9/13 6:41 AM:
-

@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input!

  was (Author: christophs78):
@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input'!
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
]

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we
should apply the workaround and make the ignoreable whitespace significant. In
my opinion this is not a problem at all, because the Analyzer will remove this
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is
getting ignorable whitespace SAX events for br tags in HTML – which makes no
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all
text-only glue stuff, which was decided at the beginning of the project. I can
find the mail from their lists; I was involved in that, too (because I applied
some fixes for that to corectly produce ignorable whitespace in some parsers,
which were missing to do this. I also added the XHTMLContentHandler stuff that
makes block XHTML elements like p/, div/ also emit a newline as
ignoreable on the closing element).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not
exist (it is handled differently, but is never reported by HTML parsers), so
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace
SAX event to report this added whitespace. The rule that was choosen in TIKA
is:
- If you ignore all elements of HTML and only extract plain text, use the
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that
produce plain text (TextOnlyContentHandler). They treat all ignoreable
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think
we should apply the workaround and make the ignoreable whitespace significant.
In my opinion this is not a problem at all, because the Analyzer will remove
this stuff in any case, so some additional whitespace would disappear.

Regarding this guideline, your patch is correct and should be applied to solr.

HTML line breaks (br) are removed during indexing; causes wrong search
results

Key: SOLR-4679
URL: https://issues.apache.org/jira/browse/SOLR-4679
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.2
Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch,
Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png

HTML line breaks (br, BR, br/, ...) seem to be removed during
extraction of content from HTML-Files. They need to be replaced with a empty
space.
Test-File:
html
head

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
]

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM:
--

Regarding this guideline, your patch is correct and should be applied to solr.

HTML line breaks (br) are removed during indexing; causes wrong search
results

Key: SOLR-4679
URL: https://issues.apache.org/jira/browse/SOLR-4679
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.2
Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch,
Solr_HtmlLineBreak_Linz_NotFound.png,

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
]

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM:
--

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in
TIKA-171. I think this was the issue when we decided to emit
ignorableWhitespace for all synthetic whitespace added to support text-only
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your
current patch, because it makes use of the stuff we decided in TIKA-171. In my
opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain
one more time and document under which circumstances TIKA emits
ignorableWhitepsace.

was (Author: thetaphi):
The stuff with ignorableWhitespace was discussed between [~jukkaz] and me
in TIKA-171. I think this was the issue when we decided to emit
ignorableWhitespace for all synthetic whitespace added to support-text only
extraction.

HTML line breaks (br) are removed during indexing; causes wrong search
results

Key: SOLR-4679
URL: https://issues.apache.org/jira/browse/SOLR-4679
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.2
Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch,
Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png

HTML line breaks (br, BR, br/, ...) seem to be removed during
extraction of content from HTML-Files. They need to be replaced with a empty
space.
Test-File:
html
head
titleTest mit HTML-Zeilenschaltungen/title
/head
p
word1brword2br/
Some other words, a special name like linzbrand another special name -
vienna
/p
/html
The Solr-content-attribute contains the following text:
Test mit HTML-Zeilenschaltungen
word1word2
Some other words, a special name like linzand another special name - vienna
So we are not able to find the word linz.
We use the ExtractingRequestHandler to put content into Solr.
(wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

4 matches

Site Navigation

Mail list logo

Footer information