[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734469#comment-13734469
 ] 

Christoph Straßer commented on SOLR-4679:
-

@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input'!

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734759#comment-13734759
 ] 

ASF subversion and git services commented on SOLR-4679:
---

Commit 1512296 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1512296 ]

SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Fix For: 4.5

 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734763#comment-13734763
 ] 

ASF subversion and git services commented on SOLR-4679:
---

Commit 1512297 from [~thetaphi] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1512297 ]

Merged revision(s) 1512296 from lucene/dev/trunk:
SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Fix For: 4.5

 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler commented on SOLR-4679:
-

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
 ] 

Uwe Schindler commented on SOLR-4679:
-

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733656#comment-13733656
 ] 

Hoss Man commented on SOLR-4679:


Uwe: I defer to your judgement on this.  if you think the patch is hte right 
way to go, then +1 from me.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733776#comment-13733776
 ] 

Uwe Schindler commented on SOLR-4679:
-

Hoss: I just took this issue because it was unassigned and I was the one 
mandating to add ignorable whitespace at that time in TIKA. So Jukka and I 
decided this would be the best.

Because you are still not convinced with my argumentation, let me recapitulate 
TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents 
to the consumer. This is nice, because it allows to preserve some of the 
formatting (like bold fonts, paragraphs,...) originating from the original 
document. Of course most of this formatting is lost, but you can still detect 
things like emphasized text. By choosing XHTML as output format, of course TIKA 
must use XHTML formatting for new lines and similar. So whenever a line break 
is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a 
PDF) inside a p/ element. As we all know, HTML ignores formatting like 
newlines, tabs,... (all are treated as one single whitespace, so means like 
this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the 
*plain text* contents. With the XHTML-only approach this would be hard for the 
consumer. Because to add the correct newlines, the consumer has to fully 
understand XHTML and detect block elements and replace them by \n

To support both usages of TIKA the idea was to embed this information which is 
unimportant to HTML (as HTML ignores whitespaces completely) as 
ignorableWhitespace as convenience for the user. A fully compliant XHTML 
consumer would not parse the ignoreable stuff. As it understands HTML it would 
detect a p element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the 
text only contents, so ideally when consuming the HTLL it could use 
{{WriteoutContentHandler(StringBuilder, 
BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the 
right thing automatically: It would extract only text from the body element and 
would use the convenience whitespace to format the text in ASCII-ART-like way 
(using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is 
similar to the above approcha), but the bug is that in contrast to TIKA's 
official WriteOutContentHandler it does not use the ignorable whitespace 
inserted for convenience. In addition TIKA also has a stack where it allows to 
process parts of the documents (like the title element or all em elements). 
In that case it has several StringBuilders in parallel that are populated with 
the contents. The problems are here too, but cannot be solved by using 
ignorable whitespace: e.g. one indexes only all em elements (which are inline 
HTML elements no block elements), there is no whitespace so all em elements 
would be glued together in the em field of your index... I just mention this, 
in my opinion the SolrContentHandler needs more work to correctly understand 
HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real 
{{character()}} events - but this is not the right thing to do here. As I said, 
HTML does not know these characters, they are ignored. The formatting is done 
by the element names (like p, div, table). So the helper whitespace for 
text-only consumers should be inserted as ignorableWhitespace only, if we would 
add it to the real character data we would report things that every HTML parser 
(like nekohtml) would never report to the consumer. Nekohtml would also report 
this useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is 
configured to help the text-only user, but don't hurt the HTML-only user. 
This differentiation is done by reporting the HTML element names (p, div, 
table, th, td, tr, abbr, em, strong,...) but also report the 
ASCII-ART-text-only content like TABs indide tables, newlines after block 
elements,... This is always done as ignorableWhitespace (for convenience), a 
real HTML parser must ignore it - and its correct to do this.



 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 

[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733791#comment-13733791
 ] 

Hoss Man commented on SOLR-4679:


bq. Because you are still not convinced with my argumentation, let me 
recapitulate TIKA's problems:

I never said that ... you said I can take the issue if you like. and you 
explained why the existing patch should be committed -- i'm totally willing to 
go along with that, so have at it.  it seems sketchy to me, but if that's the 
way Tika works that's the way tika works, you certainly understand it better 
then me, so i defer to your assesment.

(as mentioned in TIKA-1134 it would be nice if this type of behavior was better 
documented for people implementing their own ContentHandlers, but that's a Tika 
issue not a Solr issue.)

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733901#comment-13733901
 ] 

Uwe Schindler commented on SOLR-4679:
-

bq. I never said that ...

You somehow said:

bq. I defer to your judgement on this

So I assumed that you are still not 100% convinced. Sorry.

In any case I will take the issue. In my opinion there is more work to be done 
with this crazy stack of StringBuilders to better handle the ignorableWhitepace 
when a new field begins/ends. Currently its insered after the block end tag, so 
it would go one up in the stack only. I have to think a little bit about it, 
but the fix in your patch is the easiest for now. And the maybe useless 
whitespace on some lower stacked StringBuilders is generally removed by text 
analysis.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-04-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626309#comment-13626309
 ] 

Christoph Straßer commented on SOLR-4679:
-

Thank you for checking Tika.

As far as i understand http://wiki.apache.org/solr/ExtractingRequestHandler 
extracts XHTML, not text. Tika XHTML-option-output looks okay too. 

Root issue - like you said - probably somewhere within Solr.

{noformat}
D:\temp\20130409java -jar tika-app-1.3.jar --xml external.htm
?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;
head
meta name=Content-Length content=193/
meta name=Content-Encoding content=windows-1252/
meta name=Content-Type content=text/html; charset=windows-1252/
meta name=resourceName content=external.htm/
meta name=dc:title content=Test mit HTML-Zeilenschaltungen/
titleTest mit HTML-Zeilenschaltungen/title
/head
bodyp
word1
word2

Some other words, a special name like linz
and another special name - vienna
/p

/body/html
{noformat}

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, Solr_HtmlLineBreak_Linz_NotFound.png, 
 Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-04-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626775#comment-13626775
 ] 

Hoss Man commented on SOLR-4679:


Right ... i wonder if somewhere in the flow of SAX events these newline are 
being treated as ignorable whitespace ... i can't imagine why they would be, 
but that's the best guess i have at the moment.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, Solr_HtmlLineBreak_Linz_NotFound.png, 
 Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-04-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625691#comment-13625691
 ] 

Hoss Man commented on SOLR-4679:


FYI, i've confirmed this isn't a general problem with Tika 1.3 -- using the 
tika-app.jar with the --text option there is a newline generated in place of 
the br/ tag, so something about Solr's use of Tika is loosing this 
information...

{noformat}
hossman@frisbee:~/tmp$ cat external.htm 
html
head
titleTest mit HTML-Zeilenschaltungen/title
/head
p
word1brword2br/
Some other words, a special name like linzbrand another special name - vienna
/p
/htmlhossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm 

word1
word2

Some other words, a special name like linz
and another special name - vienna


hossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm | cat -vet
$
word1$
word2$
$
Some other words, a special name like linz$
and another special name - vienna$
$
$
{noformat}

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, Solr_HtmlLineBreak_Linz_NotFound.png, 
 Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org