subject:"\[jira\] \[Commented\] \(SOLR\-5124\) Solr glues word´s when parsing PDFs under certan circumstances"

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread JIRA

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734463#comment-13734463
]

Christoph Straßer commented on SOLR-5124:
-

@Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of
extractOnly=true attached.)
@Uwe: Big thanks for taking care of this issue! :-)

Solr glues word´s when parsing PDFs under certan circumstances
--

Key: SOLR-5124
URL: https://issues.apache.org/jira/browse/SOLR-5124
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.4
Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
Labels: tika,text-extraction
Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png,
03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png,
03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png

For some kind of PDF-documents Solr glues words at linebreaks under some
circumstances. (eg the last word of line 1 and the first word of line 2 are
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one
sample-PDF and screenshots of tika-output and the corrupted content indexed
by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the
issue with new word-documents, I converted into PDF on multiple ways without
success.) The attached PDF-document has a real weird internal structure. But
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This
results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread ASF subversion and git services (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734761#comment-13734761
]

ASF subversion and git services commented on SOLR-5124:
---

Commit 1512296 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1512296 ]

SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for
convenience to support plain text extraction without using the HTML elements.
This bug resulted in glued words.

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread ASF subversion and git services (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734765#comment-13734765
]

ASF subversion and git services commented on SOLR-5124:
---

Commit 1512297 from [~thetaphi] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1512297 ]

Merged revision(s) 1512296 from lucene/dev/trunk:
SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for
convenience to support plain text extraction without using the HTML elements.
This bug resulted in glued words.

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733312#comment-13733312
]

Uwe Schindler commented on SOLR-5124:
-

I have not looked into DIH's code, but I know that TIKA adds the extra
whitespace as ignoreable whitespace XML data. It might be ignored by the
extraction content handler when it consumes the SAX events.

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733321#comment-13733321
]

Christoph Straßer commented on SOLR-5124:
-

Maybe it´s in some way related to SOLR-4679. (But not sure; We use the
ExtractingRequestHandler)

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733325#comment-13733325
]

Uwe Schindler commented on SOLR-5124:
-

Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will
close this as duplicate.

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Jack Krupansky (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733547#comment-13733547
]

Jack Krupansky commented on SOLR-5124:
--

Try doing the update with the extractOnly=true parameter and look at the actual
byte codes where the two adjacent terms meet - it may be some odd Unicode value
that Solr filters ignore rather than treat as white space.

Solr glues word´s when parsing PDFs under certan circumstances
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

7 matches

Site Navigation

Mail list logo

Footer information