subject:"\[jira\] \[Updated\] \(SOLR\-5124\) Solr glues word´s when parsing PDFs under certan circumstances"

[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christoph Straßer updated SOLR-5124:

Attachment: 04_Solr.png
03_TikaOutput_GUI_StructuredText.png
03_TikaOutput_GUI_PlainText.png
03_TikaOutput_GUI_MainContent.png
03_TikaOutput.png
02_PDF.png
01_alz_2009_folge11_2009_05_28.pdf

Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index.

Solr glues word´s when parsing PDFs under certan circumstances
--

Key: SOLR-5124
URL: https://issues.apache.org/jira/browse/SOLR-5124
Project: Solr
Issue Type: Bug
Components: update
Affects Versions: 4.4
Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
Labels: tika,text-extraction
Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png,
03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png,
03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png

For some kind of PDF-documents Solr glues words at linebreaks under some
circumstances. (eg the last word of line 1 and the first word of line 2 are
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one
sample-PDF and screenshots of tika-output and the corrupted content indexed
by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the
issue with new word-documents, I converted into PDF on multiple ways without
success.) The attached PDF-document has a real weird internal structure. But
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This
results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Christoph Straßer updated SOLR-5124:

Description:
For some kind of PDF-documents Solr glues words at linebreaks under some
circumstances. (eg the last word of line 1 and the first word of line 2 are
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue
with new word-documents, I converted into PDF on multiple ways without
success.) The attached PDF-document has a real weird internal structure. But
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This results
in worse suggestions by the Suggester.

was:
For some kind of PDF-documents Solr glues words at linebreaks under some
circumstances. (eg the last word of line 1 and the first word of line 2 are
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue
with new word-documents, I converted into PDF on multiple ways without
success.) The attached PDF-document has a real weird internal structure. But
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results
in worse suggestions by the Suggester.

Solr glues word´s when parsing PDFs under certan circumstances
--

For some kind of PDF-documents Solr glues words at linebreaks under some
circumstances. (eg the last word of line 1 and the first word of line 2 are
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one
sample-PDF and screenshots of tika-output and the corrupted content indexed
by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the
issue with new word-documents, I converted into PDF on multiple ways without
success.) The attached PDF-document has a real weird internal structure. But
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This
results in worse suggestions by the Suggester.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2 matches

Site Navigation

Mail list logo

Footer information