[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734463#comment-13734463 ] Christoph Straßer commented on SOLR-5124: - @Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of extractOnly=true attached.) @Uwe: Big thanks for taking care of this issue! :-) Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734761#comment-13734761 ] ASF subversion and git services commented on SOLR-5124: --- Commit 1512296 from [~thetaphi] in branch 'dev/trunk' [ https://svn.apache.org/r1512296 ] SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734765#comment-13734765 ] ASF subversion and git services commented on SOLR-5124: --- Commit 1512297 from [~thetaphi] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512297 ] Merged revision(s) 1512296 from lucene/dev/trunk: SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733312#comment-13733312 ] Uwe Schindler commented on SOLR-5124: - I have not looked into DIH's code, but I know that TIKA adds the extra whitespace as ignoreable whitespace XML data. It might be ignored by the extraction content handler when it consumes the SAX events. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733321#comment-13733321 ] Christoph Straßer commented on SOLR-5124: - Maybe it´s in some way related to SOLR-4679. (But not sure; We use the ExtractingRequestHandler) Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733325#comment-13733325 ] Uwe Schindler commented on SOLR-5124: - Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will close this as duplicate. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733547#comment-13733547 ] Jack Krupansky commented on SOLR-5124: -- Try doing the update with the extractOnly=true parameter and look at the actual byte codes where the two adjacent terms meet - it may be some odd Unicode value that Solr filters ignore rather than treat as white space. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org