[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734463#comment-13734463
 ] 

Christoph Straßer commented on SOLR-5124:
-

@Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of 
extractOnly=true attached.)
@Uwe: Big thanks for taking care of this issue! :-)

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734761#comment-13734761
 ] 

ASF subversion and git services commented on SOLR-5124:
---

Commit 1512296 from [~thetaphi] in branch 'dev/trunk'
[ https://svn.apache.org/r1512296 ]

SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734765#comment-13734765
 ] 

ASF subversion and git services commented on SOLR-5124:
---

Commit 1512297 from [~thetaphi] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1512297 ]

Merged revision(s) 1512296 from lucene/dev/trunk:
SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using 
Solr Cell was missing ignorable whitespace, which is inserted by TIKA for 
convenience to support plain text extraction without using the HTML elements. 
This bug resulted in glued words.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733312#comment-13733312
 ] 

Uwe Schindler commented on SOLR-5124:
-

I have not looked into DIH's code, but I know that TIKA adds the extra 
whitespace as ignoreable whitespace XML data. It might be ignored by the 
extraction content handler when it consumes the SAX events.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733321#comment-13733321
 ] 

Christoph Straßer commented on SOLR-5124:
-

Maybe it´s in some way related to SOLR-4679. (But not sure; We use the 
ExtractingRequestHandler) 

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733325#comment-13733325
 ] 

Uwe Schindler commented on SOLR-5124:
-

Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will 
close this as duplicate.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733547#comment-13733547
 ] 

Jack Krupansky commented on SOLR-5124:
--

Try doing the update with the extractOnly=true parameter and look at the actual 
byte codes where the two adjacent terms meet - it may be some odd Unicode value 
that Solr filters ignore rather than treat as white space.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org