[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-05-02 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027579#comment-13027579
 ] 

Shinichiro Abe commented on SOLR-2480:
--

{quote}
But I think you want Solr to skip the content field because tika cannot extract 
it for some reasons but add meta data fields, right?
{quote}
Yes, I want to post the metadate without contents that throw parse-error.
ExtractingDocumentLoader also should be fixed.
This patch expresses improvement ideas(1).
And I think SOLR-445 can resolve improvement ideas(2).


 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor
 Attachments: SOLR-2480-idea1.patch


 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-05-01 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027544#comment-13027544
 ] 

Shinichiro Abe commented on SOLR-2480:
--

There is a same issue.
https://issues.apache.org/jira/browse/SOLR-445
If it be able to applied by that same policy, this issue is duplicate.

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor

 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-05-01 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027547#comment-13027547
 ] 

Koji Sekiguchi commented on SOLR-2480:
--

Though I've not yet read entire comment SOLR-445, I don't think your 
requirement is same.
According to description of SOLR-445, the reporter wants Solr to skip the error 
doc/ and continue adding the rest of doc/ in add.../add. But I think 
you want Solr to skip the content *field* because tika cannot extract it for 
some reasons but add meta data fields, right?

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor

 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-05-01 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027553#comment-13027553
 ] 

Koji Sekiguchi commented on SOLR-2480:
--

BTW, I have a similar issue when using UIMA update processor, as sometimes UIMA 
annotators fail to extract meta data for some reason (eg Alchemy Web services 
stop). I'll open a separate ticket for it.

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor

 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-04-28 Thread Shinichiro Abe (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13026137#comment-13026137
 ] 

Shinichiro Abe commented on SOLR-2480:
--

Improvement ideas:
1, TikaException is always ignored, and index only the metadata.
2, Parameter ignoreTikaException is provided newly.
If it is true then it returns 200 response, if it is false then it throws 
TikaException.
3, If Solr can catch internal exception about encrypting error, it changes 
return code each exception.
If it can judge poi.EncryptedDocumentException, 
pdfbox.exceptions.CryptographyException. etc. then it returns 200 or another 
code response, if it judges the other exception then it throws TikaException.

 Text extraction of password protected files
 ---

 Key: SOLR-2480
 URL: https://issues.apache.org/jira/browse/SOLR-2480
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 3.1
Reporter: Shinichiro Abe
Priority: Minor

 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-2480) Text extraction of password protected files

2011-04-28 Thread Erick Erickson
Hmmm, I'm not sure this fits into Solr-445 or not, could you add this
comment to that
patch discussion so we at least look?

Thanks,
Erick

On Thu, Apr 28, 2011 at 2:03 AM, Shinichiro Abe (JIRA) j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/SOLR-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13026137#comment-13026137
  ]

 Shinichiro Abe commented on SOLR-2480:
 --

 Improvement ideas:
 1, TikaException is always ignored, and index only the metadata.
 2, Parameter ignoreTikaException is provided newly.
 If it is true then it returns 200 response, if it is false then it throws 
 TikaException.
 3, If Solr can catch internal exception about encrypting error, it changes 
 return code each exception.
 If it can judge poi.EncryptedDocumentException, 
 pdfbox.exceptions.CryptographyException. etc. then it returns 200 or another 
 code response, if it judges the other exception then it throws TikaException.

 Text extraction of password protected files
 ---

                 Key: SOLR-2480
                 URL: https://issues.apache.org/jira/browse/SOLR-2480
             Project: Solr
          Issue Type: Improvement
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 3.1
            Reporter: Shinichiro Abe
            Priority: Minor

 Proposal:
 There are password-protected files. PDF, Office documents in 2007 format/97 
 format.
 These files are posted using SolrCell.
 We do not have to read these files if we do not know the reading password of 
 files.
 So, these files may not be extracted text.
 My requirement is that these files should be processed normally without 
 extracting text, and without throwing exception.
 This background:
 Now, when you post a password-protected file, solr returns 500 server error.
 Solr catches the error in ExtractingDocumentLoader and throws TikException.
 I use ManifoldCF.
 If the solr server responds 500, ManifoldCF judge is that this
 document should be retried because I have absolutely no idea what
 happened.
 And it attempts to retry posting many times without getting the password.
 In the other case, my customer posts the files with embedded images.
 Sometimes it seems that solr throws TikaException of unknown cause.
 He wants to post just metadata without extracting text, but makes him stop 
 posting by the exception.

 --
 This message is automatically generated by JIRA.
 For more information on JIRA, see: http://www.atlassian.com/software/jira

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org