[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671117#action_12671117
 ] 

Andrzej Bialecki  commented on NUTCH-643:
-

Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. 
During tests on documents containing images I discovered that it's necessary to 
add JAI libraries too - this unfortunately increased the size of the plugin.

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-02-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671407#action_12671407
 ] 

Hudson commented on NUTCH-643:
--

Integrated in Nutch-trunk #717 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/])
 ClassCastException in PDF parser, upgrade to unofficial PDFBox 0.7.4


 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-01-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668001#action_12668001
 ] 

Doğacan Güney commented on NUTCH-643:
-

So... Can we commit this patch and pdfbox? It seems pdfbox is released under a 
BSD license. Is it compatible with ASF license?

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-01-28 Thread Guillaume Smet (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668010#action_12668010
 ] 

Guillaume Smet commented on NUTCH-643:
--

Hi Doğacan,

The problem isn't the license of PDFBox which is already included in Nutch. 
It's more than PDFBox is on its way to become an Apache project (it's in the 
incubator - see http://incubator.apache.org/pdfbox/) and it seems that you 
can't include a library which is in the incubator.

So you can either wait for PDFBox to be a real Apache project or build a 
development version of the latest PDFBox tree which is on sourceforge.net, 
which is what I did (the problem is fixed in the sf.net tree) but you then have 
a development version in the Nutch tree and not a stable release: I'm not sure 
it's acceptable.

It's more a problem of release policy and release rules than a technical or 
license problem.

-- 
Guillaume

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-01-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668015#action_12668015
 ] 

Andrzej Bialecki  commented on NUTCH-643:
-

+1. Yes, it's compatible.

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-01-28 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668018#action_12668018
 ] 

Andrzej Bialecki  commented on NUTCH-643:
-

(sorry Guillame, missed your comment) - there is an existing precedent in Nutch 
source tree, namely the Tika library, which is still incubating. This practice 
is however frowned upon ;) I'm ok with using the latest SF.net version of 
PDFBox built from sources, provided we include a notice about the SVN revision 
of the library. This is probably better than using the version from the 
incubator and make the legal situation worse.

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-01-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668021#action_12668021
 ] 

Doğacan Güney commented on NUTCH-643:
-

Right, we should update tika to 0.2 (post-incubation) too before releasing 1.0 
:) I actually would do that a while back, but then I know nothing about tika, 
so worried about breaking stuff.


 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2008-08-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623519#action_12623519
 ] 

Andrzej Bialecki  commented on NUTCH-643:
-

AFAIK we can't include libraries from projects undergoing incubation, because 
their legal status is not fully confirmed by ASF. I think we have to wait until 
PDFBox comes out from the incubation, or to use the latest non-Apache version 
(which unfortunately doesn't yet address this problem).

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.