[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671117#action_12671117 ] Andrzej Bialecki commented on NUTCH-643: - Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. During tests on documents containing images I discovered that it's necessary to add JAI libraries too - this unfortunately increased the size of the plugin. ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671407#action_12671407 ] Hudson commented on NUTCH-643: -- Integrated in Nutch-trunk #717 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/]) ClassCastException in PDF parser, upgrade to unofficial PDFBox 0.7.4 ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668001#action_12668001 ] Doğacan Güney commented on NUTCH-643: - So... Can we commit this patch and pdfbox? It seems pdfbox is released under a BSD license. Is it compatible with ASF license? ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668010#action_12668010 ] Guillaume Smet commented on NUTCH-643: -- Hi Doğacan, The problem isn't the license of PDFBox which is already included in Nutch. It's more than PDFBox is on its way to become an Apache project (it's in the incubator - see http://incubator.apache.org/pdfbox/) and it seems that you can't include a library which is in the incubator. So you can either wait for PDFBox to be a real Apache project or build a development version of the latest PDFBox tree which is on sourceforge.net, which is what I did (the problem is fixed in the sf.net tree) but you then have a development version in the Nutch tree and not a stable release: I'm not sure it's acceptable. It's more a problem of release policy and release rules than a technical or license problem. -- Guillaume ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668015#action_12668015 ] Andrzej Bialecki commented on NUTCH-643: - +1. Yes, it's compatible. ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668018#action_12668018 ] Andrzej Bialecki commented on NUTCH-643: - (sorry Guillame, missed your comment) - there is an existing precedent in Nutch source tree, namely the Tika library, which is still incubating. This practice is however frowned upon ;) I'm ok with using the latest SF.net version of PDFBox built from sources, provided we include a notice about the SVN revision of the library. This is probably better than using the version from the incubator and make the legal situation worse. ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668021#action_12668021 ] Doğacan Güney commented on NUTCH-643: - Right, we should update tika to 0.2 (post-incubation) too before releasing 1.0 :) I actually would do that a while back, but then I know nothing about tika, so worried about breaking stuff. ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623519#action_12623519 ] Andrzej Bialecki commented on NUTCH-643: - AFAIK we can't include libraries from projects undergoing incubation, because their legal status is not fully confirmed by ASF. I think we have to wait until PDFBox comes out from the incubation, or to use the latest non-Apache version (which unfortunately doesn't yet address this problem). ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.