[jira] Closed: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE
[ https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-636. --- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Http client plug-in https doesn't work on IBM JRE - Key: NUTCH-636 URL: https://issues.apache.org/jira/browse/NUTCH-636 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Suse Enterprise Linux SLES 10 SP1 java version 1.5.0 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 (SR7)) IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 (JIT enabled) J9VM - 20080314_17962_lHdSMr JIT - 20080130_0718ifx2_r8 GC - 200802_08) JCL - 20080314 Reporter: Curtis d'Entremont Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: x509.patch I want to crawl my site, which is https, using the protocol-httpclient plug-in. However it throws exceptions each request, something about an unknown algorithm SunX509 for SSL. I don't recall the exact message. I don't have permission to change the JRE on our production server. I had to modify DummyX509TrustManager to hardcode the string to IbmX509 instead of SunX509 in order to work. It would be better if the plug-in could automatically figure out which one to use. At the very least, try the major ones until you don't hit any exception and take that one. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-251) Administration GUI
[ https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-251: Fix Version/s: (was: 1.0.0) 1.1 Administration GUI -- Key: NUTCH-251 URL: https://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Fix For: 1.1 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-251) Administration GUI
[ https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671121#action_12671121 ] Andrzej Bialecki commented on NUTCH-251: - Move to 1.1 - needs a significant update. Administration GUI -- Key: NUTCH-251 URL: https://issues.apache.org/jira/browse/NUTCH-251 Project: Nutch Issue Type: Improvement Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Minor Fix For: 1.1 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch Having a web based administration interface would help to make nutch administration and management much more user friendly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-685) Content-level redirect status lost in ParseSegment
Content-level redirect status lost in ParseSegment -- Key: NUTCH-685 URL: https://issues.apache.org/jira/browse/NUTCH-685 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki When Fetcher runs in parsing mode, content-level redirects (HTML meta tag Refresh) are properly discovered and recorded in crawl_fetch under source URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, the content-level redirection data is used only to add the new (target) URL, but the status of the original URL is not reset to indicate a redirect. Consequently, status of the original URL will be different depending on the way you run Fetcher, whereas it should be the same. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter
[ https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671124#action_12671124 ] Andrzej Bialecki commented on NUTCH-563: - I'd like to include this functionality in 1.0, but the patch doesn't document this in any way. Could you please add a bit of documentation (class-level javadoc, plus a commented-out example in nutch-default.xml)? Thanks. Include custom fields in BasicQueryFilter - Key: NUTCH-563 URL: https://issues.apache.org/jira/browse/NUTCH-563 Project: Nutch Issue Type: New Feature Components: searcher Reporter: julien nioche Priority: Minor Fix For: 1.0.0 Attachments: diff.BasicQueryFilter.dynamicFields.txt This patch allows to include additional fields in the BasicQueryFilter by specifying runtime parameters. Any parameter matching the regular expression (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be used by the BQF and the specified float value will be used as boost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671127#action_12671127 ] Andrzej Bialecki commented on NUTCH-469: - This issue was originally scheduled for 1.0, but it's still incomplete. Either we complete it within a week, or we should move it to 1.1. changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Fix For: 1.0.0 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, NUTCH-469-2007-05-09.txt.gz I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-261) Multi Language Support
[ https://issues.apache.org/jira/browse/NUTCH-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-261. --- Resolution: Fixed Multi Language Support -- Key: NUTCH-261 URL: https://issues.apache.org/jira/browse/NUTCH-261 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8 Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 1.0.0 Attachments: query-lang.patch Add multi-lingual support in Nutch, as described in http://wiki.apache.org/nutch/MultiLingualSupport The document analysis part is actually implemented, and two analysis plugins (fr and de) are provided for testing (not deployed by default). The query analysis part is missing for a complete multi-lingual support. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-261) Multi Language Support
[ https://issues.apache.org/jira/browse/NUTCH-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671132#action_12671132 ] Andrzej Bialecki commented on NUTCH-261: - It looks like this patch was committed quite a while ago, so I'm closing this issue. If there are some remaining parts that are left over, they should be tracked in a separate issue. Multi Language Support -- Key: NUTCH-261 URL: https://issues.apache.org/jira/browse/NUTCH-261 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8 Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 1.0.0 Attachments: query-lang.patch Add multi-lingual support in Nutch, as described in http://wiki.apache.org/nutch/MultiLingualSupport The document analysis part is actually implemented, and two analysis plugins (fr and de) are provided for testing (not deployed by default). The query analysis part is missing for a complete multi-lingual support. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-357) crawling simulation
[ https://issues.apache.org/jira/browse/NUTCH-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671133#action_12671133 ] Andrzej Bialecki commented on NUTCH-357: - Closing this issue - the suggested solution seems to address the problem in a sufficient way. crawling simulation --- Key: NUTCH-357 URL: https://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: protocol-simulation-pluginV1.patch We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of web servers. For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671134#action_12671134 ] Andrzej Bialecki commented on NUTCH-455: - Since LUCENE-252 is still unresolved, and it's not clear which of the proposed solutions should be selected, I'm postponing this issue. dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Fix For: 1.1 Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty
[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-455: Fix Version/s: (was: 1.0.0) 1.1 dedup on tokenized fields is faulty --- Key: NUTCH-455 URL: https://issues.apache.org/jira/browse/NUTCH-455 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Fix For: 1.1 Attachments: IndexSearcherCacheWarm.patch (From LUCENE-252) nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example url in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms. So in the situation that we are searching using url as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as www or com) rather that the whole url. This prevents using tokenized fields in the dedup field. I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with if (dedupField != null) dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); and for the first call of search in IndexSearcher, cache is built. Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671140#action_12671140 ] Andrzej Bialecki commented on NUTCH-479: - The current patch is not sufficient to solve the issue - postponing to 1.1. Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: or.patch, or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-479: Fix Version/s: (was: 1.0.0) 1.1 Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: or.patch, or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-262) Summary excerpts and highlights problems
[ https://issues.apache.org/jira/browse/NUTCH-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-262. --- Resolution: Incomplete Summary excerpts and highlights problems Key: NUTCH-262 URL: https://issues.apache.org/jira/browse/NUTCH-262 Project: Nutch Issue Type: Sub-task Components: searcher Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 1.0.0 There is some problems selecting and highlighting snippets for summary when multi-lingual support is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-262) Summary excerpts and highlights problems
[ https://issues.apache.org/jira/browse/NUTCH-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671139#action_12671139 ] Andrzej Bialecki commented on NUTCH-262: - There was no progress on this issue, and there is no patch, so I'm closing it. Summary excerpts and highlights problems Key: NUTCH-262 URL: https://issues.apache.org/jira/browse/NUTCH-262 Project: Nutch Issue Type: Sub-task Components: searcher Affects Versions: 0.8 Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 1.0.0 There is some problems selecting and highlighting snippets for summary when multi-lingual support is used. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE
[ https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671119#action_12671119 ] Andrzej Bialecki commented on NUTCH-636: - Fixed in rev. 741559. Thank you! Http client plug-in https doesn't work on IBM JRE - Key: NUTCH-636 URL: https://issues.apache.org/jira/browse/NUTCH-636 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Suse Enterprise Linux SLES 10 SP1 java version 1.5.0 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 (SR7)) IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 (JIT enabled) J9VM - 20080314_17962_lHdSMr JIT - 20080130_0718ifx2_r8 GC - 200802_08) JCL - 20080314 Reporter: Curtis d'Entremont Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: x509.patch I want to crawl my site, which is https, using the protocol-httpclient plug-in. However it throws exceptions each request, something about an unknown algorithm SunX509 for SSL. I don't recall the exact message. I don't have permission to change the JRE on our production server. I had to modify DummyX509TrustManager to hardcode the string to IbmX509 instead of SunX509 in order to work. It would be better if the plug-in could automatically figure out which one to use. At the very least, try the major ones until you don't hit any exception and take that one. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-673: Fix Version/s: (was: 1.0.0) 1.1 Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.1 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671130#action_12671130 ] Andrzej Bialecki commented on NUTCH-673: - Moving to 1.1 - needs more work. Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.1 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671117#action_12671117 ] Andrzej Bialecki commented on NUTCH-643: - Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. During tests on documents containing images I discovered that it's necessary to add JAI libraries too - this unfortunately increased the size of the plugin. ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-643. --- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-74) French Analyzer Plugin
[ https://issues.apache.org/jira/browse/NUTCH-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-74. -- Resolution: Fixed French Analyzer Plugin -- Key: NUTCH-74 URL: https://issues.apache.org/jira/browse/NUTCH-74 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.6, 0.7, 0.8 Environment: Nutch Reporter: Christophe Noel Assignee: Jerome Charron Fix For: 1.0.0 Attachments: analyze-french.zip, analyzers-050705.patch This is DRAFT for a new plugin for French Analysis (all java file come from Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial forms removing, ... Analyze-frech should be used instead of NutchDocumentAnalysis as described by Jerome Charron in New Language Identifier project. It should be used also as a query-parser in Nutch searcher. We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could anyone help me to build this new Extension Point please ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-74) French Analyzer Plugin
[ https://issues.apache.org/jira/browse/NUTCH-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671141#action_12671141 ] Andrzej Bialecki commented on NUTCH-74: This was fixed long time ago as a part of NUTCH-261 French Analyzer Plugin -- Key: NUTCH-74 URL: https://issues.apache.org/jira/browse/NUTCH-74 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.6, 0.7, 0.8 Environment: Nutch Reporter: Christophe Noel Assignee: Jerome Charron Fix For: 1.0.0 Attachments: analyze-french.zip, analyzers-050705.patch This is DRAFT for a new plugin for French Analysis (all java file come from Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial forms removing, ... Analyze-frech should be used instead of NutchDocumentAnalysis as described by Jerome Charron in New Language Identifier project. It should be used also as a query-parser in Nutch searcher. We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could anyone help me to build this new Extension Point please ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE
[ https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671406#action_12671406 ] Hudson commented on NUTCH-636: -- Integrated in Nutch-trunk #717 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/]) Httpclient plugin https doesn't work on IBM JRE. Http client plug-in https doesn't work on IBM JRE - Key: NUTCH-636 URL: https://issues.apache.org/jira/browse/NUTCH-636 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Suse Enterprise Linux SLES 10 SP1 java version 1.5.0 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 (SR7)) IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 (JIT enabled) J9VM - 20080314_17962_lHdSMr JIT - 20080130_0718ifx2_r8 GC - 200802_08) JCL - 20080314 Reporter: Curtis d'Entremont Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: x509.patch I want to crawl my site, which is https, using the protocol-httpclient plug-in. However it throws exceptions each request, something about an unknown algorithm SunX509 for SSL. I don't recall the exact message. I don't have permission to change the JRE on our production server. I had to modify DummyX509TrustManager to hardcode the string to IbmX509 instead of SunX509 in order to work. It would be better if the plug-in could automatically figure out which one to use. At the very least, try the major ones until you don't hit any exception and take that one. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671407#action_12671407 ] Hudson commented on NUTCH-643: -- Integrated in Nutch-trunk #717 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/]) ClassCastException in PDF parser, upgrade to unofficial PDFBox 0.7.4 ClassCastException in PdfParser on encrypted PDF with empty password Key: NUTCH-643 URL: https://issues.apache.org/jira/browse/NUTCH-643 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: This problem affects the current trunk too. Reporter: Guillaume Smet Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: parse-pdf-PDFBox_upgrade.diff Hi, If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password. This behaviour is implemented with the following code: if (pdf.isEncrypted()) { DocumentEncryption decryptor = new DocumentEncryption(pdf); //Just try using the default password and move on decryptor.decryptDocument(); } It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error: 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336) 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178) 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption Using the new security API, we don't have any error parsing this document and we can get its content: if (pdf.isEncrypted()) { // Just try using the default password and move on pdf.openProtection(new StandardDecryptionMaterial()); } I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.