[jira] Closed: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-636.
---

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Andrzej Bialecki 

 Http client plug-in https doesn't work on IBM JRE
 -

 Key: NUTCH-636
 URL: https://issues.apache.org/jira/browse/NUTCH-636
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Suse Enterprise Linux SLES 10 SP1
 java version 1.5.0
 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 
 (SR7))
 IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 
 (JIT enabled)
 J9VM - 20080314_17962_lHdSMr
 JIT  - 20080130_0718ifx2_r8
 GC   - 200802_08)
 JCL  - 20080314
Reporter: Curtis d'Entremont
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: x509.patch


 I want to crawl my site, which is https, using the protocol-httpclient 
 plug-in. However it throws exceptions each request, something about an 
 unknown algorithm SunX509 for SSL. I don't recall the exact message. I 
 don't have permission to change the JRE on our production server.
 I had to modify DummyX509TrustManager to hardcode the string to IbmX509 
 instead of SunX509 in order to work. It would be better if the plug-in 
 could automatically figure out which one to use. At the very least, try the 
 major ones until you don't hit any exception and take that one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-251) Administration GUI

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-251:


Fix Version/s: (was: 1.0.0)
   1.1

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 1.1

 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-251) Administration GUI

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671121#action_12671121
 ] 

Andrzej Bialecki  commented on NUTCH-251:
-

Move to 1.1 - needs a significant update.

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 1.1

 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-685) Content-level redirect status lost in ParseSegment

2009-02-06 Thread Andrzej Bialecki (JIRA)
Content-level redirect status lost in ParseSegment
--

 Key: NUTCH-685
 URL: https://issues.apache.org/jira/browse/NUTCH-685
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


When Fetcher runs in parsing mode, content-level redirects (HTML meta tag 
Refresh) are properly discovered and recorded in crawl_fetch under source URL 
and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is run as 
a separate step, the content-level redirection data is used only to add the new 
(target) URL, but the status of the original URL is not reset to indicate a 
redirect. Consequently, status of the original URL will be different depending 
on the way you run Fetcher, whereas it should be the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-563) Include custom fields in BasicQueryFilter

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671124#action_12671124
 ] 

Andrzej Bialecki  commented on NUTCH-563:
-

I'd like to include this functionality in 1.0, but the patch doesn't document 
this in any way. Could you please add a bit of documentation (class-level 
javadoc, plus a commented-out example in nutch-default.xml)? Thanks.

 Include custom fields in BasicQueryFilter
 -

 Key: NUTCH-563
 URL: https://issues.apache.org/jira/browse/NUTCH-563
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: julien nioche
Priority: Minor
 Fix For: 1.0.0

 Attachments: diff.BasicQueryFilter.dynamicFields.txt


 This patch allows to include additional fields in the BasicQueryFilter by 
 specifying runtime parameters.  Any parameter matching the regular expression 
 (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be 
 used by the BQF and the specified float value will be used as boost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671127#action_12671127
 ] 

Andrzej Bialecki  commented on NUTCH-469:
-

This issue was originally scheduled for 1.0, but it's still incomplete. Either 
we complete it within a week, or we should move it to 1.1.

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Fix For: 1.0.0

 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
 NUTCH-469-2007-05-09.txt.gz


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-261) Multi Language Support

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-261.
---

Resolution: Fixed

 Multi Language Support
 --

 Key: NUTCH-261
 URL: https://issues.apache.org/jira/browse/NUTCH-261
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8
Reporter: Jerome Charron
Assignee: Jerome Charron
 Fix For: 1.0.0

 Attachments: query-lang.patch


 Add multi-lingual support in Nutch, as described in 
 http://wiki.apache.org/nutch/MultiLingualSupport
 The document analysis part is actually implemented, and two analysis plugins 
 (fr and de) are provided for testing (not deployed by default).
 The query analysis part is missing for a complete multi-lingual support.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-261) Multi Language Support

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671132#action_12671132
 ] 

Andrzej Bialecki  commented on NUTCH-261:
-

It looks like this patch was committed quite a while ago, so I'm closing this 
issue. If there are some remaining parts that are left over, they should be 
tracked in a separate issue.

 Multi Language Support
 --

 Key: NUTCH-261
 URL: https://issues.apache.org/jira/browse/NUTCH-261
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8
Reporter: Jerome Charron
Assignee: Jerome Charron
 Fix For: 1.0.0

 Attachments: query-lang.patch


 Add multi-lingual support in Nutch, as described in 
 http://wiki.apache.org/nutch/MultiLingualSupport
 The document analysis part is actually implemented, and two analysis plugins 
 (fr and de) are provided for testing (not deployed by default).
 The query analysis part is missing for a complete multi-lingual support.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-357) crawling simulation

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671133#action_12671133
 ] 

Andrzej Bialecki  commented on NUTCH-357:
-

Closing this issue - the suggested solution seems to address the problem in a 
sufficient way.

 crawling simulation
 ---

 Key: NUTCH-357
 URL: https://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: protocol-simulation-pluginV1.patch


 We recently discovered  some serious issue related to crawling and scoring. 
 Reproducing these problems is a kind of difficult, since first of all it is 
 not polite to re-crawl a set of pages again and again, secondly it is 
 difficult to catch the page that cause a problem. 
 Therefore it would be very useful to have a testbed to simulate crawls where  
 we can control the response of  web servers. 
 For the very beginning simulate very basic situation like a page points to it 
 self,  link chains or internal links would already be very usefully. 
 However later on simulate crawls against existing data collections like TREC 
 or a webgraph would be much more interesting, for instance to caculate the 
 quality of the nutch OPIC implementation against page rank scores of the 
 webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671134#action_12671134
 ] 

Andrzej Bialecki  commented on NUTCH-455:
-

Since LUCENE-252 is still unresolved, and it's not clear which of the proposed 
solutions should be selected, I'm postponing this issue.

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Fix For: 1.1

 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-455:


Fix Version/s: (was: 1.0.0)
   1.1

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Fix For: 1.1

 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-479) Support for OR queries

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671140#action_12671140
 ] 

Andrzej Bialecki  commented on NUTCH-479:
-

The current patch is not sufficient to solve the issue - postponing to 1.1.

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-479) Support for OR queries

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-479:


Fix Version/s: (was: 1.0.0)
   1.1

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-262) Summary excerpts and highlights problems

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-262.
---

Resolution: Incomplete

 Summary excerpts and highlights problems
 

 Key: NUTCH-262
 URL: https://issues.apache.org/jira/browse/NUTCH-262
 Project: Nutch
  Issue Type: Sub-task
  Components: searcher
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Jerome Charron
 Fix For: 1.0.0


 There is some problems selecting and highlighting snippets for summary when 
 multi-lingual support is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-262) Summary excerpts and highlights problems

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671139#action_12671139
 ] 

Andrzej Bialecki  commented on NUTCH-262:
-

There was no progress on this issue, and there is no patch, so I'm closing it.

 Summary excerpts and highlights problems
 

 Key: NUTCH-262
 URL: https://issues.apache.org/jira/browse/NUTCH-262
 Project: Nutch
  Issue Type: Sub-task
  Components: searcher
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Jerome Charron
 Fix For: 1.0.0


 There is some problems selecting and highlighting snippets for summary when 
 multi-lingual support is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671119#action_12671119
 ] 

Andrzej Bialecki  commented on NUTCH-636:
-

Fixed in rev. 741559. Thank you!

 Http client plug-in https doesn't work on IBM JRE
 -

 Key: NUTCH-636
 URL: https://issues.apache.org/jira/browse/NUTCH-636
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Suse Enterprise Linux SLES 10 SP1
 java version 1.5.0
 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 
 (SR7))
 IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 
 (JIT enabled)
 J9VM - 20080314_17962_lHdSMr
 JIT  - 20080130_0718ifx2_r8
 GC   - 200802_08)
 JCL  - 20080314
Reporter: Curtis d'Entremont
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: x509.patch


 I want to crawl my site, which is https, using the protocol-httpclient 
 plug-in. However it throws exceptions each request, something about an 
 unknown algorithm SunX509 for SSL. I don't recall the exact message. I 
 don't have permission to change the JRE on our production server.
 I had to modify DummyX509TrustManager to hardcode the string to IbmX509 
 instead of SunX509 in order to work. It would be better if the plug-in 
 could automatically figure out which one to use. At the very least, try the 
 major ones until you don't hit any exception and take that one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-673:


Fix Version/s: (was: 1.0.0)
   1.1

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671130#action_12671130
 ] 

Andrzej Bialecki  commented on NUTCH-673:
-

Moving to 1.1 - needs more work.

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671117#action_12671117
 ] 

Andrzej Bialecki  commented on NUTCH-643:
-

Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. 
During tests on documents containing images I discovered that it's necessary to 
add JAI libraries too - this unfortunately increased the size of the plugin.

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-643.
---

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Andrzej Bialecki 

 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-74) French Analyzer Plugin

2009-02-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-74.
--

Resolution: Fixed

 French Analyzer Plugin
 --

 Key: NUTCH-74
 URL: https://issues.apache.org/jira/browse/NUTCH-74
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.6, 0.7, 0.8
 Environment: Nutch
Reporter: Christophe Noel
Assignee: Jerome Charron
 Fix For: 1.0.0

 Attachments: analyze-french.zip, analyzers-050705.patch


 This is DRAFT for a new plugin for French Analysis (all java file come from 
 Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial 
 forms removing, ...
 Analyze-frech should be used instead of NutchDocumentAnalysis as described by 
 Jerome Charron in New Language Identifier project. It should be used also as 
 a query-parser in Nutch searcher.
 We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could 
 anyone help me to build this new Extension Point please ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-74) French Analyzer Plugin

2009-02-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671141#action_12671141
 ] 

Andrzej Bialecki  commented on NUTCH-74:


This was fixed long time ago as a part of NUTCH-261

 French Analyzer Plugin
 --

 Key: NUTCH-74
 URL: https://issues.apache.org/jira/browse/NUTCH-74
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.6, 0.7, 0.8
 Environment: Nutch
Reporter: Christophe Noel
Assignee: Jerome Charron
 Fix For: 1.0.0

 Attachments: analyze-french.zip, analyzers-050705.patch


 This is DRAFT for a new plugin for French Analysis (all java file come from 
 Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial 
 forms removing, ...
 Analyze-frech should be used instead of NutchDocumentAnalysis as described by 
 Jerome Charron in New Language Identifier project. It should be used also as 
 a query-parser in Nutch searcher.
 We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could 
 anyone help me to build this new Extension Point please ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-636) Http client plug-in https doesn't work on IBM JRE

2009-02-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671406#action_12671406
 ] 

Hudson commented on NUTCH-636:
--

Integrated in Nutch-trunk #717 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/])
 Httpclient plugin https doesn't work on IBM JRE.


 Http client plug-in https doesn't work on IBM JRE
 -

 Key: NUTCH-636
 URL: https://issues.apache.org/jira/browse/NUTCH-636
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Suse Enterprise Linux SLES 10 SP1
 java version 1.5.0
 Java(TM) 2 Runtime Environment, Standard Edition (build pxi32dev-20080315 
 (SR7))
 IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux x86-32 j9vmxi3223-20080315 
 (JIT enabled)
 J9VM - 20080314_17962_lHdSMr
 JIT  - 20080130_0718ifx2_r8
 GC   - 200802_08)
 JCL  - 20080314
Reporter: Curtis d'Entremont
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: x509.patch


 I want to crawl my site, which is https, using the protocol-httpclient 
 plug-in. However it throws exceptions each request, something about an 
 unknown algorithm SunX509 for SSL. I don't recall the exact message. I 
 don't have permission to change the JRE on our production server.
 I had to modify DummyX509TrustManager to hardcode the string to IbmX509 
 instead of SunX509 in order to work. It would be better if the plug-in 
 could automatically figure out which one to use. At the very least, try the 
 major ones until you don't hit any exception and take that one.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

2009-02-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671407#action_12671407
 ] 

Hudson commented on NUTCH-643:
--

Integrated in Nutch-trunk #717 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/])
 ClassCastException in PDF parser, upgrade to unofficial PDFBox 0.7.4


 ClassCastException in PdfParser on encrypted PDF with empty password
 

 Key: NUTCH-643
 URL: https://issues.apache.org/jira/browse/NUTCH-643
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: This problem affects the current trunk too.
Reporter: Guillaume Smet
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: parse-pdf-PDFBox_upgrade.diff


 Hi,
 If a PDF document is encrypted with an empty password, the PdfParser should 
 decrypt it using the empty password.
 This behaviour is implemented with the following code:
   if (pdf.isEncrypted()) {
 DocumentEncryption decryptor = new DocumentEncryption(pdf);
 //Just try using the default password and move on
 decryptor.decryptDocument();
   }
 It uses a deprecated API and moreover it seems there is a bug in PDFBox in 
 this deprecated API (we have a ClassCastException in PDFBox) as we have the 
 following error:
 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
 2008-08-07 19:15:56,862 WARN  parse.pdf - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
 http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
 handled as pdf document. java.lang.ClassCastException: 
 org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
 org.pdfbox.pdmodel.encryption.PDStandardEncryption
 Using the new security API, we don't have any error parsing this document and 
 we can get its content:
   if (pdf.isEncrypted()) {
   // Just try using the default password and move 
 on
   pdf.openProtection(new 
 StandardDecryptionMaterial());
   }
 I attached the patch fixing this problem: it works perfectly with the above 
 document and get rids of the deprecated API.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.