[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4

2010-03-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843546#action_12843546
 ] 

Sami Siren commented on NUTCH-798:
--

+1

 Upgrade to SOLR1.4
 --

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
 the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-793) search.jsp compile errors

2010-02-15 Thread Sami Siren (JIRA)
search.jsp compile errors
-

 Key: NUTCH-793
 URL: https://issues.apache.org/jira/browse/NUTCH-793
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


Related to the searcher interface changes recently committed I broke search.jsp 
which does not currently compile.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-793) search.jsp compile errors

2010-02-15 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-793.
--

Resolution: Fixed

committed a fix

 search.jsp compile errors
 -

 Key: NUTCH-793
 URL: https://issues.apache.org/jira/browse/NUTCH-793
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


 Related to the searcher interface changes recently committed I broke 
 search.jsp which does not currently compile.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-788) search.jsp typo causing searches to fail

2010-02-15 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-788.
--

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Sami Siren

Thanks Sammy for the fix, I did not realize you had spotted this too. It's now 
fixed in trunk.

 search.jsp typo causing searches to fail
 

 Key: NUTCH-788
 URL: https://issues.apache.org/jira/browse/NUTCH-788
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.1
 Environment: On trunk
Reporter: Sammy Yu
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: 0001-Fix-up-servlet.patch


 Call to initialize the servlet parameter is missing parentheses.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-02-15 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833714#action_12833714
 ] 

Sami Siren commented on NUTCH-789:
--

It would be really useful to include the improvements in the functionality 
since that way almost all (-flash ?) parsers would be covered.

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)
Some external javadoc links are broken
--

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial


Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-790:
-

Attachment: NUTCH-790.patch

proposed patch, fixes links for lucene and hadoop, also updates j2se link to 
version 1.6

 Some external javadoc links are broken
 --

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial
 Attachments: NUTCH-790.patch


 Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-791) External links for published javadocs are partially broken

2010-02-14 Thread Sami Siren (JIRA)
External links for published javadocs are partially broken
--

 Key: NUTCH-791
 URL: https://issues.apache.org/jira/browse/NUTCH-791
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Reporter: Sami Siren


Lucene and Hadoop links point to non existing urls. For some versions of 
apidocs the links are just broken and for some they do not exist at all. 
Basically what is required is that the javadocs are generated again with proper 
urls for external packages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-790.
--

   Resolution: Fixed
Fix Version/s: 1.1

committed

 Some external javadoc links are broken
 --

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-790.patch


 Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-792:
-

Attachment: NUTCH-792.patch

pump version to 1.1-dev

 Nutch version still contains 1.0
 

 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch


 Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)
Nutch version still contains 1.0


 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch

Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-792.
--

Resolution: Fixed

committed

 Nutch version still contains 1.0
 

 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch


 Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832406#action_12832406
 ] 

Sami Siren commented on NUTCH-766:
--

I suggest that we would still drive this a bit further an use. currently this 
patch does not use Tika for pkg formats nor html.

Julien: was there a reason not to use AutoDetect parser? The only thing that I 
could come with was that the mime type detection would be done twice. We could 
get around this by implementing somethin simlilar to what composite parser does 
(it uses a parser (AutodetectParser) class from the context to do further 
parsing) to cover all supported pkg formats.

Also was there a reson not to parse html wtih tika?

I have a patch nearby to demonstrate some of the improvements that I will try 
to post briefly.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed 

[jira] Updated: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-766:
-

Attachment: NutchTikaConfig.java

Extended TikaConfig that is able to load parsers and can be used with existing 
tika classes. The call to (super) cannot load parser but then the config is 
porcessed again locally. This is a hack and hopefully at some point we can drop 
the class alltogether.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-766:
-

Attachment: TikaParser.java

Modified parser that can process package formats too. To get rid of the mime 
type detection happening twice we have to extend AutoDetectParser so that skips 
the intitial detection but does the detection for the rest of the content (in 
pkg formats)

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830053#action_12830053
 ] 

Sami Siren commented on NUTCH-673:
--

{quote}
Any plans or reasons not to upgrade to Lucene 3.0?
{quote}

I see no reason to stick with 2.9

{quote}
I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a separate 
issue).
{quote}

+1

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828561#action_12828561
 ] 

Sami Siren commented on NUTCH-781:
--

{quote}
the version we had was the same as the one provided by Tika 0.4 so I suppose we 
could safely rely on theTika defaults. MimeUtil currently requires needs 
tika-mimetypes.xml to be in the available in the classpath but we could modify 
that so that it uses the default version from the tika jar if nothing can be 
found in conf. Let's put that in a separate JIRA issue if we really want it, in 
the meantime I'll commit the v 0.6 of tika-mimetypes.xml
{quote}

ok. thanks.

 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-775) Enhance Searcher interface

2010-02-01 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-775.
--

Resolution: Fixed

I committed this

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828275#action_12828275
 ] 

Sami Siren commented on NUTCH-781:
--

did you forgot to update conf/tika-mimetypes.xml ?

Related question: do we actually need our own version on the tika config 
anymore? I saw there were some old issues that were fixed in the custom version 
but i would quess those changes, if important, have already made their way into 
Tika?



 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806019#action_12806019
 ] 

Sami Siren commented on NUTCH-775:
--

If there are no objections I'll commit the proposed patch within few days.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806051#action_12806051
 ] 

Sami Siren commented on NUTCH-775:
--

{quote}IMHO this could go as it is ... one suggestion though: this 
Query/QueryContext now resembles SolrQuery/SolrParams. Perhaps we could rename 
QueryContext to QueryParams?
{quote}
That sounds reasonable, I will change the name before committing. Also I forgot 
to change web gui to use the new api, will do that also.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805661#action_12805661
 ] 

Sami Siren commented on NUTCH-766:
--

{quote}
Sure, it's more of a configuration backwards-compat issue. For those folks who 
have gone to the trouble of customizing their nutch configuration 
(nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the 
parsing plugins (e.g., basically say they don't exist anymore and update your 
deployed configuration to use the tika-plugin), this patch would require a 
configuration update in their deployed environments. Because of that, why don't 
we ease them into that upgrade with at least one released version before the 
plugins go away. It would make it easier from a configuration backwards-compat 
perspective.
{quote}

Ok, so you mean that we need to have duplicate parser plugins because we don't 
want to ask people already using nutch to reconfigure the bits this involves 
now even though we have to do it later? How is postponing going to ease the 
task they need to do anyway at some point? I still don't understand the (longer 
term) benefit.

I am not strongly against the idea of keeping duplicate plugins, I mean it's 
just another ~20M in the .job, what I am worried about is that the history will 
repeat itself and we will end up having one more case of duplicate components 
(in this case many of them) doing the same work and no interest in cleaning up 
afterwards. Doing it the way I suggested would guarantee that this will not 
happen.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804448#action_12804448
 ] 

Sami Siren commented on NUTCH-766:
--

+1, I'm going to agree on this one here Julien. Other communities  have 
convinced me of the need for backwards compat and unobtrusiveness when 
bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving 
the old plugins (perhaps mentioning they should be deprecated and replaced by 
the Tika functionality) and then removing them in 1.2 or 1.3.

Chris, can you please explain me how keeping two components doing identical 
work would be more backwards compatible than having only 1? 



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803664#action_12803664
 ] 

Sami Siren commented on NUTCH-766:
--

I took a brief look into the proposed patch, some somments:

The public API footprint of new classes should be smaller, eg use private, 
package private or protected methods/classes as much as possible.

I think the end result of this plugin should be replacing all Tika supported 
parsers (or the parsers we choose to replace) with the TikaParser and not to 
build a parallel ways to parse same formats. So I think we need to copy all of 
the the existing test files and moveadapt the existing testcases fully before 
committing this. That is a good way of seeing that the parse result is what is 
expected and also find out about possible differences with old vs. Tika version.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803673#action_12803673
 ] 

Sami Siren commented on NUTCH-766:
--

 Sure, but it would be silly to block the whole Tika plugin because Tika does 
 not support such or such format as well as the original Nutch plugins. As I 
 explained above we can configure which parser to use for which mimetype and 
 use the Tika-plugin by default. Hopefully the Tika implementation will get 
 better and better and there will be no need for keeping the old plugins.

I meant test files for the parsers we replace, not all

 BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the 
 current version of Tika and the existing Nutch parsers

ok, I had misses that one. 

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your 

[jira] Updated: (NUTCH-775) Enhance Searcher interface

2009-12-30 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-775:
-

Attachment: NUTCH-775.patch

I ended up changing the Query API instead since the changes were smaller from 
API perspective that way.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-16 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829
 ] 

Sami Siren commented on NUTCH-666:
--

We should also consider switching to Tika for language identification and route 
the proposed improvements in that area through Tika?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-775) Enhance Searcher interface

2009-12-15 Thread Sami Siren (JIRA)
Enhance Searcher interface
--

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


Current Searcher interface is too limited for many purposes:

Hits search(Query query, int numHits, String dedupField, String sortField,
  boolean reverse) throws IOException;

It would be nice that we had an interface that allowed adding different 
features without changing the interface. I am proposing that we deprecate the 
current search method and introduce something like:

Hits search(Query query, Metadata context) throws IOException;

Also at the same time we should enhance the QueryFilter interface to look 
something like:

BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
throws QueryException;

I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-743) Site search powered by Lucene/Solr

2009-07-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-743.
--

Resolution: Fixed

committed

 Site search powered by Lucene/Solr
 --

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-743.patch


 Replace current Nutch site search with Lucene/Solr powered search hosted by 
 Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
 search all of the Nutch (content from other parts of the Lucene ecosystem is 
 also available) content from a single place, including web, wiki, JIRA and 
 mail archives. Lucid has a fault tolerant setup with replication and fail 
 over as well as monitoring services in place. 
 A preview of the site with the new search enabled is available at 
 http://people.apache.org/~siren/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-743) Site search powered by Lucene/Solr

2009-06-23 Thread Sami Siren (JIRA)
Site search powered by Lucene/Solr
--

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor


Replace current Nutch site search with Lucene/Solr powered search hosted by 
Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
search all of the Nutch (content from other parts of the Lucene ecosystem is 
also available) content from a single place, including web, wiki, JIRA and mail 
archives. Lucid has a fault tolerant setup with replication and fail over as 
well as monitoring services in place. 

A preview of the site with the new search enabled is available at 
http://people.apache.org/~siren/site/


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-743) Site search powered by Lucene/Solr

2009-06-23 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-743:
-

Attachment: NUTCH-743.patch

If there are no objections I will commit this within a week or so.

 Site search powered by Lucene/Solr
 --

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-743.patch


 Replace current Nutch site search with Lucene/Solr powered search hosted by 
 Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
 search all of the Nutch (content from other parts of the Lucene ecosystem is 
 also available) content from a single place, including web, wiki, JIRA and 
 mail archives. Lucid has a fault tolerant setup with replication and fail 
 over as well as monitoring services in place. 
 A preview of the site with the new search enabled is available at 
 http://people.apache.org/~siren/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph

2009-03-27 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-730:
-

Fix Version/s: (was: 1.0.0)

 NPE in LinkRank if no nodes with which to create the WebGraph
 -

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-730-1-20090325.patch


 For LinkRank, if there are no nodes to process, then a NullPointerException 
 is thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-23 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-722.
--

Resolution: Fixed

removed the jars and added note about this in README.txt

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-728) Improve nutch release packaging

2009-03-20 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683814#action_12683814
 ] 

Sami Siren commented on NUTCH-728:
--

not really, it just happens to be the mirror I use.

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)
Nutch contains jars that we cannot redistribute
---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


It seems that we have some jars (as part of pdf parser) that we cannot 
redistribute.

Jukkas comment from email:

The release contains the Java Advanced Imaging libraries (jai_core.jar and 
jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
redistribute those libraries.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
LICENCE.txt is lacking info that should be there


 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The LICENSE.txt file should have at least references to the licenses of the 
bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
NOTICE.txt is lacking info that should be there
---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
README.txt is lacking info that should be there
---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


from Jukkas email:

* The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-727) Add KEYS file to release artifact

2009-03-19 Thread Sami Siren (JIRA)
Add KEYS file to release artifact
-

 Key: NUTCH-727
 URL: https://issues.apache.org/jira/browse/NUTCH-727
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


comment from Grant:

 Where's the KEYS file for Nutch?

 hi,

 the keys file is at the top level nutch directory (eg: 
 http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)

OK, I think it should be in the tarball, too., at the top 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-726.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed

 README.txt is lacking info that should be there
 ---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren
 Fix For: 1.0.0


 from Jukkas email:
 * The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-724.
--

Resolution: Duplicate

 Drop the JAI libraries
 --

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
 (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
 License. The license is incompatible with Apache policies, so we need to drop 
 those libraries.
 AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
 rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
 much impact. A better solution would be to switch to using Apache PDFBox that 
 has a proper workaround for this issue, but the first Apache PDFBox release 
 has not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482
 ] 

Sami Siren commented on NUTCH-722:
--

+1, i am fine with this solution too

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-725.
--

Resolution: Fixed

went through the libs and added copyright notices

 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-723.
--

Resolution: Fixed

added licenses of 4rd party software

 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618
 ] 

Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM:
---

added licenses of 3rd party software

  was (Author: siren):
added licenses of 4rd party software
  
 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-728) Improve nutch release packaging

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-728:
-

Attachment: NUTCH-728.patch

add simple target to generate source release tgz from svn tag

-did not touch to the binary one

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634
 ] 

Sami Siren commented on NUTCH-722:
--

if there are no objections I will commit this change tomorrow morning (EET)

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-715) Subcollection plugin doesn't work with default subcollections.xml file

2009-03-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-715.
--

Resolution: Fixed

committed, thanks Dmitry!

 Subcollection plugin doesn't work with default subcollections.xml file
 --

 Key: NUTCH-715
 URL: https://issues.apache.org/jira/browse/NUTCH-715
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Assignee: Sami Siren
 Fix For: 1.0.0

 Attachments: NUTCH-715-testcase.patch, 
 NUTCH-715_subcollections_fix.patch


 Subcollection plugin cann't parse his configuration file because it contatins 
 top level comment (ASF notice) and DomUtil doesn't carry about of top-level 
 comments

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-705) parse-rtf plugin

2009-03-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680411#action_12680411
 ] 

Sami Siren commented on NUTCH-705:
--

I think we should start looking at Apache Tika for most (or all) of our parsers.

 parse-rtf plugin
 

 Key: NUTCH-705
 URL: https://issues.apache.org/jira/browse/NUTCH-705
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-705.patch


 Demoting this issue and moving to 1.1 - current patch is not suitable due to 
 LGPL licensed parts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-717) Make Nutch Solr integration easier

2009-03-10 Thread Sami Siren (JIRA)
Make Nutch Solr integration easier
--

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren
 Fix For: 1.1


Erik Hatcher proposed we should provide a full solr config dir to be used with 
Nutch-Solr. Now we only provide index schema. It would be considerably easier 
to setup nutch-solr if we provided the whole conf dir that you could use with 
solr like:

java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1

2009-03-04 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678691#action_12678691
 ] 

Sami Siren commented on NUTCH-711:
--

+1

 Indexer failing after upgrade to Hadoop 0.19.1
 --

 Key: NUTCH-711
 URL: https://issues.apache.org/jira/browse/NUTCH-711
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.0.0

 Attachments: patch.txt


 After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order 
 than before (see http://svn.apache.org/viewvc?view=revrevision=736239). 
 IndexingFilters populate current JobConf with field options that are required 
 for IndexerOutputFormat to function properly. However, the filters are 
 instantiated in Reducer.configure(), which is now called after the 
 OutputFormat is initialized, and not before as previously.
 The workaround for now is to instantiate IndexinigFilters once again inside 
 IndexerOutputFormat.  This issue should be revisited before 1.1 in order to 
 find a better solution.
 See this thread for more information: 
 http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-700) Neko1.9.11 goes into a loop

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-700:
-

Fix Version/s: 1.0.0
 Assignee: Sami Siren

This one just bit me - the effect is that parsing hangs forever. I am promoting 
it to be fixed in  1.0.

 Neko1.9.11 goes into a loop
 ---

 Key: NUTCH-700
 URL: https://issues.apache.org/jira/browse/NUTCH-700
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche
Assignee: Sami Siren
Priority: Critical
 Fix For: 1.0.0


 Neko1.9.11 goes into a loop on some documents e.g. 
 http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm
 http://cizel.co.kr/main.php
 reverting to 0.9.4 seems to fix the problem
 The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 
 could be a way to alleviate similar issues
 PS: haven't had time to report to the Neko people yet, will do at some stage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-700) Neko1.9.11 goes into a loop

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-700.
--

Resolution: Fixed

reverted to 0.9.4

 Neko1.9.11 goes into a loop
 ---

 Key: NUTCH-700
 URL: https://issues.apache.org/jira/browse/NUTCH-700
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche
Assignee: Sami Siren
Priority: Critical
 Fix For: 1.0.0


 Neko1.9.11 goes into a loop on some documents e.g. 
 http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm
 http://cizel.co.kr/main.php
 reverting to 0.9.4 seems to fix the problem
 The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 
 could be a way to alleviate similar issues
 PS: haven't had time to report to the Neko people yet, will do at some stage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-705) parse-rtf plugin

2009-02-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508
 ] 

Sami Siren commented on NUTCH-705:
--

I think that the patch contains some lgpl code that we cannot commit into 
apache repository.

 parse-rtf plugin
 

 Key: NUTCH-705
 URL: https://issues.apache.org/jira/browse/NUTCH-705
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
 Fix For: 1.0.0

 Attachments: NUTCH-705.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-699) Add an official solr schema for solr integration

2009-02-26 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-699.
--

Resolution: Fixed

committed

 Add an official solr schema for solr integration
 --

 Key: NUTCH-699
 URL: https://issues.apache.org/jira/browse/NUTCH-699
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0


 See Andrzej's comments on NUTCH-684 for more info.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-26 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren reassigned NUTCH-669:


Assignee: Sami Siren

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-703) Upgrade to Hadoop 0.19.1

2009-02-26 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677266#action_12677266
 ] 

Sami Siren commented on NUTCH-703:
--

Andrzej, are you working with this now?

 Upgrade to Hadoop 0.19.1
 

 Key: NUTCH-703
 URL: https://issues.apache.org/jira/browse/NUTCH-703
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.0.0


 From release notes: Release 0.19.1 fixes many critical bugs in 0.19.0, 
 including ***some data loss issues***..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-247) robot parser to restrict.

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-247.
--

Resolution: Fixed
  Assignee: Sami Siren  (was: Dennis Kubes)

committed this

- added checking to F2 (which is soon to be Fetcher)



 robot parser to restrict.
 -

 Key: NUTCH-247
 URL: https://issues.apache.org/jira/browse/NUTCH-247
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: agent-names.patch, agent-names3.patch.txt


 If the agent name and the robots agents are not proper configure the Robot 
 rule parser uses LOG.severe to log the problem but solve it also. 
 Later on the fetcher thread checks for severe errors and stop if there is one.
 RobotRulesParser:
 if (agents.size() == 0) {
   agents.add(agentName);
   LOG.severe(No agents listed in 'http.robots.agents' property!);
 } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
   agents.add(0, agentName);
   LOG.severe(Agent we advertise ( + agentName
  + ) not listed first in 'http.robots.agents' property!);
 }
 Fetcher.FetcherThread:
  if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;  
 I suggest to use warn or something similar instead of severe to log this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)
replace Fetcher with Fetcher2
-

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


Currently there are two fetcher implementation within nutch, one too many. This 
task tracks the process of promoting Fetcher2.

my plan is basically to
-remove Fetcher all together and rename Fetcher2 to Fetcher
-fix crawl class so it works with F2 api.

If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-701:
-

Summary: Replace Fetcher with Fetcher2  (was: replace Fetcher with Fetcher2)

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-698) CrawlDb is corrupted after a few crawl cycles

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-698.
--

Resolution: Fixed

committed. thanks guys

 CrawlDb is corrupted after a few crawl cycles
 -

 Key: NUTCH-698
 URL: https://issues.apache.org/jira/browse/NUTCH-698
 Project: Nutch
  Issue Type: Bug
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-698_v1.patch


 After change to hadoop's MapWritable, crawldb becomes corrupted after some 
 fetch cycles. For more details see this discussion thread:
 http://www.nabble.com/Fetcher2-crashes-with-current-trunk-td21978049.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-699) Add an official solr schema for solr integration

2009-02-24 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676233#action_12676233
 ] 

Sami Siren commented on NUTCH-699:
--

We could put it under conf/ ?

 Add an official solr schema for solr integration
 --

 Key: NUTCH-699
 URL: https://issues.apache.org/jira/browse/NUTCH-699
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0


 See Andrzej's comments on NUTCH-684 for more info.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-701.
--

Resolution: Duplicate

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-669:
-

Fix Version/s: (was: 1.1)
   1.0.0

Moving this back to 1.0

Are you close with your patch? As discussed in this thread we should just 
replace Fetcher With Fetcher2, change Crawl class and check that the tests 
pass. other issues we can deal within their own tickets.

I can also help with this if you don't have the time.



 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-694) Distributed Search Server fails

2009-02-22 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-694.
--

Resolution: Fixed

Committed. Thanks for testing it.

 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694-2.patch, NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 49 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 40 on 13001: 
 starting
 2009-02-18 

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-02-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675793#action_12675793
 ] 

Sami Siren commented on NUTCH-477:
--

It's your call.

IMO the whole URLFIlters - URLFIlter, URLNormalizers - URLNormalizer is a bit 
too complex as it is now, we can make it more clean but it's probably not worth 
the trouble pre 1.0.



 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-694) Distributed Search Server fails

2009-02-20 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-694:
-

Attachment: NUTCH-694-2.patch

I rechecked this again and there was also something else wrong, I am attaching 
a new patch that is now manually tested (we lost the testcase somewhere) with 
local and nutch rpc search.


 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694-2.patch, NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server 

[jira] Updated: (NUTCH-694) Distributed Search Server fails

2009-02-20 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-694:
-

Patch Info: [Patch Available]
  Assignee: Sami Siren

 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694-2.patch, NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 49 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 40 on 13001: 
 starting
 2009-02-18 

[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2009-02-20 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-573:
-

Patch Info: [Patch Available]

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Fix For: 1.0.0

 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-02-20 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-477:
-

Patch Info: [Patch Available]

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-694) Distributed Search Server fails

2009-02-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-694:
-

Attachment: NUTCH-694.patch

This fixed the problem for me.

 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 49 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 40 on 13001: 
 starting
 2009-02-18 17:08:14,675 INFO  ipc.RPC - Call: 
 

[jira] Resolved: (NUTCH-695) incorrect mime type detection by MoreIndexingFilter plugin

2009-02-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-695.
--

Resolution: Fixed
  Assignee: Sami Siren

committed, thanks

 incorrect mime type detection by MoreIndexingFilter plugin
 --

 Key: NUTCH-695
 URL: https://issues.apache.org/jira/browse/NUTCH-695
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Assignee: Sami Siren
 Fix For: 1.0.0

 Attachments: NUTCH-695_MoreIndexingFilter.patch, 
 NUTCH-695_TestMoreIndexingFilter.patch


 When server sends {{Content-Type}} header with optional params like 
 {{Content-Type: text/html; charset=UTF-8}} MoreIndexingFilter returns null in 
 {{type}} field.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-694) Distributed Search Server fails

2009-02-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674964#action_12674964
 ] 

Sami Siren commented on NUTCH-694:
--

Strange, did you update both ends (the server and the client?), normally the 
web application (.war) is the client.

After patching you should run 

1. ant clean job

2. deploy  run server + client

 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC 

[jira] Resolved: (NUTCH-687) Add RAT

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-687.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed

 Add RAT
 ---

 Key: NUTCH-687
 URL: https://issues.apache.org/jira/browse/NUTCH-687
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-687.patch


 Add apache rat so we can easily see the situation with required headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links

2009-02-18 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674520#action_12674520
 ] 

Sami Siren commented on NUTCH-689:
--

for some reason I cannot apply the patch:

patching file src/java/org/apache/nutch/parse/swf/SWFParser.java
Hunk #2 FAILED at 94.



 Swf parser doesn't seem to handle relative links
 

 Key: NUTCH-689
 URL: https://issues.apache.org/jira/browse/NUTCH-689
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Peter Sparks
 Attachments: parse-swf.patch


 I was using the swf parser to extract links from flash files on the site 
 www.arnoldworldwide.com and I was getting an malformed url exception because 
 an outlink was found and it was a relative link that wasn't being resolved. I 
 was able to fix it by resolving all links as they are added to the list of 
 outlinks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-591) StringIndexOutOfBoundsException when extracting text from a Word document.

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-591.
--

Resolution: Duplicate

duplicate of NUTCH-691

 StringIndexOutOfBoundsException when extracting text from a Word document.
 --

 Key: NUTCH-591
 URL: https://issues.apache.org/jira/browse/NUTCH-591
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: linux
 redhat as4u4 x86
 kernel 2.6.9
Reporter: frank ling

 see 
 http://issues.apache.org/bugzilla/show_bug.cgi?id=41076+

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-688) Fix missing/wrong headers in source files

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-688.
--

Resolution: Fixed

I think we are done with this.

 Fix missing/wrong headers in source files
 -

 Key: NUTCH-688
 URL: https://issues.apache.org/jira/browse/NUTCH-688
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-691) Update jakarta poi jars to the most relevant version

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-691.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed, Thanks Dmitry

 Update jakarta poi jars to the most relevant version
 

 Key: NUTCH-691
 URL: https://issues.apache.org/jira/browse/NUTCH-691
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
 Fix For: 1.0.0

 Attachments: NUTCH-691-v1-poi.patch, NUTCH-691-v1-test.patch

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 Update  jakarta poi jars to the most relevant version closes bug NUTCH-591.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-563) Include custom fields in BasicQueryFilter

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-563.
--

Resolution: Fixed
  Assignee: Sami Siren

committed, thanks

 Include custom fields in BasicQueryFilter
 -

 Key: NUTCH-563
 URL: https://issues.apache.org/jira/browse/NUTCH-563
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Reporter: julien nioche
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: diff.BasicQueryFilter.dynamicFields.txt, NUTCH-563.patch


 This patch allows to include additional fields in the BasicQueryFilter by 
 specifying runtime parameters.  Any parameter matching the regular expression 
 (query\\.basic\\.(.+)\\.boost) will be added to the list of fields to be 
 used by the BQF and the specified float value will be used as boost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-02-18 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674603#action_12674603
 ] 

Sami Siren commented on NUTCH-692:
--

Have you seen this outside of EC2? Only in multinode setup?

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche

 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-583) FeedParser empty links for items

2009-02-18 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-583:
-

Fix Version/s: (was: 1.0.0)
   1.1

pushing this to 1.1

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.1


 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-631:
-

Attachment: NUTCH-631.patch

Attaching a patch that fixes the problem as proposed, If there are no 
objections I will commit this soon.

 MoreIndexingFilter fails with NoSuchElementException
 

 Key: NUTCH-631
 URL: https://issues.apache.org/jira/browse/NUTCH-631
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: Verified on CentOS and OSX
Reporter: Stefan Will
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-631.patch


 I did a simple crawl and started the indexer with the index-more plugin 
 activated. The index job fails with the following stack trace in the task log:
 java.util.NoSuchElementException
 at java.util.TreeMap.key(TreeMap.java:433)
 at java.util.TreeMap.firstKey(TreeMap.java:287)
 at java.util.TreeSet.first(TreeSet.java:407)
 at 
 java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114)
 at 
 org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207)
 at 
 org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90)
 at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
 I traced this down to the part in MoreIndexingFilter where the mime type is 
 split into primary type and subtype for indexing:
 contentType = mimeType.getName();
 String primaryType = mimeType.getSuperType().getName();
 String subType = mimeType.getSubTypes().first().getName();
 Apparently Tika does not have a subtype for text/html. Furthermore, the 
 supertype for text/html is set as application/octet-stream, which I doubt is 
 what we want indexed. Don't we want primaryType to be text and subType to 
 be html ?
 So I changed the code to:
 contentType = mimeType.getName();
 String[] split = contentType.split(/);
 String primaryType = split[0];
 String subType = (split.length1)?split[1]:null;
 
 This does what I think it should do, but perhaps I'm missing something ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-687) Add RAT

2009-02-17 Thread Sami Siren (JIRA)
Add RAT
---

 Key: NUTCH-687
 URL: https://issues.apache.org/jira/browse/NUTCH-687
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-687.patch

Add apache rat so we can easily see the situation with required headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-687) Add RAT

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-687:
-

Attachment: NUTCH-687.patch

 Add RAT
 ---

 Key: NUTCH-687
 URL: https://issues.apache.org/jira/browse/NUTCH-687
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-687.patch


 Add apache rat so we can easily see the situation with required headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-631) MoreIndexingFilter fails with NoSuchElementException

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-631.
--

Resolution: Fixed
  Assignee: Sami Siren  (was: Chris A. Mattmann)

committed, thanks

 MoreIndexingFilter fails with NoSuchElementException
 

 Key: NUTCH-631
 URL: https://issues.apache.org/jira/browse/NUTCH-631
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: Verified on CentOS and OSX
Reporter: Stefan Will
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-631.patch


 I did a simple crawl and started the indexer with the index-more plugin 
 activated. The index job fails with the following stack trace in the task log:
 java.util.NoSuchElementException
 at java.util.TreeMap.key(TreeMap.java:433)
 at java.util.TreeMap.firstKey(TreeMap.java:287)
 at java.util.TreeSet.first(TreeSet.java:407)
 at 
 java.util.Collections$UnmodifiableSortedSet.first(Collections.java:1114)
 at 
 org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:207)
 at 
 org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:90)
 at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:111)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:249)
 at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:52)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:333)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:164)
 I traced this down to the part in MoreIndexingFilter where the mime type is 
 split into primary type and subtype for indexing:
 contentType = mimeType.getName();
 String primaryType = mimeType.getSuperType().getName();
 String subType = mimeType.getSubTypes().first().getName();
 Apparently Tika does not have a subtype for text/html. Furthermore, the 
 supertype for text/html is set as application/octet-stream, which I doubt is 
 what we want indexed. Don't we want primaryType to be text and subType to 
 be html ?
 So I changed the code to:
 contentType = mimeType.getName();
 String[] split = contentType.split(/);
 String primaryType = split[0];
 String subType = (split.length1)?split[1]:null;
 
 This does what I think it should do, but perhaps I'm missing something ? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-582) Add missing type parameters

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-582.
--

Resolution: Fixed

yep, all of this has been committed

 Add missing type parameters
 ---

 Key: NUTCH-582
 URL: https://issues.apache.org/jira/browse/NUTCH-582
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: typeparams.patch


 Hadoop 0.15 added possibility to use type parameters with several interfaces 
 and makes it easier to use correct types in Mappers, Reducers et al. in 
 addition to improved readability. Following patch will add type parameters to 
 Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and 
 OutputFormats.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-86) LanguageIdentifier API enhancements

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-86:


Fix Version/s: (was: 1.0.0)

removing from 1.0 queue since there has been no activity lately

 LanguageIdentifier API enhancements
 ---

 Key: NUTCH-86
 URL: https://issues.apache.org/jira/browse/NUTCH-86
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.6, 0.7, 0.8
Reporter: Jerome Charron
Assignee: Jerome Charron
Priority: Minor

 More informations can be found on the following thread on Nutch-Dev mailing 
 list:
 http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
 Summary:
 1. LanguageIdentifier API changes. The similarity methods should return an 
 ordered array of language-code/score pairs instead of a simple String 
 containing the language-code.
 2. Ensure consistency between LanguageIdentifier scoring and 
 NGramProfile.getSimilarity().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-609:
-

Fix Version/s: (was: 1.0.0)
   1.1

pushing this to 1.1, feel free to put back if there is traction

 Allow Plugins to be Loaded from Jar File(s)
 ---

 Key: NUTCH-609
 URL: https://issues.apache.org/jira/browse/NUTCH-609
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-609-1-20080212.patch


 Currently plugins cannot be loaded from a jar file.  Plugins must be unzipped 
 in one or more directories specified by the plugin.folders config.  I have 
 been thinking about an extension to PluginRepository or PluginManifestParser 
 (or both) that would allow plugins to packaged into multiple independent jar 
 files and placed on the classpath.  The system would search the classpath for 
 resources with the correct folder name and would load any plugins in those 
 jars.
 This functionality would be very useful in making the nutch core more 
 flexible in terms of packaging.  It would also help with web applications 
 where we don't want to have a plugins directory included in the webapp.
 Thoughts so far are unzipping those plugin jars into a common temp directory 
 before loading.  Another option is using something like commons vfs to 
 interact with the jar files.  VFS essential uses a disk based temporary cache 
 for jar files, so it is pretty much the same solution.   What are everyone 
 else's thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-469:
-

Fix Version/s: (was: 1.0.0)
   1.1

pushing this to 1.1

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Fix For: 1.1

 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
 NUTCH-469-2007-05-09.txt.gz


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-309) Uses commons logging Code Guards

2009-02-17 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-309:
-

Fix Version/s: (was: 1.0.0)
   1.1

pushing this to 1.1

 Uses commons logging Code Guards
 

 Key: NUTCH-309
 URL: https://issues.apache.org/jira/browse/NUTCH-309
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-689) Swf parser doesn't seem to handle relative links

2009-02-17 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674360#action_12674360
 ] 

Sami Siren commented on NUTCH-689:
--

about development: check url 
http://wiki.apache.org/nutch/Becoming%20A%20Nutch%20Developer for instructions 
about developing nutch, in particular section Step Three: Using the JIRA and 
Developing

You should attach pacthes instead of full java source files because it's much 
easier to see what changed by looking at diffs.

 Swf parser doesn't seem to handle relative links
 

 Key: NUTCH-689
 URL: https://issues.apache.org/jira/browse/NUTCH-689
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Peter Sparks
 Attachments: SWFParser.java


 I was using the swf parser to extract links from flash files on the site 
 www.arnoldworldwide.com and I was getting an malformed url exception because 
 an outlink was found and it was a relative link that wasn't being resolved. I 
 was able to fix it by resolving all links as they are added to the list of 
 outlinks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-621) Nutch needs to declare it's crypto usage

2008-06-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12603890#action_12603890
 ] 

Sami Siren commented on NUTCH-621:
--

I agree, seem to me that we're in same situation as jackrabbit ? I think we do 
not provide bc libraries with nutch, only pdfbox.

 Nutch needs to declare it's crypto usage
 

 Key: NUTCH-621
 URL: https://issues.apache.org/jira/browse/NUTCH-621
 Project: Nutch
  Issue Type: Task
Reporter: Grant Ingersoll
Assignee: Chris A. Mattmann
Priority: Blocker

 Per the ASF board direction outlined at 
 http://www.apache.org/dev/crypto.html, Nutch needs to declare it's use of 
 crypto libraries (i.e. BouncyCastle, via PDFBox/Tika).
 See TIKA-118.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-602) Allow configurable number of handlers for search servers

2008-02-07 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566779#action_12566779
 ] 

Sami Siren commented on NUTCH-602:
--

+1

 Allow configurable number of handlers for search servers
 

 Key: NUTCH-602
 URL: https://issues.apache.org/jira/browse/NUTCH-602
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-602-1-20080205.patch


 This improvement changes the distributed search server to allow a 
 configurable number of RPC handlers.  Before the number was hardcoded at 10 
 handlers.  For high volume environments that limit will be quickly reached 
 and the overall search will slowdown.  The patch changes nutch-default.xml 
 with the configuration parameter searchers.num.handlers and changes 
 DistributedSearch to pull the number of handlers from the configuration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-580) Remove deprecated hadoop api calls (FS)

2008-01-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-580.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

Committed.

 Remove deprecated hadoop api calls (FS)
 ---

 Key: NUTCH-580
 URL: https://issues.apache.org/jira/browse/NUTCH-580
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: hadoopfsdeprecated.patch


 There are quite a lot of calls to deprecated hadoop api functionality. 
 Following patch will take care of fs related ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-580) Remove deprecated hadoop api calls (FS)

2007-11-21 Thread Sami Siren (JIRA)
Remove deprecated hadoop api calls (FS)
---

 Key: NUTCH-580
 URL: https://issues.apache.org/jira/browse/NUTCH-580
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: hadoopfsdeprecated.patch

There are quite a lot of calls to deprecated hadoop api functionality. 
Following patch will take care of fs related ones.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-582) Add missing type parameters

2007-11-21 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-582:
-

Attachment: typeparams.patch

 Add missing type parameters
 ---

 Key: NUTCH-582
 URL: https://issues.apache.org/jira/browse/NUTCH-582
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: typeparams.patch


 Hadoop 0.15 added possibility to use type parameters with several interfaces 
 and makes it easier to use correct types in Mappers, Reducers et al. in 
 addition to improved readability. Following patch will add type parameters to 
 Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and 
 OutputFormats.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-582) Add missing type parameters

2007-11-21 Thread Sami Siren (JIRA)
Add missing type parameters
---

 Key: NUTCH-582
 URL: https://issues.apache.org/jira/browse/NUTCH-582
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor


Hadoop 0.15 added possibility to use type parameters with several interfaces 
and makes it easier to use correct types in Mappers, Reducers et al. in 
addition to improved readability. Following patch will add type parameters to 
Mappers, Reducers, OutputCollectors, MapRunnables, InputFormats and 
OutputFormats.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-568) Indexer does not update the Lucene TITLE field

2007-10-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536756
 ] 

Sami Siren commented on NUTCH-568:
--

There is a BOM (Byte Order Mark) in the beginning of the file [feff] that seems 
to confuse nutch. I did not track down the change that cased this.

 Indexer does not update the Lucene TITLE field
 

 Key: NUTCH-568
 URL: https://issues.apache.org/jira/browse/NUTCH-568
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
 Environment: Windows XP
Reporter: smorales
 Attachments: RN-071018-24.html


 Hi,
 The indexer is unable to update the field TITLE of the Lucene index when 
 processing specific html documents.
 This issue has been reproduced using Nutch-Nightly Build #241 (Oct 19, 2007 
 4:01:28 AM)
 The problem does not occurs using NUTCH 9.0.
 Workflow:
 1.- Extracted package and copy across the following configuration files from 
 NUTCH 9.0
 - {nutch_home_9.0}/bin/url folder, containing the urls
 - {nutch_home_9.0}/conf/nutch-site.xml
 - {nutch_home_9.0}/conf/crawl-urlfilter.txt
 2.- To reproduce the issue, you need to copy the attached html document to 
 your webserver/filesytem.
 3.- Run the crawl.
 For example: ./nutch crawl urls -dir crawl -depth 22
 4.- Open the index using Luke.  For this test, I used lukeall-0.7.1.jar
 5.- Select the window select the document tab, move thru the docs until you 
 find our html document.
 You will see that the TITLE field is empty  -- INCORRECT because this html 
 document contains a title.
 6.- Now, open the html document, add a space anywhere then save it again.
 7.- Repeat step 3 and 4.
 You will notice that this time the field TITLE field contains the correct 
 information
 Please advice,
 Many thanks in advance for your support.
 Sergio

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

2007-10-12 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534364
 ] 

Sami Siren commented on NUTCH-565:
--

I didn't actually test this, but it looks like useful addition to nutch, so +1 
from me.

 Arc File to Nutch Segments Converter
 

 Key: NUTCH-565
 URL: https://issues.apache.org/jira/browse/NUTCH-565
 Project: Nutch
  Issue Type: Improvement
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: arcsegments2.patch, nutch-565-1-20071009.patch


 Functionality that allows arc files, such as those produced by the internet 
 archive project or by the Grub distributed crawler to be parsed into Nutch 
 segments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   >