[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-817:
---

Assignee: Julien Nioche

 parse-(html)does follow links of full html page, parse-(tika) does follow any 
 links and stops at level 1
 

 Key: NUTCH-817
 URL: https://issues.apache.org/jira/browse/NUTCH-817
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Suse linux 11.1, java version 1.6.0_13
Reporter: matthew a. grisius
Assignee: Julien Nioche
 Attachments: sample-javadoc.html


 submitted per Julien Nioche. I did not see where to attach a file so I pasted 
 it here. btw: Tika command line returns empty html body for this file.
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN 
 http://www.w3.org/TR/html4/frameset.dtd;
 !--NewPage--
 HTML
 HEAD
 !-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008--
 TITLE
 Matrix Application Development Kit
 /TITLE
 SCRIPT type=text/javascript
 targetPage =  + window.location.search;
 if (targetPage !=   targetPage != undefined)
targetPage = targetPage.substring(1);
 function loadFrames() {
 if (targetPage !=   targetPage != undefined)
  top.classFrame.location = top.targetPage;
 }
 /SCRIPT
 NOSCRIPT
 /NOSCRIPT
 /HEAD
 FRAMESET cols=20%,80% title= onLoad=top.loadFrames()
 FRAMESET rows=30%,70% title= onLoad=top.loadFrames()
 FRAME src=overview-frame.html name=packageListFrame title=All Packages
 FRAME src=allclasses-frame.html name=packageFrame title=All classes and 
 interfaces (except non-static nested types)
 /FRAMESET
 FRAME src=overview-summary.html name=classFrame title=Package, class 
 and interface descriptions scrolling=yes
 NOFRAMES
 H2
 Frame Alert/H2
 P
 This document is designed to be viewed using the frames feature. If you see 
 this message, you are using a non-frame-capable web client.
 BR
 Link toA HREF=overview-summary.htmlNon-frame version./A
 /NOFRAMES
 /FRAMESET
 /HTML

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-710) Support for rel=canonical attribute

2010-04-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286
 ] 

Julien Nioche commented on NUTCH-710:
-

As suggested previously we could either treat canonicals as redirections or 
during deduplication. Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical 
is not available for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex 
due to fact that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb  
detect URLs which have a canonical target already indexed or ready to be 
indexed. We need to follow up to X levels of redirection e.g. doc A marked as 
canonical representation doc B, doc B redirects to doc C etc...if end of 
redirection chain exists and is valid then mark A as duplicate of C 
(intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker 
(e.g. status_duplicate) in the crawlDB. Then
- if indexer comes across such an entry : skip it
- make so that *deleteDuplicates can take a list of URLs with status_duplicate 
as an additional source of input OR have a custom resource that deletes such 
entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A - B
B - C
D - C

where C is an indexable document (i.e. has been fetched and parsed - it may 
have been already indexed.

will yield

A - C
B - C
D - C

but also

C - C

Once we have all possible redirections : go through the crawlDB in search of 
canonicals. if the target of a canonical is the source of a valid alias (e.g. A 
- B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning 
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite 
the whole crawlDB to mark some of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow 
the redirs from the initial URL having a canonical tag instead of generating 
these intermediate structures. We can then modify the entries one by one 
instead of regenerating the whole crawlDB.

WDYT?



 Support for rel=canonical attribute
 -

 Key: NUTCH-710
 URL: https://issues.apache.org/jira/browse/NUTCH-710
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.1
Reporter: Frank McCown
Priority: Minor

 There is a the new rel=canonical attribute which is
 now being supported by Google, Yahoo, and Live:
 http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
 Adding support for this attribute value will potentially reduce the number of 
 URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349
 ] 

Julien Nioche commented on NUTCH-808:
-

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 

I know that Cascading have various Tape/Sink implementations including JDBC, 
HBase  but also SimpleDB. Maybe it would be worth having a look at how they do 
it?

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-808:


Fix Version/s: 2.0

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7
---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Upgrading to Tika 0.7 before 1.1 release

The TikaConfig mechanism has changed and does not rely on a default XML config 
file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-789:


  Component/s: (was: fetcher)
   parser
Fix Version/s: (was: 1.1)

Have created a separate issue for the upgrade of Tika 0.7 and moved this one 
out of 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: parser
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-810.
---

Resolution: Fixed

Committed in rev 931098.

http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is 
created as it does not rely on a  tika-config.xml file any longer. Our custom 
TikaConfig has been modified to reflect these changes.

This was the last remaining issue marked for 1.1 



 Upgrade to Tika 0.7
 ---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 Upgrading to Tika 0.7 before 1.1 release
 The TikaConfig mechanism has changed and does not rely on a default XML 
 config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251
 ] 

Julien Nioche commented on NUTCH-789:
-

Will upgrade as soon as 0.7 is available from 
http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet.
I will leave this issue open but unmark it as 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche

 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-706:


Fix Version/s: (was: 1.1)

Both variants of the substitution rule above break existing tests. More work 
will be needed to get a pattern which covers the case described by Meghna *and* 
is compatible with the existing test cases.
Moving it to post-1.1

 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-779.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 929038.

Thanks Andrzej for your feedback

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-785.
---

Resolution: Fixed

Committed revision 929039

Thanks Andrzej for reviewing it

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851316#action_12851316
 ] 

Julien Nioche commented on NUTCH-789:
-

Shall we postpone the work on this issue to after 1.1?

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851545#action_12851545
 ] 

Julien Nioche commented on NUTCH-570:
-

{quote}Julien, want to take this?{quote}

Not particularly. I am busy on short term issues for 1.1  so feel free to take 
it if you have a particular interest in this. 
I would be curious to see some figures on the improvements from this patch, my 
impression is that NUTCH-776 would be quicker to implement and maintain and 
might possibly give similar gains. 

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-784.
---

Resolution: Fixed

Committed revision 928746

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Fix Version/s: 1.1

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2010-03-29 Thread Julien Nioche (JIRA)
Merge CrawlDBScanner with CrawlDBReader
---

 Key: NUTCH-806
 URL: https://issues.apache.org/jira/browse/NUTCH-806
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche


The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do 
that after the 1.1 release 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Fix Version/s: (was: 1.1)

Removed tag 1.1
Will rename to IndexingPluginsChecker later

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850912#action_12850912
 ] 

Julien Nioche commented on NUTCH-785:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850915#action_12850915
 ] 

Julien Nioche commented on NUTCH-779:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-776) Configurable queue depth

2010-03-23 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-776:


Fix Version/s: (was: 1.1)

Moving this issue post 1.1
Needs a patch file, some description of the param in nutch-default.xml and more 
importantly some experimentation to see how it impacts the performance of the 
fetching

 Configurable queue depth
 

 Key: NUTCH-776
 URL: https://issues.apache.org/jira/browse/NUTCH-776
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.1
Reporter: MilleBii
Priority: Minor

 I propose that we create a configurable item for the queuedepth in 
 Fetcher.java instead of the hard-coded value of 50.
 key name : fetcher.queues.depth
 Default value : remains 50 (of course)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-740.
---

Resolution: Fixed
  Assignee: Julien Nioche

Committed in rev 926003
Thanks Marcin for contributing this patch

 Configuration option to override default language for fetched pages.
 

 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Marcin Okraszewski
Assignee: Julien Nioche
Priority: Minor
 Fix For: 1.1

 Attachments: AcceptLanguage.patch, 
 AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch


 By default Accept-Language HTTP request header is set to English. 
 Unfortunately this value is hard coded and seems there is no way to override 
 it. As a result you may index English version of pages even though you would 
 prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Fix Version/s: 1.1

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v3.patch

new patch which reintroduces the 'generator.update.crawldb' functionality 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848095#action_12848095
 ] 

Julien Nioche commented on NUTCH-762:
-

{quote}
I just noticed that the new Generator uses different config property names 
(generator. vs. generate.), and the older versions are now marked with 
(Deprecated). However, this doesn't reflect the reality - properties with old 
names are simply ignored now, whereas deprecated implies that they should 
still work
{quote}

They will still work if we keep the old Generator as OldGenerator - which is 
what we assume in the patch. If we decide to get shot of the OldGenerator then 
yes, they should not be marked  with (Deprecated)

{quote}
For back-compat reason I think they should still work - the current (admittedly 
awkward) prefix is good enough, and I think that changing it in a minor release 
would create confusion. I suggest reverting to the old names where appropriate, 
and add new properties with the same prefix, i.e. generate..
{quote}

the original assumption was that we'd keep both this version of the generator 
and the old one in which case we could have used a different prefix for the 
properties. If we want to *replace* the old generator altogether - which I 
think would be a good option - then indeed we should discuss whether or not to 
align on the old prefix. 

I don't have strong feelings on whether or not to modify the prefix in a minor 
release.  




 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848140#action_12848140
 ] 

Julien Nioche commented on NUTCH-762:
-

The change of prefix also reflected that we now use 2 different parameters so 
specify how to count the URLs (host or domain) and the max number of URLs.  We 
can of course maintain the old parameters as well for the sake of 
compatibility, except that _generate.max.per.host.by.ip_ won't be of much use 
anymore as we don't count per IP.

Have just noticed  that 'crawl.gen.delay' is not documented in 
nutch-default.xml, and does not seem to be used outside the Generator. What is 
it supposed to be used for? 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-762.
---

Resolution: Fixed

Committed revision 926155

Have reverted the prefix for params to 'generate.' + added description of 
crawl.gen.delay on nutch-default + added warning when user specified 
generate.max.per.host.by.ip + param generate.max.per.host is now supported

Thanks Andzrej for your reviewing it 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-740:


Attachment: NUTCH-740.patch

Slightly modified version of the patch with modifs for protocol-http.
will commit shortly

 Configuration option to override default language for fetched pages.
 

 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Marcin Okraszewski
Priority: Minor
 Fix For: 1.1

 Attachments: AcceptLanguage.patch, 
 AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch


 By default Accept-Language HTTP request header is set to English. 
 Unfortunately this value is hard coded and seems there is no way to override 
 it. As a result you may index English version of pages even though you would 
 prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846910#action_12846910
 ] 

Julien Nioche commented on NUTCH-762:
-

OK, there was indeed an assumption that the generator would not need to be 
called again before an update.  Am happy to add back generate.update.crawldb. 

Note that this version of the Generator also differs from the original version 
in that 

{quote}
*IP resolution is done ONLY on the entries which have been selected for 
fetching (during the partitioning). Running the IP resolution on the whole 
crawlDb is too slow to be usable on a large scale
*can max the number of URLs per host or domain (but not by IP)
{quote}

We could allow more flexibility by counting per IP, again at the expense of 
performance. Not sure it is very useful in practice though. Since the way we 
count the URLs is now decoupled from the way we partition them, we can have an 
hybrid approach e.g. count per domain THEN partition by IP. 

Any thoughts on whether or not we should reintroduce the counting per IP?

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846930#action_12846930
 ] 

Julien Nioche commented on NUTCH-762:
-

Yes, I came across that situation too on a large crawl where a single machine 
was used to host a whole range of unrelated domain names (needless to say the 
host of the domains was not very pleased). We can now handle such cases that 
simply by partitioning by IP (and counting by domain).

I will have a look at reintroducing *generate.update.crawldb* tomorrow.



 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2010-03-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-469:


Fix Version/s: (was: 1.1)

There has not been any changes to this issue since February 09 and it won't be 
included in 1.1
Marking it as 'fix version : unknown' 

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
 NUTCH-469-2007-05-09.txt.gz


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845886#action_12845886
 ] 

Julien Nioche commented on NUTCH-740:
-

A nice contribution but should not this be applied to the *protocol-http* 
plugin as well e.g. in HttpResponse?

 Configuration option to override default language for fetched pages.
 

 Key: NUTCH-740
 URL: https://issues.apache.org/jira/browse/NUTCH-740
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Marcin Okraszewski
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 1.1

 Attachments: AcceptLanguage.patch, 
 AcceptLanguage_trunk_2009-06-09.patch


 By default Accept-Language HTTP request header is set to English. 
 Unfortunately this value is hard coded and seems there is no way to override 
 it. As a result you may index English version of pages even though you would 
 prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846141#action_12846141
 ] 

Julien Nioche commented on NUTCH-762:
-

If I am not mistaken the point of having  _generate.update.crawldb_ was to 
marke the URLs put in a fetchlist in order to be able to do another round of 
generation. This is not necessary now as we can generate several segments 
without writing a new crawldb.
Am I missing something?  

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-710) Support for rel=canonical attribute

2010-03-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-710:


Fix Version/s: (was: 1.1)

Great idea. Won't be included in 1.1 though so moving to *fix : unknown*


 Support for rel=canonical attribute
 -

 Key: NUTCH-710
 URL: https://issues.apache.org/jira/browse/NUTCH-710
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.1
Reporter: Frank McCown
Priority: Minor

 There is a the new rel=canonical attribute which is
 now being supported by Google, Yahoo, and Live:
 http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
 Adding support for this attribute value will potentially reduce the number of 
 URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2010-03-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-692.
-

   Resolution: Cannot Reproduce
Fix Version/s: 1.1

I cannot reproduce the issue since we moved to the Hadoop 0.20., which is good 
news

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-798) Upgrade to SOLR1.4

2010-03-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-798.
-

Resolution: Fixed

Updated SOLRJ's dependencies at the same time : 

Deleting   lib/apache-solr-common-1.3.0.jar
Adding  (bin)  lib/apache-solr-core-1.4.0.jar
Deleting   lib/apache-solr-solrj-1.3.0.jar
Adding  (bin)  lib/apache-solr-solrj-1.4.0.jar
Deleting   lib/commons-httpclient-3.0.1.jar
Adding  (bin)  lib/commons-httpclient-3.1.jar
Adding  (bin)  lib/commons-io-1.4.jar
Adding  (bin)  lib/geronimo-stax-api_1.0_spec-1.0.1.jar
Adding  (bin)  lib/jcl-over-slf4j-1.5.5.jar
Deleting   lib/slf4j-api-1.4.3.jar
Adding  (bin)  lib/slf4j-api-1.5.5.jar
Adding  (bin)  lib/wstx-asl-3.2.7.jar

Committed revision 921831

 Upgrade to SOLR1.4
 --

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
 the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-801.
-

Resolution: Fixed

Committed revision 921840.


 Remove RTF and MP3 parse plugins
 

 Key: NUTCH-801
 URL: https://issues.apache.org/jira/browse/NUTCH-801
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Fix For: 1.1


 *Parse-rtf* and *parse-mp3* are not built by default  due to licensing 
 issues. Since we now have *parse-tika* to handle these formats I would be in 
 favour of removing these 2 plugins altogether to keep things nice and simple. 
 The other plugins will probably be phased out only after the release of 1.1  
 when parse-tika will have been tested a lot more.
 Any reasons not to?
 Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: (was: NUTCH-762-MultiGenerator.patch)

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v2.patch

Improved version of the patch : 

- fixed a few minor bugs
- renamed Generator into OldGenerator
- renamed MultiGenerator into Generator
- fixed test classes to use new Generator
- documented parameters in nutch-default.xml
- add names of segments to the LOG to facilitate integration in scripts
- PartitionUrlByHost is replaced by URLPartitioner which is more generic

I decided to keep the old version for the time being but we might as well get 
rid of it altogether. The new version is now used in the Crawl class. 

Would be nice if people could give it a good try before we put it in 1.1

Thanks

Julien 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
SOLRIndexer to commit once all reducers have finished
-

 Key: NUTCH-799
 URL: https://issues.apache.org/jira/browse/NUTCH-799
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


What about doing only one SOLR commit after the MR job has finished in 
SOLRIndexer instead of doing that at the end of every Reducer? 
I ran into timeout exceptions in some of my reducers and I suspect that this 
was due to the fact that other reducers had already finished and called commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-799:


Attachment: NUTCH-799.patch

 SOLRIndexer to commit once all reducers have finished
 -

 Key: NUTCH-799
 URL: https://issues.apache.org/jira/browse/NUTCH-799
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-799.patch


 What about doing only one SOLR commit after the MR job has finished in 
 SOLRIndexer instead of doing that at the end of every Reducer? 
 I ran into timeout exceptions in some of my reducers and I suspect that this 
 was due to the fact that other reducers had already finished and called 
 commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-782) Ability to order htmlparsefilters

2010-03-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-782.
---

Resolution: Fixed

Committed revision 917557

 Ability to order htmlparsefilters
 -

 Key: NUTCH-782
 URL: https://issues.apache.org/jira/browse/NUTCH-782
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-782.patch


 Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
 order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
 have an impact on end result, as some filters could rely on the metadata 
 generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-798) Upgrade to SOLR1.4

2010-02-26 Thread Julien Nioche (JIRA)
Upgrade to SOLR1.4
--

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the 
way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-719.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 911905.
Thanks to S. Dennis for investigating the issue + R. Schwab for testing it 

 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-719.
---


 fetchQueues.totalSize incorrect in Fetcher2
 ---

 Key: NUTCH-719
 URL: https://issues.apache.org/jira/browse/NUTCH-719
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 I had a look at the logs generated by Fetcher2 and found cases where there 
 were no active fetchQueues but fetchQueues.totalSize was != 0
 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
 fetchQueues.totalSize=1, fetchQueues=0
 since the code relies on fetchQueues.totalSize to determine whether the work 
 is finished or not the task is blocked until the abortion mechanism kicks in
 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
 threads.
 could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-705) parse-rtf plugin

2010-02-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-705.
-

Resolution: Fixed

RTF parsing is now handled by the TikaPlugin (NUTCH-766). Please open an issue 
on Tika if  the original problem with non-ascii chars still occurs

 parse-rtf plugin
 

 Key: NUTCH-705
 URL: https://issues.apache.org/jira/browse/NUTCH-705
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-705.patch


 Demoting this issue and moving to 1.1 - current patch is not suitable due to 
 LGPL licensed parts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-644) RTF parser doesn't compile anymore

2010-02-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-644.
-

Resolution: Fixed

RTF parsing is now handled by the TikaPlugin (NUTCH-766) which solves the issue 
of licensing.

 RTF parser doesn't compile anymore
 --

 Key: NUTCH-644
 URL: https://issues.apache.org/jira/browse/NUTCH-644
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Guillaume Smet
 Attachments: NUTCH-644_v2.patch, NUTCH-644_v3.patch, 
 RTFParseFactory.java-compilation_issues.diff


 Due to API changes, the RTF parser (which is not compiled by default due to 
 licensing problem) doesn't compile anymore.
 The build.xml script doesn't work anymore too as 
 http://www.cobase.cs.ucla.edu/pub/javacc/rtf_parser_src.jar doesn't exist 
 anymore (404). I didn't fix the build.xml as I don't know from where we want 
 to get the jar file but only the compilations issues.
 Regards,
 -- 
 Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-794) Tika parser does not keep attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
Tika parser does not keep attributes on html tag


 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


The following HTML document : 

html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html

is rendered as the following xhtml by Tika : 

?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
titlejotain suomeksi/body/html

with the lang attribute getting lost. 

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Description: 
The following HTML document : 

html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html

is rendered as the following xhtml by Tika : 

?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
titlejotain suomeksi/body/html

with the lang attribute getting lost.  The lang is not stored in the metadata 
either.

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

  was:
The following HTML document : 

html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html

is rendered as the following xhtml by Tika : 

?xml version=1.0 encoding=UTF-8?html 
xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
titlejotain suomeksi/body/html

with the lang attribute getting lost. 

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

Summary: Tika parser does identify lang attributes on html tag  (was: 
Tika parser does not keep attributes on html tag)

 Tika parser does identify lang attributes on html tag
 -

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Attachment: NUTCH-794.patch

 Tika parser does identify lang attributes on html tag
 -

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834147#action_12834147
 ] 

Julien Nioche commented on NUTCH-794:
-

Committed patch in revision 910454

Waiting for issue to be fixed in Tika before closing this issue

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Summary: Language Identification must use check the parse metadata for 
language values   (was: Tika parser does identify lang attributes on html tag)

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-794 started by Julien Nioche.

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Component/s: parser

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-782:


Component/s: parser

 Ability to order htmlparsefilters
 -

 Key: NUTCH-782
 URL: https://issues.apache.org/jira/browse/NUTCH-782
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-782.patch


 Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
 order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
 have an impact on end result, as some filters could rely on the metadata 
 generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-766) Tika parser

2010-02-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-766.
---


Have added small improvement in revision 910187 (Prioritise default Tika parser 
when discovering plugins matching mime-type).
Thanks to Chris for testing and committing it + Andrzej and Sami for their 
comments and suggestions

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832454#action_12832454
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto  the directory parse-tika and ran the test just as you did 
but could not reproduce the problem.  Could there be a difference between your 
version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from 
the Content.  As you pointed out, there could be a duplication with the 
detection done by Mime-Util. I suppose one way to do would be to add a new 
version of the method getParse(Content conte, MimeType type). That's an 
interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent 
fixing.

Regarding parsing package formats, I think the plan is that Tika will handle 
that in the future but we could try to do that now if we find a relatively 
clean mechanism for doing so. BTW could you please send a diff and not the full 
code of the class you posted earlier, that would make the comparison much 
easier.




 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564
 ] 

Julien Nioche commented on NUTCH-766:
-

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library 

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564
 ] 

Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM:
--

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?

The ParserFactory section of the patch v3 can be replaced by :  

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 909059)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -348,11 +348,23 @@
 contentType)) {
   extList.add(extensions[i]);
 }
+else if (*.equals(extensions[i].getAttribute(contentType))){
+  // default plugins get the priority
+  extList.add(0, extensions[i]);
+}
   }
   
   if (extList.size()  0) {
 if (LOG.isInfoEnabled()) {
-  LOG.info(The parsing plugins:  + extList +
+  StringBuffer extensionsIDs = new StringBuffer([);
+  boolean isFirst = true;
+  for (Extension ext : extList){
+ if (!isFirst) extensionsIDs.append( - );
+ else isFirst=false;
+ extensionsIDs.append(ext.getId());
+  }
+ extensionsIDs.append(]);
+  LOG.info(The parsing plugins:  + extensionsIDs.toString() +
 are enabled via the plugin.includes system  +
property, and all claim to support the content type  +
contentType + , but they are not mapped to it  in the  +
@@ -369,7 +381,7 @@
 
   private boolean match(Extension extension, String id, String type) {
 return ((id.equals(extension.getId())) 
-(type.equals(extension.getAttribute(contentType)) ||
+(type.equals(extension.getAttribute(contentType)) || 
extension.getAttribute(contentType).equals(*) ||
  type.equals(DEFAULT_PLUGIN)));
   }
   



  was (Author: jnioche):
I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?


  
 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832583#action_12832583
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : did you do 

ant -f src/plugin/parse-tika/build-ivy.xml 

between 5 and 6? This is required in order to populate the lib directory 
automatically

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-787:


Fix Version/s: 1.1

 Upgrade Lucene to 3.0.0.
 

 Key: NUTCH-787
 URL: https://issues.apache.org/jira/browse/NUTCH-787
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Dawid Weiss
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-787.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
Better list of suffix domains
-

 Key: NUTCH-786
 URL: https://issues.apache.org/jira/browse/NUTCH-786
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Small improvement to the content of domain-suffixes.xml : added compound TLD 
for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-786:


Attachment: NUTCH-786.patch

Small improvement to the content of domain-suffixes.xml : added compound TLD 
for .ar, .co, .id, .il, .mx, .nz and .za

 Better list of suffix domains
 -

 Key: NUTCH-786
 URL: https://issues.apache.org/jira/browse/NUTCH-786
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-786.patch


 Small improvement to the content of domain-suffixes.xml : added compound TLD 
 for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-786.
---

Resolution: Fixed

Committed revision 906907

 Better list of suffix domains
 -

 Key: NUTCH-786
 URL: https://issues.apache.org/jira/browse/NUTCH-786
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-786.patch


 Small improvement to the content of domain-suffixes.xml : added compound TLD 
 for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828548#action_12828548
 ] 

Julien Nioche commented on NUTCH-781:
-

 did you forgot to update conf/tika-mimetypes.xml ?
indeed - well spotted, thanks

 Related question: do we actually need our own version on the tika config 
 anymore? I saw there were some old issues that were fixed in the custom 
 version but i would quess those changes, if important, have already made 
 their way into Tika?
the version we had was the same as the one provided by Tika 0.4 so I suppose we 
could safely rely on theTika defaults. MimeUtil currently requires needs 
tika-mimetypes.xml to be in the available in the classpath but we could modify 
that so that it uses the default version from the tika jar if nothing can be 
found in conf. Let's put that in a separate JIRA issue if we really want it, in 
the meantime I'll commit the v 0.6 of tika-mimetypes.xml

J.


 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
Update Tika to v0.6  for the MimeType detection
---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


[from annoucement]

Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
extracting metadata and structured text content from various documents using
existing parser libraries.

Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
be found in the changes file:

http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-781.
-

Resolution: Fixed

Committed revision 905228

 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-781.
---


 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: NUTCH-766-v3.patch

Updated version of the plugin : uses Tika 0.6

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: (was: Nutch-766.ParserFactory.patch)

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: (was: NUTCH-766.tika.patch)

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
Ability to order htmlparsefilters
-

 Key: NUTCH-782
 URL: https://issues.apache.org/jira/browse/NUTCH-782
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1
 Attachments: NUTCH-782.patch

Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
have an impact on end result, as some filters could rely on the metadata 
generated by a previous filter.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-782:


Attachment: NUTCH-782.patch

 Ability to order htmlparsefilters
 -

 Key: NUTCH-782
 URL: https://issues.apache.org/jira/browse/NUTCH-782
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-782.patch


 Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
 order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
 have an impact on end result, as some filters could rely on the metadata 
 generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
IndexerChecker Utilty
-

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


This patch contains a new utility which allows to check the configuration of 
the indexing filters. The IndexerChecker reads and parses a URL and run the 
indexers on it. Displays the fields obtained and the first
 100 characters of their value.

Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
http://www.lemonde.fr/



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-783:
---

Assignee: Julien Nioche

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Attachment: NUTCH-783.patch

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-779:
---

Assignee: Julien Nioche

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-779:


Attachment: NUTCH-779-v2.patch

Improved version of the patch. Followed AB's recommendations and renamed  
STATUS_PARSE_META + added description for param 'db.parsemeta.to.crawldb' in 
nutch-default.xml + fixed issue with IndexerMapReduce

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
CrawlDBScanner 
---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-784.patch

The patch file contains a utility which dumps all the entries matching a 
regular expression on their URL. The dump mechanism of the crawldb reader is 
not  very useful on large crawldbs as the ouput can be extremely large and the 
-url  function can't help if we don't know what url we want to have a look at.

The CrawlDBScanner can either generate a text representation of the 
CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 

Usage: CrawlDBScanner crawldb output regex [-s status] -text

regex: regular expression on the crawldb key
-s status : constraint on the status of the crawldb entries e.g. db_fetched, 
db_unfetched
-text : if this parameter is used, the output will be of TextOutputFormat; 
otherwise it generates a 'normal' crawldb with the MapFileOutputFormat

for instance the command below : 
./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s 
db_fetched -text

will generate a text file /tmp/amazon-dump containing all the entries of the 
crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Attachment: NUTCH-784.patch

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
Fetcher : copy metadata from origin URL when redirecting + call 
scfilters.initialScore on newly created URL
---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


When following the redirections, the Fetcher does not copy the metadata from 
the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-785:


Attachment: NUTCH-785.patch

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805892#action_12805892
 ] 

Julien Nioche commented on NUTCH-766:
-

Here is a slightly better version of the patch which : 
• fixes a small bug in the Tika parser (the API has changed slightly between 
1.5beta and 1.5)
• fixes a bug with the TestParserFactory
• adds the tika-plugin to the list of plugins to be built in 
src/plugin/build.xml
• limits public exposure of methods and classes (see Sami's comment)
• modified parse-plugins.xml : added parse-tika and commented out associations 
between some mime-types and the old parsers

I've also added an ANT script which uses IVY to pull the dependencies and 
copies them into the lib dir. Obviously this won't be needed when the plugin is 
committed but should simplify the initial testing. All you need to do after 
applying the patch is to :

cd src/plugin/parse-tika/
ant -f build-ivy.xml

Am also attaching the content of the sample directory as an archive - just 
unzip onto the src/plugin/parse-tika/ before calling ant test-plugins

Julien




 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, 
 NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library 

[jira] Updated: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: NUTCH-766.v2
sample.tar.gz

new version of the patch + archive containing the binary docs used for testing

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, 
 NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-778) Running Nutch On linux having whoami exception?

2010-01-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-778.
-

   Resolution: Invalid
Fix Version/s: (was: 1.0.0)

This is likely to be a problem with the Hadoop configuration or machine setup. 
it is not a Nutch issue as such so I'll mark this as invalid.

 Running Nutch On linux having whoami exception?
 ---

 Key: NUTCH-778
 URL: https://issues.apache.org/jira/browse/NUTCH-778
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: Linux (RedHat)
Reporter: Prakash Panjwani
   Original Estimate: 1h
  Remaining Estimate: 1h

 I want to run nutch on the linux kernel,I have loged in as a root user, I 
 have setted all the environment variable and nutch file setting. I have 
 created a url.txt file which content the url to crawl, When i am trying to 
 run nutch using following command
 bin/nutch crawl urls -dir pra
 it generates following exception.
 crawl started in: pra
 rootUrlDir = urls
 threads = 10
 depth = 5
 Injector: starting
 Injector: crawlDb: pra/crawldb
 Injector: urlDir: urls
 Injector: Converting injected urls to crawl db entries.
 Exception in thread main java.io.IOException: Failed to get the current 
 user's information.
 at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:717)
 at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:592)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
 Caused by: javax.security.auth.login.LoginException: Login failed: Cannot run 
 program whoami: java.io.IOException: error=12, Cannot allocate memory
 at 
 org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
 at 
 org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
 at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:715)
 ... 5 more
 Server has enough space to run any java application.I have attached the 
 statics..
  total   used   free  
 Mem:524320 194632 329688 
 -/+ buffers/cache: 194632 329688
 Swap:  2475680  02475680
 Total: 300 1946322805368
 Is it sufficient memory space for nutch? Please some one help me ,I am new 
 with linux kernel and nutch. 
 Thanks in Advance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803670#action_12803670
 ] 

Julien Nioche commented on NUTCH-766:
-

 I think the end result of this plugin should be replacing all Tika supported 
 parsers (or the parsers we choose to replace) with the TikaParser and not to 
 build a parallel ways to parse same formats. 

That's how I see it - it's just that we have the option of choosing when to use 
Tika or not for a given mimetype. It is used by default unless an association 
is created between a parser implementation and   a mimetype in the 
parse-plugins.xml

 So I think we need to copy all of the the existing test files and moveadapt 
 the existing testcases fully before committing this. That is a good way of 
 seeing that the parse result is what is expected and also find out about 
 possible differences with old vs. Tika version.

Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default.   Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.

BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the 
current version of Tika and the existing Nutch parsers

Even if we decide to keep using the old plugins for some of the formats to 
start with, we'd still be able to the Tika plugin by default for the ones which 
have already the same coverage


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library 

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802172#action_12802172
 ] 

Julien Nioche commented on NUTCH-779:
-

 The property needs some documentation in nutch-default.xml plus a sensible 
 default. 

Sure - just wanted the general approach to be checked before doing the tedious 
bits. Do you think it makes sense to do things the way I suggested or would you 
use the ScoringFilters instead?


 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)
Mechanism for passing metadata from parse to crawldb


 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779

The patch attached allows to pass parse metadata to the corresponding entry of 
the crawldb.  
Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-779:


Attachment: NUTCH-779

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
 Attachments: NUTCH-779


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2010-01-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-767.
---

Resolution: Fixed

Committed revision 897825

 Update Tika to v0.5  for the MimeType detection
 ---

 Key: NUTCH-767
 URL: https://issues.apache.org/jira/browse/NUTCH-767
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: NUTCH-767-part2.patch, NUTCH-767-part3.patch, 
 NUTCH-767.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 The version 0.5 of TIka requires a few changes to the MimeType 
 implementation. Tika is now split in several jars, we need to place the 
 tika-core.jar in the main nutch lib.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-751.
-

Resolution: Later

The changes in the underlying API are quite substantial and this would need a 
bit of work. Maybe this could be done as part of crawler-commons? In the 
meantime I'll just mark it as 'later' 

 Upgrade version of HttpClient 
 --

 Key: NUTCH-751
 URL: https://issues.apache.org/jira/browse/NUTCH-751
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Julien Nioche

 The existing version of commons http-client (3.01) should be replaced with 
 the latest version from http://hc.apache.org/.
 Currently the only way of using the https protocol is to enable http-client. 
 The version 3.01 is bugged and causes a lot of issues which have been 
 reported before. Apparently the new version has been redesigned and should 
 fix them. The old v3.01 is too unstable to be used on a large scale.
  
 I will try to send a patch in the next couple of weeks but would love to hear 
 your thoughts on this.
 J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798727#action_12798727
 ] 

Julien Nioche commented on NUTCH-766:
-

Hi Chris, 

No worries, I'd rather wait for you to have a look at it. It's quite a big 
change and it would be better if someone else had a look at it. Being the 
author I might miss something obvious

Thanks

J.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-269:
---

Assignee: Julien Nioche

 CrawlDbReducer: OOME because no upper-bound on inlinks count
 

 Key: NUTCH-269
 URL: https://issues.apache.org/jira/browse/NUTCH-269
 Project: Nutch
  Issue Type: Bug
Reporter: stack
Assignee: Julien Nioche
Priority: Trivial
 Attachments: too-many-links.patch, too-many-links2.patch


 A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands 
 of inlinks (The british foriegn office likes putting a clear.gif multiple 
 times into each page: 
 http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797990#action_12797990
 ] 

Julien Nioche commented on NUTCH-269:
-

I will shortly commit a variant of this approach whereby the inlinks are stored 
in a priority queue in order to keep the best scoring ones. The size of the 
queue is determined by the parameter db.update.max.inlinks which has a default 
value of 1.

 CrawlDbReducer: OOME because no upper-bound on inlinks count
 

 Key: NUTCH-269
 URL: https://issues.apache.org/jira/browse/NUTCH-269
 Project: Nutch
  Issue Type: Bug
Reporter: stack
Assignee: Julien Nioche
Priority: Trivial
 Attachments: too-many-links.patch, too-many-links2.patch


 A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands 
 of inlinks (The british foriegn office likes putting a clear.gif multiple 
 times into each page: 
 http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-269.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 897180

 CrawlDbReducer: OOME because no upper-bound on inlinks count
 

 Key: NUTCH-269
 URL: https://issues.apache.org/jira/browse/NUTCH-269
 Project: Nutch
  Issue Type: Bug
Reporter: stack
Assignee: Julien Nioche
Priority: Trivial
 Fix For: 1.1

 Attachments: too-many-links.patch, too-many-links2.patch


 A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands 
 of inlinks (The british foriegn office likes putting a clear.gif multiple 
 times into each page: 
 http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-776) Configurable queue depth

2010-01-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797653#action_12797653
 ] 

Julien Nioche commented on NUTCH-776:
-

Did you notice any improvement in the fetch rate after I suggested on the 
mailing list to use a value larger than 50? Does the memory consumption remain 
reasonable?  

 Configurable queue depth
 

 Key: NUTCH-776
 URL: https://issues.apache.org/jira/browse/NUTCH-776
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.1
Reporter: MilleBii
Priority: Minor
 Fix For: 1.1


 I propose that we create a configurable item for the queuedepth in 
 Fetcher.java instead of the hard-coded value of 50.
 key name : fetcher.queues.depth
 Default value : remains 50 (of course)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >