[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1
[ https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-817: --- Assignee: Julien Nioche parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1 Key: NUTCH-817 URL: https://issues.apache.org/jira/browse/NUTCH-817 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Suse linux 11.1, java version 1.6.0_13 Reporter: matthew a. grisius Assignee: Julien Nioche Attachments: sample-javadoc.html submitted per Julien Nioche. I did not see where to attach a file so I pasted it here. btw: Tika command line returns empty html body for this file. !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN http://www.w3.org/TR/html4/frameset.dtd; !--NewPage-- HTML HEAD !-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-- TITLE Matrix Application Development Kit /TITLE SCRIPT type=text/javascript targetPage = + window.location.search; if (targetPage != targetPage != undefined) targetPage = targetPage.substring(1); function loadFrames() { if (targetPage != targetPage != undefined) top.classFrame.location = top.targetPage; } /SCRIPT NOSCRIPT /NOSCRIPT /HEAD FRAMESET cols=20%,80% title= onLoad=top.loadFrames() FRAMESET rows=30%,70% title= onLoad=top.loadFrames() FRAME src=overview-frame.html name=packageListFrame title=All Packages FRAME src=allclasses-frame.html name=packageFrame title=All classes and interfaces (except non-static nested types) /FRAMESET FRAME src=overview-summary.html name=classFrame title=Package, class and interface descriptions scrolling=yes NOFRAMES H2 Frame Alert/H2 P This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. BR Link toA HREF=overview-summary.htmlNon-frame version./A /NOFRAMES /FRAMESET /HTML -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-710) Support for rel=canonical attribute
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286 ] Julien Nioche commented on NUTCH-710: - As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions. Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks. Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections We probably need a third approach: prefilter by going through the crawldb detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway) As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then - if indexer comes across such an entry : skip it - make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices The implementation would be as follows : Go through all redirections and generate all redirection chains e.g. A - B B - C D - C where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed. will yield A - C B - C D - C but also C - C Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate' This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates. This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB. WDYT? Support for rel=canonical attribute - Key: NUTCH-710 URL: https://issues.apache.org/jira/browse/NUTCH-710 Project: Nutch Issue Type: New Feature Affects Versions: 1.1 Reporter: Frank McCown Priority: Minor There is a the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349 ] Julien Nioche commented on NUTCH-808: - Hi Enis, {quote} On the other hand, current implementation is ... {quote} What do you mean by current implementation? NutchBase? My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. I know that Cascading have various Tape/Sink implementations including JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how they do it? Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-808: Fix Version/s: 2.0 Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-810) Upgrade to Tika 0.7
Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Upgrading to Tika 0.7 before 1.1 release The TikaConfig mechanism has changed and does not rely on a default XML config file anymore. Am working on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-789: Component/s: (was: fetcher) parser Fix Version/s: (was: 1.1) Have created a separate issue for the upgrade of Tika 0.7 and moved this one out of 1.1 Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: parser Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7
[ https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-810. --- Resolution: Fixed Committed in rev 931098. http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is created as it does not rely on a tika-config.xml file any longer. Our custom TikaConfig has been modified to reflect these changes. This was the last remaining issue marked for 1.1 Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Upgrading to Tika 0.7 before 1.1 release The TikaConfig mechanism has changed and does not rely on a default XML config file anymore. Am working on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251 ] Julien Nioche commented on NUTCH-789: - Will upgrade as soon as 0.7 is available from http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet. I will leave this issue open but unmark it as 1.1 Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-809) Parse-metatags plugin
Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: (was: NUTCH-809.patch) Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com was: h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-706) Url regex normalizer
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-706: Fix Version/s: (was: 1.1) Both variants of the substitution rule above break existing tests. More work will be needed to get a pattern which covers the case described by Meghna *and* is compatible with the existing test cases. Moving it to post-1.1 Url regex normalizer Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern. This pattern also transforms a url, such as, newsId=2000484784794newsLang=en into newnewsLang=en (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-779. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 929038. Thanks Andrzej for your feedback Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-785. --- Resolution: Fixed Committed revision 929039 Thanks Andrzej for reviewing it Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/jira/browse/NUTCH-785 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-785.patch When following the redirections, the Fetcher does not copy the metadata from the original URL to the new one or calls the method scfilters.initialScore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851316#action_12851316 ] Julien Nioche commented on NUTCH-789: - Shall we postpone the work on this issue to after 1.1? Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851545#action_12851545 ] Julien Nioche commented on NUTCH-570: - {quote}Julien, want to take this?{quote} Not particularly. I am busy on short term issues for 1.1 so feel free to take it if you have a particular interest in this. I would be curious to see some figures on the improvements from this patch, my impression is that NUTCH-776 would be quicker to implement and maintain and might possibly give similar gains. Improvement of URL Ordering in Generator.java - Key: NUTCH-570 URL: https://issues.apache.org/jira/browse/NUTCH-570 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ned Rockson Assignee: Otis Gospodnetic Priority: Minor Attachments: GeneratorDiff.out, GeneratorDiff_v1.out [Copied directly from my email to nutch-dev list] Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time. Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible. So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster. Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-784) CrawlDBScanner
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-784. --- Resolution: Fixed Committed revision 928746 CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-784.patch The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner crawldb output regex [-s status] -text regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-784) CrawlDBScanner
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Fix Version/s: 1.1 CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-784.patch The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner crawldb output regex [-s status] -text regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader
Merge CrawlDBScanner with CrawlDBReader --- Key: NUTCH-806 URL: https://issues.apache.org/jira/browse/NUTCH-806 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do that after the 1.1 release -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-783) IndexerChecker Utilty
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Fix Version/s: (was: 1.1) Removed tag 1.1 Will rename to IndexingPluginsChecker later IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-783.patch This patch contains a new utility which allows to check the configuration of the indexing filters. The IndexerChecker reads and parses a URL and run the indexers on it. Displays the fields obtained and the first 100 characters of their value. Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker http://www.lemonde.fr/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850912#action_12850912 ] Julien Nioche commented on NUTCH-785: - Could anyone please review this issue? I would like to commit it in time for the 1.1 release Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/jira/browse/NUTCH-785 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-785.patch When following the redirections, the Fetcher does not copy the metadata from the original URL to the new one or calls the method scfilters.initialScore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850915#action_12850915 ] Julien Nioche commented on NUTCH-779: - Could anyone please review this issue? I would like to commit it in time for the 1.1 release Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-776) Configurable queue depth
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-776: Fix Version/s: (was: 1.1) Moving this issue post 1.1 Needs a patch file, some description of the param in nutch-default.xml and more importantly some experimentation to see how it impacts the performance of the fetching Configurable queue depth Key: NUTCH-776 URL: https://issues.apache.org/jira/browse/NUTCH-776 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.1 Reporter: MilleBii Priority: Minor I propose that we create a configurable item for the queuedepth in Fetcher.java instead of the hard-coded value of 50. key name : fetcher.queues.depth Default value : remains 50 (of course) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-740) Configuration option to override default language for fetched pages.
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-740. --- Resolution: Fixed Assignee: Julien Nioche Committed in rev 926003 Thanks Marcin for contributing this patch Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Marcin Okraszewski Assignee: Julien Nioche Priority: Minor Fix For: 1.1 Attachments: AcceptLanguage.patch, AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Fix Version/s: 1.1 Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v3.patch new patch which reintroduces the 'generator.update.crawldb' functionality Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848095#action_12848095 ] Julien Nioche commented on NUTCH-762: - {quote} I just noticed that the new Generator uses different config property names (generator. vs. generate.), and the older versions are now marked with (Deprecated). However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas deprecated implies that they should still work {quote} They will still work if we keep the old Generator as OldGenerator - which is what we assume in the patch. If we decide to get shot of the OldGenerator then yes, they should not be marked with (Deprecated) {quote} For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. generate.. {quote} the original assumption was that we'd keep both this version of the generator and the old one in which case we could have used a different prefix for the properties. If we want to *replace* the old generator altogether - which I think would be a good option - then indeed we should discuss whether or not to align on the old prefix. I don't have strong feelings on whether or not to modify the prefix in a minor release. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848140#action_12848140 ] Julien Nioche commented on NUTCH-762: - The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that _generate.max.per.host.by.ip_ won't be of much use anymore as we don't count per IP. Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for? Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-762. --- Resolution: Fixed Committed revision 926155 Have reverted the prefix for params to 'generate.' + added description of crawl.gen.delay on nutch-default + added warning when user specified generate.max.per.host.by.ip + param generate.max.per.host is now supported Thanks Andzrej for your reviewing it Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-740: Attachment: NUTCH-740.patch Slightly modified version of the patch with modifs for protocol-http. will commit shortly Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Marcin Okraszewski Priority: Minor Fix For: 1.1 Attachments: AcceptLanguage.patch, AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846910#action_12846910 ] Julien Nioche commented on NUTCH-762: - OK, there was indeed an assumption that the generator would not need to be called again before an update. Am happy to add back generate.update.crawldb. Note that this version of the Generator also differs from the original version in that {quote} *IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale *can max the number of URLs per host or domain (but not by IP) {quote} We could allow more flexibility by counting per IP, again at the expense of performance. Not sure it is very useful in practice though. Since the way we count the URLs is now decoupled from the way we partition them, we can have an hybrid approach e.g. count per domain THEN partition by IP. Any thoughts on whether or not we should reintroduce the counting per IP? Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846930#action_12846930 ] Julien Nioche commented on NUTCH-762: - Yes, I came across that situation too on a large crawl where a single machine was used to host a whole range of unrelated domain names (needless to say the host of the domains was not very pleased). We can now handle such cases that simply by partitioning by IP (and counting by domain). I will have a look at reintroducing *generate.update.crawldb* tomorrow. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-469: Fix Version/s: (was: 1.1) There has not been any changes to this issue since February 09 and it won't be included in 1.1 Marking it as 'fix version : unknown' changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, NUTCH-469-2007-05-09.txt.gz I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-740) Configuration option to override default language for fetched pages.
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845886#action_12845886 ] Julien Nioche commented on NUTCH-740: - A nice contribution but should not this be applied to the *protocol-http* plugin as well e.g. in HttpResponse? Configuration option to override default language for fetched pages. Key: NUTCH-740 URL: https://issues.apache.org/jira/browse/NUTCH-740 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Marcin Okraszewski Assignee: Otis Gospodnetic Priority: Minor Fix For: 1.1 Attachments: AcceptLanguage.patch, AcceptLanguage_trunk_2009-06-09.patch By default Accept-Language HTTP request header is set to English. Unfortunately this value is hard coded and seems there is no way to override it. As a result you may index English version of pages even though you would prefer it in different language. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846141#action_12846141 ] Julien Nioche commented on NUTCH-762: - If I am not mistaken the point of having _generate.update.crawldb_ was to marke the URLs put in a fetchlist in order to be able to do another round of generation. This is not necessary now as we can generate several segments without writing a new crawldb. Am I missing something? Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-710) Support for rel=canonical attribute
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-710: Fix Version/s: (was: 1.1) Great idea. Won't be included in 1.1 though so moving to *fix : unknown* Support for rel=canonical attribute - Key: NUTCH-710 URL: https://issues.apache.org/jira/browse/NUTCH-710 Project: Nutch Issue Type: New Feature Affects Versions: 1.1 Reporter: Frank McCown Priority: Minor There is a the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-692. - Resolution: Cannot Reproduce Fix Version/s: 1.1 I cannot reproduce the issue since we moved to the Hadoop 0.20., which is good news AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-798) Upgrade to SOLR1.4
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-798. - Resolution: Fixed Updated SOLRJ's dependencies at the same time : Deleting lib/apache-solr-common-1.3.0.jar Adding (bin) lib/apache-solr-core-1.4.0.jar Deleting lib/apache-solr-solrj-1.3.0.jar Adding (bin) lib/apache-solr-solrj-1.4.0.jar Deleting lib/commons-httpclient-3.0.1.jar Adding (bin) lib/commons-httpclient-3.1.jar Adding (bin) lib/commons-io-1.4.jar Adding (bin) lib/geronimo-stax-api_1.0_spec-1.0.1.jar Adding (bin) lib/jcl-over-slf4j-1.5.5.jar Deleting lib/slf4j-api-1.4.3.jar Adding (bin) lib/slf4j-api-1.5.5.jar Adding (bin) lib/wstx-asl-3.2.7.jar Committed revision 921831 Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the way we buffer the docs before sending them to the SOLR instance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-801) Remove RTF and MP3 parse plugins
[ https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-801. - Resolution: Fixed Committed revision 921840. Remove RTF and MP3 parse plugins Key: NUTCH-801 URL: https://issues.apache.org/jira/browse/NUTCH-801 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Fix For: 1.1 *Parse-rtf* and *parse-mp3* are not built by default due to licensing issues. Since we now have *parse-tika* to handle these formats I would be in favour of removing these 2 plugins altogether to keep things nice and simple. The other plugins will probably be phased out only after the release of 1.1 when parse-tika will have been tested a lot more. Any reasons not to? Julien -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: (was: NUTCH-762-MultiGenerator.patch) Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v2.patch Improved version of the patch : - fixed a few minor bugs - renamed Generator into OldGenerator - renamed MultiGenerator into Generator - fixed test classes to use new Generator - documented parameters in nutch-default.xml - add names of segments to the LOG to facilitate integration in scripts - PartitionUrlByHost is replaced by URLPartitioner which is more generic I decided to keep the old version for the time being but we might as well get rid of it altogether. The new version is now used in the Crawl class. Would be nice if people could give it a good try before we put it in 1.1 Thanks Julien Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-799) SOLRIndexer to commit once all reducers have finished
SOLRIndexer to commit once all reducers have finished - Key: NUTCH-799 URL: https://issues.apache.org/jira/browse/NUTCH-799 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 What about doing only one SOLR commit after the MR job has finished in SOLRIndexer instead of doing that at the end of every Reducer? I ran into timeout exceptions in some of my reducers and I suspect that this was due to the fact that other reducers had already finished and called commit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-799) SOLRIndexer to commit once all reducers have finished
[ https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-799: Attachment: NUTCH-799.patch SOLRIndexer to commit once all reducers have finished - Key: NUTCH-799 URL: https://issues.apache.org/jira/browse/NUTCH-799 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 Attachments: NUTCH-799.patch What about doing only one SOLR commit after the MR job has finished in SOLRIndexer instead of doing that at the end of every Reducer? I ran into timeout exceptions in some of my reducers and I suspect that this was due to the fact that other reducers had already finished and called commit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-782) Ability to order htmlparsefilters
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-782. --- Resolution: Fixed Committed revision 917557 Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-782.patch Patch which adds a new parameter 'htmlparsefilter.order' which specifies the order in which HTMLParse filters are applied. HTMLParse filter ordering MAY have an impact on end result, as some filters could rely on the metadata generated by a previous filter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-798) Upgrade to SOLR1.4
Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the way we buffer the docs before sending them to the SOLR instance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-719. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 911905. Thanks to S. Dennis for investigating the issue + R. Schwab for testing it fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-719. --- fetchQueues.totalSize incorrect in Fetcher2 --- Key: NUTCH-719 URL: https://issues.apache.org/jira/browse/NUTCH-719 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 I had a look at the logs generated by Fetcher2 and found cases where there were no active fetchQueues but fetchQueues.totalSize was != 0 fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, fetchQueues.totalSize=1, fetchQueues=0 since the code relies on fetchQueues.totalSize to determine whether the work is finished or not the task is blocked until the abortion mechanism kicks in 2009-03-12 09:27:38,977 WARN fetcher.Fetcher2 - Aborting with 200 hung threads. could that be a synchronisation issue? any ideas? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-705) parse-rtf plugin
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-705. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766). Please open an issue on Tika if the original problem with non-ascii chars still occurs parse-rtf plugin Key: NUTCH-705 URL: https://issues.apache.org/jira/browse/NUTCH-705 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.0.0 Reporter: Dmitry Lihachev Priority: Minor Fix For: 1.1 Attachments: NUTCH-705.patch Demoting this issue and moving to 1.1 - current patch is not suitable due to LGPL licensed parts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-644) RTF parser doesn't compile anymore
[ https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-644. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766) which solves the issue of licensing. RTF parser doesn't compile anymore -- Key: NUTCH-644 URL: https://issues.apache.org/jira/browse/NUTCH-644 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Guillaume Smet Attachments: NUTCH-644_v2.patch, NUTCH-644_v3.patch, RTFParseFactory.java-compilation_issues.diff Due to API changes, the RTF parser (which is not compiled by default due to licensing problem) doesn't compile anymore. The build.xml script doesn't work anymore too as http://www.cobase.cs.ucla.edu/pub/javacc/rtf_parser_src.jar doesn't exist anymore (404). I didn't fix the build.xml as I don't know from where we want to get the jar file but only the compilations issues. Regards, -- Guillaume -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-794) Tika parser does not keep attributes on html tag
Tika parser does not keep attributes on html tag Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Description: The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore was: The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore Summary: Tika parser does identify lang attributes on html tag (was: Tika parser does not keep attributes on html tag) Tika parser does identify lang attributes on html tag - Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Attachment: NUTCH-794.patch Tika parser does identify lang attributes on html tag - Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834147#action_12834147 ] Julien Nioche commented on NUTCH-794: - Committed patch in revision 910454 Waiting for issue to be fixed in Tika before closing this issue Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Summary: Language Identification must use check the parse metadata for language values (was: Tika parser does identify lang attributes on html tag) Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-794 started by Julien Nioche. Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Component/s: parser Language Identification must use check the parse metadata for language values -- Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-794.patch The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain suomeksi/body/html is rendered as the following xhtml by Tika : ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 titlejotain suomeksi/body/html with the lang attribute getting lost. The lang is not stored in the metadata either. I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Component/s: parser Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-782.patch Patch which adds a new parameter 'htmlparsefilter.order' which specifies the order in which HTMLParse filters are applied. HTMLParse filter ordering MAY have an impact on end result, as some filters could rely on the metadata generated by a previous filter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-766. --- Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins matching mime-type). Thanks to Chris for testing and committing it + Andrzej and Sami for their comments and suggestions Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832454#action_12832454 ] Julien Nioche commented on NUTCH-766: - @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? @Sami : {quote} was there a reason not to use AutoDetect parser? {quote} I suppose we could as long we give it a clue about the MimeType obtained from the Content. As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point. {quote} Also was there a reson not to parse html wtih tika? {quote} It is supposed to do so, if it does not then it's a bug which needs urgent fixing. Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564 ] Julien Nioche commented on NUTCH-766: - I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing. Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika. What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type). Makes sense? Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library
[jira] Issue Comment Edited: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564 ] Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM: -- I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing. Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika. What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type). Makes sense? The ParserFactory section of the patch v3 can be replaced by : Index: src/java/org/apache/nutch/parse/ParserFactory.java === --- src/java/org/apache/nutch/parse/ParserFactory.java (revision 909059) +++ src/java/org/apache/nutch/parse/ParserFactory.java (working copy) @@ -348,11 +348,23 @@ contentType)) { extList.add(extensions[i]); } +else if (*.equals(extensions[i].getAttribute(contentType))){ + // default plugins get the priority + extList.add(0, extensions[i]); +} } if (extList.size() 0) { if (LOG.isInfoEnabled()) { - LOG.info(The parsing plugins: + extList + + StringBuffer extensionsIDs = new StringBuffer([); + boolean isFirst = true; + for (Extension ext : extList){ + if (!isFirst) extensionsIDs.append( - ); + else isFirst=false; + extensionsIDs.append(ext.getId()); + } + extensionsIDs.append(]); + LOG.info(The parsing plugins: + extensionsIDs.toString() + are enabled via the plugin.includes system + property, and all claim to support the content type + contentType + , but they are not mapped to it in the + @@ -369,7 +381,7 @@ private boolean match(Extension extension, String id, String type) { return ((id.equals(extension.getId())) -(type.equals(extension.getAttribute(contentType)) || +(type.equals(extension.getAttribute(contentType)) || extension.getAttribute(contentType).equals(*) || type.equals(DEFAULT_PLUGIN))); } was (Author: jnioche): I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing. Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika. What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type). Makes sense? Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832583#action_12832583 ] Julien Nioche commented on NUTCH-766: - @Chris : did you do ant -f src/plugin/parse-tika/build-ivy.xml between 5 and 6? This is required in order to populate the lib directory automatically Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-787: Fix Version/s: 1.1 Upgrade Lucene to 3.0.0. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Priority: Trivial Fix For: 1.1 Attachments: NUTCH-787.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-786) Better list of suffix domains
Better list of suffix domains - Key: NUTCH-786 URL: https://issues.apache.org/jira/browse/NUTCH-786 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Small improvement to the content of domain-suffixes.xml : added compound TLD for .ar, .co, .id, .il, .mx, .nz and .za -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-786) Better list of suffix domains
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-786: Attachment: NUTCH-786.patch Small improvement to the content of domain-suffixes.xml : added compound TLD for .ar, .co, .id, .il, .mx, .nz and .za Better list of suffix domains - Key: NUTCH-786 URL: https://issues.apache.org/jira/browse/NUTCH-786 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-786.patch Small improvement to the content of domain-suffixes.xml : added compound TLD for .ar, .co, .id, .il, .mx, .nz and .za -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-786) Better list of suffix domains
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-786. --- Resolution: Fixed Committed revision 906907 Better list of suffix domains - Key: NUTCH-786 URL: https://issues.apache.org/jira/browse/NUTCH-786 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-786.patch Small improvement to the content of domain-suffixes.xml : added compound TLD for .ar, .co, .id, .il, .mx, .nz and .za -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828548#action_12828548 ] Julien Nioche commented on NUTCH-781: - did you forgot to update conf/tika-mimetypes.xml ? indeed - well spotted, thanks Related question: do we actually need our own version on the tika config anymore? I saw there were some old issues that were fixed in the custom version but i would quess those changes, if important, have already made their way into Tika? the version we had was the same as the one provided by Tika 0.4 so I suppose we could safely rely on theTika defaults. MimeUtil currently requires needs tika-mimetypes.xml to be in the available in the classpath but we could modify that so that it uses the default version from the tika jar if nothing can be found in conf. Let's put that in a separate JIRA issue if we really want it, in the meantime I'll commit the v 0.6 of tika-mimetypes.xml J. Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-781. - Resolution: Fixed Committed revision 905228 Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-781) Update Tika to v0.6 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-781. --- Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 [from annoucement] Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 0.6 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766-v3.patch Updated version of the plugin : uses Tika 0.6 Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: Nutch-766.ParserFactory.patch) Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: NUTCH-766.tika.patch) Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-782) Ability to order htmlparsefilters
Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-782.patch Patch which adds a new parameter 'htmlparsefilter.order' which specifies the order in which HTMLParse filters are applied. HTMLParse filter ordering MAY have an impact on end result, as some filters could rely on the metadata generated by a previous filter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Attachment: NUTCH-782.patch Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-782.patch Patch which adds a new parameter 'htmlparsefilter.order' which specifies the order in which HTMLParse filters are applied. HTMLParse filter ordering MAY have an impact on end result, as some filters could rely on the metadata generated by a previous filter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-783) IndexerChecker Utilty
IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.1 This patch contains a new utility which allows to check the configuration of the indexing filters. The IndexerChecker reads and parses a URL and run the indexers on it. Displays the fields obtained and the first 100 characters of their value. Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker http://www.lemonde.fr/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-783) IndexerChecker Utilty
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-783: --- Assignee: Julien Nioche IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-783.patch This patch contains a new utility which allows to check the configuration of the indexing filters. The IndexerChecker reads and parses a URL and run the indexers on it. Displays the fields obtained and the first 100 characters of their value. Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker http://www.lemonde.fr/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-783) IndexerChecker Utilty
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Attachment: NUTCH-783.patch IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.1 Attachments: NUTCH-783.patch This patch contains a new utility which allows to check the configuration of the indexing filters. The IndexerChecker reads and parses a URL and run the indexers on it. Displays the fields obtained and the first 100 characters of their value. Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker http://www.lemonde.fr/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-779: --- Assignee: Julien Nioche Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779-v2.patch Improved version of the patch. Followed AB's recommendations and renamed STATUS_PARSE_META + added description for param 'db.parsemeta.to.crawldb' in nutch-default.xml + fixed issue with IndexerMapReduce Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-784) CrawlDBScanner
CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-784.patch The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner crawldb output regex [-s status] -text regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-784) CrawlDBScanner
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Attachment: NUTCH-784.patch CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-784.patch The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner crawldb output regex [-s status] -text regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL
Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/jira/browse/NUTCH-785 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 When following the redirections, the Fetcher does not copy the metadata from the original URL to the new one or calls the method scfilters.initialScore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-785: Attachment: NUTCH-785.patch Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/jira/browse/NUTCH-785 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-785.patch When following the redirections, the Fetcher does not copy the metadata from the original URL to the new one or calls the method scfilters.initialScore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805892#action_12805892 ] Julien Nioche commented on NUTCH-766: - Here is a slightly better version of the patch which : • fixes a small bug in the Tika parser (the API has changed slightly between 1.5beta and 1.5) • fixes a bug with the TestParserFactory • adds the tika-plugin to the list of plugins to be built in src/plugin/build.xml • limits public exposure of methods and classes (see Sami's comment) • modified parse-plugins.xml : added parse-tika and commented out associations between some mime-types and the old parsers I've also added an ANT script which uses IVY to pull the dependencies and copies them into the lib dir. Obviously this won't be needed when the plugin is committed but should simplify the initial testing. All you need to do after applying the patch is to : cd src/plugin/parse-tika/ ant -f build-ivy.xml Am also attaching the content of the sample directory as an archive - just unzip onto the src/plugin/parse-tika/ before calling ant test-plugins Julien Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library
[jira] Updated: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766.v2 sample.tar.gz new version of the patch + archive containing the binary docs used for testing Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-778) Running Nutch On linux having whoami exception?
[ https://issues.apache.org/jira/browse/NUTCH-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-778. - Resolution: Invalid Fix Version/s: (was: 1.0.0) This is likely to be a problem with the Hadoop configuration or machine setup. it is not a Nutch issue as such so I'll mark this as invalid. Running Nutch On linux having whoami exception? --- Key: NUTCH-778 URL: https://issues.apache.org/jira/browse/NUTCH-778 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Linux (RedHat) Reporter: Prakash Panjwani Original Estimate: 1h Remaining Estimate: 1h I want to run nutch on the linux kernel,I have loged in as a root user, I have setted all the environment variable and nutch file setting. I have created a url.txt file which content the url to crawl, When i am trying to run nutch using following command bin/nutch crawl urls -dir pra it generates following exception. crawl started in: pra rootUrlDir = urls threads = 10 depth = 5 Injector: starting Injector: crawlDb: pra/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: Failed to get the current user's information. at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:717) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:592) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.Injector.inject(Injector.java:160) at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) Caused by: javax.security.auth.login.LoginException: Login failed: Cannot run program whoami: java.io.IOException: error=12, Cannot allocate memory at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.mapred.JobClient.getUGI(JobClient.java:715) ... 5 more Server has enough space to run any java application.I have attached the statics.. total used free Mem:524320 194632 329688 -/+ buffers/cache: 194632 329688 Swap: 2475680 02475680 Total: 300 1946322805368 Is it sufficient memory space for nutch? Please some one help me ,I am new with linux kernel and nutch. Thanks in Advance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803670#action_12803670 ] Julien Nioche commented on NUTCH-766: - I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. That's how I see it - it's just that we have the option of choosing when to use Tika or not for a given mimetype. It is used by default unless an association is created between a parser implementation and a mimetype in the parse-plugins.xml So I think we need to copy all of the the existing test files and moveadapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version. Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers Even if we decide to keep using the old plugins for some of the formats to start with, we'd still be able to the Tika plugin by default for the ones which have already the same coverage Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802172#action_12802172 ] Julien Nioche commented on NUTCH-779: - The property needs some documentation in nutch-default.xml plus a sensible default. Sure - just wanted the general approach to be checked before doing the tedious bits. Do you think it makes sense to do things the way I suggested or would you use the ScoringFilters instead? Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779 Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-767. --- Resolution: Fixed Committed revision 897825 Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767-part2.patch, NUTCH-767-part3.patch, NUTCH-767.patch Original Estimate: 0h Remaining Estimate: 0h The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-751) Upgrade version of HttpClient
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-751. - Resolution: Later The changes in the underlying API are quite substantial and this would need a bit of work. Maybe this could be done as part of crawler-commons? In the meantime I'll just mark it as 'later' Upgrade version of HttpClient -- Key: NUTCH-751 URL: https://issues.apache.org/jira/browse/NUTCH-751 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche The existing version of commons http-client (3.01) should be replaced with the latest version from http://hc.apache.org/. Currently the only way of using the https protocol is to enable http-client. The version 3.01 is bugged and causes a lot of issues which have been reported before. Apparently the new version has been redesigned and should fix them. The old v3.01 is too unstable to be used on a large scale. I will try to send a patch in the next couple of weeks but would love to hear your thoughts on this. J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798727#action_12798727 ] Julien Nioche commented on NUTCH-766: - Hi Chris, No worries, I'd rather wait for you to have a look at it. It's quite a big change and it would be better if someone else had a look at it. Being the author I might miss something obvious Thanks J. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-269: --- Assignee: Julien Nioche CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: https://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Issue Type: Bug Reporter: stack Assignee: Julien Nioche Priority: Trivial Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797990#action_12797990 ] Julien Nioche commented on NUTCH-269: - I will shortly commit a variant of this approach whereby the inlinks are stored in a priority queue in order to keep the best scoring ones. The size of the queue is determined by the parameter db.update.max.inlinks which has a default value of 1. CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: https://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Issue Type: Bug Reporter: stack Assignee: Julien Nioche Priority: Trivial Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-269. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 897180 CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: https://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Issue Type: Bug Reporter: stack Assignee: Julien Nioche Priority: Trivial Fix For: 1.1 Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-776) Configurable queue depth
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797653#action_12797653 ] Julien Nioche commented on NUTCH-776: - Did you notice any improvement in the fetch rate after I suggested on the mailing list to use a value larger than 50? Does the memory consumption remain reasonable? Configurable queue depth Key: NUTCH-776 URL: https://issues.apache.org/jira/browse/NUTCH-776 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.1 Reporter: MilleBii Priority: Minor Fix For: 1.1 I propose that we create a configurable item for the queuedepth in Fetcher.java instead of the hard-coded value of 50. key name : fetcher.queues.depth Default value : remains 50 (of course) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.