[jira] Updated: (NUTCH-814) SegmentMerger bug
[ https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-814: Attachment: merger.patch Patch fixing the issue, and a unit test. I will commit this shortly. SegmentMerger bug - Key: NUTCH-814 URL: https://issues.apache.org/jira/browse/NUTCH-814 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Dennis Kubes Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: merger.patch Dennis reported: {quote} In the SegmentMerger.java file about line 150 we have this: final SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(), job); Then about line 166 in the record reader we have this: boolean res = reader.next(key, w); If I am reading that right, that would mean that the map tap would loop over all records for a given file and not just a given split. {quote} Right, this should instead use SequenceFileRecordReader that already has the logic to handle splits. Patch coming shortly - thanks for spotting this! This could be the reason for out of disk space errors that many users reported. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work stopped: (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-466 stopped by Andrzej Bialecki . Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: ParseFilters.java, segmentparts.patch In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-812: Affects Version/s: 1.1 Priority: Critical (was: Major) Crawl.java incorrectly uses the Generator API resulting in NPE -- Key: NUTCH-812 URL: https://issues.apache.org/jira/browse/NUTCH-812 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Andrzej Bialecki Priority: Critical As reported by Phil Barnett on nutch-user: {quote} The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine. {quote} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[VOTE] Board resolution for Nutch as TLP
Hi, Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Hold on... (Re: [VOTE] Board resolution for Nutch as TLP)
On 2010-04-12 12:57, Andrzej Bialecki wrote: Hi, Following the discussion, below is the text of the proposed Board Resolution to vote upon. Ehh, scrap that ... I missed one occurrence of the crawling platform. Resending... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[VOTE 2] Board resolution for Nutch as TLP
Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
On 2010-04-10 04:13, Mattmann, Chris A (388J) wrote: Hi Andrzej, +1, with the following amendment: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. This should read: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. Good catch, thanks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
On 2010-04-10 15:32, Jukka Zitting wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Yes, that's a good change too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[DISCUSS] Board resolution for Nutch as TLP
Hi, I was told that the next step is to come up with the proposed Board resolution and vote it among committers. Here's the proposed text (shameless copypaste from Tika and Mahout proposals). IMPORTANT NOTE: I removed from the members of the PMC those existing Nutch committers that haven't been active for more than 1 year, with the intention of moving them to Emeritus status. If any one of these people feels left out and would like to become an active committer in the project, please let us know and we will gladly welcome you back :) The text of the resolution follows. Committers, please read it and optionally comment on the salient points of the text, the rest is boilerplate. If there's an overall consensus I will call for a formal vote to submit this proposal to the Board. == X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. = -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Hmm .. this puzzles me, do you think we should port changes from 1.1 to nutchbase? I thought we should do it the other way around, i.e. merge nutchbase bits to trunk. * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. Again, the advantage of DataNucleus is that we don't have to handcraft all the mid- to low-level mappings, just the mid-level ones (JOQL or whatever) - the cost of maintenance is lower, and the number of backends that are supported out of the box is larger. Of course, this is just IMHO - we won't know for sure until we try to use both your custom ORM and DataNucleus... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
On 2010-04-07 19:24, Enis Söztutar wrote: Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. So, it seems that at some point, we need to bite the bullet, and refactor plugins, dropping backwards compatibility. Right, that was my point - now is the time to break it, with the cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well enough in the interim period. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Question: Nutch 0.8.2 and Nutch 0.7.3?
On 2010-04-04 02:59, Mattmann, Chris A (388J) wrote: Hey Guys, Question. I see 2 releases that haven't been cut in JIRA: 0.8.2: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312064 0.7.3: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312176 I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door. However, I have a question: is this Nutch 0.8.2 in SVN? http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/ That's the code that was intended to become 0.8.2 ... However, I'm not sure whether there's any benefit in releasing either of these. Those who really had the need to track this branch (or 0.7) likely used the code from this branch even though it wasn't released. And I believe we are not interested in maintaining a new release based on this code...? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
On 2010-04-02 16:14, Mattmann, Chris A (388J) wrote: * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? Yes - thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851331#action_12851331 ] Andrzej Bialecki commented on NUTCH-789: - There are no diffs, so it's difficult to figure out what's changed ... I think that Tika will soon release v. 0.7 which may also impact this patch if we decide to upgrade before our release. I asked the Tika guys about their release, let's wait a couple days more. Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-784) CrawlDBScanner
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850896#action_12850896 ] Andrzej Bialecki commented on NUTCH-784: - This should have been reviewed first - I don't question the usefulness of this class, but I think that this should have been added as an option to CrawlDbReader. As it is now we get a new tool with a cryptic name that performs a function that is a variant of another existing tool... CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-784.patch The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner crawldb output regex [-s status] -text regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850931#action_12850931 ] Andrzej Bialecki commented on NUTCH-785: - +1. The scoring api should allow us to set this metadata in one call, but changing the API now would be problematic. Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL: https://issues.apache.org/jira/browse/NUTCH-785 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-785.patch When following the redirections, the Fetcher does not copy the metadata from the original URL to the new one or calls the method scfilters.initialScore -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850939#action_12850939 ] Andrzej Bialecki commented on NUTCH-779: - CrawlDbReducer, the cramped line {{if (metaFromParse!=null){}} needs some whitespace fixing. Other than that, +1. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-779, NUTCH-779-v2.patch The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848090#action_12848090 ] Andrzej Bialecki commented on NUTCH-762: - I just noticed that the new Generator uses different config property names (generator. vs. generate.), and the older versions are now marked with (Deprecated). However, this doesn't reflect the reality - properties with old names are simply ignored now, whereas deprecated implies that they should still work. For back-compat reason I think they should still work - the current (admittedly awkward) prefix is good enough, and I think that changing it in a minor release would create confusion. I suggest reverting to the old names where appropriate, and add new properties with the same prefix, i.e. generate.. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848110#action_12848110 ] Andrzej Bialecki commented on NUTCH-762: - bq. If we want to replace the old generator altogether - which I think would be a good option I think this makes sense now, since the new Generator in your latest patch is a strict superset of the old one. bq. I don't have strong feelings on whether or not to modify the prefix in a minor release. I do :) , see also here: http://en.wikipedia.org/wiki/Principle_of_least_astonishment IMHO it's all about breaking or not breaking existing installs after a minor upgrade. I suspect most users won't be aware of a subtle change between generate. and generator., especially since the command-line of the new Generator is compatible with the old one. So they will try to use the new Generator while keeping their existing configs. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848173#action_12848173 ] Andrzej Bialecki commented on NUTCH-762: - bq. The change of prefix also reflected that we now use 2 different parameters so specify how to count the URLs (host or domain) and the max number of URLs. We can of course maintain the old parameters as well for the sake of compatibility, except that generate.max.per.host.by.ip won't be of much use anymore as we don't count per IP. Ok. bq. Have just noticed that 'crawl.gen.delay' is not documented in nutch-default.xml, and does not seem to be used outside the Generator. What is it supposed to be used for? Ah, a bit of ancient magic .. ;) This value, expressed in days, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. the become eligible for selecting. Default value of this is 7 days. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847291#action_12847291 ] Andrzej Bialecki commented on NUTCH-693: - Thanks for the pointer to the article. Indeed, the issue is muddy at best. So far Nutch adhered to a strict interpretation, where the links with this attribute are deleted from page outlinks immediately (so they are not only not followed but also don't affect out-degree metrics). If there is a general agreement in Nutch community towards relaxing this behavior we can further develop this patch - at the moment I don't see such support. Consequently, I propose to discuss it and in the meantime to move this issue to a later release. Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Assignee: Otis Gospodnetic Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-693: Assignee: (was: Otis Gospodnetic) Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-797: --- Assignee: Andrzej Bialecki parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost = basePath.substring(baseRightMostIdx+1); + } + + if (target.startsWith(?)) + target
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847300#action_12847300 ] Andrzej Bialecki commented on NUTCH-797: - If there are no futher comments I'm going to commit the current patch with a TODO to revisit this code if/when it's refactored to an external dependency. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1
[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.1.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-787: Assignee: Andrzej Bialecki Summary: Upgrade Lucene to 3.0.1. (was: Upgrade Lucene to 3.0.0.) We're shooting at 3.0.1 now. Upgrade Lucene to 3.0.1. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Assignee: Andrzej Bialecki Priority: Trivial Fix For: 1.1 Attachments: NUTCH-787.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847315#action_12847315 ] Andrzej Bialecki commented on NUTCH-787: - Using Lucene 3.0.1 artifacts I verified that your patch passes all tests and produces correct searchable indexes. I'll commit this shortly. Upgrade Lucene to 3.0.0. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Priority: Trivial Fix For: 1.1 Attachments: NUTCH-787.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-787) Upgrade Lucene to 3.0.1.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-787. --- Resolution: Fixed Committed. Thanks Dawid! Upgrade Lucene to 3.0.1. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Assignee: Andrzej Bialecki Priority: Trivial Fix For: 1.1 Attachments: NUTCH-787.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-803) Upgrade Hadoop to 0.20.2
Upgrade Hadoop to 0.20.2 Key: NUTCH-803 URL: https://issues.apache.org/jira/browse/NUTCH-803 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.1 Per subject. We are currently using 0.20.1, so there are no API changes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-803) Upgrade Hadoop to 0.20.2
[ https://issues.apache.org/jira/browse/NUTCH-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-803. --- Resolution: Fixed All tests pass - committed. Upgrade Hadoop to 0.20.2 Key: NUTCH-803 URL: https://issues.apache.org/jira/browse/NUTCH-803 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.1 Per subject. We are currently using 0.20.1, so there are no API changes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[DISCUSS] Nutch as a top level project (TLP)?
Hi devs, The ASF Board indicated recently that so called umbrella projects, i.e. projects that host many significant sub-projects, should examine their structure towards simplification, such as merging or splitting out sub-projects. Lucene TLP is such a project. Recently the Lucene PMC accepted the merge of Solr and Lucene core projects. Mahout project will most likely split to its own TLP soon. Which leaves Nutch as a sort of odd duck ;) Moving Nutch to its own TLP has some advantages, mostly an easier decision process - voting on new committers and new releases involves then only those who participate directly in Nutch dev., i.e. the Nutch community. Also, from the coding point of view, Nutch is not intrinsically tied to the Lucene development as if both would require some careful coordination - we just use Lucene as one of many dependencies, and in fact we aim to cleanly separate Nutch search API from Lucene-based API. I can easily imagine Nutch dropping completely the low-level Lucene-based components and moving to a more general search fabric (e.g. SolrCloud). Being its own TLP could also give Nutch more exposure and help to crystallize our mission. There are some disadvantages to such a split, too: we would need to spend some more effort on various administrative tasks, and maintain a separate web site (under Apache, but not under Lucene), and probably some other tasks that I'm not yet aware of. This would also mean that Nutch would have to stand on its own merit, which considering the small number of active committers may be challenging. Let's discuss this, and after we collect some pros and cons I'm going to call for a vote. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846923#action_12846923 ] Andrzej Bialecki commented on NUTCH-797: - That's one option, at least until the crawler-commons produces any artifacts ... Eventually I think that this code and other related code (e.g. deciding which URL is canonical in presence of redirects, url normalization and filtering) should end up in the crawler-commons. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846927#action_12846927 ] Andrzej Bialecki commented on NUTCH-762: - In my experience the IP-based fetching was only (rarely) needed when there was a large number of urls from virtual hosts hosted at the same ISP. In other words, not a common case - others may have different experience depending on their typical crawl targets... IMHO I think we don't have to reimplement this. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (NUTCH-802) Problems managing outlinks with large url length
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-802: - Assignee: Andrzej Bialecki Submitting a patch is not fixing, it's fixed when the patch is accepted and applied. Problems managing outlinks with large url length Key: NUTCH-802 URL: https://issues.apache.org/jira/browse/NUTCH-802 Project: Nutch Issue Type: Bug Components: parser Reporter: Pablo Aragón Assignee: Andrzej Bialecki Attachments: ParseOutputFormat.patch Nutch can get idle during the collection of outlinks if the URL address of the outlink is too large. The maximum sizes of an URL for the main web servers are: * Apache: 4,000 bytes * Microsoft Internet Information Server (IIS): 16, 384 bytes * Perl HTTP::Daemon: 8.000 bytes URL adress sizes bigger than 4000 bytes are problematic, so the limit should be set in the nutch-default.xml configuration file. I attached a patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-802) Problems managing outlinks with large url length
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846932#action_12846932 ] Andrzej Bialecki commented on NUTCH-802: - We already have a general way to control this and other aspects of URL-s as such, namely with URLFilters. I agree that this functionality could be useful, but in a form of a URLFilter (or adding this control to e.g. urlfilter-basic or urlfilter-validator). Problems managing outlinks with large url length Key: NUTCH-802 URL: https://issues.apache.org/jira/browse/NUTCH-802 Project: Nutch Issue Type: Bug Components: parser Reporter: Pablo Aragón Assignee: Andrzej Bialecki Attachments: ParseOutputFormat.patch Nutch can get idle during the collection of outlinks if the URL address of the outlink is too large. The maximum sizes of an URL for the main web servers are: * Apache: 4,000 bytes * Microsoft Internet Information Server (IIS): 16, 384 bytes * Perl HTTP::Daemon: 8.000 bytes URL adress sizes bigger than 4000 bytes are problematic, so the limit should be set in the nutch-default.xml configuration file. I attached a patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging
[ https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-796. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Patch applied in rev. 924945. Thanks for reporting it. Zero results problems difficult to troubleshoot due to lack of logging -- Key: NUTCH-796 URL: https://issues.apache.org/jira/browse/NUTCH-796 Project: Nutch Issue Type: Improvement Components: searcher, web gui Affects Versions: 1.0.0, 1.1 Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1 Reporter: Jesse Hires Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: logging.patch There are a few places where search can fail in a distributed environment, but when configuration is not quite right, there are no indications of errors or logging. Increased logging of failures would help troubleshoot such problems, as well as lower the I get 0 results, why? questions that come across the mailing lists. Areas where logging would be helpful: search app cannot locate search-servers.txt search app cannot find searcher node listed in search-server.txt search app cannot connect to port on searcher specified in search-server.txt searcher (bin/nutch server...) cannot find index searcher cannot find segments Access denied in any of the above scenarios. There are probably more that would be helpful, but I am not yet familiar to know all the points of possible failure between the webpage and a search node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-800) Generator builds a URL list that is not encoded
[ https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847071#action_12847071 ] Andrzej Bialecki commented on NUTCH-800: - I'm puzzled by your problem description. Is Nutch affected by a potentially malicious URL data? URL form encoding is just a transport encoding, it doesn't make URL inherently safe (or unsafe). Generator builds a URL list that is not encoded --- Key: NUTCH-800 URL: https://issues.apache.org/jira/browse/NUTCH-800 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 1.0.0, 1.1 Reporter: Jesse Campbell The URL string that is grabbed by the generator when creating the fetch list does not get encoded, could potentially allow unsafe excecution, and breaks reading improperly encoded URLs from the scraped pages. Since we a) cannot guarantee that any site we scrape is not malitious, and b) likely do not have control over all content providers, we are currently forced to use a regex normalizer to perform the same function as a built-in java class (it would be unsafe to leave alone) A quick solution would be to update Generator.java to utilize the java.net.URLEncoder static class: line 187: old: String urlString = url.toString(); new: String urlString = URLEncoder.encode(url.toString(),UTF-8); line 192: old: u = new URL(url.toString()); new: u = new URL(urlString); The use of URLEncoder.encode could also be at the updatedb stage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.
[ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847074#action_12847074 ] Andrzej Bialecki commented on NUTCH-693: - This patch is controversial in the sense that a) Nutch strives to adhere to Internet standards and netiquette, which says that robots should obey nofollow, and b) most Nutch users want a well-behaved robot. You are free of course to modify the source as you did. Therefore I think that this functionality is not applicable to majority of Nutch users, and I vote -1 on including it in Nutch. Add configurable option for treating nofollow behaviour. Key: NUTCH-693 URL: https://issues.apache.org/jira/browse/NUTCH-693 Project: Nutch Issue Type: New Feature Reporter: Andrew McCall Assignee: Otis Gospodnetic Priority: Minor Attachments: nutch.nofollow.patch For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-795) Add ability to maintain nofollow attribute in linkdb
[ https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847075#action_12847075 ] Andrzej Bialecki commented on NUTCH-795: - Please see my comment to that issue. Or is there some other use case that you have in mind? Add ability to maintain nofollow attribute in linkdb Key: NUTCH-795 URL: https://issues.apache.org/jira/browse/NUTCH-795 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 1.1 Reporter: Sammy Yu Attachments: 0001-Updated-with-nofollow-support-for-Outlinks.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-780) Nutch crawler did not read configuration files
[ https://issues.apache.org/jira/browse/NUTCH-780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847094#action_12847094 ] Andrzej Bialecki commented on NUTCH-780: - Is the purpose of this issue to make Crawl.java usable via strongly-typedAPI instead of the generic main, e.g. something like this: {code} public class Crawl extends Configured { public int crawl(Path output, Path seedDir, int threads, int numCycles, int topN, ...) { ... } } {code} Nutch crawler did not read configuration files -- Key: NUTCH-780 URL: https://issues.apache.org/jira/browse/NUTCH-780 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Vu Hoang Attachments: NUTCH-780.patch Nutch searcher can read properties at the constructor ... {code:java|title=NutchSearcher.java|borderStyle=solid} NutchBean bean = new NutchBean(getFilesystem().getConf(), fs); ... // put search engine code here {code} ... but Nutch crawler is not, it only reads data from arguments. {code:java|title=NutchCrawler.java|borderStyle=solid} StringBuilder builder = new StringBuilder(); builder.append(domainlist + SPACE); builder.append(ARGUMENT_CRAWL_DIR); builder.append(domainlist + SUBFIX_CRAWLED + SPACE); builder.append(ARGUMENT_CRAWL_THREADS); builder.append(threads + SPACE); builder.append(ARGUMENT_CRAWL_DEPTH); builder.append(depth + SPACE); builder.append(ARGUMENT_CRAWL_TOPN); builder.append(topN + SPACE); Crawl.main(builder.toString().split(SPACE)); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846402#action_12846402 ] Andrzej Bialecki commented on NUTCH-797: - Thanks for reporting this, and providing a patch. An updated revision of the standard, RFC3986 section 5.4.1 example 7 follows the same reasoning. I'll fix this shortly. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846418#action_12846418 ] Andrzej Bialecki commented on NUTCH-797: - Hm, actually the picture is more complicated than I thought - if we apply both methods (fixEmbeddedParams and fixPureQueryTargets) then some of the test cases from RFC fail. However, all tests succeed if we only apply the fixPureQueryTargets ! Looking at the origin of the fixEmbeddedParams method (NUTCH-436) something must been fixed in java.net.URL, because the test case mentioned in that issue now passes if we apply only fixPureQueryTargets. The same case with test cases in a near-duplicate issue NUTCH-566. Consequently I'm going to remove fixEmbeddedParams. I added all tests from RFC3986 section 5.4.1, and they all pass now. I'll attach an updated patch shortly. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith
[jira] Updated: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-797: Attachment: pureQueryUrl-2.patch Updated patch with some refactoring and unit tests. If no objections I'll commit this shortly. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost = basePath.substring(baseRightMostIdx+1
[jira] Updated: (NUTCH-796) Zero results problems difficult to troubleshoot due to lack of logging
[ https://issues.apache.org/jira/browse/NUTCH-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-796: Attachment: logging.patch I propose this patch. If there are no objections I'll commit it shortly. Zero results problems difficult to troubleshoot due to lack of logging -- Key: NUTCH-796 URL: https://issues.apache.org/jira/browse/NUTCH-796 Project: Nutch Issue Type: Improvement Components: searcher, web gui Affects Versions: 1.0.0, 1.1 Environment: Linux, x86, nutch searcher and nutch webaps. v1.0, v1.1 Reporter: Jesse Hires Attachments: logging.patch There are a few places where search can fail in a distributed environment, but when configuration is not quite right, there are no indications of errors or logging. Increased logging of failures would help troubleshoot such problems, as well as lower the I get 0 results, why? questions that come across the mailing lists. Areas where logging would be helpful: search app cannot locate search-servers.txt search app cannot find searcher node listed in search-server.txt search app cannot connect to port on searcher specified in search-server.txt searcher (bin/nutch server...) cannot find index searcher cannot find segments Access denied in any of the above scenarios. There are probably more that would be helpful, but I am not yet familiar to know all the points of possible failure between the webpage and a search node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-787) Upgrade Lucene to 3.0.0.
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846428#action_12846428 ] Andrzej Bialecki commented on NUTCH-787: - Lucene 3.0.1 is out now .. I'll test this patch with 3.0.1 artifacts and will report. Upgrade Lucene to 3.0.0. Key: NUTCH-787 URL: https://issues.apache.org/jira/browse/NUTCH-787 Project: Nutch Issue Type: Task Components: build Reporter: Dawid Weiss Priority: Trivial Fix For: 1.1 Attachments: NUTCH-787.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846437#action_12846437 ] Andrzej Bialecki commented on NUTCH-797: - Unfortunately the way your fix was applied there is not reusable (private method in HtmlParser... ugh :( ). So for the time being I think we'll go with our utility class ... which we should really move to the crawler-commons anyway! parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx
[jira] Assigned: (NUTCH-774) Retry interval in crawl date is set to 0
[ https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-774: --- Assignee: Andrzej Bialecki Retry interval in crawl date is set to 0 Key: NUTCH-774 URL: https://issues.apache.org/jira/browse/NUTCH-774 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-774.patch, NUTCH-774_2.patch When i fetch and parse a feed with the feed plugin, http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ another crawl date is generated http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ after fetching a second round the dump in the crawl db still shows a retry interval with value 0. http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Dec 02 12:48:22 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.084 Signature: db9ab2193924cd2d0b53113a500ca604 Metadata: _pst_: success(1), lastModified=0 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in the method setFetchSchedule -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846527#action_12846527 ] Andrzej Bialecki commented on NUTCH-797: - A few issues with this: * does this mean that the fixes would be applied to links found in other content types as well, not just html (the fixup code in TIKA-287 is located in HtmlParser)? * we need this also in other places, e.g. in the redirection handling code (both meta-refresh, javascript location.href and protocol-level redirect) * for a while we still need this in the parse-html plugin that does not use Tika. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Priority: Minor Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846133#action_12846133 ] Andrzej Bialecki commented on NUTCH-762: - It appears this class is not a strict superset - the generate.update.crawldb functionality is not there. This is a regression in a useful functionality, so I think it needs to be added back. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846174#action_12846174 ] Andrzej Bialecki commented on NUTCH-762: - In case of users generating just 1 segment at a time it's an unexpected loss of flexibility. You can't run this version of Generator twice without first completing _both_ fetching updating of all segments from the previous run - because some of the same urls would be generated in the next round. The point of generate.update.crawldb is to be able to freely interleave generate/update steps. E.g. the following scenario breaks in a non-obvious way: * generate 10 segments * fetch update 8 of them * realize you need more rounds due to e.g. gone pages * generate additional 10 segments ..kaboom! now the new segments partially overlap with the unfetched 2 segments from the previous generation, and you are going to fetch some urls twice. Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-762-v2.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843555#action_12843555 ] Andrzej Bialecki commented on NUTCH-798: - +1, preferably before the 1.1 freeze so that we can test it. Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the way we buffer the docs before sending them to the SOLR instance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-801) Remove RTF and MP3 parse plugins
[ https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843587#action_12843587 ] Andrzej Bialecki commented on NUTCH-801: - Definitely +1, the only reason they lingered so long was the lack of a suitable replacement. Remove RTF and MP3 parse plugins Key: NUTCH-801 URL: https://issues.apache.org/jira/browse/NUTCH-801 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Fix For: 1.1 *Parse-rtf* and *parse-mp3* are not built by default due to licensing issues. Since we now have *parse-tika* to handle these formats I would be in favour of removing these 2 plugins altogether to keep things nice and simple. The other plugins will probably be phased out only after the release of 1.1 when parse-tika will have been tested a lot more. Any reasons not to? Julien -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 1.1 release?
On 2010-03-09 18:17, Julien Nioche wrote: Hi Chris, Excellent idea! There have been quite a few changes since 1.0 and it's probably the right time to have a new release. +1. Let's just check JIRA and make sure we didn't forget anything important ... Not really a blocker but https://issues.apache.org/jira/browse/NUTCH-762 would be nice to have in 1.1, just needs a bit of reviewing / testing I suppose. Otherwise this can wait until after 1.1 I'll try to test it before the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-799) SOLRIndexer to commit once all reducers have finished
[ https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841790#action_12841790 ] Andrzej Bialecki commented on NUTCH-799: - I think it's ok to do it this way - the commit per reducer may be actually harmful if commit succeeds but the task is killed for any reason and re-ran. Note: the patch has some formatting errors. SOLRIndexer to commit once all reducers have finished - Key: NUTCH-799 URL: https://issues.apache.org/jira/browse/NUTCH-799 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1 Attachments: NUTCH-799.patch What about doing only one SOLR commit after the MR job has finished in SOLRIndexer instead of doing that at the end of every Reducer? I ran into timeout exceptions in some of my reducers and I suspect that this was due to the fact that other reducers had already finished and called commit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832250#action_12832250 ] Andrzej Bialecki commented on NUTCH-766: - +1 to commit this - please remember to update nutch-default.xml to switch to the tika plugin, perhaps add a comment about the deprecated parse-* plugins - most people look here and not in the parse-plugins, where this change is documented... Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830065#action_12830065 ] Andrzej Bialecki commented on NUTCH-673: - +1 on both counts. Upgrade to Lucene 3.0 may involve more work than expected because of deprecated 2.x APIs that are no longer available in 3.0. Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.1 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806031#action_12806031 ] Andrzej Bialecki commented on NUTCH-775: - IMHO this could go as it is ... one suggestion though: this Query/QueryContext now resembles SolrQuery/SolrParams. Perhaps we could rename QueryContext to QueryParams? Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Attachments: NUTCH-775.patch Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804558#action_12804558 ] Andrzej Bialecki commented on NUTCH-766: - I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent deprecation note, but I feel equally strongly that we should not prolong their life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. We simply don't have resources to maintain so many duplicate plugins, and instead we should direct our efforts to improve those in Tika. Tika parser --- Key: NUTCH-766 URL: https://issues.apache.org/jira/browse/NUTCH-766 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Chris A. Mattmann Fix For: 1.1 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome. Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points. Unlike most other parsers, Tika handles more than one Mime-type which is why we are using * as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it. The following libraries are required in the lib/ directory of the tika-parser : library name=asm-3.1.jar/ library name=bcmail-jdk15-144.jar/ library name=commons-compress-1.0.jar/ library name=commons-logging-1.1.1.jar/ library name=dom4j-1.6.1.jar/ library name=fontbox-0.8.0-incubator.jar/ library name=geronimo-stax-api_1.0_spec-1.0.1.jar/ library name=hamcrest-core-1.1.jar/ library name=jce-jdk13-144.jar/ library name=jempbox-0.8.0-incubator.jar/ library name=metadata-extractor-2.4.0-beta-1.jar/ library name=mockito-core-1.7.jar/ library name=objenesis-1.0.jar/ library name=ooxml-schemas-1.0.jar/ library name=pdfbox-0.8.0-incubating.jar/ library name=poi-3.5-FINAL.jar/ library name=poi-ooxml-3.5-FINAL.jar/ library name=poi-scratchpad-3.5-FINAL.jar/ library name=tagsoup-1.2.jar/ library name=tika-parsers-0.5-SNAPSHOT.jar/ library name=xml-apis-1.0.b2.jar/ library name=xmlbeans-2.3.0.jar/ There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine. Again, your comments are welcome. Please bear in mind that this is just a first step. Julien http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802175#action_12802175 ] Andrzej Bialecki commented on NUTCH-779: - Personally I would use ScoringFilters because I'm familiar with the API, but the approach that you propose is certainly more user friendly especially for novice users. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801875#action_12801875 ] Andrzej Bialecki commented on NUTCH-779: - You can already achieve this with ScoringFilters, although it requires using three methods instead ... I would also rename the status to parse_meta, it's less cryptic this way. The property needs some documentation in nutch-default.xml plus a sensible default. Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Attachments: NUTCH-779 The patch attached allows to pass parse metadata to the corresponding entry of the crawldb. Comments are welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-655) Injecting Crawl metadata
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797013#action_12797013 ] Andrzej Bialecki commented on NUTCH-655: - I'm not sure about the latest addition (the score option). If we go this route, then I suggest doing the last minor step and recognize reserved metadata keys to do also other useful things like setting fetch interval. I.e. define and recognize nutch.score and nutch.fetchInterval, and document it properly somewhere ...(wiki? javadoc? cmd-line synopsis?). Injecting Crawl metadata Key: NUTCH-655 URL: https://issues.apache.org/jira/browse/NUTCH-655 Project: Nutch Issue Type: Improvement Components: injector Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Fix For: 1.1 Attachments: Injector.patch, NUTCH-655.v2 the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this: http://www.myurl.com \t categ=value1 \t categ2=value2 This functionality can be useful to store external knowledge and index it with a custom plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791979#action_12791979 ] Andrzej Bialecki commented on NUTCH-666: - Do you think it was related to the quality of language models that you built (presumably the ones in the patch?) versus the ones in the Nutch plugin, or due to a different classification algorithm? I'm trying to understand the source of such a big difference, because AFAIK the algorithm in textcat is essentially the same as the one we use. Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-775) Enhance Searcher interface
[ https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791411#action_12791411 ] Andrzej Bialecki commented on NUTCH-775: - +1. I would suggest creating a subclass of Metadata, where we can guarantee the presence of some required parameters, e.g.: {code} public class SearchContext extends Metadata { protected int numHits; protected String sortField; protected String dedupField; ... // setters and getters for the above } {code} and change the QueryFilter interface to use SearchContext too. Enhance Searcher interface -- Key: NUTCH-775 URL: https://issues.apache.org/jira/browse/NUTCH-775 Project: Nutch Issue Type: Improvement Components: searcher Reporter: Sami Siren Assignee: Sami Siren Fix For: 1.1 Current Searcher interface is too limited for many purposes: Hits search(Query query, int numHits, String dedupField, String sortField, boolean reverse) throws IOException; It would be nice that we had an interface that allowed adding different features without changing the interface. I am proposing that we deprecate the current search method and introduce something like: Hits search(Query query, Metadata context) throws IOException; Also at the same time we should enhance the QueryFilter interface to look something like: BooleanQuery filter(Query input, BooleanQuery translation, Metadata context) throws QueryException; I would like to hear your comments before proceeding with a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790225#action_12790225 ] Andrzej Bialecki commented on NUTCH-666: - Dennis, what's the status of this patch (especially the missing part, the new language identifier)? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: State of nutchbase
Alban Mouton wrote: Hello, I have looked a little into nutch code and mailing lists. I think the nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is very interesting, with a good potential to improve code clarity and flexibility (I find data structure quite obscure in current version). The issue is untouched since last august, so my question is : can nutchbase really be part of nutch 1.1 ? Definitely no. Release 1.1 will be an update to 1.0, with no major design changes. However, we intend to integrate the nutchbase branch with trunk at some point - but since this would be a major change it would come under 2.0 branch or so ... Is there still much work to do or is it almost ready ? Is it a worthy issue for an interested developer with a (still !) limited knowledge of the project ? Please contact Dogacan, who is leading the work on this branch. AFAIK he's going to update the design soon. So far I have only tried to run nutchbase in eclipse by applying the tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run in errors when building, mostly from Parser and tests. I may start by cleaning this up. See above - please coordinate with Dogacan to avoid duplication of effort. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-767: Remaining Estimate: 0h Original Estimate: 0h I applied the patch, and I'm closing this issue - we will track the test failures when we upgrade to Tika 0.6, which is imminent. Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767-part2.patch, NUTCH-767.patch Original Estimate: 0h Remaining Estimate: 0h The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-767: - Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767.patch The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784790#action_12784790 ] Andrzej Bialecki commented on NUTCH-767: - Reopening this issue, because TestContent is failing now - after fixing a trivial compilation problem, now the problem seems to be that the type for empty content is auto-detected as text/plain and this value overrides the hint from the Content-Type header. Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767.patch The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20
[ https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784206#action_12784206 ] Andrzej Bialecki commented on NUTCH-768: - +1. Minor nit: file lib/hsqldb-1.8.0.10.LICENSE.txt uses Windows EOL style, this should be probably corrected before commit. Upgrade Nutch 1.0 to use Hadoop 0.20 Key: NUTCH-768 URL: https://issues.apache.org/jira/browse/NUTCH-768 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-768-1-20091125.patch Upgrade Nutch 1.0 to use the Hadoop 0.20 release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784250#action_12784250 ] Andrzej Bialecki commented on NUTCH-770: - Fixed in rev. 885776. Thank you! Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-770. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784260#action_12784260 ] Andrzej Bialecki commented on NUTCH-769: - I had to apply this patch by hand, due to NUTCH-770. I also added conf/nutch-default.xml documentation. This was committed in rev. 885785 - thanks! Fetcher to skip queues for URLS getting repeated exceptions - Key: NUTCH-769 URL: https://issues.apache.org/jira/browse/NUTCH-769 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: NUTCH-769-2.patch, NUTCH-769.patch As discussed on the mailing list (see http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues. by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-769) Fetcher to skip queues for URLS getting repeated exceptions
[ https://issues.apache.org/jira/browse/NUTCH-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-769. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Fetcher to skip queues for URLS getting repeated exceptions - Key: NUTCH-769 URL: https://issues.apache.org/jira/browse/NUTCH-769 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: NUTCH-769-2.patch, NUTCH-769.patch As discussed on the mailing list (see http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg15360.html) this patch allows to clear URLs queues in the Fetcher when more than a set number of exceptions have been encountered in a row. This can speed up the fetching substantially in cases where target hosts are not responsive (as a TimeoutException would be thrown) and limits cases where a whole Fetch step is slowed down because of a few queues. by default the parameter fetcher.max.exceptions.per.queue has a value of -1 and is deactivated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-767. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki (was: Chris A. Mattmann) Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767.patch The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-767) Update Tika to v0.5 for the MimeType detection
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784337#action_12784337 ] Andrzej Bialecki commented on NUTCH-767: - Fixed in rev. 885869. Thank you! Update Tika to v0.5 for the MimeType detection --- Key: NUTCH-767 URL: https://issues.apache.org/jira/browse/NUTCH-767 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-767.patch The version 0.5 of TIka requires a few changes to the MimeType implementation. Tika is now split in several jars, we need to place the tika-core.jar in the main nutch lib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783638#action_12783638 ] Andrzej Bialecki commented on NUTCH-770: - bq. time limit is definitely better than timebomb (but not as amusing). :) let's got for informative and less confusing now ... Could you please also add the nutch-default.xml property and its documentation. Re: FetchQueues - ok, you have a point here. Re: code style - yes. Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: wrong wiki front page
Alban Mouton wrote: No reaction ? Isn't the Wiki admin on this mailing list ? I don't see any link on the Wiki to contact the admin. The french frontpage is still the generic MoinMoin wiki home page and that can make a bad impression to newcomers ! We have little control over the MoinMoin config (AFAIK it's configured for multiple projects), and what you noticed is probably a fallout of the recent wiki upgrade - please create a JIRA issue here: https://issues.apache.org/jira/browse/INFRA (don't forget to mention the project name). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-770) Timebomb for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783283#action_12783283 ] Andrzej Bialecki commented on NUTCH-770: - I propose to change the name of this functionality - timebomb is not self-explanatory, and it suggests that if you misbehave then your cluster may explode ;) Instead I would use time limit, rename all vars and methods to follow this naming, and document it properly in nutch-default.xml. A few comments to the patch: * it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy to resolve, see also the next point. * why change the code in FetchQueues at all? Time limit is a global condition, we could just break the main loop in run() and ignore the QueueFeeder (or don't start it if the time limit already passed when starting run() ). * the patch does not follow the code style (notably whitespace in for/while loops and assignments). Timebomb for Fetcher Key: NUTCH-770 URL: https://issues.apache.org/jira/browse/NUTCH-770 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Attachments: log-770, NUTCH-770.patch This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-746. --- Resolution: Fixed Assignee: Andrzej Bialecki NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. Key: NUTCH-746 URL: https://issues.apache.org/jira/browse/NUTCH-746 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode) Reporter: Kirby Bohling Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-746.patch NutchBeanConstructor is not cleaning up upon application shutdown (contextDestroyed()). It leaves open the SegmentUpdater, and potentially other resources. This causes the WebApp's classloader to not be able to GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-746) NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container.
[ https://issues.apache.org/jira/browse/NUTCH-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783287#action_12783287 ] Andrzej Bialecki commented on NUTCH-746: - Fixed in rev. 885148. Thanks! NutchBeanConstructor does not close NutchBean upon contextDestroyed, causing resource leak in the container. Key: NUTCH-746 URL: https://issues.apache.org/jira/browse/NUTCH-746 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Environment: Apache Tomcat 5.5.27 and 6.0.18, Fedora 11, OpenJDK or Sun JDK 1.6 OpenJDK 64-Bit Server VM (build 14.0-b15, mixed mode) Reporter: Kirby Bohling Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-746.patch NutchBeanConstructor is not cleaning up upon application shutdown (contextDestroyed()). It leaves open the SegmentUpdater, and potentially other resources. This causes the WebApp's classloader to not be able to GC'ed in Tomcat, which after repeated restarts will lead to a PermGen error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-738) Close SegmentUpdater when FetchedSegments is closed
[ https://issues.apache.org/jira/browse/NUTCH-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-738. --- Resolution: Fixed Assignee: Andrzej Bialecki Close SegmentUpdater when FetchedSegments is closed --- Key: NUTCH-738 URL: https://issues.apache.org/jira/browse/NUTCH-738 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Martina Koch Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: FetchedSegments.patch, NUTCH-738.patch Currently FetchedSegments starts a SegmentUpdater, but never closes it when FetchedSegments is closed. (The problem was described in this mailing: http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg13823.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-739. --- Resolution: Fixed Assignee: Andrzej Bialecki SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-739) SolrDeleteDuplications too slow when using hadoop
[ https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783290#action_12783290 ] Andrzej Bialecki commented on NUTCH-739: - Fixed in rev. 885152. Thank you! SolrDeleteDuplications too slow when using hadoop - Key: NUTCH-739 URL: https://issues.apache.org/jira/browse/NUTCH-739 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.0.0 Environment: hadoop cluster with 3 nodes Map Task Capacity: 6 Reduce Task Capacity: 6 Indexer: one instance of solr server (on the one of slave nodes) Reporter: Dmitry Lihachev Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-739_remove_optimize_on_solr_dedup.patch in my environment i always have many warnings like this on the dedup step {noformat} Task attempt_200905270022_0212_r_03_0 failed to report status for 600 seconds. Killing! {noformat} solr logs: {noformat} INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173741 May 27, 2009 10:29:27 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {optimize=} 0 173599 May 27, 2009 10:29:27 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update params={wt=javabinwaitFlush=trueoptimize=truewaitSearcher=truemaxSegments=1version=2.2} status=0 QTime=173599 May 27, 2009 10:29:27 AM org.apache.solr.search.SolrIndexSearcher close INFO: Closing searc...@2ad9ac58 main May 27, 2009 10:29:27 AM org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean getMBeanInfo WARNING: Could not getStatistics on info bean org.apache.solr.search.SolrIndexSearcher org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed {noformat} So I think the problem in the piece of code on line 301 of SolrDeleteDuplications ( solr.optimize() ). Because we have few job tasks each of ones tries to optimize solr indexes before closing. The simplest way to avoid this bug - removing this line and sending optimize/ message directly to solr server after dedup step -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-755) DomainURLFilter crashes on malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-755. --- Resolution: Cannot Reproduce Assignee: Andrzej Bialecki DomainURLFilter crashes on malformed URL Key: NUTCH-755 URL: https://issues.apache.org/jira/browse/NUTCH-755 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Tomcat 6.0.14 Java 1.6.0_14 Linux Reporter: Mike Baranczak Assignee: Andrzej Bialecki 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply filter on url: http:/comments.php java.lang.NullPointerException at org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173) at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) Expected behavior would be to recognize the URL as malformed, and reject it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-755) DomainURLFilter crashes on malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783299#action_12783299 ] Andrzej Bialecki commented on NUTCH-755: - I could not verify that the filter indeed crashes - it simply prints the exception and then returns null, as you suggested. DomainURLFilter crashes on malformed URL Key: NUTCH-755 URL: https://issues.apache.org/jira/browse/NUTCH-755 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: Tomcat 6.0.14 Java 1.6.0_14 Linux Reporter: Mike Baranczak Assignee: Andrzej Bialecki 2009-09-16 21:54:17,001 ERROR [Thread-156] DomainURLFilter - Could not apply filter on url: http:/comments.php java.lang.NullPointerException at org.apache.nutch.urlfilter.domain.DomainURLFilter.filter(DomainURLFilter.java:173) at org.apache.nutch.net.URLFilters.filter(URLFilters.java:88) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:200) at org.apache.nutch.parse.ParseOutputFormat$1.write(ParseOutputFormat.java:113) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:96) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) Expected behavior would be to recognize the URL as malformed, and reject it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783302#action_12783302 ] Andrzej Bialecki commented on NUTCH-692: - We should review this issue after the upgrade to Hadoop 0.20 - task output mgmt differs there, and the problem may be nonexistent. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-741) Job file includes multiple copies of nutch config files.
[ https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783304#action_12783304 ] Andrzej Bialecki commented on NUTCH-741: - Fixed in rev. 885156. Thank you! Job file includes multiple copies of nutch config files. Key: NUTCH-741 URL: https://issues.apache.org/jira/browse/NUTCH-741 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Kirby Bohling Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: removeJobDupConf.diff From a clean checkout, running ant tar will create a .job file. The .job file includes two copies of the nutch-site.xml and nutch-default.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-741) Job file includes multiple copies of nutch config files.
[ https://issues.apache.org/jira/browse/NUTCH-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-741. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Job file includes multiple copies of nutch config files. Key: NUTCH-741 URL: https://issues.apache.org/jira/browse/NUTCH-741 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.0.0 Reporter: Kirby Bohling Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: removeJobDupConf.diff From a clean checkout, running ant tar will create a .job file. The .job file includes two copies of the nutch-site.xml and nutch-default.xml file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-712. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: ParseOutputFormat-NUTCH712v2.patch ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-712) ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers
[ https://issues.apache.org/jira/browse/NUTCH-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783306#action_12783306 ] Andrzej Bialecki commented on NUTCH-712: - Fixed in rev. 885159. Thank you! ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers - Key: NUTCH-712 URL: https://issues.apache.org/jira/browse/NUTCH-712 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: ParseOutputFormat-NUTCH712v2.patch ParseOutputFormat should catch java.net.MalformedURLException coming from normalizers otherwise the whole parsing step crashes instead of simply ignoring dodgy outlinks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-772) Upgrade Nutch to use Lucene 2.9.1
Upgrade Nutch to use Lucene 2.9.1 - Key: NUTCH-772 URL: https://issues.apache.org/jira/browse/NUTCH-772 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Upgrade Nutch to the latest Lucene release. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
david.stu...@progressivealliance.co.uk wrote: While you are doing changes and commits in this area I have been waiting for this patch https://issues.apache.org/jira/browse/NUTCH-760 of mine to be incorporated for a while now. Is it possible it get it in?? It's on my agenda - I'll apply the patch either today or tomorrow, time permitting. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java
[ https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-773. --- Resolution: Fixed Assignee: Andrzej Bialecki some minor bugs in AbstractFetchSchedule.java - Key: NUTCH-773 URL: https://issues.apache.org/jira/browse/NUTCH-773 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: NUTCH-773.patch fixes some minor trivial bugs in AbstractFetchSchedule.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-773) some minor bugs in AbstractFetchSchedule.java
[ https://issues.apache.org/jira/browse/NUTCH-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782509#action_12782509 ] Andrzej Bialecki commented on NUTCH-773: - That was a nasty bug - fixed in rev. 884198. Thanks! some minor bugs in AbstractFetchSchedule.java - Key: NUTCH-773 URL: https://issues.apache.org/jira/browse/NUTCH-773 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.0.0 Reporter: Reinhard Schwab Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: NUTCH-773.patch fixes some minor trivial bugs in AbstractFetchSchedule.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice
[ https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782516#action_12782516 ] Andrzej Bialecki commented on NUTCH-753: - Fixed in rev. 884203 - thanks! Prevent new Fetcher to retrieve the robots twice Key: NUTCH-753 URL: https://issues.apache.org/jira/browse/NUTCH-753 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-753.patch The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-753) Prevent new Fetcher to retrieve the robots twice
[ https://issues.apache.org/jira/browse/NUTCH-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-753. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Prevent new Fetcher to retrieve the robots twice Key: NUTCH-753 URL: https://issues.apache.org/jira/browse/NUTCH-753 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: NUTCH-753.patch The new Fetcher which is now used by default handles the robots file directly instead of relying on the protocol. The options Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS are set to false to prevent fetching the robots.txt twice (in Fetcher + in protocol), which avoids calling robots.isAllowed. However in practice the robots file is still fetched as there is a call to robots.getCrawlDelay() a bit further which is not covered by the if (Protocol.CHECK_ROBOTS). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782524#action_12782524 ] Andrzej Bialecki commented on NUTCH-762: - This class offers a strict superset of the current Generator functionality. Maintaining both tools would be cumbersome and error-prone. I propose to replace Generator with MultiGenerator (under the current name Generator). Alternative Generator which can generate several segments in one parse of the crawlDB - Key: NUTCH-762 URL: https://issues.apache.org/jira/browse/NUTCH-762 Project: Nutch Issue Type: New Feature Components: generator Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-762-MultiGenerator.patch When using Nutch on a large scale (e.g. billions of URLs), the operations related to the crawlDB (generate - update) tend to take the biggest part of the time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB then update the Db only once on several segments. The existing Generator allows several successive runs by generating a copy of the crawlDB and marking the URLs to be fetched. In practice this approach does not work well as we need to read the whole crawlDB as many time as we generate a segment. The patch attached contains an implementation of a MultiGenerator which can generate several fetchlists by reading the crawlDB only once. The MultiGenerator differs from the Generator in other aspects: * can filter the URLs by score * normalisation is optional * IP resolution is done ONLY on the entries which have been selected for fetching (during the partitioning). Running the IP resolution on the whole crawlDb is too slow to be usable on a large scale * can max the number of URLs per host or domain (but not by IP) * can choose to partition by host, domain or IP Typically the same unit (e.g. domain) would be used for maxing the URLs and for partitioning; however as we can't count the max number of URLs by IP another unit must be chosen while partitioning by IP. We found that using a filter on the score can dramatically improve the performance as this reduces the amount of data being sent to the reducers. The MultiGenerator is called via : nutch org.apache.nutch.crawl.MultiGenerator ... with the following options : MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num] where most parameters are similar to the default Generator - apart from : -noNorm (explicit) -topN : max number of URLs per segment -maxNumSegments : the actual number of segments generated could be less than the max value select e.g. not enough URLs are available for fetching and fit in less segments Please give it a try and less me know what you think of it Julien Nioche http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer
[ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-761. --- Resolution: Fixed Fix Version/s: 1.1 Assignee: Andrzej Bialecki Avoid cloningCrawlDatum in CrawlDbReducer -- Key: NUTCH-761 URL: https://issues.apache.org/jira/browse/NUTCH-761 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: optiCrawlReducer.patch In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments. The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, we noticed an improvement of around 25-30% in the time spent in the reduce phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer
[ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782537#action_12782537 ] Andrzej Bialecki commented on NUTCH-761: - I applied the patch with some changes - reverted the logic in the name of the boolean var, and applied the same method to other cases of non-multiple values. Committed in rev. 884224 - thanks! Avoid cloningCrawlDatum in CrawlDbReducer -- Key: NUTCH-761 URL: https://issues.apache.org/jira/browse/NUTCH-761 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: optiCrawlReducer.patch In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments. The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger, we noticed an improvement of around 25-30% in the time spent in the reduce phase. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-760) Allow field mapping from nutch to solr index
[ https://issues.apache.org/jira/browse/NUTCH-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-760. --- Resolution: Fixed Fix Version/s: 1.1 Allow field mapping from nutch to solr index Key: NUTCH-760 URL: https://issues.apache.org/jira/browse/NUTCH-760 Project: Nutch Issue Type: Improvement Components: indexer Reporter: David Stuart Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch, solrindex_schema.patch I am using nutch to crawl sites and have combined it with solr pushing the nutch index using the solrindex command. I have set it up as specified on the wiki using the copyField url to id in the schema. Whilst this works fine it is stuff's up my inputs from other sources in solr (e.g. using the solr data import handler) as they have both id's and url's. I have patch that implements a nutch xml schema defining what basic nutch fields map to in your solr push. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.