Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
From what I know (the way we use hudson) is that hudson has plugins for presenting tool results only and the tools need to be executed during build - and libraries need to be included so they are available to ant. Piotr On Tue, Jan 20, 2009 at 9:40 PM, Doğacan Güney doga...@gmail.com wrote: On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
I have configured hudson for 10 or more projects and always used pmd plugin to display the pmd results only - the actual pmd task to generate report was run from ant script. Maybe there is such possibility tu run pmd reports directly in hudson (not through project build scripts) but I have never come accross it. Piotr On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: They've had pmd integrated with Hudson for many months now, I believe. I've seen patches in JIRA that were the result of fixes for problems reported by pmd. Or maybe they run pmd by hand? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: FW: Nutch release process help
Chris, I have documented the process in the wiki. Doug have sent the links already. If you have any questions I would be willing to help. I can even do it myself if find it difficult - I simply do not want to be the bottleneck as I am behind my schedule at work and in private life. I still hope I would be able to get to be more active in ntuch community in future. Regards Pitor On 3/6/07, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: It's too bad that this has turned out to be an issue that I've handled incorrectly, and for that, I apologize. Sorry if I blew this out of proportion. We all help each other run this project. I don't think any grave error was made. I just saw an opportunity to remind folks to try to keep project discussions public, and did not mean to rebuke you. I am thrilled that you want to take on the responsibility of making a release. I very much do not want to damp your enthusiasm for that. As you probably know, the release documentation is at: http://wiki.apache.org/nutch/Release_HOWTO This may need to be updated. You might also look at the release documentation for other projects, to get ideas. http://wiki.apache.org/lucene-hadoop/HowToRelease http://wiki.apache.org/solr/HowToRelease http://wiki.apache.org/jakarta-lucene/ReleaseTodo Cheers, Doug
Re: Reviving Nutch 0.7
Otis, Some time ago people on the list said that they are willing to at least maintain Nutch 0.7 branch. As a committer (not very active recently) I volunteered to commit patches when they appear - I do not have enough time at the moment to do active coding. I have created a 7.3 release in JIRA so we can start looking at it. So - we are ready and willing to move Nutch 0.7 forward but it looks like there is no interest at the moment. Regards Piotr On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally. Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today. However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be. Fairly regular nutch-user inquiries confirm this. Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond. So, what is going to happen to 0.7? Maintenance mode? I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward. Thoughts? Otis
[jira] Closed: (NUTCH-429) Secured Searches
[ https://issues.apache.org/jira/browse/NUTCH-429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Kosiorowski closed NUTCH-429. --- Resolution: Invalid Please use nutch-user mailing list for such questions and JIRA for reporting issues only. I suggest also being more specifing on the mailing list what you mean by Secured Seraches. Secured Searches Key: NUTCH-429 URL: https://issues.apache.org/jira/browse/NUTCH-429 Project: Nutch Issue Type: Bug Reporter: Piyush Does NUTCH Support secured Searches? If yes, Could you please forward me to a appropriate documentation -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.7.3 version
As no objections were raised I created a 0.7.3 version in JIRA so we can start assigning current JIRA issues to it. Regards Piotr Piotr Kosiorowski wrote: Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is 0.8.1 is not good for small crawls at the moment. It will allow us to work on 0.8 branch so it would be more small installation friendly. I would like to approach it this way that if noone objects I would create a 0.7.3 release in JIRA and ask people to assign issues with patches to it. I do not have a lot of time personally so I do not plan to do any development myself - just taking care of high quality patches and committing them - after some time when we gather some aomount of bugfixes/isues I would prepare 0.7.3 release. Any objections comments? Regards Piotr
0.7.3 version
Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is 0.8.1 is not good for small crawls at the moment. It will allow us to work on 0.8 branch so it would be more small installation friendly. I would like to approach it this way that if noone objects I would create a 0.7.3 release in JIRA and ask people to assign issues with patches to it. I do not have a lot of time personally so I do not plan to do any development myself - just taking care of high quality patches and committing them - after some time when we gather some aomount of bugfixes/isues I would prepare 0.7.3 release. Any objections comments? Regards Piotr
Re: How to start working with MapReduce?
Please read the tutorial on nutch site. O suggest posting such issues to nutch-user - you will have much higher chance of getting useful response there. regards Piotr On 11/9/06, kauu [EMAIL PROTECTED] wrote: or it's the same with the version 0.8.x any idea is preciated On 11/9/06, kauu [EMAIL PROTECTED] wrote: anyone kown the detail of the process with the topic how to start working with MapReduce? i'v read something in the FAQ ,but i don't understand it very well , my version is 0.7.2, not 0.8x -- www.babatu.com -- www.babatu.com
Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)
+1 On 10/16/06, Doug Cutting [EMAIL PROTECTED] wrote: Sami Siren wrote: looks like somebody just enabled email-to-jira-comments-feature. I was just wondering would it be good to use this feature more widely. I think it would be good. That way mailing list discussion would be logged to the bug as well. This could be achieved by removing the replyto header from messages coming from jira so that replies get sent to [EMAIL PROTECTED] (i am assuming that is possible). So whenever somebody just hits reply from email client and writes the comment it would get automatically attached to correct issue as a comment. I sent a message to [EMAIL PROTECTED] this morning asking about this. If it's possible, and no one objects, I will request it for the Nutch mailing lists. Doug
Re: Nutch requires JDK 1.5 now?
I had a look at it and it seems I do not have enough permissions to change it. So probably this one goes to Doug... P. Chris Mattmann wrote: Hey Guys, Speaking of which, I noticed that Sami's issue below is a Task in JIRA, which reminded me of a task that I input a long time ago that would be nice to fix real quick (for those with JIRA permissions to do so): http://issues.apache.org/jira/browse/NUTCH-304 We should really change the email address for JIRA to not use the Apache incubator one anymore, and to use to Lucene one. Sound good? If so, could someone with permissions please take care of it? :-) Cheers, Chris On 10/3/06 9:04 AM, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Chris Mattmann wrote: Hi Folks, I noticed that Nutch now requires JDK 5 in order to compile, due to recent changes to the PluginRepository and some other classes. I think that this is a good move, however, I wasn't sure that I had seen any official announcement that Nutch now requires 1.5... This is a proactive change - as soon as we upgrade to Hadoop 0.6.x we will lose 1.4 compatibility anyway, so we may as well prepare in advance. Also, Now refers to the unreleased 0.9, we will keep branch 0.8.x compatible with 1.4. The switch to 1.5 format was also logged on jira issue http://issues.apache.org/jira/browse/NUTCH-360 -- Sami Siren __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
[jira] Assigned: (NUTCH-374) when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.
[ http://issues.apache.org/jira/browse/NUTCH-374?page=all ] Piotr Kosiorowski reassigned NUTCH-374: --- Assignee: Piotr Kosiorowski when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing. - Key: NUTCH-374 URL: http://issues.apache.org/jira/browse/NUTCH-374 Project: Nutch Issue Type: Bug Affects Versions: 0.8, 0.8.1 Reporter: King Kong Assigned To: Piotr Kosiorowski I set http.content.limit to -1 to not truncate content being fetched. However , if response used gzip or x-gzip , then it was not able to uncompress. I found the problem is in HttpBase.processGzipEncoded (plugin lib-http) ... byte[] content = GZIPUtils.unzipBestEffort(compressed, getMaxContent()); ... because it is not deal with -1 to no limit , so must modify code to solve it; byte[] content; if (getMaxContent()=0){ content = GZIPUtils.unzipBestEffort(compressed, getMaxContent()); }else{ content = GZIPUtils.unzipBestEffort(compressed); } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 release
No objections form me. We waited long and we can fix things in maitenance release in few weeks. Regards Piotr On 7/26/06, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Sami Siren wrote: There is a package available for testing in http://people.apache.org/~siren/nutch-0.8/ please give it some testing and post in your opinion - is it good enough to be a public release? I have some doubts because of NUTCH-266, but so far only 3 people have reported this to be problem (me included) This is I guess related to a very specific environment - multiple nodes running on cygwin. Usually people run multiple nodes on some flavor of Unix. I don't have any means to test it for this issue ... Bug also appears in singlenode configuration, but I think that it is not that common (quessing from the number of people who have reported it). However that is now fixed in hadoop trunk. Should we use a patched version of hadoop-0.4.0 in Nutch or wait for 0.5 (which at least still seems to be 1.4 compatible)? The 0.8 package has now hit the mirrors, has anybody any objections about announcing it? Stefan allready commented about two issues he wished to be fixed in 0.8 but to me it looks that they can both be addressed with configuration changes and documentation in the first place and there's nothing stopping us from releasing 0.8.1 in very short time addressing the issues discovered in 0.8. -- Sami Siren
Re: log when blocked by robots.txt
I think I would log in both situations but different message. +1 P. On 7/21/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Developers, another thing in the discussion to be more polite. I suggest that we log a message in case an requested URL was blocked by a robots.txt. Optimal would be if we only log this message in case the current used agent name is only blocked and it is not a general blocking of all agents. Should I create a patch? Stefan
Re: Nutch web site
It was maintained in branch as we agreed public website should contain docs for released version. I have nothing against moving it to trunk and maintaining it there. Sorry for late response but I am back from vacation and going through all these emails. Regards, Piotr Sami Siren wrote: Piotr, is there a reason why this (among other) documentation (for all relevant versions) could not be maintained in trunk? -- Sami Siren Piotr Kosiorowski wrote: Andrzej Bialecki wrote: +1, yes it would be really confusing. Since there are more and more people trying 0.8, could we perhaps include a short note that 0.8 and later is NOT compatible with this tutorial, and a reference to the tutorial for 0.8 (or the trunk/ branch in general)? I can add both tutorials to Nutch web site named Tutorial for 0.7 version and Tutorial for 0.8 version. It should make things clear. Anyone against it? Piotr
Re: 0.8 release
+1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8 now, there has been quite a lot of improvements/new features since 0.7 series and I strongly feel that we should push the first 0.8 series release (alfa/beta) out the door now. It would IMO lower the barrier to first timers try the 0.8 series and that would give us more feedback about the overall quality. Definitely +1. Let's do some testing, however, after the upgrade to hadoop 0.3.2 - hadoop had many, many changes, so we just need to make sure it's stable when used with Nutch ... We should also check JIRA and apply any trivial fixes before the release. If there is a consensus about this I can volunteer to be the RM. That would be great, thanks!
Re: 0.8 release?
I had problems with DOS/Unix new lines and some (still unsolved) environment settings on my linux box - I will try to solve it. Anyway I was able to apply the patch on Cygwin. Could you please have a look at it so we will be sure I have not applied it wrongly (I think it is correct but I did it so many times that I want to cross check). Regards Piotr Dawid Weiss wrote: What kind of problems? If you need something, let me know. D. Piotr Kosiorowski wrote: I got some problems while applying Dawid clustering patch (my linux environment looks not to be setu correctly) - but I switched to cygwin and it looks ok. I will try to commit it today/tommorow. Regards Piotr On 4/12/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, Any progress on the 0.8 release? Was there any resolution about which JIRA issues to complete before the 0.8 release? We had a bit of conversation there and some ideas, but no definitive answer... Thanks for your help, and sorry to pester ;) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: 0.8 release?
I got some problems while applying Dawid clustering patch (my linux environment looks not to be setu correctly) - but I switched to cygwin and it looks ok. I will try to commit it today/tommorow. Regards Piotr On 4/12/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, Any progress on the 0.8 release? Was there any resolution about which JIRA issues to complete before the 0.8 release? We had a bit of conversation there and some ideas, but no definitive answer... Thanks for your help, and sorry to pester ;) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: mapred branch
Anton Potehin wrote: Where now placed mapred branch of nutch ? it is developed in trunk now. P.
Re: PMD integration
Jérôme Charron wrote: 2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one oro jar than. Piotr, please keep oro-2.0.8 in pmd-ext I think we can plan to replace oro regex by java ones (as in RegexUrlFilter) in the whole nutch code (and then remove oro-2.0.7 from lib): src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java src/java/org/apache/nutch/parse/OutlinkExtractor.java src/java/org/apache/nutch/net/RegexUrlNormalizer.java src/java/org/apache/nutch/net/BasicUrlNormalizer.java I do not agree here - we are going to make a new release next week and releasing with two versions of oro does not look nice. oro is quite stable product and changes are in fact minimal: http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES I would like to upgrade to 2.0.8 (as no interface changes were made it would be trivial) before 0.8 release. What do others think? Regards Piotr
Re: PMD integration
I do agree with Jarome - plugins should be checked too. I would like to integrate PMD for core and plugins over the weekend based on the Dawid's work - I will make it totally separate target (so test do not depend on it). The goal is to allow other developers to play with pmd easily but at the same time I do not want the build to be affected. I would like also to look at possibility to generate crossreferenced HTML code from Nutch sources as it looks like pmd can use it and violation reports would be much easier to read. P, On 4/7/06, Jérôme Charron [EMAIL PROTECTED] wrote: that right now it is checking only main code (without plugins?). Yes, that's correct -- I forgot to mention that. PMD target is hooked up with tests and stops the build if something fails. I thought the core code should be this strict; for plugins we can have more relaxed rules -1 Since plugins provides a lot of Nutch functionalities (without any plugin, Nutch provides no service), I think that plugins code should be as strict as the core code. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PMD integration
I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I agree with it) to stop the build file if PMD complains about something. It's similar to testing -- if your tests fail, the entire build file fails. I totally agree with it - but I want to switch it on for others to play first, and when we agree on rules we want to use make it obligatory. Piotr
Re: 0.8 release schedule (was Re: latest build throws error - critical)
Doug Cutting wrote: Piotr, would you like to make this release, or should I? I would prefer you would do it this time - I am not sure if I can find some time next week. I would like to do some things before release though: 1) Commit clustering patch from Dawid (I took it over from Andrzej). 2) Commit pmd stuff as optional for this release. We will make it required later. 3) Review tutorial - I saw some posts on user list with claims about errors so I would like to check it before release. 4) It would be good to go through JIRA issues before - but I am not sure if I will manage it. Any comments? Regards Piotr
Re: Patch to remove Nutch formating from logs
Hello Christopher, I personally do not like combining logging with severe error handling but it is one of the features of Nutch for some time and I do not think it causes infinite loops in normal installations. Changing it as we are preparing to release a new version is not a good idea in my opinion. But I will be happy if we change the way it is handled in future. So for now -1. Piotr Christopher Burkey wrote: Did anyone get this email? Can a commiter acknowledge this has been received? We are have been having problems with infinite loops caused by Nutch. My theory is that the problem is related to using the log API to track severe errors. This patch is a only a few lines of code and should be easy to insert. Please let me know if it has been received and what the feedback is. Christopher Burkey wrote: Hello, Here is a patch to change org.apache.nutch.util.LogFormatter to not insert itself as the default handler for the system. I have been using Nutch for a year and have been waiting for a version that I can embed into OpenEdit. The problem has been that Nutch inserts itself as the formatter for the Java log system and that interferes with OpenEdit logging. diff -Naur ../java/org/apache/nutch/util/LogFormatter.java java/org/apache/nutch/util/LogFormatter.java --- ../java/org/apache/nutch/util/LogFormatter.java2006-03-31 13:40:50.0 -0500 +++ java/org/apache/nutch/util/LogFormatter.java2006-04-05 16:27:59.0 -0400 @@ -16,13 +16,23 @@ package org.apache.nutch.util; -import java.util.logging.*; -import java.io.*; -import java.text.*; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.PrintStream; +import java.io.PrintWriter; +import java.io.StringWriter; +import java.text.FieldPosition; +import java.text.SimpleDateFormat; import java.util.Date; - -/** Prints just the date and the log message. */ - +import java.util.logging.Formatter; +import java.util.logging.Level; +import java.util.logging.LogRecord; +import java.util.logging.Logger; + +/** Prints just the date and the log message. + * This was also used to stop processing as nutch crawls a web site + * [EMAIL PROTECTED] changed this code to use a LogWrapper class to catch severe errors + * */ public class LogFormatter extends Formatter { private static final String FORMAT = yyMMdd HHmmss; private static final String NEWLINE = System.getProperty(line.separator); @@ -35,20 +45,27 @@ private static boolean showTime = true; private static boolean showThreadIDs = false; + protected static LogFormatter sharedformatter = new LogFormatter(); + protected static SevereLogHandler sharedhandler = new SevereLogHandler(sharedformatter); + + /* // install when this class is loaded static { Handler[] handlers = LogFormatter.getLogger().getHandlers(); for (int i = 0; i handlers.length; i++) { - handlers[i].setFormatter(new LogFormatter()); + handlers[i].setFormatter(sharedformatter); handlers[i].setLevel(Level.FINEST); } } - + */ /** Gets a logger and, as a side effect, installs this as the default * formatter. */ public static Logger getLogger(String name) { // just referencing this class installs it -return Logger.getLogger(name); +Logger logr = Logger.getLogger(name); +logr.addHandler(sharedhandler); + +return logr; } /** When true, time is logged with each entry. */ @@ -60,7 +77,10 @@ public static void setShowThreadIDs(boolean showThreadIDs) { LogFormatter.showThreadIDs = showThreadIDs; } - + public void setLoggedSevere( boolean inSevere ) + { + loggedSevere = inSevere; + } /** * Format the given LogRecord. * @param record the log record to be formatted. diff -Naur ../java/org/apache/nutch/util/SevereLogHandler.java java/org/apache/nutch/util/SevereLogHandler.java --- ../java/org/apache/nutch/util/SevereLogHandler.java1969-12-31 19:00:00.0 -0500 +++ java/org/apache/nutch/util/SevereLogHandler.java2006-04-05 16:29:20.0 -0400 @@ -0,0 +1,46 @@ +/* + * Created on Apr 5, 2006 + */ +package org.apache.nutch.util; + +import java.util.logging.Handler; +import java.util.logging.Level; +import java.util.logging.LogRecord; + +public class SevereLogHandler extends Handler +{ +protected LogFormatter fieldNutchFormatter; + +public SevereLogHandler(LogFormatter inFormatter) +{ +setNutchFormatter(inFormatter); +} + +protected LogFormatter getNutchFormatter() +{ +return fieldNutchFormatter; +} + +protected void setNutchFormatter(LogFormatter inNutchFormatter) +{ +fieldNutchFormatter = inNutchFormatter; +} + +public void publish(LogRecord inRecord) +{ +if ( inRecord.getLevel().intValue() == Level.SEVERE.intValue()) +{ +
Re: PMD integration
Committed. One can run the pmd checks by 'ant pmd'. It produces file with html report in build directory. It covers core nutch and plugins. Currently it uses unusedcode ruleset checks only but one can uncomment other rulesets in build.xml (or add another ones according to pmd documentation). I would like to add cross-referenced source so report is easier to read in near feature. I have two additional questions for developers: 1) Should we check test sources with pmd? 2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one oro jar than. So happy PMD-ing, Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: I will make it totally separate target (so test do not depend on it). That was actually Doug's idea (and I agree with it) to stop the build file if PMD complains about something. It's similar to testing -- if your tests fail, the entire build file fails. I totally agree with it - but I want to switch it on for others to play first, and when we agree on rules we want to use make it obligatory. So we start out comitting it as an independent target, and then add it to the test target? Is that the plan? If so, +1. Doug
Re: Add .settings to svn:ignore on root Nutch folder?
+1 - I offer my help - we can coordinate it and I can do a part of work. I will also try to commit your patches quickly. Piotr On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote: Other options (raised on the Hadoop list) are Checkstyle: PMD seems to be the best choice for an Apache project and they all seem to perform at a similar level. Anything that generates a lot of false positives is bad: it either causes us to skip analysis of lots of files, or ignore the warnings. Skipping the JavaCC-generated classes is reasonable, but I'm wary of skipping much else. I thought a bit about this. The warnings PMD may actually make sense to fix. Take a look at maxDoc here: class LuceneQueryOptimizer { private static class LimitExceeded extends RuntimeException { private int maxDoc; public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; } } ... maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic accessor in LimitExceeded. It also may look confusing because you declare a field private to a class, but use it from the outside... changing declarations to something like this: class LuceneQueryOptimizer { private static class LimitExceeded extends RuntimeException { final int maxDoc; public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; } } ... removes the warning and also seems to make more sense (note that package scope of maxDoc doesn't really expose it much more than before because the entire class is private). So... if you agree to change existing warnings as shown above (there's not that many) then integrating PMD with a set of sensible rules may help detecting bad smells in the future (I couldn't resist -- it really is called like this in software engineering :). I only used dead code detection ruleset for now, other rulesets can be checked and we will see if they help or quite the contrary. If developers agree to the above I'll create a patch together with what needs to be fixed to cleanly compile. Otherwise I see little sense in integrating PMD. D.
PMD integration (was: Re: Add .settings to svn:ignore on root Nutch folder?)
Hi, I have downloaded the patches and generally like them (I had only read them not applied yet). I have one question - am I reading it correctly that right now it is checking only main code (without plugins?). P. Dawid Weiss wrote: All right, I though I'd give it a go since I have a spare few minutes. Jura is off, so I made the patches available here -- http://ophelia.cs.put.poznan.pl/~dweiss/nutch/ pmd.patch is the build file patch and libraries (binaries are in a separate zip file pmd-ext.zip). pmd-fixes.patch fixes the current core code to go through pmd smoothly. I removed obvious unused code, but left FIXME comments where I wasn't sure if the removal can cause side effects (in these places PMD warnings are suppressed with NOPMD comments). I also discovered a bug in PMD... eh... nothing's perfect. https://sourceforge.net/tracker/?func=detailatid=479921aid=1465574group_id=56262 D. Piotr Kosiorowski wrote: +1 - I offer my help - we can coordinate it and I can do a part of work. I will also try to commit your patches quickly. Piotr On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote: Other options (raised on the Hadoop list) are Checkstyle: PMD seems to be the best choice for an Apache project and they all seem to perform at a similar level. Anything that generates a lot of false positives is bad: it either causes us to skip analysis of lots of files, or ignore the warnings. Skipping the JavaCC-generated classes is reasonable, but I'm wary of skipping much else. I thought a bit about this. The warnings PMD may actually make sense to fix. Take a look at maxDoc here: class LuceneQueryOptimizer { private static class LimitExceeded extends RuntimeException { private int maxDoc; public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; } } ... maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic accessor in LimitExceeded. It also may look confusing because you declare a field private to a class, but use it from the outside... changing declarations to something like this: class LuceneQueryOptimizer { private static class LimitExceeded extends RuntimeException { final int maxDoc; public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; } } ... removes the warning and also seems to make more sense (note that package scope of maxDoc doesn't really expose it much more than before because the entire class is private). So... if you agree to change existing warnings as shown above (there's not that many) then integrating PMD with a set of sensible rules may help detecting bad smells in the future (I couldn't resist -- it really is called like this in software engineering :). I only used dead code detection ruleset for now, other rulesets can be checked and we will see if they help or quite the contrary. If developers agree to the above I'll create a patch together with what needs to be fixed to cleanly compile. Otherwise I see little sense in integrating PMD. D.
[jira] Closed: (NUTCH-239) I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl
[ http://issues.apache.org/jira/browse/NUTCH-239?page=all ] Piotr Kosiorowski closed NUTCH-239: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied with JavaDoc changes. Thanks. I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl Key: NUTCH-239 URL: http://issues.apache.org/jira/browse/NUTCH-239 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.2-dev Environment: RedHat Enterprise Linux Reporter: Jake Vanderdray Assignee: Piotr Kosiorowski Priority: Trivial Fix For: 0.7.2-dev I made the following changes in order to get the dependency on com.sun.ssl out of the 0.7 branch. The same changes have already been applied to the 0.8 branch (Revision 379215) thanks to ab. There is still a dependency on using the Sun JRE. In order to get it to work with the IBM JRE I had to change SunX509 to IbmX509, but I didn't include that change in this patch. Thanks, Jake. Index: DummySSLProtocolSocketFactory.java === --- DummySSLProtocolSocketFactory.java (revision 388638) +++ DummySSLProtocolSocketFactory.java (working copy) @@ -22,8 +22,8 @@ import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; -import com.sun.net.ssl.SSLContext; -import com.sun.net.ssl.TrustManager; +import javax.net.ssl.SSLContext; +import javax.net.ssl.TrustManager; public class DummySSLProtocolSocketFactory implements ProtocolSocketFactory { Index: DummyX509TrustManager.java === --- DummyX509TrustManager.java (revision 388638) +++ DummyX509TrustManager.java (working copy) @@ -10,9 +10,9 @@ import java.security.cert.CertificateException; import java.security.cert.X509Certificate; -import com.sun.net.ssl.TrustManagerFactory; -import com.sun.net.ssl.TrustManager; -import com.sun.net.ssl.X509TrustManager; +import javax.net.ssl.TrustManagerFactory; +import javax.net.ssl.TrustManager; +import javax.net.ssl.X509TrustManager; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; @@ -57,4 +57,12 @@ public X509Certificate[] getAcceptedIssuers() { return this.standardTrustManager.getAcceptedIssuers(); } + +public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { + // do nothing +} + +public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { + // do nothing +} } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-94) MapFile.Writer throwing 'File exists error'.
[ http://issues.apache.org/jira/browse/NUTCH-94?page=all ] Piotr Kosiorowski closed NUTCH-94: -- Fix Version: 0.7.2-dev Resolution: Duplicate Assign To: Piotr Kosiorowski Duplicate ofNUTCH-117. MapFile.Writer throwing 'File exists error'. Key: NUTCH-94 URL: http://issues.apache.org/jira/browse/NUTCH-94 Project: Nutch Type: Bug Components: fetcher Versions: 0.6 Environment: Server 2003, Resin, 1.4.2_05 Reporter: Michael Couck Assignee: Piotr Kosiorowski Fix For: 0.7.2-dev Running Nutch inside a server JVM or multiple times in the same JVM, MapFile.Writer doesn't get collected or closed by the WebDBWriter and the associated files and directories are not deleted, consequently throws a File exists error in the constructor of MapFile.Writer. Seems that this portion of code is very heavily integrated into Nutch and I am hesitant to look for a solution personally as a retrofit will be necessary with every release. Has anyone got any ideas, had the same issue, any solutions? Regards Michael -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-14) NullPointerException NutchBean.getSummary
[ http://issues.apache.org/jira/browse/NUTCH-14?page=all ] Piotr Kosiorowski closed NUTCH-14: -- Resolution: Cannot Reproduce Closed according to Stefan suggestion NullPointerException NutchBean.getSummary - Key: NUTCH-14 URL: http://issues.apache.org/jira/browse/NUTCH-14 Project: Nutch Type: Bug Components: searcher Reporter: Stefan Groschupf Priority: Minor In heavy load scenarios this may happens when connection broke. java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:333) at net.nutch.ipc.Client.getConnection(Client.java:276) at net.nutch.ipc.Client.call(Client.java:251) at net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418) at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236) at org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:552) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
[ http://issues.apache.org/jira/browse/NUTCH-117?page=all ] Piotr Kosiorowski closed NUTCH-117: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied fixed by Mike. Also reported offlist by Michal Karwanski. Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL - Key: NUTCH-117 URL: http://issues.apache.org/jira/browse/NUTCH-117 Project: Nutch Type: Bug Versions: 0.7.1, 0.7, 0.6 Environment: Window 2000 P4 1.70GHz 512MB RAM Java 1.5.0_05 Reporter: Stephen Cross Assignee: Piotr Kosiorowski Priority: Critical Fix For: 0.7.2-dev I started a crawl using the command line using nutch 0.7.1. nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20 After crawling for over 15 hours the crawl crached with the following exception: 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438 051019 050544 Processing document 0 051019 050544 Finishing update 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds. 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second Exception in thread main java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Site switched to branch-0.7.
Hi, I have updated site in 0.7 branch with latest trunk changes. I have added both tutorials to the site so people will be aware of differences. I have also committed DOAP file in 0.7 branch. Nutch Website uses branch-0.7 now. Piotr
Nutch 0.7.2
Hello, I would like to release nutch 0.7.2 in a week or two. Some serious bugfixes are already covered and I have a plan to fix one or two more. I found an email from Doug with title [Fwd: Crawler submits forms?] stating: This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. I just want to make sure it was fixed by svn commit: r348533 Fix to not extract urls whose method=post.. I think this was the fix but I wanted to make sure before committing. Any objections against the plan? Piotr
[jira] Closed: (NUTCH-225) Changed the links to the tutorial to point to the wiki
[ http://issues.apache.org/jira/browse/NUTCH-225?page=all ] Piotr Kosiorowski closed NUTCH-225: --- Resolution: Won't Fix I have just updated Nutch Web site. It contains now both tutorials (for 0.7 and 0.8). I have also added a notr to each tutorialstating that more detailed tutorials are available on Nutch Wiki. Changed the links to the tutorial to point to the wiki -- Key: NUTCH-225 URL: http://issues.apache.org/jira/browse/NUTCH-225 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jake Vanderdray This is a patch to repoint tutorial links on the nutch site to the wiki. Index: site.xml === --- site.xml(revision 384005) +++ site.xml(working copy) @@ -26,7 +26,7 @@ docs label=Documentation faq label=FAQ href=ext:faq / wikilabel=Wiki href=ext:wiki / -tutoriallabel=Tutorial href=tutorial.html / +tutoriallabel=Tutorial href=ext:tutorial / webmasters label=Robothref=bot.html / i18nlabel=i18n href=i18n.html / apidocs label=API Docs href=apidocs/index.html / @@ -48,6 +48,7 @@ wiki href=http://wiki.apache.org/nutch/; / faq href=http://wiki.apache.org/nutch/FAQ; / store href=http://www.cafepress.com/nutch/; / +tutorial href=http://wiki.apache.org/nutch/NutchTutorial; / /external-refs /site Index: i18n.xml === --- i18n.xml(revision 384005) +++ i18n.xml(working copy) @@ -188,7 +188,7 @@ href=http://jakarta.apache.org/tomcat/;Tomcat/a installed./p pAn index is also required. You can collect your own by working -through the a href=http://lucene.apache.org/nutch/tutorial.html;tutorial/a. +through the a href=http://wiki.apache.org/nutch/NutchTutorial;tutorial/a. Once you have an index, follow the steps outlined at the end of the tutorial for searching./p -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Tutorial
Upps, sorry for ignoring this discussion - i was looking for comments in JIRA and already committed the change before reading your discussion. My motivation is to have usable version of tutorial - as simple as it is possible to be versioned with the sources - only for historical purposes - if somebody wants to use nutch 0.7 a year from now he will be able to find a tutorial for it without problems. But for more advanced stuff I fully support Wiki. I will wait for other committers opinions before doing anything. Jeff Ritchie wrote: +1 Site tutorial links pointing to wiki tutorials is the best option. Jeff. Richard Braman wrote: +1. No need for 2 tutorials. The only descrepency I saw, was the invertlinks command not in 0.7. I updated the wiki to note that that command only applied to 0.8 -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 9:30 AM To: nutch-dev@lucene.apache.org Subject: Tutorial This is in response to Piotr's comment to my JIRA entry (http://issues.apache.org/jira/browse/NUTCH-225). I haven't been subscribed to this list, so I'm afraid I missed the discussion about the tutorial that went on here. After getting Piotr's comment I went to the archive and read the earlier thread about the tutorial. Here's what I understand: * The tutorial necessarily differs between the 0.7 and the 0.8 branches and this needs to be reflected on the web site by having both tutorials up there. * Some users have requested that the tutorial be moved to the wiki so that it can be more easily edited and updated. In recognition of this I went ahead and added it to the wiki and made some edits based on input from people who were confused about the use of Intranet Crawl as a label. I now realize this needs to be edited some more to indicate that it is the tutorial for the 0.7 branch. I'll do that in a bit. * Piotr wants the existing tutorials (both the one for 0.7 and the one for 0.8) on the web site as simple versions while copies get put on the wiki and become more advanced versions. In an effort to clear things up and move ahead, can we just do a quick vote on the last point? I'd propose moving both tutorials to the wiki and updating the links on the site to reflect that. I don't think keeping two copies of each tutorial up to date is going to be manageable. I suspect that one is going to go stale and having multiple copies (even if one is shorter than the other) is just going to confuse users. Thanks, Jake.
[jira] Closed: (NUTCH-91) empty encoding causes exception
[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ] Piotr Kosiorowski closed NUTCH-91: -- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Commited with small extension. Thanks. empty encoding causes exception --- Key: NUTCH-91 URL: http://issues.apache.org/jira/browse/NUTCH-91 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Michael Nebel Fix For: 0.7.2-dev, 0.8-dev I found some sites, where the header says: Content-Type: text/html; charset=. This causes an exception in the HtmlParser. My suggestion: Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java === --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397) +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy) @@ -120,7 +120,7 @@ byte[] contentInOctets = content.getContent(); InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets)); String encoding = StringUtil.parseCharacterEncoding(contentType); - if (encoding!=null) { + if (encoding!=null !.equals(encoding)) { metadata.put(OriginalCharEncoding, encoding); if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) { metadata.put(CharEncodingForConversion, encoding); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-225) Changed the links to the tutorial to point to the wiki
[ http://issues.apache.org/jira/browse/NUTCH-225?page=comments#action_12369405 ] Piotr Kosiorowski commented on NUTCH-225: - As stated in another thread I prefer to have a simple tutorial kept in version control with releases. We already have a link to Wiki on nutch Web site so users have possiility to find detailed tutorials. So -1 from me. If no objections I will close this issue. Changed the links to the tutorial to point to the wiki -- Key: NUTCH-225 URL: http://issues.apache.org/jira/browse/NUTCH-225 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jake Vanderdray This is a patch to repoint tutorial links on the nutch site to the wiki. Index: site.xml === --- site.xml(revision 384005) +++ site.xml(working copy) @@ -26,7 +26,7 @@ docs label=Documentation faq label=FAQ href=ext:faq / wikilabel=Wiki href=ext:wiki / -tutoriallabel=Tutorial href=tutorial.html / +tutoriallabel=Tutorial href=ext:tutorial / webmasters label=Robothref=bot.html / i18nlabel=i18n href=i18n.html / apidocs label=API Docs href=apidocs/index.html / @@ -48,6 +48,7 @@ wiki href=http://wiki.apache.org/nutch/; / faq href=http://wiki.apache.org/nutch/FAQ; / store href=http://www.cafepress.com/nutch/; / +tutorial href=http://wiki.apache.org/nutch/NutchTutorial; / /external-refs /site Index: i18n.xml === --- i18n.xml(revision 384005) +++ i18n.xml(working copy) @@ -188,7 +188,7 @@ href=http://jakarta.apache.org/tomcat/;Tomcat/a installed./p pAn index is also required. You can collect your own by working -through the a href=http://lucene.apache.org/nutch/tutorial.html;tutorial/a. +through the a href=http://wiki.apache.org/nutch/NutchTutorial;tutorial/a. Once you have an index, follow the steps outlined at the end of the tutorial for searching./p -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Nutch web site
Hi, It looks like Nutch web site was updated with site built from latest trunk - the only problem is it contains tutorial for unreleased (yet) version 0.8. I think we talked about it and agreed to keep tutorial for latest release on the Web. I have just updated site in svn (branch-0.7) with latest changes (forrest 0.7 compatibility and mailing list archives) and rebuilt it using forrest 0.7. If no objections I can switch web site to use version from branch instead of trunk. Regards Piotr
Re: Nutch web site
Andrzej Bialecki wrote: +1, yes it would be really confusing. Since there are more and more people trying 0.8, could we perhaps include a short note that 0.8 and later is NOT compatible with this tutorial, and a reference to the tutorial for 0.8 (or the trunk/ branch in general)? I can add both tutorials to Nutch web site named Tutorial for 0.7 version and Tutorial for 0.8 version. It should make things clear. Anyone against it? Piotr
[jira] Commented: (NUTCH-79) Fault tolerant searching.
[ http://issues.apache.org/jira/browse/NUTCH-79?page=comments#action_12364496 ] Piotr Kosiorowski commented on NUTCH-79: I think it should work without changes I suggested in previous comment - they would be simply useful additions. I was not using it for quite a while so I would get back to it to make sure it works with latest code (I hope sooner than later) - but no promises at the moment Fault tolerant searching. - Key: NUTCH-79 URL: http://issues.apache.org/jira/browse/NUTCH-79 Project: Nutch Type: New Feature Components: searcher Reporter: Piotr Kosiorowski Attachments: patch I have finally managed to prepare first version of fault tolerant searching I have promised long time ago. It reads server configuration from search-groups.txt file (in startup directory or directory specified by searcher.dir) if no search-servers.txt file is present. If search-servers.txt is presentit would be read and handled as previously. --- Format of search-groups.txt: * pre * search.group.count=[int] * search.group.name.[i]=[string] (for i=0 to count-1) * * For each name: * [name].part.count=[int] partitionCount * [name].part.[i].host=[string] (for i=0 to partitionCount-1) * [name].part.[i].port=int (for i=0 to partitionCount-1) * * Example: * search.group.count=2 * search.group.name.0=master * search.group.name.1=backup * * master.part.count=2 * master.part.0.host=host1 * master.part.0.port= * master.part.1.host=host2 * master.part.1.port= * * backup.part.count=2 * backup.part.0.host=host3 * backup.part.0.port= * backup.part.1.host=host4 * backup.part.1.port= * /pre. If more than one search group is defined in configuration file requests are distributed among groups in round-robin fashion. If one of the servers from the group fails to respond the whole group is treated as inactive and removed from the pool used to distributed requests. There is a separate recovery thread that every searcher.recovery.delay seconds (default 60) tries to check if inactive became alive and if so adds it back to the pool of active groups. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-45) Log corrupt segments in SegmentMergeTool
[ http://issues.apache.org/jira/browse/NUTCH-45?page=all ] Piotr Kosiorowski closed NUTCH-45: -- Fix Version: 0.7.2-dev Resolution: Fixed Applied. Thanks. Log corrupt segments in SegmentMergeTool Key: NUTCH-45 URL: http://issues.apache.org/jira/browse/NUTCH-45 Project: Nutch Type: Improvement Reporter: Otis Gospodnetic Priority: Trivial Fix For: 0.7.2-dev Attachments: SegmentMergeTool.patch Just added a LOG.warning line when corrupt segments are encountered, otherwise they just get skipped silently. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-174) Problem encountered with ant during compilation
[ http://issues.apache.org/jira/browse/NUTCH-174?page=all ] Piotr Kosiorowski closed NUTCH-174: --- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Fixed some time ago during preparation of 0.7.2 release. Please use version from SVN branch-0.7. Problem encountered with ant during compilation --- Key: NUTCH-174 URL: http://issues.apache.org/jira/browse/NUTCH-174 Project: Nutch Type: Bug Versions: 0.7.1 Environment: Suse LInux 9.3 Reporter: Matthias Günter Priority: Trivial Fix For: 0.8-dev, 0.7.2-dev There is a directory missing which causes ant to fail. Error message: BUILD FAILED /home/guenter/workspace/lucene/nutch-0.7.1/build.xml:76: The following error occurred while executing this line: /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/build.xml:9: The following error occurred while executing this line: /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/build-plugin.xml:85: srcdir /home/guenter/workspace/lucene/nutch-0.7.1/src/plugin/nutch-extensionpoints/src/java does not exist! Compilation worked, when I omitted line 9 in nutch-0.7.1/src/plugin/build.xml: !-- ant dir=nutch-extensionpoints target=deploy/ -- However, I guess that is not what was intended. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: test suite fails?
It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf wrote: Hi, is anyone able to run the test suite without any problems? Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: no static NutchConf
+1 in general In fact I like the approach presented by Stefan to pass only required parameters to objects that have small number of configurable params instead of NutchConf - it makes it obvious which parameters are required for such basic objects to run and as they are usually building blocks for something bigger it makes it easier to reuse it with different params in different parts of the code. But I like the direction and will not oppose against passing the whole NutchConf in this case. Regards Piotr
Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/
Andrzej, Do you think it would be a good idea to commit it in 0.7 branch for 0.7.2 release? I personally prefer to use released libraries instead of RC if possible. It does not require a lot of changes and you have already tested it with existing code... Piotr [EMAIL PROTECTED] wrote: Author: ab Date: Tue Jan 3 23:32:04 2006 New Revision: 365850 URL: http://svn.apache.org/viewcvs?rev=365850view=rev Log: Update Commons HTTPClient to v. 3.0. Add some default headers to prefer HTML content, and in English.
[jira] Closed: (NUTCH-142) NutchConf should use the thread context classloader
[ http://issues.apache.org/jira/browse/NUTCH-142?page=all ] Piotr Kosiorowski closed NUTCH-142: --- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed NutchConf should use the thread context classloader --- Key: NUTCH-142 URL: http://issues.apache.org/jira/browse/NUTCH-142 Project: Nutch Type: Improvement Versions: 0.7 Reporter: Mike Cannon-Brookes Fix For: 0.7.2-dev, 0.8-dev Right now NutchConf uses it's own static classloader which is _evil_ in a J2EE scenario. This is simply fixed. Line 52: private ClassLoader classLoader = NutchConf.class.getClassLoader(); Should be: private ClassLoader classLoader = Thread.currentThread().getContextClassLoader(); This means no matter where Nutch classes are loaded from, it will use the correct J2EE classloader to try to find configuration files (ie from WEB-INF/classes). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] Piotr Kosiorowski commented on NUTCH-138: - I am not sure but I would suspect it is a problem of bad tomcat configuration. To handle special characters in query urls one have to change default tomcat configuration - especially URIEncoding attribute to UTF8. See: http://tomcat.apache.org/faq/connectors.html#utf8 Please check if it helps in your particular case so we can close the issue. non-Latin-1 characters cannot be submitted for search - Key: NUTCH-138 URL: http://issues.apache.org/jira/browse/NUTCH-138 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windows XP, Tomcat 5.5.12 Reporter: KuroSaka TeruHiko Priority: Minor The search.html currently specifies GET method for query submission. Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.) To allow proper transmission of non-ISO-8859-1, POST method should be used. Here's a proposed patch: *** search.html Tue Dec 13 15:02:15 2005 --- search-org.html Tue Dec 13 15:02:07 2005 *** *** 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=post input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a --- 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=get input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=all ] Piotr Kosiorowski closed NUTCH-138: --- Resolution: Invalid Setting URIEncoding in tomcat config file fixes the problem. non-Latin-1 characters cannot be submitted for search - Key: NUTCH-138 URL: http://issues.apache.org/jira/browse/NUTCH-138 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windows XP, Tomcat 5.5.12 Reporter: KuroSaka TeruHiko Priority: Minor The search.html currently specifies GET method for query submission. Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.) To allow proper transmission of non-ISO-8859-1, POST method should be used. Here's a proposed patch: *** search.html Tue Dec 13 15:02:15 2005 --- search-org.html Tue Dec 13 15:02:07 2005 *** *** 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=post input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a --- 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=get input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] Piotr Kosiorowski commented on NUTCH-138: - BTW - just create user for yourself in nutch Wiki and you shoudl be able to add a new page with information without problems. Thanks for checking and documenting it. non-Latin-1 characters cannot be submitted for search - Key: NUTCH-138 URL: http://issues.apache.org/jira/browse/NUTCH-138 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: Windows XP, Tomcat 5.5.12 Reporter: KuroSaka TeruHiko Priority: Minor The search.html currently specifies GET method for query submission. Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over GET because of some restrictions of HTML or HTTP spec they discovered. (If my memory is correct, non ISO-8859-1 characters were woking OK over GET with older versions of Tomcat as far as setCharacterEncoding() is called properly.) To allow proper transmission of non-ISO-8859-1, POST method should be used. Here's a proposed patch: *** search.html Tue Dec 13 15:02:15 2005 --- search-org.html Tue Dec 13 15:02:07 2005 *** *** 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=post input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a --- 59,65 /spanspan class=bodytext center ! form name=search action=../search.jsp method=get input name=query size=44nbsp;input type=submit value=Search a href=help.htmlhelp/a BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Mega-cleanup in trunk/
Andrzej Bialecki wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... Hi, I am not sure what is wrong but a lot of JUnit test simply does not compile - I did svn checkout to new directory to be sure I do not anything left from my experiments. I am looking at it right now but - I would suggest to temporarily do a quick cleanup to make trunk testable: 1) Remove permanently - as classes under tests are removed in trunk: src/test/org/apache/nutch/pagedb/TestFetchListEntry.java src/test/org/apache/nutch/pagedb/TestPage.java src/test/org/apache/nutch/db/TestWebDB.java src/test/org/apache/nutch/db/DBTester.java src/test/org/apache/nutch/tools/TestSegmentMergeTool.java 2) Remove temporarly and create JIRA issue to fix it: src/test/org/apache/nutch/fetcher/TestFetcher.java src/test/org/apache/nutch/fetcher/TestFetcherOutput.java 3) Remove unused import in: src/test/org/apache/nutch/parse/TestParseText.java 4) Fix (as it looks simple to fix it - I will look at it in meantime): src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java After removal of all these not compiling classes tests of trunk complete sucessfully on my machine (JDK 1.4.2). If no objections - especially from Andrzej would be raised I can do the cleanup tommorow. P.
Re: how to add additional factor at search time to ranking score
AJ Chen wrote: It would be great if I can add some new functions to the nutch code to accomplish this. But, if it requires to customize lucene code, that's fine. I have tried to use the most recent release (1.4.3) of lucene source code, but it did not work. Is the lucene jar files included in the nutch release (0.7.1) very different from lucene 1.4.3? If yes, is it possible to get the source code for lucene used in nutch? Nutch uses lucene 1.9 (not existing release yet) - build from lucene trunk. Simply grab sources from lucene trunk and nutch should work fine with them. P.
[jira] Commented: (NUTCH-142) NutchConf should use the thread context classloader
[ http://issues.apache.org/jira/browse/NUTCH-142?page=comments#action_12361492 ] Piotr Kosiorowski commented on NUTCH-142: - Thanks. Fixed in 0.7 branch. Left open to fix it in trunk after cleaning trunk JUnit test problems (in next few days). NutchConf should use the thread context classloader --- Key: NUTCH-142 URL: http://issues.apache.org/jira/browse/NUTCH-142 Project: Nutch Type: Improvement Versions: 0.7 Reporter: Mike Cannon-Brookes Right now NutchConf uses it's own static classloader which is _evil_ in a J2EE scenario. This is simply fixed. Line 52: private ClassLoader classLoader = NutchConf.class.getClassLoader(); Should be: private ClassLoader classLoader = Thread.currentThread().getContextClassLoader(); This means no matter where Nutch classes are loaded from, it will use the correct J2EE classloader to try to find configuration files (ie from WEB-INF/classes). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-42) enhance search.jsp such that it can also returns XML
[ http://issues.apache.org/jira/browse/NUTCH-42?page=all ] Piotr Kosiorowski closed NUTCH-42: -- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed OpenSearch implemented. enhance search.jsp such that it can also returns XML Key: NUTCH-42 URL: http://issues.apache.org/jira/browse/NUTCH-42 Project: Nutch Type: Wish Components: web gui Reporter: Michael Wechner Priority: Trivial Fix For: 0.7.2-dev, 0.8-dev Attachments: NutchRssSearch.zip, NutchRssSearch.zip, search.jsp.diff, search.jsp.diff Enhance search.jsp such that by specifying a parameter format=xml the JSP will return an XML, whereas if no format is being specified then it will return HTML -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
[ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361206 ] Piotr Kosiorowski commented on NUTCH-148: - 'df' command is required for NDFS operation so if you were not using NDFS in 0.7.1 and nutch shell scripts you were able to run it on Windows without cygwin. Now majority of tools use NDFS so cygwin is required on Windows. I would asssume the other bug is also cygwin related - please test it with cygwin and report if it fixed the issue. In future in case if doubts it is better to ask on the nutch-user mailing list rather than create JIRA issue first. I will close both your issues now assuming they are cygwin related. If you fins that it still does not work with cygwin please reopen. org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates -- Key: NUTCH-148 URL: http://issues.apache.org/jira/browse/NUTCH-148 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: Windows XP Home Reporter: raghavendra prabhu I get the following error while running org.apache.nutch.tools.CrawlTool The error actually is in deleteduplicates 51223 001121 Reading url hashes... 051223 001121 Sorting url hashes... 051223 001121 Deleting url duplicates... 051223 001121 Error moving bad file G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: CreateProcess : df -k G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2 It throws the error here in NFSDataInputStream.java The exception is org.apache.nutch.fs.ChecksumException: Checksum error: G:\apach e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
[ http://issues.apache.org/jira/browse/NUTCH-148?page=all ] Piotr Kosiorowski closed NUTCH-148: --- Resolution: Invalid org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates -- Key: NUTCH-148 URL: http://issues.apache.org/jira/browse/NUTCH-148 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: Windows XP Home Reporter: raghavendra prabhu I get the following error while running org.apache.nutch.tools.CrawlTool The error actually is in deleteduplicates 51223 001121 Reading url hashes... 051223 001121 Sorting url hashes... 051223 001121 Deleting url duplicates... 051223 001121 Error moving bad file G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: CreateProcess : df -k G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2 It throws the error here in NFSDataInputStream.java The exception is org.apache.nutch.fs.ChecksumException: Checksum error: G:\apach e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-147) nutch map reduce does not work in windows map reduce runs in a loop
[ http://issues.apache.org/jira/browse/NUTCH-147?page=all ] Piotr Kosiorowski closed NUTCH-147: --- Resolution: Invalid cygwin requirement on Windows is listed in nutch tutorial. Please reopen if problems persists after using it from cygwin environment. nutch map reduce does not work in windows map reduce runs in a loop --- Key: NUTCH-147 URL: http://issues.apache.org/jira/browse/NUTCH-147 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: Windows system Winxp Pro Reporter: raghavendra prabhu Priority: Blocker Description Crawl Starts and i am able to see the initial messages Then the map reduce process starts and it continues to run in a loop I do not find the same problem in linux(linux it works perfectly) Below is loop into which i run into clustering.OnlineClusterer) 051222 182058 Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 051222 182058 Nutch Content Parser (org.apache.nutch.parse.Parser) 051222 182058 Ontology Model Loader (org.apache.nutch.ontology.Ontology) 051222 182058 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 051222 182058 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 051222 182058 found resource crawl-urlfilter.txt at file:/G:/trunklatest/conf/cr awl-urlfilter.txt 051222 182058 crawl\url.txt:0+25 051222 182059 crawl\url.txt:0+25 051222 182059 map -521216% 051222 182100 crawl\url.txt:0+25 051222 182100 map -1107496% 051222 182101 crawl\url.txt:0+25 051222 182101 map -1678544% 051222 182102 crawl\url.txt:0+25 051222 182102 map -2265900% 051222 182103 crawl\url.txt:0+25 051222 182103 map -2849416% 051222 182104 crawl\url.txt:0+25 051222 182104 map -3422908% 051222 182105 crawl\url.txt:0+25 The same thing continues -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
[ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361128 ] Piotr Kosiorowski commented on NUTCH-148: - Do you have Cygwin installed? Is 'df' working in your cygwin installation? Do you run crawl from cygwin shell? Nutch requires cygwin on Windows. org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates -- Key: NUTCH-148 URL: http://issues.apache.org/jira/browse/NUTCH-148 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: Windows XP Home Reporter: raghavendra prabhu I get the following error while running org.apache.nutch.tools.CrawlTool The error actually is in deleteduplicates 51223 001121 Reading url hashes... 051223 001121 Sorting url hashes... 051223 001121 Deleting url duplicates... 051223 001121 Error moving bad file G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException: CreateProcess : df -k G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2 It throws the error here in NFSDataInputStream.java The exception is org.apache.nutch.fs.ChecksumException: Checksum error: G:\apach e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Commiter access for Stefan Groschupf
+1 - especially for amount of support Stefan gives to nutch users. P. Andrzej Bialecki wrote: Hi, During the past year and more Stefan participated actively in the development, and contributed many high-quality patches. He's been spending considerable effort on addressing many issues in JIRA, and proposing fixes and improvements. Apparently he has too much free time on his hands, and it's best to catch him now, before he realizes that there are other ways of spending time than hacking Nutch code... ;-) So, I'd like to call for a vote on adding Stefan as a commiter.
Re: svn commit: r357334 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/protocol/Content.java src/java/org/apache/nutch/protocol/ContentProperties.java
Doug Cutting wrote: [EMAIL PROTECTED] wrote: +/* + * (non-Javadoc) + * + * @see org.apache.nutch.io.Writable#write(java.io.DataOutput) + */ +public final void write(DataOutput out) throws IOException { We should either include javadoc or not. In general, all public methods should have javadoc. In this case, since this is implementing an interface method, if no Javadoc comment is added, then the interface's will be used. That would be preferable. Frequently in this case folks add a comment like: // javadoc inherited Doug Doug, It is not a JavaDoc comment as it does not start with /** - it has exactly the efect you mentioned - the JavaDoc would be inherited - in fact Eclipse generates such comment automatically. In fact in my opinion both versions (//javadoc inherited) and commited one are ok and I have no preferences towards any of them. Regards, Piotr
JUnit test failures
Hi, I have problems with JUnit tests in trunk and mapred branches. TestFetcher fails in both branches. The same test executes correctly in 0.7 branch. Is it only my problem (environment setup) or others are having it too? I would suspect some changes in redirect handling Regards Piotr
Re: [Fwd: Crawler submits forms?]
Doug Cutting wrote: Andrzej Bialecki wrote: Please also don't forget that the trunk/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) Thinking about this more, perhaps we should do it sooner. There's already a branch for 0.7.x releases, so what point is there in not merging mapred to trunk now? We'd have fewer branches to maintain, and start getting nightly builds of mapred. Folks who require 0.7.x compatibility can continue to use (and patch) the 0.7.x branch. Objections? Doug +1. Looking at the questions on mailing lists I do not think many people use trunk now. Piotr
Re: Lucene performance bottlenecks
Hi, I started to think about implementing special kind of Lucene Query (if I remember correctly I would have to write my own Scorer and probably a few other classes) optimized for Nutch some time ago. I assumed having specialized query I would be able to avoid accessing some of lucene index structures multiple times as the same term apears many times in query generated by Nutch for multitoken queries. I am not an Lucene expert but maybe it is worth checking if it might give some performance boost. Has anyone any ideas why it might help or not? Regards, Piotr
Re: Urlfilter Patch
Jérôme Charron wrote: [...] build a list of file extensions to include (other ones will be excluded) in the fecth process. [...] I would not like to exclude all others - as for example many extensions are valid for html - especially dynamicly generated pages (jsp,asp,cgi just to name the easy ones and a lot of custom ones). But the idea of automatically allowing extensions for which plugins are enabled is good in my opinion. Anyway I will try to find my own list of forbidden extensions I prepared based on 80mln of urls - I just prepared the list of most common ones and went through it manually. I will try to find it over weekend so we can combine it with the list discussed in this thread. P.
Re: Performance issues with ConjunctionScorer
On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method operates on a LinkedList, which seems to be a huge bottleneck. Perhaps it would be possible to replace LinkedList with a table? I had exactly the same findings some time ago and even replaced LinkedList with a table and started to prepare the patch and summarize my finding as at the same time this subject was rised on lucene mailing list with patch - doing exactly the same thing. I cannot find the link to thread right now - but as far as I remember it is already commited in SVN trunk. Regards Piotr
Re: Performance issues with ConjunctionScorer
You are right - it is still not committed but the patch is here: http://issues.apache.org/jira/browse/LUCENE-443. During tests of my patch - it was very,very similar to this one- I had up to 5% perfomance increase. But probably it will mainly result in nicer GC behaviour. Piotr On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Piotr Kosiorowski wrote: On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene ConjunctionScorer.doNext() method. This method operates on a LinkedList, which seems to be a huge bottleneck. Perhaps it would be possible to replace LinkedList with a table? I had exactly the same findings some time ago and even replaced LinkedList with a table and started to prepare the patch and summarize my finding as at the same time this subject was rised on lucene mailing list with patch - doing exactly the same thing. I cannot find the link to thread right now - but as far as I remember it is already commited in SVN trunk. Can't be - I'm working with the latest revision of Lucene from trunk/ -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-99) ports are hardcoded or random
[ http://issues.apache.org/jira/browse/NUTCH-99?page=all ] Piotr Kosiorowski closed NUTCH-99: -- Resolution: Fixed Patch committed. Thanks Stefan. ports are hardcoded or random - Key: NUTCH-99 URL: http://issues.apache.org/jira/browse/NUTCH-99 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: port_patch_04.txt, port_patch.txt, port_patch_02.txt, port_patch_03.txt Ports of tasktracker are random and the port of the datanode is hardcoded to 7000 as strting port. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: suspicious outlink count
EM wrote: 202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does nutch bother counting anything over that again? During PageRank computation nutch retrieves all links from given page by MD5. If we have many pages with the same MD5 it can retrieve all outlinks from these pages - I saw some bot traps that had big site structures that had exactly the same MD5 (once I had over a milion of identical pages in my index with different urls from the same host).So in this case we are getting the union af all such outlinks. In some situations having a big number of outlinks is not a problem (like in your case - all pages injected from dmoz are outlinks from dmoz) - but usually it indicates some problems in your index or at least a reason to look at it. So I have decided to print a warning in this case so one can have a look at such site. Regards Piotr
Re: to many hdd reads
Committed in trunk and branch-0.7 (just in case if we decide to make a 0.7.2release sometime). Thanks Piotr On 10/11/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, don't think I'm fuddy-duddy but is it really sensefull to do following in the nutchbean? File [] directories = fs.listFiles(indexesDir); for(int i = 0; i fs.listFiles(indexesDir).length; i++) { wouldn't it be better to do it like this: File [] directories = fs.listFiles(indexesDir); for(int i = 0; i directories.length; i++) { First of all that are many unneccesary disck reads and second there is theoretically a chance that the numbers of files change until the loop and this will throw an exception. Should I provide a patch or someone of the contributor just change this one word? Thanks! Stefan
Nutch 0.7.1 and Nutch web site
Hello, I have prepared Nutch 0.7.1 release today but I had one problem. I was updating the site in branch but to deploy it one must use the version from trunk. Currently I simply committed generated site in trunk but this solution is far from perfect. Should we have version independent site - always modified in trunk? Or should we think about having a site (eg. JavaDocs, tutorial etc) versioned and available for all versions at the same time? I am not sure know so I am asking if sombody has some ideas about it? Tegards Piotr
Re: Nutch Suggestion? (Google like did you mean)
Have a look at http://issues.apache.org/jira/browse/NUTCH-48. I think ngram based appeoach is appropriate here. I was using it in our search engine. Regards Piotr On 9/29/05, Jack Tang [EMAIL PROTECTED] wrote: Hi I am very like Google's Did you mean and I notice that nutch now does not provider this function. In this article http://today.java.net/lpt/a/211, author Tim White implemented suggestion using n-gram to generate suggestion index. Do you think is it good for nutch? I mean index in nutch will be really huge. Or just provide some dictionaries like jazzy(LGPL) does? Thanks /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars
[jira] Closed: (NUTCH-89) parse-rss null pointer exception
[ http://issues.apache.org/jira/browse/NUTCH-89?page=all ] Piotr Kosiorowski closed NUTCH-89: -- Fix Version: 0.8-dev 0.7 Resolution: Fixed Applied in trunk and 0.7 branch. Thanks. parse-rss null pointer exception Key: NUTCH-89 URL: http://issues.apache.org/jira/browse/NUTCH-89 Project: Nutch Type: Bug Components: fetcher Versions: 0.7, 0.8-dev Reporter: Michael Nebel Fix For: 0.7, 0.8-dev Attachments: parse-rss.20050910.patch The rss-parser causes an exception. The reason is a syntax error in the page. Hitting this pages, the parser trys to add an outlink with null as anchor. The anchor of a outlink must no be null. java.lang.NullPointerException at org.apache.nutch.io.UTF8.writeString(UTF8.java:236) at org.apache.nutch.parse.Outlink.write(Outlink.java:51) at org.apache.nutch.parse.ParseData.write(ParseData.java:111) at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) Exception in thread main java.lang.RuntimeException: SEVERE error logged. Exiting fetcher. at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140) I suggest the following patch: Index: src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java === --- src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (revision 279397) +++ src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (working copy) @@ -157,11 +157,13 @@ if (r.getLink() != null) { try { // get the outlink -theOutlinks.add(new Outlink(r.getLink(), r -.getDescription())); + if (r.getDescription()!= null ) { + theOutlinks.add(new Outlink(r.getLink(), r.getDescription())); + } else { + theOutlinks.add(new Outlink(r.getLink(), )); + } } catch (MalformedURLException e) { -LOG -.info(nutch:parse-rss:RSSParser Exception: MalformedURL: +LOG.info(nutch:parse-rss:RSSParser Exception: MalformedURL: + r.getLink() + : Attempting to continue processing outlinks); e.printStackTrace(); @@ -185,12 +187,13 @@ if (whichLink != null) { try { -theOutlinks.add(new Outlink(whichLink, theRSSItem -.getDescription())); - + if (theRSSItem.getDescription()!=null) { + theOutlinks.add(new Outlink(whichLink, theRSSItem.getDescription())); + } else { + theOutlinks.add(new Outlink(whichLink, )); + } } catch (MalformedURLException e) { -LOG -.info(nutch:parse-rss:RSSParser Exception: MalformedURL: +LOG.info(nutch:parse-rss:RSSParser Exception: MalformedURL: + whichLink + : Attempting to continue processing outlinks); e.printStackTrace(); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments
[ http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12330113 ] Piotr Kosiorowski commented on NUTCH-95: I was renaming segments quite often so I would vote for reading the date from the segment instead of using dir name. DeleteDuplicates depends on the order of input segments --- Key: NUTCH-95 URL: http://issues.apache.org/jira/browse/NUTCH-95 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev, 0.6, 0.7 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki DeleteDuplicates depends on what order the input segments are processed, which in turn depends on the order of segment dirs returned from NutchFileSystem.listFiles(File). In most cases this is undesired and may lead to deleting wrong records from indexes. The silent assumption that segments at the end of the listing are more recent is not always true. Here's the explanation: * Dedup first deletes the URL duplicates by computing MD5 hashes for each URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx is just an int index to the array of open IndexReaders - and if segment dirs are moved/copied/renamed then entries in that array may change their order. And then for all equal triples Dedup keeps just the first entry. Naturally, if segmentIdx is changed due to dir renaming, a different record will be kept and different ones will be deleted... * then Dedup deletes content duplicates, again by computing hashes for each content, and then sorting records by (hash, segmentIdx, docIdx). However, by now we already have a different set of undeleted docs depending on the order of input segments. On top of that, the same factor acts here, i.e. segmentIdx changes when you re-shuffle the input segment dirs - so again, when identical entries are compared the one with the lowest (segmentIdx, docIdx) is picked. Solution: use the fetched date from the first record in each segment to determine the order of segments. Alternatively, modify DeleteDuplicates to use the newer algorithm from SegmentMergeTool. This algorithm works by sorting records using tuples of (urlHash, contentHash, fetchDate, score, urlLength). Then: 1. If urlHash is the same, keep the doc with the highest fetchDate (the latest version, as recorded by Fetcher). 2. If contentHash is the same, keep the doc with the highest score, and then if the scores are the same, keep the doc with the shortest url. Initial fix will be prepared for the trunk/ and then backported to the release branch. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
0.7.1 release
Hello, As it looks everything that was planned was commited to 0.7 branch I would like to prepare a 0.7.1 release in next few days. I will change branch name at the same time to comply with agreed standard. Any objections? Regards Piotr
Re: svn commit: r290163 - in /lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2: ./ lib/
Hi Andrzej, Is anything related to clustering commits left? Or should we proceed with 0.7.1 release? Piotr [EMAIL PROTECTED] wrote: Author: ab Date: Mon Sep 19 07:11:07 2005 New Revision: 290163 URL: http://svn.apache.org/viewcvs?rev=290163view=rev Log: Update of the clustering plugin, contributed by Dawid Weiss. Carrot2 components updated to the newest stable versions. Improvements in tokenizers (speedups) and stop words handling. Internal API changed slightly (update needed if anyone wants to use other Carrot2 components and uses this code as a glue). Support added for Danish, Finnish, Norwegian (bokmaal) and Swedish. Added: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-collections-3.1-patched.jar (with props) lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/log4j-1.2.11.jar (with props) Removed: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-collections-3.0.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/log4j-1.2.8.jar Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.LICENSE lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/commons-pool.LICENSE lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/plugin.xml Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-filter-lingo.jar?rev=290163r1=290162r2=290163view=diff == Binary files - no diff available. Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-local-core.jar?rev=290163r1=290162r2=290163view=diff == Binary files - no diff available. Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-snowball-stemmers.jar?rev=290163r1=290162r2=290163view=diff == Binary files - no diff available. Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-common.jar?rev=290163r1=290162r2=290163view=diff == Binary files - no diff available. Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2-util-tokenizer.jar?rev=290163r1=290162r2=290163view=diff == Binary files - no diff available. Modified: lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS URL: http://svn.apache.org/viewcvs/lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS?rev=290163r1=290162r2=290163view=diff == --- lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS (original) +++ lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2/lib/carrot2.CONTRIBUTORS Mon Sep 19 07:11:07 2005 @@ -5,9 +5,10 @@ # # First name, surname name; Duties; Active from; Institution -Dawid Weiss; Project administrator, various components, core; 2002; Poznan University of Technology, Poland -StanisÅaw, OsiÅski; Lingo clustering component, ODP Input; 2003; Poznan University of Technology, Poland +Dawid Weiss; Project administrator, various components, core; 2002; Poland +StanisÅaw, OsiÅski; Lingo clustering component, ODP Input; 2003; Poland + MichaÅ, Wróblewski [*]; AHC clustering components; 2003; Poznan University of Technology, Poland
Re: DistributedSearch$Client.updateSegments() blocking other threads
Hello Andrzej, You can also try http://issues.apache.org/jira/browse/NUTCH-79 - I think it should also help here - it is a bit complicated as it contain additional functionality but if you have any problems I am willing to help. I am going to perform some test of it again and maybe commit it in some time if others think it is worth it. Regrads Piotr Andrzej Bialecki wrote: Hi, I was doing performance testing of a distributed search setup, with JMeter, using the code from trunk/. Whenever one of the backend Servers goes down, there is a hiccup on the frontend, because all ParallelCalls started by the Client, which still use that dead address, need to timeout. This is expected, and acceptable. New calls being made in the meantime (before updateSegments() discovers that the host is down) will also need to timeout - which is so so, I think it could be improved by removing the offending address at the first sign of trouble, i.e. not to wait for updateSegments() but immediately remove the dead host from liveAddresses. Anyway, read on... What was curious was that the same hiccup would then occur every 10 seconds, which is the hardcoded interval for calling Client.updateSegments(). It was as if the call to updateSegments() was synchronized on the whole class, so that all other calls are blocked until updateSegments() completes. I modified the code, so that instead of using DistributedSearch$Client itself as a Thread instance, a new independent Thread instance is created. The hiccups are gone now - the list of liveAddresses is still being updated as it should whenever Servers go down/up, but now updateSegments() doesn't interfere with other calls. I attach the patch - but to be honest I'm still not quite sure what was happening... Index: DistributedSearch.java === --- DistributedSearch.java (revision 280515) +++ DistributedSearch.java (working copy) @@ -112,8 +112,9 @@ public Client(InetSocketAddress[] addresses) throws IOException { this.defaultAddresses = addresses; updateSegments(); - setDaemon(true); - start(); + Thread t = new Thread(this); + t.setDaemon(true); + t.start(); } private static final Method GET_SEGMENTS; @@ -168,8 +169,10 @@ liveSegments+=segments.length; } - this.liveAddresses = (InetSocketAddress[]) // update liveAddresses -liveAddresses.toArray(new InetSocketAddress[liveAddresses.size()]); + synchronized(this.liveAddresses) { +this.liveAddresses = (InetSocketAddress[]) // update liveAddresses + liveAddresses.toArray(new InetSocketAddress[liveAddresses.size()]); + } LOG.info(STATS: +liveServers+ servers, +liveSegments+ segments.); }
Re: Problems on Crawling
bin/nutch updatedb db $s1 command updates WebDB with links you fetched in segment $s1. Regards Piotr Daniele Menozzi wrote: Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do not have really understood what is the ralationship between depth,segments,fetching.. Take for example the tutorial, I understand theese 2 steps: bin/nutch admin db -create bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000 but, when I do this: bin/nutch generate db segments what happens? I think that a dir called 'segments' id created, and inside of it I can find the links I have previously injected.Ok.Next steps: bin/nutch fetch $s1 bin/nutch updatedb db $s1 Ok, no problems here. But now I cannot understood what happens with this command: bin/nutch generate db segments it is the same command of above, but now I've not injected anything in the DB, it only contais the pages I've previously fetched. So, does it mean that when I generate a segment, it will automagically be filled with links found in fetched pages? And where theese links are saved? And who saves theese link? Thank you so much, this work is really interesting! Menoz
Re: Delete an entry in ArrayFile/MapFile
Hello, You cannot do it. These structures where not designed for it. But you can copy all the data to other ArrayFile skipping entries you want to delete. Regards Piotr On 9/6/05, Ben [EMAIL PROTECTED] wrote: Hi How can I delete an entry in the ArrayFile/MapFile if I know the id/key? Thanks, Ben
Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki
Doug Cutting wrote: Glancing at other Apache projects in subversion, I see that httpd uses branch names like 2.2.x and tag names like 2.2.4. That's a little cryptic. I propose that we use branch names like branch-2.4 and tag names like release-2.4.1. What do folks think? +1 In fact I wanted to do it this way when I started to create a branch but as noone objected against Release-X.Y name for branch that was present in Release-HOWTO I prepared ealier (and have not thought it through) I decided to go with Release-HOWTO version to avoid confusion. I can try to change things in next few days if others agree. I will also rollback errorneous commit in tags subdirectory. Regards Piotr
Re: merge mapred to trunk
Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug +1 P.
Re: null lang bug? and patch?
Great - I just thought that it would be better if you look at it - instead of me digging into the code. I wanted to be on the safe side with 0.7.1 release. Regards Piotr Jérôme Charron wrote: I am a bit lost but just a quick check - shouldn't it also be committed in Release-0.7 branch? No, the analyzer extension-point is commited only in trunk. It's a new feature, so I follow Committer's Rules ( http://wiki.apache.org/nutch/Committer's_Rules) ;-) Regards Jérôme
Re: Analysis plugins and lucene-analyzers
Hello, I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in nutch core but I would like to give another option. I think it is possible to create a plugin which contains and exports this library and make other analysis plugin depend on it. I am not an expert in it but I think such solution is also possible. But it is just a second idea for you to consider - I do not have a preference for any of these options. Regards Piotr Andrzej Bialecki wrote: Jérôme Charron wrote: Hi, I would like to add some language specific analysis plugins. In this first approach, each plugin would be simply a wrapper of the lucene's analyzers. So each analysis-lang plugin need to import lucene-analyzers-1.9-rc1-dev.jar in its lib directory. In order to avoid adding this jar in many plugins, I would like to add the lucene-analyzers-1.9-rc1-dev.jar in the nutch core lib. Any comments? Any objection? I'm wondering if you could implement this plugin as a more or less automatic wrapper around any Lucene classes that implement Analyzer, i.e. so that it doesn't require recompiling to change/select the language, or add a non-standard analyzer from the classpath. I think it's possible to do this, but you would have to code a special-case for Snowball analyzers, where the default constructor requires an argument. All of this could be read from the plugin.xml or n utch-default.xml files.
Re: crawl-urlfilter.txt mechanics
crawl-urlfilter.txt is bin/nutch crawl specific. If you want to use each step separatelly - you ar ein fact doing Whole Web crawling from tutorial - so you need to modify regex-urlfilter.txt instead. Regards Piotr On 8/22/05, Michael Ji [EMAIL PROTECTED] wrote: Hi, When I use intranet crawling, such as, call bin/nutch crawl ..., crawl-urlfilter.txt works---it filters out the urls that is not matched the domain I included; actually, when I take a look at crawltool.java, the config files are read in Java Properties by 'NutchConf.get().addConfResource(crawl-tool.xml)' But: When I calling each steps explicitly by myself, such as, Loop generate segment fetch updateDB The crawl-urlfilter.txt doesn't work; My question is: 1) If I want to control the crawler's behavior in second case, should I call 'NutchConf.get()...' by myself? 2) Where url-filter exactly works? In fetcher? So, after loaded from .xml and .txt, all the configuration data is kept in Properties for life time of nutch running? thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Failing JUnit test
Hello Jérôme, I found it and commited the fix. It was not using UTF-8 encoding sometimes. But while looking at the code I feel a little bit worried about LanguageIdentifier.identify(InputStream is) - as it reads bytes from file in chunks and coverts each chunk to stink separatelly. If multibyte UT-8 character is located at the chunk boundary it would would be split in two parts. Am I right? Regards Piotr Jérôme Charron wrote: It works on my Linux box - with both JDK 1.4 and 1.5. ok. so it seems to be constent with my conf. I will try to track it down. I assume it is an encoding problem of the Ngram profile files, but I have no time evening. Regards Jérôme
Re: Failing JUnit test
It works on my Linux box - with both JDK 1.4 and 1.5. I will try to track it down. Regards Piotr Jérôme Charron wrote: I am using JDK 1.5 on Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the problem. OK. Thanks Jérôme
Failing JUnit test
Hello, I have updated my local copy today and JUnit tests started to fail. expected:el but was:sv junit.framework.ComparisonFailure: expected:el but was:sv at org.apache.nutch.analysis.lang.TestLanguageIdentifier.testIdentify(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) As I suspect it is a result of latest updates to LanguageIdentifier plugin or its tests. I am not deep into it I will not try to debug it myslef at the moment - so just wanted you to know about the issue. Regards Piotr
Release 0.7
Hello Nutch Committers, Is anyone working on preparing the release? If not I can spent some time on it in an hour or so. Regards Piotr
Release 0.7 problem
Hello, I have a problem related to 0.7 release. After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). I thought it was my mistake - I started to do it on Windows so I moved to linux, but the problem persisted. I downloaded latest nightly build(nutch-2005-08-16.tar.gz) and it is still the same. I am not using standard nutch script(and build.xml too) for my local installation at work so I had a look and noticed that in my build.xml I have additional elements inside tar element tarfileset dir=${build.dir} mode=755 include name=${final.name}/bin/*/ /tarfileset It is strange nobody reported it so far so it may still be my fault. But if not - should we make a release with bin/* scripts not executable or change the build process? I would go for a change but than I will do the release tommorow - as I would like to test it. Comments? Regards Piotr
Re: Release 0.7 problem
So I will move the release till tommorow as I am a bit sleepy now. Regards Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). It is strange nobody reported it so far so it may still be my fault. No, it looks like a problem with ant's tar task, which erases executable bits. In prior releases I think Nutch used to directly exec 'tar czf' since ant's tar task didn't support compression. Since it added compression we started using the ant task... But if not - should we make a release with bin/* scripts not executable or change the build process? I think we should fix this before we release. Good job catching it. Doug
Re: Release 0.7 problem
Hi, Just for information The only change I plan to make is change the tar task to: target name=tar depends=package tar compression=gzip longfile=gnu destfile=${build.dir}/${final.name}.tar.gz tarfileset dir=${build.dir} mode=664 exclude name=${final.name}/bin/* / include name=${final.name}/** / /tarfileset tarfileset dir=${build.dir} mode=755 include name=${final.name}/bin/* / /tarfileset /tar /target I will commit it tommorow and test. Regards Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: After making a tar I was trying to go through crawl tutorial. - tar xvfz nutch-0.7.tar.gz bin/nutch - is not executable (and nutch-daemon.sh too). It is strange nobody reported it so far so it may still be my fault. No, it looks like a problem with ant's tar task, which erases executable bits. In prior releases I think Nutch used to directly exec 'tar czf' since ant's tar task didn't support compression. Since it added compression we started using the ant task... But if not - should we make a release with bin/* scripts not executable or change the build process? I think we should fix this before we release. Good job catching it. Doug
Re: VOTE: clustering plugin update for Rel 0.7
Hi, Maybe it would be a better idea to go for 0.7 branch and schedule a new 0.7.1 release in short time? It is difficult for me to judge if the patch I had not seen is good for release. So I would say 0 from me (if you think it is good enough I will not object). Regards, Piotr Andrzej Bialecki wrote: Hi, This is yet another request for exception from the no-commit rule before release ... *sigh* Dawid Weiss reported that he prepared an updated version of the Carrot2 clustering plugin, which contains significant updates and improvements. He suggests that it would be a good idea to include it in the 0.7 release. If committers agree, I can commit this updated version before the release (which should happen tomorrow?), however I'm not sure if I can test it sufficiently enough to be sure that nothing breaks ... If the decision is positive and you have a collection of test segments, which works with recent code, your help in testing would be appreciated. Please vote +1 for and -1 against.
Re: FW: Fetcher, ParseText, ParseData - need to modify
Hello, To change nutch standard html parsing the best place to start would be probably parse-html plugin. Regards Piotr Fuad Efendi wrote: 1. This is part of ParseText: Any Accessories Backup Devices Media Barebone Systems Camcorder Accessories Camcorders Cases External Enclosures CD / DVD Drives Media Cooling Devices Digital Camera Accessories Digital Cameras - it is content of Dropdown, OPTIONS in HTML 2. I have some sub-text in ParseText which seems to be an anchor, I compared visually with web-page... -Original Message- From: Fuad Efendi [mailto:[EMAIL PROTECTED] Sent: Monday, August 15, 2005 1:20 PM To: nutch-dev@lucene.apache.org Subject: Fetcher, ParseText, ParseData - need to modify I just catched some output from Fetcher.FetcherThread.outputPage(.) and noticed that some anchors are in a text, and some OPTIONS tags within a text too. LOG.info(ParseText = +text); LOG.info(ParseData = + parseData); I'd like to modify behaviour, ParseText should contain subset of a text which I need, and ParseData should contain all anchors. Where to start? Would be nice to have plugins modifying Fetcher behaviour...
Re: page ranking weights
Boost for the page maybe calculated in few different ways (and in few different places in nutch): 1) PageRank based score - calculated by nutch analyze command based on WebDB - during fetchlist generation scores from WebDB are stored in segment - indexing phase uses score to set the boost for a page 2) based on number of incoming links - during fetchlist generation inlinks are stored in segment - during indexing number of inlinks is read from segment and used in boost calculation There is a separate command (updatesegs) to update score and inlink information in existing segments. Regards Piotr Jay Pound wrote: also how does it keep track of incoming links globally on these pages, if the weight is determined by # of incoming links then there would have to be somewhere it keeps track so when you split your indexes it can still have an accurate value for the distributed search? -J - Original Message - From: Jay Pound [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, August 11, 2005 4:49 PM Subject: page ranking weights at which step does nutch figure out the weight of each page, the updatedb step? or the index step? Thanks, -Jay
Nutch versions - Was: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml
Hello, I think a lot of people will wait before moving to mapreduce implementation for some time so we will have a 0.7 version to support. I was a heavy CVS branch user in my previous job taking care about common library so I fully agree that such branch would be needed for bug fixing. I would go (as Doug suggested) for lazy creation mode of this branch - should be created on first need to commit to it. I also tracked down version numbers in nutch source code: 1) CHANGES.txt - that was easy all releases are listed here 2) default.properties version=0.7-dev 3) nutch-default.xml property namehttp.agent.version/name value0.07/value descriptionA version string to advertise in the User-Agent header./description /property So we have a small inconsistency in naming versions (I have commited 0.07 in nutch-default.xml yesterday but it used to be 0.06 so I have not changed the format) - I will fix soon. I think we all refer to 0.7 as next number (and 0.6 as current) so nutch-default.xml contains wrong format. In fact it should still contain -dev suffix. To make undocumented comvention documented I would also like to suggest naming releases with X.Y format and naming code developed to make a (release X.Y) X.Y-dev. I will try to put a draft of doc on Release HOWTO on Wiki tommorow. Regards, Piotr Andrzej Bialecki wrote: Doug Cutting wrote: We may want to start a branch for the release too, as described at: http://svnbook.red-bean.com/en/1.1/ch04s04.html#svn-ch-4-sect-4.4.1 If we think we may someday want need 0.7.1 release, then we will need a 0.7 branch to make this from. We can start this branch later by basing it on the 0.7 tag. But we should never alter the 0.7 tagged code once it is created. Thoughts? I agree. I have a similar experience from another OSS project (FreeBSD), where this schema is being used extensively. This allows to provide some limited maintenance for past releases. Considering that this is the last release before merging the map-reduce, doing a branch seems very appropriate.
Re: clucene-java bindings
Hello Ben, I personally would be interested mainly in search part of it if speed increase would be significant. I am running my indices on linux/ AMD Opterons - I hope CLucene will work well in this environment. I assume CLucene is compatible with Java lucene index format as we do have some tools in Java that manipulate Lucene indices. If you have abything to integrate with nutch I am willing to help with integration and test it. Regards, Piotr Ben van Klinken wrote: Hi Nutch People, I am a developer of CLucene, which is a full C++ port of Apache Lucene. I would like to propose something to users of Nutch: I have been working on some SWIG wrappers for CLucene in various higher-level languages such as C#,Java and COM. I started working on the Java wrapper for the purpose of 'stealing' Java test suites for the purpose of testing CLucene. I have already managed to run about half of the luceneDotNet tests successfully using the CLucene-csharp bindings (the rest can mostly not be done because of the lack of director support in the Swig Csharp module). This has been useful in tracking down bugs, etc. Without too much effort, I have managed to get the Java bindings working. I have so far been able to get the IndexFiles demo program to run with very few changes to the Java code (I had to change the imports code and put a System.loadLibrary call in - though these differences would eventually be able to be removed completely). I only spent a minute looking at speeds, but I indexed a directory which took 2.5 seconds on java lucene and the same thing took 1.5 seconds in clucene-java. Of course this is not saying much, but it means that clucene-java *might* be faster. So what I wanted to propose to users and developer of Nutch this: with a bit of effort, clucene-java could be good enough to be 'dropped into' the nutch project thereby speeding up the nutch indexer. We could write directors for clucene-java which would pass off some things like the analysers into java. This would be beneficial to nutch because of the added speed. If the clucene-java wrapper was written well, there would be no need for any code change in nutch, aside from changing which lucene jar file is loaded. This is just some preliminary thoughts, I'm sure there is still a lot to think about. But I have shown that the concept could work using the demo files and I think that it could give nutch indexing/search a reasonable speed boost. What do people think? I am prepared to nut out this one with whoever is interested cheers, ben
Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml
Hello Doug, I read your email ten times and still I am not sure what the problem is. Regards, Piotr Doug Cutting wrote: [EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I think this should now be: http://lucene.apache.org/nutch/bot.html The docs/en pages have mostly been reduced to the about page, whose translations I hate to throw away, even though they don't really fit into the new Forrest-based website. Doug
Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml
No problem at all. I have a lot to learn yet and it is nice people like you check my commits for stupid mistakes. Four eyes are always better than two :). Regards, Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: I read your email ten times and still I am not sure what the problem is. The problem is with me. Doug Cutting wrote: [EMAIL PROTECTED] wrote: - valuehttp://www.nutch.org/docs/en/bot.html/value + valuehttp://lucene.apache.org/nutch/bot.html/value I clicked on what I thought was the lower link, then looked at the browser and saw the wrong thing with the wrong link. But I must have accidentally clicked on the upper link. Sorry for the confusion! Doug
Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml
Will do it tommorow - I wanted to put down a kind of release checklist in Wiki - starting with where to change numbers. But would like to cover also release howto - but in fact I am not sure how to do make a relase yet. But will try to gather this information. Regards Piotr Andrzej Bialecki wrote: [EMAIL PROTECTED] wrote: Author: pkosiorowski Date: Mon Aug 8 13:44:23 2005 New Revision: 230887 URL: http://svn.apache.org/viewcvs?rev=230887view=rev Log: User agent string related properties updated. Piotr, could you also check other places where the version number is harcoded? We should set them to 0.07 now, so that we have the right values in the release ...
Tutorial
Hello, Some time ago someone mentioned on the list a problem with nutch tutorial (I cannot find this email now). I have checked it today and he/she was right. If you follow the nutch Intranet Crawling tutorial you will end up with not very interesting index. This is because it recommends users to set urlfilter and urls file for nutch.org domain, but www.nutch.org redirects to http://lucene.apache.org/nutch and all links are rejected by urlfilter. So I suggest to change it so: urls file will contain: http://lucene.apache.org/nutch crawl-urlfilter.txt will contain: +^http://([a-z0-9]*\.)*apache.org/ I would also add pdf and png to list of rejected extensions in crawl-urlfilter.txt file so users would not be confused by errors in log file. pdf parsing plugin is disabled in default configuration. I can commit such changes for 0.7 release (it means today) if I got positive feedback from other committers. Regards Piotr
NUTCH 79 Fault tolerant searching.
Hello, I just created an issue in JIRA http://issues.apache.org/jira/browse/NUTCH-79 containing the code for fault tolerant searching. I think it is too late to include it in 0.7 release but I would wait for comments and test it in the meantime. I would like to commit it when release would be done and merging of mapreduce branch would be finished. Waiting for comments, Piotr
Re: JIRA access
Thanks. It works. Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able to resolve issues, etc. Doug