Re: Next Nutch release
Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should have as less as possible code, however the idea was to have as less as possible dependencies to thirdparty tools and libraries and also getting things realized with low tech (jsp). Stefan
Re: Next Nutch release
Th old hadoop patch is here: https://issues.apache.org/jira/browse/NUTCH-251 Also we had this conversation: http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg00314.html I guess after this we missed to post the patches we use internally. If someone feels strong about getting the gui working with hadoop he/ she should feel free to update the patch and post it in the hadoop jira. Stefan On 18.01.2007, at 15:39, Doug Cutting wrote: Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. Are there issues in Hadoop's Jira for these? If so, do they have patches attached? Are they linked to the corresponding issue in Nutch? Doug ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: Next Nutch release
Hi Scott, feel free - I have no options on that. From my very little point of view the nutch .8 source stream is a one way street. In all my projects we move as far as possible away from nutch. I like hadoop a lot and writing customer tools on top of it is - that easy. But nutch .8 was a proof of concept for the early hadoop. There is only one serious developer left and wow how great he does his job - but nutch .8 is just too monolithic, to difficult to extend, to difficult to debug, to difficult to integrate for a serious mission critical application. I spend a signification part of my life daily working with nutch, but if someone would ask - I would answer don't use it. May be one day we can get some developer together first think about a good extendable design and than start a 2.x stream or a new project. And ... yes no opic and yes definitely no plugin architecture (I feel very sorry for all that wast so much life time because of my terrible complicate plugin system) but a clean IOC design with lightweight default interface implementations and a great test coverage. Anyway just my *very little* point of view based on 3.5 years nutch experience. Stefan On 18.01.2007, at 21:33, Scott Green wrote: Stefan, I also dived into contrib/web2 in nutch. The one and admin-gui are both owns some plugins based on nutch plugin architecture. So I think it is great if we extract something in high level and they should have a lot commons. Well, i dont know it is the right time to do this job. On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should have as less as possible code, however the idea was to have as less as possible dependencies to thirdparty tools and libraries and also getting things realized with low tech (jsp). Stefan ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: Next Nutch release
Hi, great to hear people still working on things. It shows once more getting something in early would save some effort. :) Just some random comments. We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach. Everything else feels like a workaround to fix some strange hadoop behaviors. It is may be a long time ago that I spoke to Doug and some other Hadoop developers but at this time I understand people that there is a general interest to have a nutch gui and support required functionality in hadoop. I'm not sure if that is still the case or if I had a wrong impression. In any case from my p.o.v. the clean way would be getting the required minor changes into hadoop (not critical simple stuff from my point of view) instead of implement working around in nutch. Since hadoop is a kind of child of nutch there should be a close relation at least to discuss things. Anyway no strong option, just my 2 cents. In any case I'm very happy if people see now the need for a gui as well and someone is working on that since I'm kind of busy with other projects. Thanks. Stefan On 17.01.2007, at 06:42, Enis Soztutar wrote: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Some of the features in the patch is not appropriate for our use cases and it requires hadoop changes, thus I am currently working on an alternative implementation of the administration gui, which runs a hadoop server( like JobTraker) to listen to submitted Jobs, an web Gui to submit and track the jobs from the browser and a job runner. The architechture details of the patch is as follows : - An interface AdminJob which is an abstract class representing a Job in nutch. - various classes extending AdminJob. for ex FetchAdminJob, IndexAdminJob. - A queue which sorts the jobs in priority order, by a modified a topological sort(jobs can be dependent). - an interface to submit Jobs - a rpc server to listen to job submissions - an extension point (basically same as the previous) - a web server to serve plugin jsp's upon the features will be - submitting jobs from code, command line or web interface, - tracking jobs from the command line or web interface - scheduling jobs I could send the code or details if anyone is interested in pretesting. And i will appreciate any comments and suggestions on this. I am planning to complete the patch and submit it to Jira ASAP. Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. The top 10 voted issues are currently: NUTCH-61 Adaptive re-fetch interval. Detecting umodified content NUTCH-48Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address NUTCH-36Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59 meta data support in webdb NUTCH-92DistributedSearch incorrectly scores results NUTCH-68A tool to generate arbitrary fetchlists NUTCH-87Efficient site-specific crawling for a large number of sites Are there any opinions about issues that should go in before the next release (Answering yes means that you are willing to provide a patch for it). -- Sami Siren ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: What's the status of Nutch-GUI?
Hi Sami, I quess you refer to these: • LocalJobRunner: • Run as kind of singelton • Have a kind of jobQueue • Implement JobSubmissionProtocol status-report methods • implement killJob method Right! -how about writing a nutchrunner that just extends the functionality of localjobrunner? That would be one solution, however I still hope that the hadoop developer understand that it would be general benefit to improve the local jobrunner. Since it would be somehow duplicated code it does not feel right, but I also think better this way as never get this issue solved. -scheduling (jobQueue) could be completely outside of jobrunner? We solved that with Quarz and file based JobStore we implemented back than. Stefan
Re: [jira] Created: (NUTCH-408) Plugin development documentation
did you erver browse this: http://wiki.media-style.com/display/ nutchDocu/Home Nothing big, but it will give you some ideas, also about plugins. On 25.11.2006, at 06:32, Armel T. Nene wrote: I agree with you that documentation is vital not the just extending the current version but also for any plugins and patches created. I have been spending almost two weeks trying to adapt nutch to my project but I spend more time in reading code and trying to understand what they do before I can even start to fix problem. Come on guys, documentation is good coding practice, we can't read your mind to know exactly what you were trying to achieve by just looking at the implementation code. This is just a good constructive criticism. :) Armel -Original Message- From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] Sent: 25 November 2006 03:45 To: nutch-dev@lucene.apache.org Subject: [jira] Created: (NUTCH-408) Plugin development documentation Plugin development documentation Key: NUTCH-408 URL: http://issues.apache.org/jira/browse/NUTCH-408 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Environment: Linux Fedora Reporter: nutch.newbie Documentation is rare! But very vital for extending current (0.9) nutch. Current docs on the wiki for 0.7 plugin development was good but it doesn't apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 documentation not enough. A more practical plugin writing documentation for 0.9 is desired also exposing the plugin principals in practical terms i.e. extension points and libs etc. furthermore it would be good to provide some best practice example i.e. look for the lib you are planning to use if its already in lib folder and maybe that version of the external lib is good for the plugin dev rather then using another version things like that.. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/ software/jira ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: Fetcher freezes
Hi, try to have no regular expression filter and check if this helps. Let me know if this solve the problem. You may be want to do a thread dump and send the log to the list to check where exactly the fetcher freezes. Stefan Am 03.11.2006 um 15:53 schrieb Aisha: Hi, I don't know why but I have no answer on the 3 forums where I sent my problem As the problem of Fetcher freezes occurs every time I try to fetch my file system I can't imagine that I am the only one who have this problem and as I said in my last e-mail, I found many mails about this problem but no solution seems have been done It is a big problem so I don't understand why nobody seems interested on it I try to crawl over my file system but the crawl never finished, it aborted with the message Aborting with 3 hung threads. The number of hung threads is not the same if I retry I modify the configuration grawing the number of threads but it doesn't solve the problem Please could somebody help me, I can't crawl my file system.. thanks in advance. Aïcha -- View this message in context: http://www.nabble.com/Fetcher-freezes- tf2568287.html#a7158776 Sent from the Nutch - Dev mailing list archive at Nabble.com. ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: How could I test my modify to NutchAnalysis.jj?
There is a eclipse java cc plugin. It compiles your the grammar and you can write easily test code. However it has it's own issues so you may just want to generate the java files with the nutch ant script and write than unit tests again these files. HTH Stefan On 10.09.2006, at 00:49, heack wrote: I made some changes to this file(with main func), and I want to test it, What should I do? I use ant to build, but it build it all. Maybe I could write an ant xml to run it, but is there any easier way to do that? Thank you! ~~~ 101tec Inc. search tech for web 2.1 Menlo Park, California http://www.101tec.com
Re: Patch Available status?
Another alternative would be to construct a new workflow that just adds the Patch Available status and still permits issues to be re- opened. +1
Re: Missing pages anchor text
Hi Doug, I'm pretty sure that your problem is related to the deduping of your index. In general the hash of the content of a page is used as key for the dedub tool. We ran into the the forwarding problem also in a other case. https://issues.apache.org/jira/browse/NUTCH-353 So may be we should think about a general solution of the forwarding problem. Greetings, Stefan Am 28.08.2006 um 11:33 schrieb Doug Cook: Hi, folks, I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted to check to see if these were known issues (a quick search of the email archives and of JIRA didn't turn up anything). I'm running 0.8 with a handful of patches. I'm frequently finding root pages of sites missing from my index, despite the fact that they have been fetched. In my admittedly short investigation I have found two classes of cases: 1. Root URL is not a redirect, but there is a root-level index.html page. The index.html page is in the index, but the root page is not. Unfortunately, most of the anchor text points to the root page, not the /index.html page, and the anchor text has gone missing along with its associated page, so relevance is poor. 2. Root URL is a redirect to another page. Again, this other page is in the index, the but the root page, along with its anchor text, has gone missing. I have a deduped index. Both of these cases could result from dedup throwing out the wrong URL, i.e. the one with more anchor text, although one might expect dedup to merge the two anchor texts (at least in the case of pages which commonly normalize to the same URL, e.g. / and /index.html). The second case might result from the root URL somehow being normalized to its redirect target, but in that case (incorrect, in any case) I would expect the anchor text to also be attached to the redirect target, and it is not. I'm about to rebuild with no deduping and see what I find. Thanks for your help comments- Doug -- View this message in context: http://www.nabble.com/Missing-pages--- anchor-text-tf2179049.html#a6025652 Sent from the Nutch - Dev forum at Nabble.com. ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: [Nutch Wiki] Update of RunNutchInEclipse by UrosG
Hi, + You may have problems with some imports in parse-mp3 and parse- rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here: + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/ lib/ + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/ lib/ + + You need to copy jar files into plugin lib path and refresh the project. Isn't the mp3 plugin deactivated? I suggest we remove it and put in a kind of sandbox within the jars. However I think the sandbox have to be outside of apache. Stefan
Re: Checking if crawl dir exists ...
Hi Michi, what is your motivation for that? Stefan Am 25.08.2006 um 06:52 schrieb Michael Wechner: Hi I think it would be very useful if the NutchBean would check if the crawl dir exists and throw at least a warning in case it doesn't: Index: nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java === --- nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java (Revision 436787) +++ nutch-0.8/src/java/org/apache/nutch/searcher/NutchBean.java (Arbeitskopie) @@ -95,6 +95,9 @@ if (dir == null) { dir = new Path(this.conf.get(searcher.dir, crawl)); } + if (!new java.io.File(dir.toString()).exists()) { +LOG.warn(No such directory: + new java.io.File (dir.toString())); +} Path servers = new Path(dir, search-servers.txt); if (fs.exists(servers)) { if (LOG.isInfoEnabled()) { WDYT? Thanks Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61 ~~~ 101tec Inc. Menlo Park, California http://www.101tec.com
Re: [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]
Hi Renaud, I think you were meaning editing: http://wiki.apache.org/nutch/ RunNutchInEclipse , not http://wiki.apache.org/nutch/ RenaudRichardet , right? Right! Sorry for the misunderstanding.I have no idea about your personal page so it would be a bad move to edit it. :-) Thanks again for creating the debugging nutch within eclipse. Stefan
Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet
Hi Renaud, I updated your page with some more details, I hope that is ok for you. Thanks for creating it. Stefan Am 23.08.2006 um 11:51 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by RenaudRichardet: http://wiki.apache.org/nutch/RenaudRichardet New page: {{{ Renaud Richardet COO America Wyona Inc. - Open Source Content Management - Apache Lenya office +1 857 776-3195 mobile +1 617 230 9112 renaud.richardet at wyona.com http://www.wyona.com }}}
Re: Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation
One must also remember that proper junit testing can be used to verify functionality. There's lot of code currently that is not guarded by unit tests and I hereby invite everybody to participate in this endless effort and make Nutch unit tests better ;) I completely agree!!! Nutch has more bugs than ever before since most of the .8 code was developed without tests. Stefan
[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] Stefan Groschupf commented on NUTCH-354: Since this issue is already closed I can not attach the patch file, so I attach it as text within this comment. If you need the file let me know and I send you a offlist mail. Index: src/test/org/apache/nutch/crawl/TestMapWritable.java === --- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 432325) +++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy) @@ -180,6 +180,31 @@ assertEquals(before, after); } + public void testRecycling() throws Exception { +UTF8 value = new UTF8(value); +UTF8 key1 = new UTF8(a); +UTF8 key2 = new UTF8(b); + +MapWritable writable = new MapWritable(); +writable.put(key1, value); +assertEquals(writable.get(key1), value); +assertNull(writable.get(key2)); + +DataOutputBuffer dob = new DataOutputBuffer(); +writable.write(dob); +writable.clear(); +writable.put(key1, value); +writable.put(key2, value); +assertEquals(writable.get(key1), value); +assertEquals(writable.get(key2), value); + +DataInputBuffer dib = new DataInputBuffer(); +dib.reset(dob.getData(), dob.getLength()); +writable.readFields(dib); +assertEquals(writable.get(key1), value); +assertNull(writable.get(key2)); + } + public static void main(String[] args) throws Exception { TestMapWritable writable = new TestMapWritable(); writable.testPerformance(); MapWritable, nextEntry is not reset when Entries are recycled -- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0, 0.8.1 Attachments: resetNextEntryInMapWritableV1.patch MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Fwd: [webspam-announces] Web Spam Collection Announced
Hi, May be some people will find that posting interesting. Webspam is one of the biggest issues or nutch for whole web crawls from my POV. Greetings, Stefan During AIRWeb'06 we announced the availability of the collection. We are currently planning a Web Spam challenge based on the dataset we have built. I assume most of you will be interested on this, so I have moved the webspam-volunteers list to webspam-announces. If you do not want to be in this new webspam-announces list, please send me an e-mail. This was shown during AIRWeb in Seattle: . Web Spam Collection Available August 10th, 2006 We are pleased to announce the availability of a public collection for research on Web spam. This collection is the result of efforts by a team of volunteers: Thiago AlvesAntonio GulliTamas Sarlos Luca Becchetti Zoltan Gyongyi Mike Thelwall Paolo Boldi Thomas Lavergn Belle Tseng Paul ChiritaAlex Ntoulas Tanguy Urvoy Mirel Cosulschi Josiane-Xavier Parreira Wenzhong Zhao Brian Davison Xiaoguang Qi Pascal Filoche Massimo Santini The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts downloaded in May 2006 by the Laboratory of Web Algorithmics, Universit{\`a} degli Studi di Milano. The labelling process was coordinated by Carlos Castillo working at the Algorithmic Engineering group at Universit{\`a} di Roma ``La Sapienza'' The project was funded by the DELIS project (Dynamically Evolving, Large Scale Information Systems). Volunteers were provided with a set of guidelines and were asked to mark a set of hosts as either normal, spam, or borderline. The collection includes about 6,700 judgments done by the volunteers and can be used for testing link-based and content-based Web spam detection and demotion techniques. More information is available in our Web page, including the guidelines given to the human judges, the instructions for obtaining the links and contents of the pages in this collection, and the contact information for questions and comments. http://aeserver.dis.uniroma1.it/webspam/ If you use this data set please subscribe to our mailing list by sending an e-mail to [EMAIL PROTECTED] -- Carlos Castillo Universita di Roma La Sapienza Rome, ITALY Yahoo! Groups Links * To visit your group on the web, go to: http://groups.yahoo.com/group/webspam-announces/ * To unsubscribe from this group, send an email to: [EMAIL PROTECTED] * Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] Stefan Groschupf commented on NUTCH-356: Hi Enrico, there will be as much PluginRepositories as Configuration objects. So in case you create many configuration objects you will have a problem with the memory. There is no way around having a singleton pluginrepository. However you can reset the the pluginRepository by remove the cached object from the configuration object. In any case do not cache the pluginrepository is a bad idea, thinkabout writing a own plugin that solve your problem that should be a cleaner solution for your problem. Would you agree to close this issue since we will not be able to commit your changes. Stefan Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: http://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Attachments: NutchTest.java, patch.txt While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-357) crawling simulation
crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix For: 0.9.0 We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of web servers. For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-357) crawling simulation
[ http://issues.apache.org/jira/browse/NUTCH-357?page=all ] Stefan Groschupf updated NUTCH-357: --- Attachment: protocol-simulation-pluginV1.patch A very first preview of a plugin that helps to simulate crawls. This protocol plugin can be used to replace the http protocol plugin and return defined content during a fetch. To simulate custom scenarios a interface names Simulator can be implemented with just one method. The plugin comes with a very simple basic Simulator implementation, however this already allows to simulate the by today known nutch scoring problems, like pages pointing to itself or link chains. For more details see the java doc, however I plan to improve the java doc with a native speaker. Feedback is welcome. crawling simulation --- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix For: 0.9.0 Attachments: protocol-simulation-pluginV1.patch We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of web servers. For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
MapWritable, nextEntry is not reset when Entries are recycled --- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. MapWritable, nextEntry is not reset when Entries are recycled -- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0, 0.8.1 Attachments: resetNextEntryInMapWritableV1.patch MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look to the patch file. My personal experience is that some nutch developers have strong opitions about code formating, so you may be want to check your code formating. :-) Index MP3 SHA1 hashes - Key: NUTCH-343 URL: http://issues.apache.org/jira/browse/NUTCH-343 Project: Nutch Issue Type: New Feature Affects Versions: 0.8, 0.9.0, 0.8.1 Reporter: Hasan Diwan Attachments: parsemp3.pat Add indexing of the mp3s sha1 hash. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes it afterwards with all content. Very dangerous if a user provides / as tmp folder. The attached patch rollsback the missing line and I would love to ask that one developer with write access can roll in this in asap! THANKS! IndexMerger now deletes entire workingdir after completing Key: NUTCH-341 URL: http://issues.apache.org/jira/browse/NUTCH-341 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8 Reporter: Chris Schneider Priority: Critical Attachments: doNotDeleteTmpIndexMergeDirV1.patch Change 383304 deleted the following line near Line 117 (see http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304r2=405204diff_format=h for details): workDir = new File(workDir, indexmerger-workingdir); Previously, if no -workingdir workingdir parameter was specified, IndexMerger.main() would place an indexmerger-workingdir directory into the default directory and then delete the former after completing. Now, IndexMerger.main() defaults the value of its workDir to indexmerger within the default directory, and deletes this workDir afterward. However, if -workingdir workingdir _is_ specified, IndexMerger.main() will now set workDir to _this_ path and delete the _entire_ workingdir afterward. Previously, IndexMerger.main() would only delete workingDir/indexmerger-workingdir, without deleting workingdir itself. This is because the line mentioned above always appended indexmerger-workingdir to workDir. Our hardware configuration on the jobtracker/namenode box attempts to keep all large datasets on a separate, large hard drive. Accordingly, we were keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir on this drive. Unfortunately, we were passing the folder containing these folders in the workingdir parameter to the IndexMerger. As a result, the first time we ran the IndexMerger, we ended up trashing our entire DFS! Perhaps the way that the IndexMerger handles its workingdir parmaeter now is an acceptable design. However, given the way it handled this parameter in the past, I feel that the current implementation is unacceptably dangerous. More importantly, perhaps there's some way that we could make hadoop more robust in handling its critical data files. I plan to place a directory owned by root with dr permissions into each of these critical directories in order to prevent any of them from suffering the fate of our DFS. This could become part of a standard hadoop installation. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit this to trunk Fetcher ignores the fetcher.parse value configured in config file - Key: NUTCH-337 URL: http://issues.apache.org/jira/browse/NUTCH-337 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.9.0 Reporter: Jeremy Huylebroeck Priority: Trivial Attachments: respectFetcherParsePropertyV1.patch using the command line call to Fetcher, if the noParsing parameter is given, everything is fine. if the noParsing is not given, the value in the nutch-site.xml (or nutch-default.xml) should be taken but it is true that is always given to the call to fetch. it should be the value from the conf. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file - Key: NUTCH-337 URL: http://issues.apache.org/jira/browse/NUTCH-337 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.9.0 Reporter: Jeremy Huylebroeck Attachments: respectFetcherParsePropertyV1.patch using the command line call to Fetcher, if the noParsing parameter is given, everything is fine. if the noParsing is not given, the value in the nutch-site.xml (or nutch-default.xml) should be taken but it is true that is always given to the call to fetch. it should be the value from the conf. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE
urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE -- Key: NUTCH-350 URL: http://issues.apache.org/jira/browse/NUTCH-350 Project: Nutch Issue Type: Bug Reporter: Stefan Groschupf Priority: Critical Intranet crawls or focused crawls will fetch many pages from the same host. This causes that a thread will be blocked since a other thread already fetch from the same host. It is very likely that threads are more often blocked than http.max.delays. In such a case the HttpBase.blockAddr method throws a HttpException. This will be handled in the fetcher by increment the crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that at least you have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in intranet or focused crawls it is very likely that this is not enough. Increaing one of the involved properties dramatically slow down the fetch. I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem was caused by a blocked thread. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ] Stefan Groschupf updated NUTCH-350: --- Attachment: protocolRetryV5.patch This patch will dramatically increase the number of successfully fetched pages of a intranet crawl over the time. urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE Key: NUTCH-350 URL: http://issues.apache.org/jira/browse/NUTCH-350 Project: Nutch Issue Type: Bug Reporter: Stefan Groschupf Priority: Critical Attachments: protocolRetryV5.patch Intranet crawls or focused crawls will fetch many pages from the same host. This causes that a thread will be blocked since a other thread already fetch from the same host. It is very likely that threads are more often blocked than http.max.delays. In such a case the HttpBase.blockAddr method throws a HttpException. This will be handled in the fetcher by increment the crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that at least you have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in intranet or focused crawls it is very likely that this is not enough. Increaing one of the involved properties dramatically slow down the fetch. I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem was caused by a blocked thread. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] Stefan Groschupf commented on NUTCH-322: I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output. That causes that Page A does not change the state or the next fetch time, what means that page A is fetched again, again, again ... ∞ I suggest that we write out Page A with a status change to STATUS_DB_GONE. Fetcher discards ProtocolStatus, doesn't store redirected pages --- Key: NUTCH-322 URL: http://issues.apache.org/jira/browse/NUTCH-322 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Andrzej Bialecki Fix For: 0.9.0 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages. I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value. Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes: * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad. * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time
pages that serverside forwards will be refetched every time --- Key: NUTCH-353 URL: http://issues.apache.org/jira/browse/NUTCH-353 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1 Attachments: doNotRefecthForwarderPagesV1.patch Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ] Stefan Groschupf updated NUTCH-353: --- Attachment: doNotRefecthForwarderPagesV1.patch Since we discussed that nutch need to be more polite we should fix that asap. pages that serverside forwards will be refetched every time --- Key: NUTCH-353 URL: http://issues.apache.org/jira/browse/NUTCH-353 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1 Attachments: doNotRefecthForwarderPagesV1.patch Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed. This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 Fetcher discards ProtocolStatus, doesn't store redirected pages --- Key: NUTCH-322 URL: http://issues.apache.org/jira/browse/NUTCH-322 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Andrzej Bialecki Fix For: 0.9.0 Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important information, such as protocol-level response code, lastModified time, and possibly other messages. I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus contains a valid lastModified time, that CrawlDatum's modified time should also be set to this value. Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently discarded. When Fetcher translates from protocol-level status to crawldb-level status it should probably store such pages with the following translation of status codes: * ProtocolStatus.TEMP_MOVED - CrawlDatum.STATUS_DB_RETRY. This code indicates a transient change, so we probably shouldn't mark the initial URL as bad. * ProtocolStatus.MOVED - CrawlDatum.STATUS_DB_GONE. This code indicates a permanent change, so the initial URL is no longer valid, i.e. it will always result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-347) Build: plugins' Jars not found
[ http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] Stefan Groschupf commented on NUTCH-347: Please submit this patch! Thanks! Build: plugins' Jars not found -- Key: NUTCH-347 URL: http://issues.apache.org/jira/browse/NUTCH-347 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Otis Gospodnetic Attachments: nutch_build_plugins_patch.txt While building Nutch, I noticed several places where various Jars from plugins' lib directories could not be found, for example: $ ant package ... deploy: [copy] Warning: Could not find file /home/otis/dev/repos/lucene/nutch/trunk/build/lib-log4j/lib-log4j.jar to copy. init: init-plugin: compile: jar: deps-test: deploy: [copy] Warning: Could not find file /home/otis/dev/repos/lucene/nutch/trunk/build/lib-nekohtml/lib-nekohtml.jar to copy. ... The problem is, these lib-.jar files do not exist. Instead, those Jars are typically named with a version in the name, like log4j-1.2.11.jar. I could not find where this lib- prefix comes from, nor where the version is dropped from the name. Anyone knows? In order to avoid these errors I had to make symbolic links and fake things: e.g. ln -s log4j-1.2.11.jar lib-log4j.jar But this should really be fixed somewhere, I just can't see where... :( Note that this doesn't completely break the build, but missing Jars can't be a good thing. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log
[ http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] Stefan Groschupf commented on NUTCH-346: +1 I agree, can you please create a patch file and attach it to this bug. Thanks Improve readability of logs/hadoop.log -- Key: NUTCH-346 URL: http://issues.apache.org/jira/browse/NUTCH-346 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: ubuntu dapper Reporter: Renaud Richardet Priority: Minor adding log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN to conf/log4j.properties dramatically improves the readability of the logs in logs/hadoop.log (removes all INFO) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] Stefan Groschupf commented on NUTCH-345: Shouldn't the DeflateUtils also be part of the protocol-http plugin? Also since it is a larger contribution and not just a small bug fix it would be great to have a junit test within the patch. Thanks for the contribution. Add support for Content-Encoding: deflated -- Key: NUTCH-345 URL: http://issues.apache.org/jira/browse/NUTCH-345 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Pascal Beis Priority: Minor Attachments: nutch-deflate.patch Add support for the deflated content-encoding, next to the already implemented GZIP content-encoding. Patch attached. See also the Patch: deflate encoding thread on nutch-dev on August 7/8 2006. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] Stefan Groschupf commented on NUTCH-349: my vote goes to #2. Having a tool that need to be started manually would be better than complicate the already fragile code from my point of view. Port Nutch to use Hadoop Text instead of UTF8 - Key: NUTCH-349 URL: http://issues.apache.org/jira/browse/NUTCH-349 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Andrzej Bialecki Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. This class has been deprecated in Hadoop 0.5.0, and Text class should be used instead. Sooner or later we will need to move Nutch to use this class instead of UTF8. This raises numerous issues regarding the compatibility of existing data in CrawlDB, LinkDB and segments. I can see two ways to solve this: * add code in readers of respective formats to convert UTF8-Text on the fly. New writers would only use Text. This is less than ideal, because it complicates the code, and also at some point in time the UTF8 class will be removed. * create a converter (to be maintaines as long as UTF8 exists), which converts existing data in bulk from UTF8 to Text. This requires an additional processing step when upgrading to convert all existing data to the new format. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls that for example comes from link farms the crawler runs into. wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ] Stefan Groschupf updated NUTCH-348: --- Attachment: sortPatchV1.patch What people think about this kind of solution? Generator is building fetch list using *lowest* scoring URLs Key: NUTCH-348 URL: http://issues.apache.org/jira/browse/NUTCH-348 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Chris Schneider Attachments: sortPatchV1.patch Ever since revision 391271, when the CrawlDatum key was replaced by a FloatWritable key, the Generator.Selector.reduce method has been outputting the *lowest* scoring URLs! The CrawlDatum class has a Comparator that essentially treats higher scoring CrawlDatum objects as less than lower scoring CrawlDatum objects, so the higher scoring ones would appear first in a sequence file sorted using this as the key. When a FloatWritable based on the score itself (as returned from scfilters.generatorSortValue) became the sort key, it should have been negated in Generator.Selector.map to have the same result. Curiously, there is a comment to this effect immediately before the FloatWritable is set: // sort by decreasing score sortValue.set(sort); It seems like the simplest way to fix this is to just negate the score, and this seems to work for me: // sort by decreasing score // 2006-08-15 CSc REALLY sort by decreasing score sortValue.set(-sort); Unfortunately, this means that any crawls that have been done using Generator.java after revision 391271 should be discarded, as they were focused on fetching the lowest scoring unfetched URLs in the crawldb, essentially pointing the crawler 180 degrees from its intended direction. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.
doubling score causes by page internal anchors. --- Key: NUTCH-332 URL: http://issues.apache.org/jira/browse/NUTCH-332 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. So may be the status fetched will be overwritten with unfetched. In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] Stefan Groschupf commented on NUTCH-318: Yes this happens only in a distributed environment. Please also see my last mail in the hadoop dev list. I think there are more general logging problems, that only occurs in a distributed environment. So you will not track them down using local runner. log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.9-dev In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps the information to the console, but not in a nice way. What makes me wonder is that these information should be also in the log file, but the arn't, so there are may be even here problems. Also what is the different between hadoop-XXX-jobtracker-XXX.out and hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] Stefan Groschupf commented on NUTCH-318: Shouldn't that be fixed in .8 since by today this tool just produce no output?! log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.9-dev In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps the information to the console, but not in a nice way. What makes me wonder is that these information should be also in the log file, but the arn't, so there are may be even here problems. Also what is the different between hadoop-XXX-jobtracker-XXX.out and hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages will run into this problem. The problems are for example from spam bot generated urls. wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9-dev Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: segread vs. readseg
I like it! Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki: Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Yes, it seems more consistent. However, if we change it then scripts people wrote would break. We could support both aliases in 0.8, and give a deprecation message. What do others think? Same feeling here. Agreed. What about the following? Index: bin/nutch === --- bin/nutch(revision 424960) +++ bin/nutch(working copy) @@ -40,7 +40,7 @@ echo generate generate new segments to fetch echo fetch fetch a segment's pages echo parse parse a segment's pages - echo segread read / dump segment data + echo readseg read / dump segment data echo mergesegs merge several segments, with optional filtering and slicing echo updatedb update crawl db from segments after fetching echo invertlinks create a linkdb from parsed segments @@ -158,7 +158,10 @@ CLASS=org.apache.nutch.crawl.CrawlDbMerger elif [ $COMMAND = readlinkdb ] ; then CLASS=org.apache.nutch.crawl.LinkDbReader +elif [ $COMMAND = readseg ] ; then + CLASS=org.apache.nutch.segment.SegmentReader elif [ $COMMAND = segread ] ; then + echo [DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead. CLASS=org.apache.nutch.segment.SegmentReader elif [ $COMMAND = mergesegs ] ; then CLASS=org.apache.nutch.segment.SegmentMerger -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
result comparison tool?
Hi, I remember there was a search result comparison tool within nutch. Is that still alive? How to use it / find it? I was not able to find it by browsing the trunk sources. Is there any such a tool people can suggest to compare search results with yahoo or google result to play with configuration properties and scoring mechanisms? Thanks for any hints. Stefan
nutch-extensionpoints not in plugin.includes
Hi developers, in nutch-default.xml property plugin.includes we say: In any case you need at least include the nutch-extensionpoints plugin. But we do not include it by default. valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index- basic|query-(basic|site|url)|summary-basic|scoring-opic/value We may be should update the text or include the plugin everything else may be confuse users. Should I open a bug or can someone with write access just jump in and fix that. Thanks, Stefan
Re: nutch-extensionpoints not in plugin.includes
I may - but since you know the details of the plugin subsystem, tell me what _should_ be there? I.e. should we really include it in the plugin.includes list, or not? This is a philosophically question. I personal prefer restrict definitions, since applications behavior is better traceable. That was a reason I implemented the plugin system in a restrict way. Later on this was washed out by the plugin.auto-activation mechanism, what I still think was not a good move. However in the moment we have the situation that nutch- extensionpoints is not included but the the auto activation mechanism includes this plugin since it is used by all other plugins. So if you switch of auto activation today with default configured plugin-includes nutch will crash. My person point of view is to add nutch-extensionpoints and switch off auto activation. .. but this is just my personal point of view... Stefan
[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes --- Key: NUTCH-325 URL: http://issues.apache.org/jira/browse/NUTCH-325 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev In URLFilters constructor we use an array as long as we have filters defined in the urlfilter.order property. In case those filters are not included in the plugin.include property end up putting null entries into the array. This cause a NPE in URLFilters line 82. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ] Stefan Groschupf updated NUTCH-325: --- Attachment: UrlFiltersNPE.patch A patch that uses a Arralist instead of an array and put only entries into the list when the entry is not null. Means only loaded Urlfilter that are loaded will be also stored into the filters array that is cached into the Configuration object. UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes --- Key: NUTCH-325 URL: http://issues.apache.org/jira/browse/NUTCH-325 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev Attachments: UrlFiltersNPE.patch In URLFilters constructor we use an array as long as we have filters defined in the urlfilter.order property. In case those filters are not included in the plugin.include property end up putting null entries into the array. This cause a NPE in URLFilters line 82. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
log when blocked by robots.txt
Hi Developers, another thing in the discussion to be more polite. I suggest that we log a message in case an requested URL was blocked by a robots.txt. Optimal would be if we only log this message in case the current used agent name is only blocked and it is not a general blocking of all agents. Should I create a patch? Stefan
[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ] Stefan Groschupf updated NUTCH-323: --- Attachment: MapWritableCopyConstructor.patch Attached patch add a copy constructor to the map writable and use it in the CrawlDatum.set methode. However there are more methods in the code where meta data are passed from one CrawlDatum to a other, but I don't can see any risk of concurent usage of the mapWritable there. CrawlDatum.set just reference a mapWritable of a other object but not copy it. -- Key: NUTCH-323 URL: http://issues.apache.org/jira/browse/NUTCH-323 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: MapWritableCopyConstructor.patch Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to a other. Also a reference of the MapWritable is passed. Means both project share the same mapWritable and its content. This causes problems with concurent manipulate mapWritables and its key-value tuples. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored
db.score.link.internal and db.score.link.external are ignored - Key: NUTCH-324 URL: http://issues.apache.org/jira/browse/NUTCH-324 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Critical Configuration properties db.score.link.external and db.score.link.internal are ignored. In case of e.g. message board webpages or pages that have larger navigation menus on each page having a lower impact of internal links makes a lot of sense for scoring. Also for web spam this is a serious problem, since now spammers can setup just one domain with dynamically generated pages and this highly manipulate the nutch scores. So I also suggest that we give db.score.link.internal by default a value of something like 0.25. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Stefan Groschupf updated NUTCH-324: --- Attachment: InternalAndExternalLinkScoreFactor.patch Multiply the score of a page during distributeScoreToOutlink with db.score.link.internal or db.score.link.external. db.score.link.internal and db.score.link.external are ignored - Key: NUTCH-324 URL: http://issues.apache.org/jira/browse/NUTCH-324 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Critical Attachments: InternalAndExternalLinkScoreFactor.patch Configuration properties db.score.link.external and db.score.link.internal are ignored. In case of e.g. message board webpages or pages that have larger navigation menus on each page having a lower impact of internal links makes a lot of sense for scoring. Also for web spam this is a serious problem, since now spammers can setup just one domain with dynamically generated pages and this highly manipulate the nutch scores. So I also suggest that we give db.score.link.internal by default a value of something like 0.25. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ] Stefan Groschupf resolved NUTCH-319. Resolution: Won't Fix Sorry, that is bogus since it is wriiten to the logging stream. OPICScoringFilter should use logging API instead of printStackTrace --- Key: NUTCH-319 URL: http://issues.apache.org/jira/browse/NUTCH-319 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Assigned To: Andrzej Bialecki Priority: Trivial Fix For: 0.8-dev OPICScoringFilter line 107 should be a logging not a e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
db.max.inlinks
Hi, shouldn't db.max.inlinks be in the nutch-default.xml configuration? Stefan
OPICScoringFilter Metadata transport scores as String
Hi, OPICScoringFilter line 91: content.getMetadata().set(Fetcher.SCORE_KEY, + datum.getScore()); and line 96,102 we set and get the Fetch Sore as Strings. :-o. Wouldn't it be better to have the Metadata support floats as well instead of serializing and parsing strings? In general wouldn't it be a good idea to have Metadata as child of MapWritable ? OO Design? Any thoughts? Stefan
[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace
OPICScoringFilter should use logging API instead of printStackTrace --- Key: NUTCH-319 URL: http://issues.apache.org/jira/browse/NUTCH-319 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Assigned To: Andrzej Bialecki Priority: Trivial Fix For: 0.8-dev OPICScoringFilter line 107 should be a logging not a e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Nutch-dev] Crawl error
As mentioned, set the environment variable bin/nutch set also for eclipse, especially logging related variables! Am 10.07.2006 um 00:05 schrieb AJ Chen: My classpath has conf folder. NUTCH_JAVA_HOME is set. In fact, nutch 0.71is working well from my eclipse. I suspect the error comes from changes in verions 0.8. The problem is the log message does not say what file is not found. So, it's hard to debug. Any idea? Thanks, AJ On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set it up in eclipse. When trying to run Crawl from Eclipse using args urls -dir crawl - depth 3 -topN 50, I got the following error, which started from LogFactory.getLog( Crawl.class). Any idea what file was not found? There is a url file under directory urls. Thanks, log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (The system cannot find the path specified) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:177) at java.io.FileOutputStream.init(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions (FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions( DailyRollingFileAppender.java:215) at org.apache.log4j.config.PropertySetter.activate (PropertySetter.java :256) at org.apache.log4j.config.PropertySetter.setProperties( PropertySetter.java:132) at org.apache.log4j.config.PropertySetter.setProperties( PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender( PropertyConfigurator.java:654) at org.apache.log4j.PropertyConfigurator.parseCategory( PropertyConfigurator.java:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory( PropertyConfigurator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java:415) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java:441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure( OptionConverter.java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger( Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.init (Log4JLogger.java :65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java: 494) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance( LogFactoryImpl.java:529) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance( LogFactoryImpl.java:235) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance( LogFactoryImpl.java:209) at org.apache.commons.logging.LogFactory.getLog(LogFactory.java: 351) at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. -AJ - - --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel? cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information
log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps the information to the console, but not in a nice way. What makes me wonder is that these information should be also in the log file, but the arn't, so there are may be even here problems. Also what is the different between hadoop-XXX-jobtracker-XXX.out and hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Nutch-dev] Crawl error
Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set it up in eclipse. When trying to run Crawl from Eclipse using args urls -dir crawl - depth 3 -topN 50, I got the following error, which started from LogFactory.getLog( Crawl.class). Any idea what file was not found? There is a url file under directory urls. Thanks, log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (The system cannot find the path specified) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:177) at java.io.FileOutputStream.init(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions (FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions( DailyRollingFileAppender.java:215) at org.apache.log4j.config.PropertySetter.activate (PropertySetter.java :256) at org.apache.log4j.config.PropertySetter.setProperties( PropertySetter.java:132) at org.apache.log4j.config.PropertySetter.setProperties( PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender( PropertyConfigurator.java:654) at org.apache.log4j.PropertyConfigurator.parseCategory( PropertyConfigurator.java:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory( PropertyConfigurator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java:415) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java:441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure( OptionConverter.java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger( Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.init (Log4JLogger.java :65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance( NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:494) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance( LogFactoryImpl.java:529) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance( LogFactoryImpl.java:235) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance( LogFactoryImpl.java:209) at org.apache.commons.logging.LogFactory.getLog(LogFactory.java: 351) at org.apache.nutch.crawl.Crawl.clinit(Crawl.java:38) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. -AJ -- --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel? cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: Nutch based directory and crawler based on keyword
Hi, this question is difficult to answer and may be there more experts in the nutch user list than in the developer list. In nutch 0.8 you can use the new scoring api to change the scoring of a page for being scheduled for crawling based on the it's scores. Have a look to the opic score plugin and to the crawldatum meta data. The meta data can be used to transport informations like customs category weightnings scores that take effect in the crawlDatum score caculation. Attention this is not scoring during search time, this is scoring crawling scheduling. Beside that the may be simplest way is to write a index plugin that tag a page (keywordMatch:true / false) that a keyword occurs or not. During search you extend the search string behind the scene with something like: yourSearchString+ keywordMatch:true Stefan Am 08.07.2006 um 07:03 schrieb Syed Kamran Ali: Hi, I have successfully configured nutch 0.7.2. Ran the crawler a few times all working fine. Now i wanted to know is there a way i can run the crawler so that if it finds certain keyword in a website only then it indexes it otherwise not. Also after i have the index created is it possible that i can create a categorized directory, like there is yahoo and google directories? -- Thanks Kamran
Re: Error with Hadoop-0.4.0
Hi Jérôme, I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. We should fix that. Stefan On 06.07.2006, at 08:54, Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob (JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) By looking at the Nutch code, and simply changing the line 145 of Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) all is working fine. By taking a closer look at CrawlDb code, I finaly dont understand why there is the following line in the createJob method: job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); For curiosity, if a hadoop guru can explain why there is such a regression... Does somebody have the same error? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) and this worked without any problem. Thanks for catching that, this saved us a lot of time. Stefan On 07.07.2006, at 16:08, Jérôme Charron wrote: I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: 0.8 release
+1, but I really would love to see NUTCH-293 as part of nutch .8 since this all about being more polite. Thanks. Stefan On 05.07.2006, at 03:46, Doug Cutting wrote: +1 Piotr Kosiorowski wrote: +1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8 now, there has been quite a lot of improvements/new features since 0.7 series and I strongly feel that we should push the first 0.8 series release (alfa/beta) out the door now. It would IMO lower the barrier to first timers try the 0.8 series and that would give us more feedback about the overall quality. Definitely +1. Let's do some testing, however, after the upgrade to hadoop 0.3.2 - hadoop had many, many changes, so we just need to make sure it's stable when used with Nutch ... We should also check JIRA and apply any trivial fixes before the release. If there is a consensus about this I can volunteer to be the RM. That would be great, thanks!
noindedo not index/noindex
Hi, as far I can see nutch's html parser does only support the meta tag noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there is an inoffiziel html noindex tag. http://www.webmasterworld.com/forum10003/2703.htm May be this would be another thing to make nutch more polite. Also please remember my patch to support crawl-delay properties in robots.txt. That would be also something important to make nutch more polite and may be a better way than removing the nutch crawler identification. Thoughts? Stefan
Re: how to manipulate with MapWritable metaData in CrawlDatum structure
Hi Feng, map Writrable is a kind of hashmap. You can put in any key value pair, but the key and values need to be Writables: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ Writable.html You can use UTF8 as StingKey and Value or ByteWritable as key and Utf8 as Values. Etc. Does this answer your question? Stefan Am 12.06.2006 um 04:15 schrieb Feng Ji: hi, I wonder how to use MapWritable metaData in CrawlDatum.java. The API gives us some function call, but I still don't know how to input information (String) to metaData and retrieve information; How to convert MapWritable variable to other types like MetaData type or String type. Any good sample in Nutch's java class? thanks, Feng
Re: nutch-default.xml configuration
Hi Lourival, this means all pages older than 30 days are potential candidates for a fetch list that is created by segment generation process. Stefan Am 12.06.2006 um 16:33 schrieb Lourival Júnior: Hi all! I have a question about nutch-default.xml configuration file. There is a parameter db.default.fetch.interval that is set by default to 30. It means that pages from the webdb are recrawled every 30 days.http://www.mail-archive.com/nutch-user@lucene.apache.org/ msg02058.htmlI want to know if this recrawled here means automatic recrawl or I have to execute some shell script before this period to make possible updates to my WebDB. I really wanna know this because at this time I did not obtain a update in fact. Thanks a lot! -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: nutch-default.xml configuration
Ok. So, have you any solution to do this job automatically? I have a shell script, but I don't see if this really works yet. Shell scripts are the best solution. Sorry if I'm being redundant. I'm learn about this tool and I have a lot of questions :). No Problem, but the nutch user mailing list would be a better list to ask such questions. Thanks! Stefan Thanks! On 6/12/06, Dima Mazmanov [EMAIL PROTECTED] wrote: Hi,Lourival. You wrote 12 èþíÿ 2006 ã., 19:33:15: Hi all! I have a question about nutch-default.xml configuration file. There is a parameter db.default.fetch.interval that is set by default to 30. It means that pages from the webdb are recrawled every 30 days. http://www.mail-archive.com/nutch-user@lucene.apache.org/ msg02058.htmlI want to know if this recrawled here means automatic recrawl or I have to execute some shell script before this period to make possible updates to my WebDB. I really wanna know this because at this time I did not obtain a update in fact. Thanks a lot! You have to recrawl db manually. -- Regards, Dima mailto:[EMAIL PROTECTED] -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
[jira] Updated: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV5.patch Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte[4]) + a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded + add a property to define if the IpAddress Resolver should be started as a part of the crawlDb update tool to update the parseoutput folder (contains CrawlDatum Status Linked) of a segment before updating the crawlDb. + using cached IP during Generation Please review this patch and give me any improvement suggestion, I think this is a very important issue, since it helps to do _real_ whole web crawls and not end up in a honey pot after some fetch iterations. Also if you like please vote for this issue. :-) Thanks. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV4.patch Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte in each crawldatum for now. I tested the resolver tool with a 200++mio crawldb and in average a performance of 500 IP lookups / sec per box is possible by using 1000 threads. I really would love to get this into the sources as the basic version of having the IP address in the crawlDatum, since I'm working on a tool set of spam detectors that all need ip adresses somehow. May be let's exclude the tool but start with the crawlDatum? :-? Any improvement suggestions? Thanks. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong
java doc of CrawlDb is wrong Key: NUTCH-302 URL: http://issues.apache.org/jira/browse/NUTCH-302 Project: Nutch Type: Bug Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.8-dev CrawlDb has the same java doc as Injector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query
[ http://issues.apache.org/jira/browse/NUTCH-301?page=all ] Stefan Groschupf updated NUTCH-301: --- Attachment: CommonGramsCacheV1.patch Cache HashMap COMMON_TERMS in configuration instance. CommonGrams loads analysis.common.terms.file for each query --- Key: NUTCH-301 URL: http://issues.apache.org/jira/browse/NUTCH-301 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Chris Schneider Attachments: CommonGramsCacheV1.patch The move away from static objects toward instance variables has resulted in CommonGrams constructor parsing its analysis.common.terms.file for each query. I'm not certain how large a performance impact this really is, but it seems like something you'd want to avoid doing for each query. Perhaps the solution is to keep around an instance of the CommonGrams object itself? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] Stefan Groschupf commented on NUTCH-293: Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does not support crawl-delay. Becasue nutch tries to be polite from my point of view this is a small but important change. If there are no improvement suggestions can someone of the committers take care of that _please_? :-) support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Attachments: crawlDelayv1.patch Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a de-facto standard. See: http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Webmasters start blocking nutch since we do not support it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
resolving IP in...
Hi, after playing around to figure out the best place to resolve IP's of freshly discovered ulrs I agree with Andrzej that the Parseoutputformat isn't the best place. The problem here, Parseoutputformat is not multithreaded and we definitely need many threads for ip lookup. I think in case we a ip Resolving MapRunnable to preprocess segment data (after fetching) before crawldb updating would be may be a better place. + less data to process (in opposite to process a complete crawldb) + good dns cache usage, since many new urls will point to the same host (internal links) - we may lookup urls we already have in the crawldb. Any thoughts? Stefan
[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] Stefan Groschupf commented on NUTCH-293: Hi Andrzej, I agree but writing a queue based fetcher is a big step. I already have some basic code (nio based). Also I don't think that a new fetcher will be as stable as that we can put it into a .8 release. Since we plan to have .8 release it think it is a good idea for now to add this functionality. Maybe we do it configurable and switch it off by default? In any case I suggest that we solve NUTCH-289 first and than getting the fetcher done. support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Attachments: crawlDelayv1.patch Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a de-facto standard. See: http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Webmasters start blocking nutch since we do not support it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar
As far I understand hadoop use commons logging. Should we switch to use commons logging as well? Am 06.06.2006 um 11:02 schrieb Jérôme Charron: URL: http://svn.apache.org/viewvc?rev=411943view=rev Log: Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons Logging, configured for log4j by default. If log4j is now included in the core, we can remove the lib-log4j plugin. If no objection, I will doing it. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code. So I vote for the issue and I vote to reopen this issue. Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Priority: Critical Attachments: dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV1.patch To keep the discussion alive attached a _first draft_ for storing the ip in the crawlDatum for public discussion. Some notes. The IP is stored as byte[] in the crawlDatum itself not in the meta data. There is a IpAddressResolver maprunnable tool to update a crawlDb using multithreaded ip lookups. In case a IP is available in the crawlDatum the Generator use the cached ip. To discuss: I don't like the idea of post process the complete crawlDb any time after a update. Processing crawlDb is expansive in storage usage and time. We can have a property ipLookups with possible values never|duringParsing|postUpdateDb. Than we can add also some code to lookup the IP in the ParseOutputFormat as discussed or we start IpAddressResolver as job in the updateDb tool code. In the moment I write the ip address bytes like this: out.writeInt(ipAddress.length); out.write(ipAddress); I think for now we can define that byte[] ipAddress is everytime 4 bytes long, or should we be IPv6 compatible by today? Please give me some comments I have a strong interest to get this issue fixed asap and I'm willing to improve things as required. :-) CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml
hmm... didn't think about that, are there more opinions about this? I don't believe this don't be evil thing at all. I think it is just a question of time google feel we attack the appliance server market and I believe nutch has a serious chance to do so (some time in the far feature. :-) ) Stefan -- Sami Siren Are you sure there is no trademark infringement here? Perhaps we should call it something else, just to avoid any potential legal unpleasantries ...
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get rid of the dubious use of LogFormatter (I hope Chris that even you would agree that this pattern is slightly .. unusual ;) ), and we gain flexible mechanism limited in scope to the current task, which ensures isolation from other tasks in the same JVM. How about that? Wonderful idea :-D + 1
[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a 404 for a robots.txt is returned no page is fetched at all from the host) Sorry, worng description. if a 404 for a robots.txt is returned a NPE is thrown - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: fixNpeRobotRuleSet.patch What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: search engine spam detector
The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) This will be a very long way. I collect some thoughts and a list of web spam related papers in my blog. http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072- F98E13CAEFE1.html Feedback is welcome. Stefan
[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host
if a 404 for a robots.txt is returned no page is fetched at all from the host - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Attachment: fixNpeRobotRuleSet.patch fix the npe in RobotRuleSet happen in case we use a empthy RuleSet if a 404 for a robots.txt is returned no page is fetched at all from the host - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: fixNpeRobotRuleSet.patch What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RobotRuleSet
Hi, just posted a fix for a NPE in case a empty RobotRuleSet is used. The patch only contains a two lines fix, since I learned that this best way to get things committed sooner. :) However I really don't like the RobotRuleSet implementation since entries are copied between a arraylist and a array for just no reasons. from my point of view. I would love to change that to just use the arraylist. Any thoughts? Can I have a vote from one committer that would commit that to the source in case I do this change? :-) Thanks. Stefan
[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging not correct) -- Key: NUTCH-282 URL: http://issues.apache.org/jira/browse/NUTCH-282 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind I did a search and got back the value itemsPerPage from opensearch. But the output shows results 1-8 and I have a total of 46 searchresults. Same happens for the webinterface. Why aren't enough results fetched? The problem might be somewhere in the area of where Nutch should only display a certaian number of websites per site. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-286) Handling common error-pages as 404
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol status , content analysis can only done during parsing. Also normally such pages do not get a high OPIC score and should be not in the top search results. However this is a wrong configured http server response, so you may should open a bug in the typo3 issue tracking. Should we close this issue? Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an http 200 ok even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: The requested page did not exist or was inaccessible. So I had the idea if somebody might create a plugin that could find commonly used formulations for page does not exist etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space Key: NUTCH-292 URL: http://issues.apache.org/jira/browse/NUTCH-292 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Attachments: summarizer.diff java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155) javax.servlet.http.HttpServlet.service(HttpServlet.java:689) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) The URL I use is: [...]something[...]/opensearch?query=mysearchstart=0hitsPerSite=3hitsPerPage=20sort=url It seems to be a problem specific to the date I'm working with. Moving the start from 0 to 10 or changing the query works fine. Or maybe it doesn't have to do with sorting but it's just that I hit one bad search-result that has a broken summary? !! The problem is repeatable. So if anybody has an idea where to search / what to fix, I can easily try that out !! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified and date is stored in the index, please submit a patch for MoreIndexingFilter. OpenSearchServlet should return date as well as lastModified Key: NUTCH-291 URL: http://issues.apache.org/jira/browse/NUTCH-291 Project: Nutch Type: Improvement Components: web gui Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-291-unfinished.patch Currently lastModified is provided by OpenSearchServlet - but only in case the date lastModified-date is known. Since you can sort by date (which is lastModified or if not present the fetchdate), it might be useful if OpenSearchServlet could provide date as well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning(Error parsing: + key + : + parseStatus); parse = parseStatus.getEmptyParse(getConf()); } than we use the empty parse object: and a empthy parse contans just no text, see getText private static class EmptyParseImpl implements Parse { private ParseData data = null; public EmptyParseImpl(ParseStatus status, Configuration conf) { data = new ParseData(status, , new Outlink[0], new Metadata(), new Metadata()); data.setConf(conf); } public ParseData getData() { return data; } public String getText() { return ; } } So the Problem should be somewhere else. parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-287) Exception when searching with sort
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort -- Key: NUTCH-287 URL: http://issues.apache.org/jira/browse/NUTCH-287 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Stefan Neufeind Priority: Critical Running a search with sort=url works. But when usingsort=title I get the following exception. 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.RuntimeException: Unknown sort value type! at org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239) at org.apache.jsp.search_jsp._jspService(search_jsp.java:257) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:595) What is in those lines is: WritableComparable sortValue; // convert value to writable if (sortField == null) { sortValue = new FloatWritable(scoreDocs[i].score); } else { Object raw = ((FieldDoc)scoreDocs[i]).fields[0]; if (raw instanceof Integer) { sortValue = new IntWritable(((Integer)raw).intValue()); } else if (raw instanceof Float) { sortValue = new FloatWritable(((Float)raw).floatValue()); } else if (raw instanceof String) { sortValue = new UTF8((String)raw); } else { throw new RuntimeException(Unknown sort value type!); } } So I thought that maybe raw is an instance of something strange and tried raw.getClass().getName() or also raw.toString() to track the cause down - but that always resulted in a NullPointerException. So it seems I'm having raw being null for some strange reason. When I try with title2 (or something none-existing) I get a different error that title2 is unknown / not indexed. So I suspect that title
[jira] Closed: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time consuming. But if there is a bug please continue open bug reports. :) Thanks. NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments -- Key: NUTCH-281 URL: http://issues.apache.org/jira/browse/NUTCH-281 Project: Nutch Type: Bug Components: web gui Reporter: Stefan Neufeind Priority: Trivial see cached.jsp base href=... does not take effect when showing a cached page because of the comments around it -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list results in error Key: NUTCH-274 URL: http://issues.apache.org/jira/browse/NUTCH-274 Project: Nutch Type: Bug Versions: 0.8-dev Environment: nightly-2006-05-20 Reporter: Stefan Neufeind Priority: Minor This is minor - but it's a little unclean :-) Reproduce: Have a URL-file with one URL followed by a newline, thus producing an empty line. Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is fine - but second is empty and therefor fails proper protocol-detection. 60521 022639 Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 060521 022639 Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 060521 022639 found resource parse-plugins.xml at file:/home/mm/nutch-nightly/conf/parse-plugins.xml 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060521 022639 fetching http://www.bild.de/ 060521 022639 fetching 060521 022639 fetch of failed with: org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: no protocol: 060521 022639 http.proxy.host = null 060521 022639 http.proxy.port = 8080 060521 022639 http.timeout = 1 060521 022639 http.content.limit = 65536 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 022639 fetcher.server.delay = 1000 060521 022639 http.max.delays = 1000 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 022640 map 0% reduce 0% 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status. If the parser throws an expception these exception is not catched in the parseutil at all. So the pdf parser should throw an expception and not report a unsucessfully status to solve this problem, isn't it? parse-pdf: Garbage indexed when text-extraction not allowed --- Key: NUTCH-290 URL: http://issues.apache.org/jira/browse/NUTCH-290 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind Attachments: NUTCH-290-canExtractContent.patch It seems that garbage (or undecoded text?) is indexed when text-extraction for a PDF is not allowed. Example-PDF: http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-286) Handling common error-pages as 404
[ http://issues.apache.org/jira/browse/NUTCH-286?page=all ] Stefan Groschupf closed NUTCH-286: -- Resolution: Won't Fix I hope everybody agree with the statement: We can not detect http response codes based on responded html content. Prune the index is a good idea to solve the problem. Handling common error-pages as 404 -- Key: NUTCH-286 URL: http://issues.apache.org/jira/browse/NUTCH-286 Project: Nutch Type: Improvement Reporter: Stefan Neufeind Idea: Some pages from some software-packages/scripts report an http 200 ok even though a specific page could not be found. Example I just found is: http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef That's a typo3-page explaining in it's standard-layout and wording: The requested page did not exist or was inaccessible. So I had the idea if somebody might create a plugin that could find commonly used formulations for page does not exist etc. and turn the page into a 404 before feeding them into the nutch-index - although the server responded with status 200 ok. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt
support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a de-facto standard. See: http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Webmasters start blocking nutch since we do not support it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt
[ http://issues.apache.org/jira/browse/NUTCH-293?page=all ] Stefan Groschupf updated NUTCH-293: --- Attachment: crawlDelayv1.patch A frist darft of a crawl delay support for nutch. The problem I see is that in case ip based delay is configured it can happen that we use the crawl delay of one host for a other host running on the same ip. Feedback is welcome. support for Crawl-delay in Robots.txt - Key: NUTCH-293 URL: http://issues.apache.org/jira/browse/NUTCH-293 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Attachments: crawlDelayv1.patch Nutch need support for Crawl-delay defined in robots.txt, it is not a standard but a de-facto standard. See: http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Webmasters start blocking nutch since we do not support it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: JVM error while parsing
Hi, I heard there is a bug in JVM 1.5_06 beta, can you try a older or may be a 1.4 jvm and report if this happens with a other jvm as well. Thanks, Stefan Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren: Hi everyone, I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to complete a 1-depth-crawl by using DFS and mapreduce structures. However, after a fetch step, I encounter the below JVM exception at one or more task trackers at the parsing step. It does not differ whether I use only the default parsers, or I also use the additional ones (pdf excel etc.). My task trackers work on AMD X2 64-bit machines and my JVM version is 1.5_06. Have you ever faced with such a problem at the parse stage?Or how do you think I can spot the cause of this JVM exception?The error report is : 060530 144113 task_0007_m_10_0 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060530 144113 task_0007_m_10_0 5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data: 0+12303612 060530 144114 task_0007_m_10_0 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060530 144114 task_0007_m_07_0 0.084114%/crawl/segments/20060521171305/content/part-00011/data:0 +12493176 060530 144115 task_0007_m_07_0 0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0 +12493176 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An unexpected error has been detected by HotSpot Virtual Machine: 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # SIGSEGV (0xb) at pc=0x003d1d247c10, pid=25093, tid=182894086496 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64- Bit Server VM (1.5.0_06-b05 mixed mode) 060530 144115 task_0007_m_07_0 # Problematic frame: 060530 144115 task_0007_m_07_0 # C [libc.so.6+0x47c10] printf_size+0x740 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # An error report file with more information is saved as hs_err_pid25093.log 060530 144115 task_0007_m_07_0 # 060530 144115 task_0007_m_07_0 # If you would like to submit a bug report, please visit: 060530 144115 task_0007_m_07_0 # http://java.sun.com/webapps/bugreport/crash.jsp 060530 144115 task_0007_m_07_0 # 060530 144115 Server connection on port 51950 from 192.168.15.61: exiting 060530 144115 task_0007_m_07_0 Child Error java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:242) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145) Thank you very much.
Re: Extract infos from documents and query external sites
Think about using the google API. However the way to go could be: + fetch your pages + do not parse the pages + write a map reduce job that extract your data ++ make a xhtml dom from the html e.g. using neko ++ use xpath queries to extract your data ++ also check out gate as a named entity extraction tool to extract names based on patterns and heuristics. ++ write the names in a file. + build your query urls + inject the query urls in a empty crawl db + create a segment fetch it and update the segment agains a second empty crawl database + remove the first segment and db + create a segment with your second db and fetch it. You second segment will only contains the paper pages. HTH Stefan Am 30.05.2006 um 12:14 schrieb HellSpawn: I'm working on a search engine for my university and they want me to do that to create a repository of scientific articles on the web :D I red something about xpath for extracting exact parts from a document, once done this building the query is very easy but my doubts are about how to insert all of this in the nutch crawler... Thank you -- View this message in context: http://www.nabble.com/Extract+infos +from+documents+and+query+external+sites-t1675003.html#a4624272 Sent from the Nutch - Dev forum at Nabble.com.