interesting paper with competing index systems
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf Anyone have any further details on this?
wildcard matches not working?
i can't seem to get wildcard matches ( test* ) to work in my index using the default nutch search application. is there something i'm missing? i'm using nutch built from trunk, with a patch applied that lets htdig-noindex boundaries not be indexed. Thanks in advance for any help, -a
Re: XP/Cygwin setup problems
Hi You get that error while running earlier 0.7 nutch tutorial running on 0.8dev nutch. Use the tutorial for 0.8 dev http://wiki.media-style.com/display/nutchDocu/quick+tutorial+for+nutch+0.8+and+later. Or add following property to nutch-site.xml. property namemapred.input.dir/name valueC:/cygwin/usr/local/src/nutch-nightly/conf/value descriptionThe proxy port./description /property P Hi all, Having some problems getting nutch to run on XP/Cygwin. This is re nutch-2006-01-17 Intranet crawl When I do this (after making urls file, etc.): bin/nutch crawl urls -dir cdir -depth 2 log I get this in the log: 060117 114833 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml 060117 114834 crawl started in: cdir 060117 114834 rootUrlDir = urls 060117 114834 threads = 10 060117 114834 depth = 2 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml 060117 114834 Injector: starting 060117 114834 Injector: crawlDb: cdir\crawldb 060117 114834 Injector: urlDir: urls 060117 114834 Injector: Converting injected urls to crawl db entries. 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/crawl-tool.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml 060117 114834 Running job: job_krj0e1 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-default.xml 060117 114834 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/mapred-default.xml 060117 114835 parsing \tmp\nutch\mapred\local\localRunner\job_krj0e1.xml 060117 114835 parsing file:/C:/cygwin/usr/local/src/nutch-nightly/conf/nutch-site.xml java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , \tmp\nutch\mapred\local\localRunner\job_krj0e1.xml , nutch-site.xml at org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63) 060117 114835 map 0% java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.crawl.Injector.inject(Injector.java:102) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Exception in thread main I see that: nutch-site.xml is empty mapred-default is empty Whole Web setup When I do this: (after mkdirs) bin/nutch admin db -create I get this at the prompt: Exception in thread main java.lang.NoClassDefFoundError: admin I don't speak Java, so I'm not sure what it's saying. Please help. TIA. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: So many Unfetched Pages using MapReduce
Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec --- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec --- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec --- 6 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike On 1/17/06, Mike Smith [EMAIL PROTECTED] wrote: I've experienced the same effect. When I decrease number of map/reduce tasks, I can fetch more web pages. but increasing those increases unfetched pages. I also get some java.net.SocketTimeoutException : Read timed out exceptions in my datanode log files. But those time out problems couldn't cause this much missing pages!! I agree the problem should be somewhere is the fetcher. Mike On 1/17/06, Florent Gluck [EMAIL PROTECTED] wrote: I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently I'm trying to debug the code to see what's going on. So far, I noticed the generator is fine, so the issue must lay further in the pipeline (fetcher?). Let me know if you find anything regarding this issue. Thanks. --Flo Mike Smith wrote: Hi, I have setup for boxes using MapReduce, everything goes smoothly, I have feeded about 8 seed nodes for begining and I have crawled by depth 2. Only 1900 pages (about 300MG) data and the rest is marked and db unfetched. Does any one know what could be wrong? This is the output of (bin/nutch readdb h2/crawldb -stats): 060115 171625 Statistics for CrawlDb: h2/crawldb 060115 171625 TOTAL urls: 99403 060115 171625 avg score:1.01 060115 171625 max score:7.382 060115 171625 min score:1.0 060115 171625 retry 0: 99403 060115 171625 status 1 (DB_unfetched): 97470 060115 171625 status 2 (DB_fetched):1933 060115 171625 CrawlDb statistics: done Thanks, Mike
Re: interesting paper with competing index systems
Thats exactly how i felt. No mention of JVM/Platform or options (or versions) used. I've just been bombarded with someone (who i can probably assume works or uses the afformentioned program) asking me why i use lucene on all of my projects. The paper hardly seems acadamic even though it appears that is what they're going for. Thanks again for the quick follow up. --- Doug Cutting [EMAIL PROTECTED] wrote: Byron Miller wrote: http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf Anyone have any further details on this? The first author of the paper is also the founder of the company which sells the software described, so these benchmarks should not be considered entirely objective. That's not to say that IXE is not faster than Lucene, it might well be. But they do not list any JVM details, the Lucene version or any Lucene options. Chances are, with a few informed tweaks, one could improve Lucene's performance on this benchmark. Chances are also that IXE was configured for optimal performance on this benchmark, since it was performed by the authors of IXE. Also note that this is a micro-benchmark, designed to highlight their skip implementation. A better comparison would average times from a log of real user queries. Please feel free to try to obtain the IXE software and perform benchmarks of your own. Doug
Re: How do I control log level with MapReduce?
Chris Schneider wrote: I'm trying to bring up a MapReduce system, but am confused about how to control the logging level. It seems like most of the Nutch code is still logging the way it used to, but the -logLevel parameter that was getting passed to each tool's main() method no longer exists (not that these main methods are getting called by Crawl.java, of course). Previously, if -logLevel was omitted, each tool would set its logLevel field to INFO, but those fields no longer exist either. The result seems to be that the logging level defaults all the way back to the LogFormatter, which sets all of its handlers to FINEST. I was sort of expecting there to be a new configuration property (perhaps a job configuration property?) that would control the logging level, but I don't see anything like this. Any guidance would be greatly appreciated. There is no config property to control logging level. That would be a useful addition, if someone wishes to contribute it. In the meantime, Nutch uses Java's built-in logging mechanism. Instructions for configuring that are in: http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/LogManager.html Doug
Re: interesting paper with competing index systems
Not only that, they mention that each test was run _twice_ to get an average score. With hotspot JVMs this is ridiculous, you need to run at least a dozen or more cycles so that the hotspots are recompiled... This alone discredits the results in my eyes. Yes, running a load or performance test by running only two tests makes really no sense with a JVM. On some complex telco systems we notice that a JVM becomes hot after many minutes with at least 100 req/s (so running only two tests is REALLY RIDICULOUS) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Can't index some pages
Michael Plax wrote: Question summery: Q: How can I set up crawler in order to index all web site? I'm trying to run crawl with command from tutorial 1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 crawl.log 4. Crawling is finished 5. I run: bin/nutch readdb crawled/db -stats output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml 060118 155526 No FS indicated, using default:local Stats for [EMAIL PROTECTED] --- Number of pages: 63 Number of links: 3906 6. I get less pages than I have expected. This is a common question, but there's not a common answer. The problem could be that urls are blocked by your url filter, or by http.max.delays, or something else. What might help is if the fetcher and crawl db printed more detailed statistics. In particular, the fetcher could categorize failures and periodically print a list of failure counts by category. The crawl db updater could also list the number of urls that are filtered. In the meantime, please examine the logs, particularly watching for errors while fetching. Doug
Re: So many Unfetched Pages using MapReduce
Hi Mike, Your differents tests are really interesting, thanks for sharing! I didn't do as many tests. I changed the number of fetch threads and the number of map and reduce tasks and noticed that it gave me quite different results in terms of pages fetched. Then, I wanted to see if this issue would still happen when running the crawl (single pass) on one single machine running everything locally, without ndfs. So I injected 5 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the /while(true)/** loop in the method /run/ in /Fetcher.java,/ I don't end up with 5! After some poking around, I noticed that if I comment out the line doing the page fetch /ProtocolOutput output = protocol.getProtocolOutput(key, datum);/, then I get 5. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. The following bug seems to be very similar to what we are encountering: http://issues.apache.org/jira/browse/NUTCH-136 Check out the latest comment. I'm gonna remove line 211 and run some tests to see how it behaves (with protocol-http and protocol-httpclient). I'll let you know what I find out, --Florent Mike Smith wrote: Hi Florent I did some more testings. Here is the results: I have 3 machines, P4 and 1G ram. All three are data node and one is namenode. I started from 8 seed urls and tried to see the effect of depth 1 crawl for different configuration. Number of unfetch pages changes with different configurations: --Configuration 1 Number of map tasks: 3 Number of reduce tasks: 3 Number of fetch threads: 40 Number of thread per host: 2 http.timeout: 10 sec --- 6700 pages fetched --Configuration 2 Number of map tasks: 12 Number of reduce tasks: 6 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 18000 pages fetched --Configuration 3 Number of map tasks: 40 Number of reduce tasks: 20 Number of fetch threads: 500 Number of thread per host: 20 http.timeout: 10 sec --- 37000 pages fetched --Configuration 4 Number of map tasks: 100 Number of reduce tasks: 20 Number of fetch threads: 100 Number of thread per host: 20 http.timeout: 10 sec --- 34000 pages fetched --Configuration 5 Number of map tasks: 50 Number of reduce tasks: 50 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 52000 pages fetched --Configuration 6 Number of map tasks: 50 Number of reduce tasks: 100 Number of fetch threads: 40 Number of thread per host: 100 http.timeout: 20 sec --- 57000 pages fetched --Configuration 7 Number of map tasks: 50 Number of reduce tasks: 120 Number of fetch threads: 250 Number of thread per host: 20 http.timeout: 20 sec --- 6 pages fetched Do you have any idea why pages are missing from the fetcher without the any log or exceptions? It seems it really depends on the number of reduce tasks! Thanks, Mike
Re: So many Unfetched Pages using MapReduce
Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
please help: Recovered from failed datanode connection
I am getting more and more of that. Though it seems there is no side effect Any info would be appreciated. G.
Re: Error at end of MapReduce run with indexing
Matt Zytaruk wrote: I am having this same problem during the reduce phase of fetching, and am now seeing: 060119 132458 Task task_r_obwceh timed out. Killing. That is a different problem: a different timeout. This happens when a task does not report status for too long then it is assumed to be hung. Will the jobtracker restart this job? It will retry that task up to three times. If so, if I change the ipc timeout in the config, will the tasktracker read in the new value when the job restarts? The ipc timeout is not the relevant timeout. The task timeout is what's involved here. And, no, at present I think the tasktracker only reads this when it is started, not per job. Doug
Re: Can't index some pages
Doug, would it make sense to print a LOG.info() message every time the fetcher bumps into one of these db.max limits? This would help users find out when they need to adjust their configuration. I can prepare a patch if it seems sensible. --Matt On Jan 19, 2006, at 5:34 PM, Michael Plax wrote: Thank you very much, I changed db.max.outlinks.per.page and db.max.anchor.length to 200 and I got whole web site indexed. This particular web site has more than 100 outbound links per page. Michael - Original Message - From: Steven Yelton [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, January 19, 2006 5:29 AM Subject: Re: Can't index some pages Is it not catching all the outbound links? db.max.outlinks.per.page I think the default is 100. I had to bump it up significantly to index a reference site... Steven Michael Plax wrote: Hello, Question summery: Q: How can I set up crawler in order to index all web site? I'm trying to run crawl with command from tutorial 1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 crawl.log 4. Crawling is finished 5. I run: bin/nutch readdb crawled/db -stats output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml 060118 155526 No FS indicated, using default:local Stats for [EMAIL PROTECTED] --- Number of pages: 63 Number of links: 3906 6. I get less pages than I have expected. What I did: 0. I read http://www.mail-archive.com/nutch- [EMAIL PROTECTED]/msg02458.html 1. I changed the depth to 10,100, 1000- same results 2. I changed start page to page that did not appear - I do get that page indexed output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml 060118 162103 No FS indicated, using default:local Stats for [EMAIL PROTECTED] --- Number of pages: 64 Number of links: 3906 This page appears in depth 3 from index.html Q: How can I set up crawler in order to index all web site? Thank you Michael P.S. I have attached configuration files urls http://www.totallyfurniture.com/index.html crawl-url-filter.txt # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz| rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*totallyfurniture.com/ +^http://([a-z0-9]*\.)*yahoo.net/ # skip everything else -. -- Matt Kangas / [EMAIL PROTECTED]
Re: Can't index some pages
att Kangas wrote: Doug, would it make sense to print a LOG.info() message every time the fetcher bumps into one of these db.max limits? This would help users find out when they need to adjust their configuration. I can prepare a patch if it seems sensible. Sure, this is sensible. But it's not done under the fetcher, but when the links are read, under db update. Doug
RE: interesting paper with competing index systems
Another interesting tool to perform linguistic analysis on natural language data: http://www.alias-i.com/lingpipe/ - is it really indexing engine? They are using NekoHTML parser. -Original Message- From: Byron Miller http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf Anyone have any further details on this?
Re: getOutlinks doesn't work properly
Good call. That's another limit where it'd be nice to see a log message when it's exceeded. I'll try to add a patch to NUTCH-182 tomorrow for this. --Matt On Jan 19, 2006, at 11:39 PM, Fuad Efendi wrote: property namefile.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is larger than zero, content longer than it will be truncated; otherwise (zero or negative), no truncation at all. /description /property (default is 65536) -Original Message- From: Jack Tang Hi pls change the value of db.max.outlinks.per.page(default is 100) property to say 1000. property namedb.max.outlinks.per.page/name value1000/value descriptionThe maximum number of outlinks that we'll process for a page. /description /property /Jack On 1/20/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: Hi everyone, I found that getOutlinks function in html-parser/ DOMContentUtils.java doesn't work correctly for some cases. An example is this website: http://blog.donews.com/boyla/. The function returns only 170 records, while in fact it contains a lot more (Firefox returns 356 links!). When I compare the hyperlink list with the one returned by Firefox, the orders are exactly identical, meaning that the 170th link of getOutlinks function is the same as the 170th link of Firefox. Therefore, it seems that the algorithm is correct, but there is some bug around. There is no threshold at this point, since the max outlinks parameter is set at updatedb part. Even when I increase the max outlinks to 1000, the situation still remains. Any suggestions are very appreciated. Regards, Giang -- Keep Discovering ... ... http://www.jroller.com/page/jmars -- Matt Kangas / [EMAIL PROTECTED]
org.apache.nutch.indexer.IndexMerger (Nutch 0.7)
Hi, Could anyone let me know definitively if the IndexMerger(NutchFileSystem nfs, File[] segments, File outputIndex, File localWorkingDir) merge operation merges the segments and overwrites any existing index at outputIndex, or merges the segments into the existing indec at outputIndex. If it overwrites, is there another method to merge segments into an existing index without needing to copy the existing index to a temporary area and specifying it as one of the input segments? Thanks. I am using Nutch 0.7 Regards, CW