Re: Lucene query support in Nutch
Tomi said: In conclusion, my position is pragmatic: I welcome the simplest solution to implement the or search. I just believe that it'd be easiest to do that extending the nutch Analyzer. This seems like a very reasonable approach. I too would very much like OR. It would also be nice if it worked in 0.7.2 and I could drop it in, but that may be asking for too much. - Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Been there. Done that. | | -- Ed Viesturs as he looked up Mount Everest. He climbed it five times, | | twice without oxygen. He now plans to be the first American to scale | | all of the world's 8,000 meter mountains. Climber for the Ages Has | | Next Peak in View, New York Times, 2/13/00. | *---*
Boost for Occurances in a Page / Analyze Once?
I'm trying to tweak the search results at my http://ese.rfe.org/ and I've got two questions (I'm running .7.2): - In searching at the above for unemployment the leading results have 10 or more occurrences of that word on the page. I'd like to reduce the influence of multiple occurrences of a word on a page and give more weight to links, titles, and such. But, in looking at nutch-default.xml I don't see any obvious parameters for this. I have upped the following to these values: indexer.score.power 2.5 db.score.link.external 4.0 query.url.boost 2.0 query.anchor.boost 2.0 query.title.boost 2.0 query.phrase.boost 2.0 As the top links for unemployment are state agencies, I think I will switch db.ignore.internal.links back to true as there are more external links to where I would like users to go: http://www.bls.gov . - In the old 0.7 tutorial, I could swear that the example suggested running nutch analyze, but it no longer mentions that (it's not on the Internet Archive). I believe it also suggested running it more than once. Thoughts on these lines? I currently run it once after nutch updatedb, but would more runs aid link analysis? - Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | It's a scholarly activity that has nothing to do with his professional | | activity. | |-- Samuel Landau, the lawyer for Roger Shepherd, who used to be a | | professor at New School University until he left after admitting he | | plagiarized part of a book. Landau argues that his book was | | totally unrelated to his university work and he is suing to get | | his job back. Professor Who Acknowledged Plagiarism Accuses New| | School U. of Firing Him Unfairly, Chronicle of Higher Education, | | November 17, 2004. | *---*
Re: Boost for Occurances in a Page / Analyze Once?
Andrzej said: Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool would perform a couple iterations to propagate the scores along links. However, it was a slow and very resource-hungry process, so sometimes it was even impossible to go through the analysis step even for moderatly-sized collections. Interesting. If this is invoked with bin/nutch analyze db_dir 3 (three rounds of analysis) it took about 35 minutes with some 300,000 pages on a dual Xeon machine with 3 gigs of RAM. This is a small share of time spent fetching, generating segments, etc. 0.7 offers also an option to use a static ranking method, which doesn't require running the analysis step, and which is based on the number of outlinks and inlinks. Um, it isn't clear how to do this. I don't see anything in http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml. Nutch 0.8 uses scoring plugins, which can implement different scoring algorithms. The default one is based on OPIC, which is again a variant of link-based quality metrics - please see OPICScoringFilter for more details. That sounds useful. The referenced paper sure makes it sure sounds more efficient. Thanks and best wishes, Bill P.S. Any thoughts on how to downplay repeated instances of a word on a page? -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | I have been informed by the senior neurosurgical society to discontinue | | expert testimony for plaintiffs or risk membership. Therefore I am| | withdrawing as your expert. | | -- Dr. Robert W. Rand, a neurosurgeon, on why he couldn't testify | | against another neurosurgeon, Dr. Edgar Housepian. Dr. Housepian was | | alleged to have accidentally cut a major artery in the brain of a 3 | | year old who ended up with permanent disabilities. Making | | Malpractice Harder to Prove, Michelle Andrews, New York Times, | | 12/21/03.| *---*
Re: Starting Nutch in init.d?
Thanks -- after a rebuild and redeploy this works fine. The init.d tomcat script works OK, but I have yet to try a reboot as I'm not near the server at the moment. Thanks, Bill Matthew said: You don't need to cd to the nutch directory for the startup script. All you need to do is edit the nutch-site.xml that is found within the nutch servlet and include a searcher directory property that tells tomcat where to look for the crawl db. So if you have nutch 0.8, edit the file TOMCAT_PATH/webapps/NUTCH_DIR/WEB-INF/classes/nutch-site.xml and include the following: property namesearcher.dir/name value/your_index_folder_path/value /property I believe the your_index_folder_path is the path to your crawl directory. However, if that doesn't work, make it the path to the index folder within your crawl directory. Now, save that and make sure your script just starts tomcat on init and everything should work fine for you. Matt Bill Goffe wrote: I'd like to start Nutch automatically when I reboot. I wrote a real rough script (see below) that works on my Debian system when the system is up, but I get nothing on a reboot (and the links are set to the /etc/init.d/nutch). Any hints, ideas, or suggestions? I checked the FAQ and the archive but didn't see anything. In addition, it would be great to get messages going into /var/log to help figure out what is going on but I've had no luck doing that. Thanks, Bill ## Start and stop Nutch. Note how specific it is to ## (i) Tomcat (typically $CATALINA_HOME/bin/shutdown.sh ## or $CATALINA_HOME/bin/startup.sh) and (ii) the ## directory with the most recent fetch results. ## PATH stuff PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games PATH=$PATH:/usr/local/SUNWappserver/bin CLASSPATH=/usr/local/SUNWappserver/jdk/jre/lib JAVA_HOME=/usr/local/SUNWappserver/jdk CATALINA_HOME=/usr/local/jakarta-tomcat-5 JAVA_OPTS=-Xmx1024m -Xms512m case $1 in start) cd /home/bgoffe/nc/40 ## start in correct directory /usr/local/jakarta-tomcat-5/bin/startup.sh ;; stop) /usr/local/jakarta-tomcat-5/bin/shutdown.sh ;; force-reload|restart) /usr/local/jakarta-tomcat-5/bin/shutdown.sh cd /home/bgoffe/nc/40 /usr/local/jakarta-tomcat-5/bin/startup.sh ;; *) echo Usage: /etc/init.d/nutch {start|stop|force-reload|restart} exit 1 ;; esac exit 0 -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | An orange freezes to the hardness of a baseball; a potato cannot be | | dented with a hammer.| | -- A description of conditions that Indian and Pakistani soldiers face | | in their dispute in the western Himalayas near and on Siachen| | Glacier. Frozen in Fury at the Room of the World, New York Times, | | May 23, 1999.| *---*
Re: modifying header logo and page content
All I can say is that it works for me. I'll admit that I'm not clear on how these files fit together and I simply experimented to see what would work. Nothing seemed to break. I'd be happy to hear about a better method. - Bill Chris said: Thanks Bill for all the details. You mentioned that you changed nutch-page.xsl. But this page has the following line: xsl:commentThis page is automatically generated. Do not edit!/xsl:comment Are we supposed to indeed change html formating through this page is there another page or src that we can modify? Chris --- Bill Goffe [EMAIL PROTECTED] wrote: Chris - Here's my list of what I changed for http://ese.rfe.org. It is kinda terse, but it should give you some decent hints even so. I may well have done some things the hard way, but it seems to work. Don't forget you have to run ant war and redeploy the war file. Still working on getting the favicon right... - Bill Files changed (all have *.org in the same directory) src/web/pages/en/search.xml (1st page) src/web/pages/en/help.xml src/web/include/en/header.xml (title) src/web/include/footer.html src/web/include/style.html src/web/style/nutch-page.xsl (main formatting) src/web/style/nutch-header.xsl src/web/jsp/search.jsp(results page (help.html URL hardcoded)) (note: really could stand a more thorough change; lots of stuff could be taken out that isn't; weird formatting if you look at the html) (this material is older) - General Guide: http://lucene.apache.org/nutch/i18n.html Static Page Content search page: src/web/pages/en/search.xml (just the middle of the page -- no table, only gif is poweredbynutch_01.gif) others: src/web/pages/en/ (3 files -- about.xml, help.xml, search.xml) Header src/web/include/en/header.xml (only thing in that directory -- just a mere listing of the top 4 items maybe keep about) Dyamic Pages anchors_en.properties cached_en.properties explain_en.properties search_en.properties text_en.properties (is this used?) -- no changes there (I THINK) Footer src/web/include/footer.html (in same directory have language specific pages) Styles src/web/style/nutch-header.xsl src/web/style/nutch-page.xsl Style Sheet? (has colors) src/web/include/style.html Colors #FF9900 -- kinda bright orange #F9F7F4 -- the middle color on the page (kinda light orange?) #ECE5DC -- light orange background color (diff than previous?) Notes - a change in nutch-page.xsl does indeed show up - since help on the search page isn't working for me, change to one at nuch.org? CSS nutch/docs/api/stylesheet.css (what is indeed needed?) Mods - keep the main table structure (3 on top of each other) - changed src/web/include/footer.html to mention Nutch, RFE, EDIRC - changed src/web/include/style.html -- all F9F7F4 -- FF - src/web/style/nutch-header.xsl much simplified - took out gif in src/web/pages/en/search.xml - took out a table in src/web/style/nutch-page.xsl (has robots.gif) and also took out td starting w/ td width=20 Hello, This must have been mentioned before, but didn't find in past user mailing lists or any documentation on the web. How can I use a custom logo rather than the nutch-logo.gif? Found the docs on how to modify i18n content but that doesn't give access to other parts of the final html. Found the xslt sheets, but these all have a comment that they are generated from some xml files. Where are these xml files that control the look and feel of the nutch pages? THANKS for any help! Chris -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126 | **--*---* | He scared Carl. It's a good thing I didn't have blond hair. | | -- Johnnie Cochran to his colleague Carl Douglas near the end of the| | O.J. Simpson trial. He was talking about O.J. Simpson. | *---* -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY
favicon?
At http://ese.rfe.org I've Nutch running for some time, but I have a minor question: how to put in my own favicon? In .71, I put my favicon.ico in src/site/src/documentation/resources/images/ and docs/img/ (wasn't sure which mattered), did an ant war, and redeployed the resulting war file. The correct favicon is in webapps/ROOT/img/ and http://ese.rfe.org/favicon.ico shows the correct icon. But, it shows inconsistently in Firefox and Internet Explorer on search results and on http://ese.rfe.org in spite of clearing the cache and history in both (in fact, after clearing them, it now doesn't show!). Also, in Firefox, when I drag the blank icon from the address bar to my list of shortcuts (term?) at the top of the browser, the correct icon shows up there but still not on the address bar. Ugh! Thanks, Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Getting and spending by everyone else continues to make the intellectual | | life possible, ... which is why universities are named for the likes of | | Carnegie, Rockefeller, Stanford, and Duke. | | -- Daniel Akst, talking in part about anti-consumerism. Buyer's| | Remorse, Wilson Quarterly, Winter, 2004.| *---*
Re: Nutch Search stats
Nutch doesn't save it, but at least you can find the search terms in your Tomcat logs. Granted, it would take some processing, but it would seem to be useful. Here's an entry from mine today: 127.0.0.1 - - [21/Apr/2006:08:00:48 -0500] GET /search.jsp?query=irreversible+investment HTTP/1.1 200 7176 - Bill Ravish Bhagdev said: No. Not at present (unless somone enlightens me) R On 4/21/06, Aled Jones [EMAIL PROTECTED] wrote: Hiya all Does nutch save any of the search terms entered for stats purposes? E.g. most commonly used terms and so on. Pity but I can't come to the nutch-user meeting, an 11 hour flight too far! ;-) Cheers Aled ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more information, connect to http://www.f-secure.com/ This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free. -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | I was finding it extremely irritating [a fruit fly experiment]. We had | | already pretty much prepared our paper and we just needed to know when| | these flies were going to die. They kept living on and on. At some point, | | it occurred to us that maybe something is happening here that we should | | be paying attention to. | | -- Dr. Stephen L. Helfand describing how they found one fruit fly gene, | | which they dubbed INDY (I'm Not Dead Yet) extended the lives of fruit | | flies by 50%. Fly geneticists had look for such a gene for nearly | | a century until Helfand and his group stumbled across it. I'm Not| | Dead Yet: Stumbling on a Genetic Mutation That Lives Up to Its Name, | | Gina Kolata, New York Times, December 15, 2000| *---*
Thanks!
I wanted to say thanks to everybody here (and in the past on the developers list) for help with my Economics Search Engine http://ese.rfe.org. I wouldn't have been able to get it running without the suggestions I've received. I entered it onto the list of Public Servers on the wiki. - Bill P.S. Vertical search engines seem to be on something of an upswing. There was an article in the Wall Street Journal on Monday, 12/19/05: Beyond Google, Kevin Delaney, p. R1 http://online.wsj.com/article/SB113459260842822579.html (likely need an account with them) and Jupiter will be having a web cast on them http://www.jupiterwebevents.com/webcasts/looksmart_012506.html . -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Well, in fact, I was very surprised. I thought it would kill me.| | -- The balloonist Steve Fossett on his descent out of a thunderstorm| | into the Coral Sea that ended his attempt to circumnavigate the | | world in a balloon in August, 1998. | *---*
Re: Nutch Tomcat5 or.apache.jasper.JasperException
, javax.servlet.http.HttpServletResponse, java.lang.String, java.lang.Throwable, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so) org.apache.jasper.servlet.JspServlet.service( javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) (/usr/lib/libjasper5- compiler-5.0.30.jar.so) javax.servlet.http.HttpServlet.service( javax.servlet.ServletRequest, javax.servlet.ServletResponse) (/usr/lib/libservletapi5-5.0.30.jar.so) org.apache.catalina.valves.ErrorReportValve.invoke( org.apache.catalina.Request, org.apache.catalina.Response, org.apache.catalina.ValveContext) (/usr/lib/libcatalina-5.0.30.jar.so) org.apache.coyote.tomcat5.CoyoteAdapter.service( org.apache.coyote.Request, org.apache.coyote.Response) (/usr/lib/libcatalina-5.0.30.jar.so) org.apache.coyote.http11.Http11Processor.process( java.io.InputStream, java.io.OutputStream) (/usr/lib/libtomcat- http11-5.0.30.jar.so) org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection (org.apache.tomcat.util.net.TcpConnection, java.lang.Object[]) (/usr/lib/libtomcat-http11-5.0.30.jar.so) org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[]) (/tmp/libtomcat-util-5.0.30.jar.so46imxs.so) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() (/tmp/libtomcat-util-5.0.30.jar.so46imxs.so) java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0) I have no idea what is causing this error. Any ideas. If it makes a difference I am running this installation on a FC4 box. Thanks for any insight anyone can provide. Mike -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Retreat? Hell, we just got here.| | -- Captain Lloyd William, U.S. Marine Corps, on the suggestion of | | retreating French soldiers that his Marines retreat as well at the | | Battle of Belleau Wood on June 4, 1918. The First World War, John | | Keegan, p. 407. | *---*
Re: Nutch Tomcat5 or.apache.jasper.JasperException
RJ said: (nutch-0.7.1 won't recompile on my setup which is, freebsd5.4, apache-ant and jdk1.4.2. However, nutch-0.7 will compile for me and that is why I'm using it) I believe that if you go to src/plugin/nutch-extensionpoints and create the empty directories src and src/java ant will build it. - Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Making low-yield [nuclear] weapons is not a hard problem. We call them | | failures.| | -- Jas Mercer-Smith, a nuclear weapons designer at Los Alamos. More | | seriously, he says that many designers don't like the idea of | | low-yield nukes as it makes their use more likely. This is Not a | | Test, Evan Ratliff, Wired, March 2002. | *---*
Re: ATB: Nutch webapp not at root context.
Not knowing about the solutions posted here, I had a similar problem with the value of base href. I'm proxying from Apache2, so I put the following in my apache2.conf. There indeed must be better solutions, but this does work. I did need to add the apache module ext_filter. - Bill ## Change instances of localhost:8080 to rfe.org ## (needed for base href in pages from Nutch) ExtFilterDefine fixtext mode=output intype=text/html \ cmd=/bin/sed s/localhost:8080/rfe.org/g Location / SetOutputFilter fixtext /Location Aled said: I just renamed the war, deployed it and it all works ok except that when I click Search instead of calling the jsp relative to it's location it's calling the jsp at root. E.g. Create war and call mycontext.war, deploy to have http://localhost:8080/mycontext/ Go to link above, index.jsp works fine and nutch start page shows up. However, once I click Search it jumps to the context at root i.e. http://localhost:8080/Search.jsp Is there something I should be defining when building the war file or in web.xml or the nutch-site.xml?? -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | I guess if I do something again, I'll try not to get caught.| | -- Glenn (last name not given), a 15 year-old charged with assault, on | | his reaction to a program of tough love where juveniles are given | | a four day introduction to jail life with repeat offenders. You are | | There, Sunday New York Times Magazine, 9/7/97, p. 31. | *---*
Re: Proxy base href Problem
Hm. I tried that (thanks!), but no go. I wonder if one needs to compile it? After too much thrashing around I ended adding this to my apache2.conf file: ExtFilterDefine fixtext mode=output intype=text/html \ cmd=/bin/sed s/localhost:8080/rfe.org/g Location / # core directive to cause the fixtext filter to # be run on output SetOutputFilter fixtext /Location (straight from http://httpd.apache.org/docs/2.0/mod/mod_ext_filter.html -- you'll need mod_ext_filter installed for it to work). mod_proxy_html _should_ be able to do it, but I wasn't able to get it to run; the above is much more brute force but works. Below is what else I did to get Nutch running at http://rfe.org:8080 to be seen as http://rfe.org/search by the outside world. - Bill Added to apache2.conf: IfModule mod_proxy.c ProxyRequests Off Proxy * Order deny,allow Allow from all /Proxy ProxyPass /search http://localhost:8080/en/search.html ProxyPassReverse /search http://localhost:8080/en/search.html ProxyPass /search.jsp http://localhost:8080/search.jsp ProxyPassReverse /search.jsp http://localhost:8080/search.jsp ProxyPass /cached.jsp http://localhost:8080/cached.jsp ProxyPassReverse /cached.jsp http://localhost:8080/cached.jsp ProxyPass /explain.jsp http://localhost:8080/explain.jsp ProxyPassReverse /explain.jsp http://localhost:8080/explain.jsp ProxyPass /anchors.jsp http://localhost:8080/anchors.jsp ProxyPassReverse /anchors.jsp http://localhost:8080/anchors.jsp ProxyPass /img/reiter http://localhost:8080/img/reiter ProxyPassReverse /img/reiter http://localhost:8080/img/reiter ProxyPass /img http://localhost:8080/img ProxyPassReverse /img http://localhost:8080/img ProxyPass /en http://localhost:8080/en ProxyPassReverse /en http://localhost:8080/en /IfModule Added to the following to server.xml for Tomcat Connector className=org.apache.catalina.connector.http.HttpConnector port=8080 minProcessors=5 maxProcessors=75 enableLookups=true acceptCount=100 debug=0 connectionTimeout=2 proxyName=rfe.org proxyPort=80 useURIValidationHack=false / Jake said: Bill, Take a look at search.jsp. It looks like it sets the base href based on the name you hit it as. Since you're proxying back to localhost:8080, that's what it sets as the base href. I think you should be able to just hardcode that to be rfe.org, restart tomcat and it should work. Current Code: String base = requestURI.substring(0, requestURI.lastIndexOf('/')); ... base href=%= base + / + language %/ My Suggestion: String base = rfe.org; ... base href=%= base + /search/ + language %/ Jake. -Original Message- From: Bill Goffe [mailto:[EMAIL PROTECTED] Sent: Thursday, November 10, 2005 2:42 AM To: nutch-user@lucene.apache.org Subject: Proxy base href Problem I'm having a specific problem running Nutch through a proxy server. I run http://rfe.org (a directory of resources for economists). I'd like to run Nutch with data from the web sites listed in rfe and other sites out of http://rfe.org/search (as you might guess, Nutch would be a very nice thing to add to a directory of resources). I've currently got 1.4M pages in a database and by tweaking various parameters in nutch-default.xml I'm getting pretty good results and would really like to move to production use. At any rate, in my apache2.conf file I have a bunch of ProxyPass and ProxyPassReverse statements for the different files directories that I'd like to pass from Tomcat to Apache. I also added a connector in Tomcat's server.xml. This almost works (feel free to try), but, for one problem -- the base href in pages returned by Nutch: base href=http://localhost:8080/en/; . Clearly, this messes up relative URLs in the pages Nutch returns. How could I go about changing the base URL in pages that Nutch returns? I don't see anything obvious in Tomcat's configuration (likely wouldn't want there anyway I suspect), nutch/conf, mod_proxy, mod_proxy_html, or mod_rewrite. Thanks a bunch, Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*--- * | If you compare the way you'd teach anything to the way juries are | | 'taught,' it's amazing juries do as well as they do. | | -- Catherine T. Struve, a law professor at the University of | | Pennsylvania, on how well juries do in spite of not being able to | | take notes or for lawyers to summarize material for them
nutch readdb db -stats Pages Fetched Not Consistent
Now that I've got my reverse proxy up, one less pressing question. With ~/nutch/bin/nutch readdb db -stats I get Number of pages: 1,399,730 Number of links: 5,369,361 Yet when I search my log (have a perl script that outputs everything to it) with grep fetching log | wc -l I get 313,998 (commas added). I'm thinking that readdb db -stats isn't counting the number of downloaded pages -- am I correct? - Bill -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | It shocks them actually. They never knew that such a world actually | | exists, because they have their own problems.| | -- Mustapha Ahansal, of the U.S. Navy. He was describing the response of | | Iraqis to his boarding party of himself (a Moroccan-American), an | | African-American, an Hispanic, and often a woman. Sinbad vs. the | | Mermaids, Thomas L. Friedman, New York Times, 10/5/05. | *---*
Re: where is catalina.sh?
-5.0.30.jar.so) org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[]) (/tmp/libtomcat-util-5.0.30.jar.so5gnaze.so) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() (/tmp/libtomcat-util-5.0.30.jar.so5gnaze.so) java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0) __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com -- *--* | Bill Goffe [EMAIL PROTECTED] | | Department of Economicsvoice: (315) 312-3444 | | SUNY Oswegofax: (315) 312-5444 | | 416 Mahar Hall http://cook.rfe.org | | Oswego, NY 13126| **--*---* | Police went to the man's house and visually confirmed he was the right | | person. | | -- A story on how a visitor to the Albuquerque zoo claimed he wasn't | | harmed by a jaguar, when in fact it bit off one of his fingers. The | | match was made by a fingerprint. Unclaimed Finger Leads Zoo to Ban, | | Associated Press, May 18, 2004. | *---*