Re: Lucene query support in Nutch

2006-10-10 Thread Bill Goffe
Tomi said:

 In conclusion, my position is pragmatic: I welcome the simplest
 solution to implement the or search. I just believe that it'd be
 easiest to do that extending the nutch Analyzer.

This seems like a very reasonable approach. I too would very much like
OR. It would also be nice if it worked in 0.7.2 and I could drop it in,
but that may be asking for too much.

 - Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Been there. Done that.  |
|   -- Ed Viesturs as he looked up Mount Everest. He climbed it five times, |
|  twice without oxygen. He now plans to be the first American to scale |
|  all of the world's 8,000 meter mountains. Climber for the Ages Has  |
|  Next Peak in View, New York Times, 2/13/00. |
*---*



Boost for Occurances in a Page / Analyze Once?

2006-09-21 Thread Bill Goffe
I'm trying to tweak the search results at my http://ese.rfe.org/ and I've
got two questions (I'm running .7.2):

  - In searching at the above for unemployment the leading results have
10 or more occurrences of that word on the page. I'd like to reduce
the influence of multiple occurrences of a word on a page and give more
weight to links, titles, and such. But, in looking at
nutch-default.xml I don't see any obvious parameters for this. I have
upped the following to these values:
  indexer.score.power 2.5
  db.score.link.external 4.0
  query.url.boost 2.0
  query.anchor.boost 2.0
  query.title.boost 2.0
  query.phrase.boost 2.0
As the top links for unemployment are state agencies, I think I will 
switch db.ignore.internal.links back to true as there are more external
links to where I would like users to go: http://www.bls.gov .

  - In the old 0.7 tutorial, I could swear that the example suggested
running nutch analyze, but it no longer mentions that (it's not on
the Internet Archive). I believe it also suggested running it more
than once. Thoughts on these lines? I currently run it once after
nutch updatedb, but would more runs aid link analysis?

 - Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| It's a scholarly activity that has nothing to do with his professional   |
|  activity.   |
|-- Samuel Landau, the lawyer for Roger Shepherd, who used to be a  |
|   professor at New School University until he left after admitting he |
|   plagiarized part of a book. Landau argues that his book was |
|   totally unrelated to his university work and he is suing to get   |
|   his job back. Professor Who Acknowledged Plagiarism Accuses New|
|   School U. of Firing Him Unfairly, Chronicle of Higher Education,   |
|   November 17, 2004.  |
*---*



Re: Boost for Occurances in a Page / Analyze Once?

2006-09-21 Thread Bill Goffe
Andrzej said:

 Nutch 0.7 uses a variant of PageRank link analysis, and the analyze tool 
 would perform a couple iterations to propagate the scores along links. 
 However, it was a slow and very resource-hungry process, so sometimes it 
 was even impossible to go through the analysis step even for 
 moderatly-sized collections. 

Interesting. If this is invoked with bin/nutch analyze db_dir 3 (three
rounds of analysis) it took about 35 minutes with some 300,000 pages on a
dual Xeon machine with 3 gigs of RAM. This is a small share of time spent
fetching, generating segments, etc.

 0.7 offers also an option to use a static ranking method, which doesn't
 require running the analysis step, and which is based on the number of
 outlinks and inlinks.

Um, it isn't clear how to do this. I don't see anything in
http://wiki.apache.org/nutch/CommandLineOptions nor nutch-default.xml.

 Nutch 0.8 uses scoring plugins, which can implement different scoring 
 algorithms. The default one is based on OPIC, which is again a variant 
 of link-based quality metrics - please see OPICScoringFilter for more 
 details.

That sounds useful. The referenced paper sure makes it sure sounds more
efficient.

Thanks and best wishes,

   Bill

P.S. Any thoughts on how to downplay repeated instances of a word on 
 a page?

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| I have been informed by the senior neurosurgical society to discontinue  |
| expert testimony for plaintiffs or risk membership. Therefore I am|
| withdrawing as your expert.  |
|  --  Dr. Robert W. Rand, a neurosurgeon, on why he couldn't testify   |
|  against another neurosurgeon, Dr. Edgar Housepian. Dr. Housepian was |
|  alleged to have accidentally cut a major artery in the brain of a 3  |
|  year old who ended up with permanent disabilities. Making   |
|  Malpractice Harder to Prove, Michelle Andrews, New York Times,  |
|  12/21/03.|
*---*



Re: Starting Nutch in init.d?

2006-07-28 Thread Bill Goffe
Thanks -- after a rebuild and redeploy this works fine. The init.d tomcat
script works OK, but I have yet to try a reboot as I'm not near the server
at the moment.

Thanks,

 Bill


Matthew said:

 You don't need to cd to the nutch directory for the startup script. All 
 you need to do is edit the nutch-site.xml that is found within the nutch 
 servlet and include a searcher directory property that tells tomcat 
 where to look for the crawl db.
 
 So if you have nutch 0.8, edit the file 
 TOMCAT_PATH/webapps/NUTCH_DIR/WEB-INF/classes/nutch-site.xml and include 
 the following:
 
 property
namesearcher.dir/name
value/your_index_folder_path/value
  /property
 
 
 I believe the your_index_folder_path is the path to your crawl 
 directory.  However, if that doesn't work, make it the path to the index 
 folder within your crawl directory.
 
 Now, save that and make sure your script just starts tomcat on init and 
 everything should work fine for you.
 
 Matt
 
 
 Bill Goffe wrote:
 I'd like to start Nutch automatically when I reboot. I wrote a real rough
 script (see below) that works on my Debian system when the system is up,
 but I get nothing on a reboot (and the links are set to the
 /etc/init.d/nutch).  Any hints, ideas, or suggestions? I checked the FAQ
 and the archive but didn't see anything. In addition, it would be great to
 get messages going into /var/log to help figure out what is going on but
 I've had no luck doing that.
 
 Thanks,
 
Bill
 
 ## Start and stop Nutch. Note how specific it is to
 ## (i) Tomcat (typically $CATALINA_HOME/bin/shutdown.sh
 ## or $CATALINA_HOME/bin/startup.sh) and (ii) the
 ## directory with the most recent fetch results.
 
 ## PATH stuff
 PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
 PATH=$PATH:/usr/local/SUNWappserver/bin
 CLASSPATH=/usr/local/SUNWappserver/jdk/jre/lib
 JAVA_HOME=/usr/local/SUNWappserver/jdk
 CATALINA_HOME=/usr/local/jakarta-tomcat-5
 JAVA_OPTS=-Xmx1024m -Xms512m
 
 case $1 in
 start)
   cd /home/bgoffe/nc/40  ## start in correct directory
   /usr/local/jakarta-tomcat-5/bin/startup.sh
   ;;
 
 stop)
  /usr/local/jakarta-tomcat-5/bin/shutdown.sh
  ;;
 
 force-reload|restart)
   /usr/local/jakarta-tomcat-5/bin/shutdown.sh
   cd /home/bgoffe/nc/40
   /usr/local/jakarta-tomcat-5/bin/startup.sh
   ;;
 
 *)
 echo Usage: /etc/init.d/nutch {start|stop|force-reload|restart}
 exit 1
 ;;
 
 esac
 
 exit 0
 
   

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| An orange freezes to the hardness of a baseball; a potato cannot be  |
| dented with a hammer.|
|   -- A description of conditions that Indian and Pakistani soldiers face  |
|  in their dispute in the western Himalayas near and on Siachen|
|  Glacier.  Frozen in Fury at the Room of the World, New York Times, |
|  May 23, 1999.|
*---*



Re: modifying header logo and page content

2006-04-23 Thread Bill Goffe
All I can say is that it works for me. I'll admit that I'm not clear on
how these files fit together and I simply experimented to see what would
work. Nothing seemed to break.

I'd be happy to hear about a better method.

- Bill


Chris said:

 Thanks Bill for all the details.
 
 You mentioned that you changed nutch-page.xsl. But
 this page has the following line:
 
 xsl:commentThis page is automatically generated.  Do
 not edit!/xsl:comment
 
 Are we supposed to indeed change html formating
 through this page is there another page or src that we
 can modify?
 
 Chris
 
 --- Bill Goffe [EMAIL PROTECTED] wrote:
 
  Chris -
Here's my list of what I changed for
  http://ese.rfe.org. It is kinda 
  terse, but it should give you some decent hints even
  so. I may well
  have done some things the hard way, but it seems to
  work.
Don't forget you have to run ant war and
  redeploy the war file.
Still working on getting the favicon right...
- Bill
  
  Files changed (all have *.org in the same directory)
  src/web/pages/en/search.xml  (1st page)
  src/web/pages/en/help.xml
  src/web/include/en/header.xml  (title)
  src/web/include/footer.html
  src/web/include/style.html
  src/web/style/nutch-page.xsl (main formatting)
  src/web/style/nutch-header.xsl
  src/web/jsp/search.jsp(results page
  (help.html URL hardcoded))
  (note: really could stand a more thorough
  change; lots
  of stuff could be taken out that isn't; weird
  formatting
  if you look at the html)
  
   (this material is older) -
  
  General Guide:
  http://lucene.apache.org/nutch/i18n.html
  
  Static Page Content
  search page: src/web/pages/en/search.xml (just the
  middle of the page --
 no table, only gif is
  poweredbynutch_01.gif)
  others: src/web/pages/en/ (3 files -- about.xml,
  help.xml,  search.xml)
  
  Header
  src/web/include/en/header.xml (only thing in that
  directory -- just a mere
 listing of the top 4
  items maybe keep about)
  
  Dyamic Pages
  anchors_en.properties
  cached_en.properties
  explain_en.properties
  search_en.properties
  text_en.properties (is this used?)
-- no changes there (I THINK)
  
  Footer
  src/web/include/footer.html
(in same directory have language specific pages)
  
  Styles
  src/web/style/nutch-header.xsl
  src/web/style/nutch-page.xsl
  
  Style Sheet? (has colors)
  src/web/include/style.html
  
  Colors
#FF9900 -- kinda bright orange
#F9F7F4 -- the middle color on the page (kinda
  light orange?)
#ECE5DC -- light orange background color (diff
  than previous?)
  
  Notes
- a change in nutch-page.xsl does indeed show up
- since help on the search page isn't working for
  me, change to one
  at nuch.org?
  CSS
nutch/docs/api/stylesheet.css (what is indeed
  needed?)
  
  Mods
- keep the main table structure (3 on top of each
  other)
- changed src/web/include/footer.html to mention
  Nutch, RFE,  EDIRC
- changed src/web/include/style.html -- all F9F7F4
  -- FF
- src/web/style/nutch-header.xsl much simplified
- took out gif in src/web/pages/en/search.xml
- took out a table in src/web/style/nutch-page.xsl
  (has robots.gif)
and also took out td starting w/ td
  width=20
  
  
  
   Hello,
   
   This must have been mentioned before, but didn't
  find
   in past user mailing lists or any documentation on
  the
   web.
   
   How can I use a custom logo rather than the
   nutch-logo.gif? Found the docs on how to modify
  i18n
   content but that doesn't give access to other
  parts of
   the final html.
   
   Found the xslt sheets, but these all have a
  comment
   that they are generated from some xml files. Where
  are
   these xml files that control the look and feel of
  the
   nutch pages?
   
   THANKS for any help!
   
   Chris
  
  -- 
  
 
 *--*
   | Bill Goffe
  [EMAIL PROTECTED]  |
   | Department of Economicsvoice: (315)
  312-3444 |
   | SUNY Oswegofax:   (315)
  312-5444 |
   | 416 Mahar Hall
  http://cook.rfe.org |  
   | Oswego, NY  13126
 |
 
 **--*---*
  | He scared Carl. It's a good thing I didn't have
  blond hair. |
  |   -- Johnnie Cochran to his colleague Carl Douglas
  near the end of the|
  |  O.J. Simpson trial. He was talking about O.J.
  Simpson.   |
 
 *---*
  
  

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY

favicon?

2006-04-21 Thread Bill Goffe
At http://ese.rfe.org I've Nutch running for some time, but I have a minor
question: how to put in my own favicon? In .71, I put my favicon.ico in
src/site/src/documentation/resources/images/ and docs/img/ (wasn't sure
which mattered), did an ant war, and redeployed the resulting war file.
The correct favicon is in webapps/ROOT/img/ and
http://ese.rfe.org/favicon.ico shows the correct icon.

But, it shows inconsistently in Firefox and Internet Explorer on search
results and on http://ese.rfe.org in spite of clearing the cache and
history in both (in fact, after clearing them, it now doesn't show!).
Also, in Firefox, when I drag the blank icon from the address bar to my
list of shortcuts (term?) at the top of the browser, the correct icon
shows up there but still not on the address bar. Ugh!

Thanks,

   Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Getting and spending by everyone else continues to make the intellectual |
| life possible, ... which is why universities are named for the likes of   |
| Carnegie, Rockefeller, Stanford, and Duke.   |
|   -- Daniel Akst, talking in part about anti-consumerism. Buyer's|
|  Remorse, Wilson Quarterly, Winter, 2004.|
*---*



Re: Nutch Search stats

2006-04-21 Thread Bill Goffe
Nutch doesn't save it, but at least you can find the search terms in your
Tomcat logs. Granted, it would take some processing, but it would seem to
be useful. Here's an entry from mine today:
  127.0.0.1 - - [21/Apr/2006:08:00:48 -0500] GET
 /search.jsp?query=irreversible+investment HTTP/1.1 200 7176

- Bill


Ravish Bhagdev said:

 No.  Not at present (unless somone enlightens me)
 
 R
 
 
 On 4/21/06, Aled Jones [EMAIL PROTECTED] wrote:
 
  Hiya all
 
  Does nutch save any of the search terms entered for stats purposes? E.g.
  most commonly used terms and so on.
 
  Pity but I can't come to the nutch-user meeting, an 11 hour flight too
  far! ;-)
 
  Cheers
  Aled
 
 
  ###
 
  This message has been scanned by F-Secure Anti-Virus for Microsoft
  Exchange.
  For more information, connect to http://www.f-secure.com/
  
  This e-mail and any attachments are strictly confidential and intended
  solely for the addressee. They may contain information which is covered by
  legal, professional or other privilege. If you are not the intended
  addressee, you must not copy the e-mail or the attachments, or use them for
  any purpose or disclose their contents to any other person. To do so may be
  unlawful. If you have received this transmission in error, please notify us
  as soon as possible and delete the message and attachments from all places
  in your computer where they are stored.
 
  Although we have scanned this e-mail and any attachments for viruses, it
  is your responsibility to ensure that they are actually virus free.
 
 
 
 

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| I was finding it extremely irritating [a fruit fly experiment]. We had   |
| already pretty much prepared our paper and we just needed to know when|
| these flies were going to die. They kept living on and on. At some point, |
| it occurred to us that maybe something is happening here that we should   |
| be paying attention to.  |
|  -- Dr. Stephen L. Helfand describing how they found one fruit fly gene,  |
| which they dubbed INDY (I'm Not Dead Yet) extended the lives of fruit |
| flies by 50%. Fly geneticists had look for such a gene for nearly |
| a century until Helfand and his group stumbled across it. I'm Not|
| Dead Yet: Stumbling on a Genetic Mutation That Lives Up to Its Name, |
| Gina Kolata, New York Times, December 15, 2000|
*---*



Thanks!

2005-12-20 Thread Bill Goffe
I wanted to say thanks to everybody here (and in the past on the
developers list) for help with my Economics Search Engine
http://ese.rfe.org. I wouldn't have been able to get it running without
the suggestions I've received.

I entered it onto the list of Public Servers on the wiki.

 - Bill

P.S. Vertical search engines seem to be on something of an upswing. There
 was an article in the Wall Street Journal on Monday, 12/19/05: 
 Beyond Google, Kevin Delaney, p. R1 
 http://online.wsj.com/article/SB113459260842822579.html (likely need
 an account with them) and Jupiter will be having a web cast on them
 http://www.jupiterwebevents.com/webcasts/looksmart_012506.html .

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Well, in fact, I was very surprised. I thought it would kill me.|
|   -- The balloonist Steve Fossett on his descent out of a thunderstorm|
|  into the Coral Sea that ended his attempt to circumnavigate the  |
|  world in a balloon in August, 1998.  |
*---*



Re: Nutch Tomcat5 or.apache.jasper.JasperException

2005-12-12 Thread Bill Goffe
,
   javax.servlet.http.HttpServletResponse, java.lang.String,
   java.lang.Throwable, boolean) (/usr/lib/libjasper5-compiler-5.0.30.jar.so)
   org.apache.jasper.servlet.JspServlet.service(
   javax.servlet.http.HttpServletRequest,
   javax.servlet.http.HttpServletResponse) (/usr/lib/libjasper5-
   compiler-5.0.30.jar.so)
   javax.servlet.http.HttpServlet.service(
   javax.servlet.ServletRequest, javax.servlet.ServletResponse)
   (/usr/lib/libservletapi5-5.0.30.jar.so)
   org.apache.catalina.valves.ErrorReportValve.invoke(
   org.apache.catalina.Request, org.apache.catalina.Response,
   org.apache.catalina.ValveContext) (/usr/lib/libcatalina-5.0.30.jar.so)
   org.apache.coyote.tomcat5.CoyoteAdapter.service(
   org.apache.coyote.Request, org.apache.coyote.Response)
   (/usr/lib/libcatalina-5.0.30.jar.so)
   org.apache.coyote.http11.Http11Processor.process(
   java.io.InputStream, java.io.OutputStream) (/usr/lib/libtomcat-
   http11-5.0.30.jar.so)
  
   org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection
   (org.apache.tomcat.util.net.TcpConnection, java.lang.Object[])
   (/usr/lib/libtomcat-http11-5.0.30.jar.so)
   
   org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[])
   (/tmp/libtomcat-util-5.0.30.jar.so46imxs.so)
   org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
   (/tmp/libtomcat-util-5.0.30.jar.so46imxs.so)
   java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0)
  
   I have no idea what is causing this error. Any ideas. If it makes a
   difference I am running this installation on a FC4 box.
   Thanks for any insight anyone can provide.
   Mike
  

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Retreat? Hell, we just got here.|
|   -- Captain Lloyd William, U.S. Marine Corps, on the suggestion of   |
|  retreating French soldiers that his Marines retreat as well at the   |
|  Battle of Belleau Wood on June 4, 1918. The First World War, John  |
|  Keegan, p. 407.  |
*---*



Re: Nutch Tomcat5 or.apache.jasper.JasperException

2005-12-10 Thread Bill Goffe
RJ said:

 (nutch-0.7.1 won't recompile on my setup which is, freebsd5.4,
 apache-ant and jdk1.4.2. However, nutch-0.7 will compile for me and that
 is why I'm using it)

I believe that if you go to src/plugin/nutch-extensionpoints and create
the empty directories src and src/java ant will build it.

 - Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Making low-yield [nuclear] weapons is not a hard problem. We call them   |
| failures.|
|  -- Jas Mercer-Smith, a nuclear weapons designer at Los Alamos. More  |
| seriously, he says that many designers don't like the idea of |
| low-yield nukes as it makes their use more likely. This is Not a |
| Test, Evan Ratliff, Wired, March 2002.   |
*---*



Re: ATB: Nutch webapp not at root context.

2005-11-18 Thread Bill Goffe
Not knowing about the solutions posted here, I had a similar problem with
the value of base href. I'm proxying from Apache2, so I put the following
in my apache2.conf. There indeed must be better solutions, but this does
work. I did need to add the apache module ext_filter.

 - Bill

## Change instances of localhost:8080 to rfe.org
## (needed for base href in pages from Nutch)
ExtFilterDefine fixtext mode=output intype=text/html \
cmd=/bin/sed s/localhost:8080/rfe.org/g
 
Location /
SetOutputFilter fixtext
/Location


Aled said:

 I just renamed the war, deployed it and it all works ok except that when
 I click Search instead of calling the jsp relative to it's location it's
 calling the jsp at root.  E.g. Create war and call mycontext.war, deploy
 to have http://localhost:8080/mycontext/ Go to link above, index.jsp
 works fine and nutch start page shows up.  However, once I click Search
 it jumps to the context at root i.e. http://localhost:8080/Search.jsp
 
 Is there something I should be defining when building the war file or in
 web.xml or the nutch-site.xml??

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| I guess if I do something again, I'll try not to get caught.|
|   -- Glenn (last name not given), a 15 year-old charged with assault, on  |
|  his reaction to a program of tough love where juveniles are given  |
|  a four day introduction to jail life with repeat offenders. You are |
|  There, Sunday New York Times Magazine, 9/7/97, p. 31.   |
*---*



Re: Proxy base href Problem

2005-11-10 Thread Bill Goffe
Hm. I tried that (thanks!), but no go. I wonder if one needs to compile
it? After too much thrashing around I ended adding this to my apache2.conf
file:
  ExtFilterDefine fixtext mode=output intype=text/html \
  cmd=/bin/sed s/localhost:8080/rfe.org/g
  Location /
  # core directive to cause the fixtext filter to
  # be run on output
  SetOutputFilter fixtext
  /Location

(straight from http://httpd.apache.org/docs/2.0/mod/mod_ext_filter.html --
you'll need mod_ext_filter installed for it to work). mod_proxy_html
_should_ be able to do it, but I wasn't able to get it to run; the above
is much more brute force but works. Below is what else I did to get Nutch
running at http://rfe.org:8080 to be seen as http://rfe.org/search by the
outside world.

- Bill

Added to apache2.conf:

IfModule mod_proxy.c
ProxyRequests Off
  Proxy *
  Order deny,allow
  Allow from all
  /Proxy
  ProxyPass /search http://localhost:8080/en/search.html
  ProxyPassReverse /search http://localhost:8080/en/search.html
  ProxyPass /search.jsp http://localhost:8080/search.jsp
  ProxyPassReverse /search.jsp http://localhost:8080/search.jsp
  ProxyPass /cached.jsp http://localhost:8080/cached.jsp
  ProxyPassReverse /cached.jsp http://localhost:8080/cached.jsp
  ProxyPass /explain.jsp http://localhost:8080/explain.jsp
  ProxyPassReverse /explain.jsp http://localhost:8080/explain.jsp
  ProxyPass /anchors.jsp http://localhost:8080/anchors.jsp
  ProxyPassReverse /anchors.jsp http://localhost:8080/anchors.jsp
  ProxyPass /img/reiter http://localhost:8080/img/reiter
  ProxyPassReverse /img/reiter http://localhost:8080/img/reiter
  ProxyPass /img http://localhost:8080/img
  ProxyPassReverse /img http://localhost:8080/img
  ProxyPass /en http://localhost:8080/en
  ProxyPassReverse /en http://localhost:8080/en
/IfModule

Added to the following to server.xml for Tomcat
  Connector className=org.apache.catalina.connector.http.HttpConnector
 port=8080 minProcessors=5 maxProcessors=75
 enableLookups=true
 acceptCount=100 debug=0 connectionTimeout=2
 proxyName=rfe.org proxyPort=80
 useURIValidationHack=false /

Jake said:

 Bill,
 
   Take a look at search.jsp.  It looks like it sets the base href
 based on the name you hit it as.  Since you're proxying back to
 localhost:8080, that's what it sets as the base href.  I think you
 should be able to just hardcode that to be rfe.org, restart tomcat and
 it should work.
 
 Current Code:
   String base = requestURI.substring(0, requestURI.lastIndexOf('/'));
   ...
   base href=%= base  + / + language %/
 
 My Suggestion:
   String base = rfe.org;
   ...
   base href=%= base  + /search/ + language %/
 
 Jake.
 
 -Original Message-
 From: Bill Goffe [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, November 10, 2005 2:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Proxy  base href Problem
 
 I'm having a specific problem running Nutch through a proxy server. I
 run
 http://rfe.org (a directory of resources for economists). I'd like to
 run
 Nutch with data from the web sites listed in rfe and other sites out of
 http://rfe.org/search (as you might guess, Nutch would be a very nice
 thing to add to a directory of resources). I've currently got 1.4M pages
 in a database and by tweaking various parameters in nutch-default.xml
 I'm
 getting pretty good results and would really like to move to production
 use.
 
 At any rate, in my apache2.conf file I have a bunch of ProxyPass and
 ProxyPassReverse statements for the different files  directories that
 I'd
 like to pass from Tomcat to Apache. I also added a connector in Tomcat's
 server.xml.
 
 This almost works (feel free to try), but, for one problem -- the base
 href in pages returned by Nutch: base href=http://localhost:8080/en/;
 .
 Clearly, this messes up relative URLs in the pages Nutch returns. How
 could I go about changing the base URL in pages that Nutch returns? I
 don't see anything obvious in Tomcat's configuration (likely wouldn't
 want
 there anyway I suspect), nutch/conf, mod_proxy, mod_proxy_html, or
 mod_rewrite. 
 
 Thanks a bunch,
 
Bill
 
 -- 
  *--*
  | Bill Goffe [EMAIL PROTECTED]  |
  | Department of Economicsvoice: (315) 312-3444 |
  | SUNY Oswegofax:   (315) 312-5444 |
  | 416 Mahar Hall http://cook.rfe.org |
 
  | Oswego, NY  13126|
 **--*---
 *
 | If you compare the way you'd teach anything to the way juries are
 |
 | 'taught,' it's amazing juries do as well as they do.
 |
 |   -- Catherine T. Struve, a law professor at the University of
 |
 |  Pennsylvania, on how well juries do in spite of not being able to
 |
 |  take notes or for lawyers to summarize material for them

nutch readdb db -stats Pages Fetched Not Consistent

2005-11-10 Thread Bill Goffe
Now that I've got my reverse proxy up, one less pressing question. With
  ~/nutch/bin/nutch  readdb db -stats
I get 
  Number of pages: 1,399,730
  Number of links: 5,369,361
Yet when I search my log (have a perl script that outputs everything to
it) with grep fetching log  | wc -l I get
  313,998
(commas added).

I'm thinking that readdb db -stats isn't counting the number of downloaded
pages -- am I correct?

 - Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| It shocks them actually. They never knew that such a world actually  |
| exists, because they have their own problems.|
|  -- Mustapha Ahansal, of the U.S. Navy. He was describing the response of |
| Iraqis to his boarding party of himself (a Moroccan-American), an |
| African-American, an Hispanic, and often a woman. Sinbad vs. the |
| Mermaids, Thomas L. Friedman, New York Times, 10/5/05.   |
*---*



Re: where is catalina.sh?

2005-11-09 Thread Bill Goffe
-5.0.30.jar.so)
 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(java.lang.Object[])
 (/tmp/libtomcat-util-5.0.30.jar.so5gnaze.so)
 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 (/tmp/libtomcat-util-5.0.30.jar.so5gnaze.so)
   java.lang.Thread.run() (/usr/lib/libgcj.so.6.0.0)
 
 
 
 
 
   
   
 __ 
 Yahoo! Mail - PC Magazine Editors' Choice 2005 
 http://mail.yahoo.com

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Police went to the man's house and visually confirmed he was the right   |
| person.  |
|  -- A story on how a visitor to the Albuquerque zoo claimed he wasn't |
| harmed by a jaguar, when in fact it bit off one of his fingers. The   |
| match was made by a fingerprint. Unclaimed Finger Leads Zoo to Ban, |
| Associated Press, May 18, 2004.   |
*---*