RE: New Installation - Problems - Error 500

2008-01-30 Thread Paul Stewart
That's wonderful - what a great list! You guys respond very quickly... Now I gotta get back to reading the docs as I'm sure most of what I just asked is already in there...;) Best! Paul -Original Message- From: John Mendenhall [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 29, 2008

Simple question about query terms

2008-01-30 Thread Chaz Hickman
I'm experiencing some trouble in forming simple queries that include non-alphabetic characters. One specific instance is if I want to search for the string @test. If I build up the query using either addRequiredPhrase, addRequiredTerm, or Query.parse, the search term loses the @ sign at the

Re: Simple question about query terms

2008-01-30 Thread Jasper Kamperman
Yes. When you index your pages, the text is run through an analyzer that parses it into tokens. The analyzer does interesting stuff like lowercasing, throwing away bothersome characters, stemming (tokenizing the word looking into look because that is the stem of the verb). There are many

What is that mean? robots_denied(18)

2008-01-30 Thread Vinci
hi, I finally make the crawler running without exception by build from trunk, but I found the linkdb cannot crawl anything...and then I dump the crawl db and seeing this in the metadata: _pst_:robots_denied(18) any idea? -- View this message in context:

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci
Hi All, I get the same exception when I trying with the nightly build for a static page, any one can help? Vicious wrote: Hi All, Using the latest nightly build I am trying to run a crawl. I have set the agent property and all relevant plugin. However as soon as I run the crawl I get

Re: Fetch issue with Feeds

2008-01-30 Thread Vinci
Hi, Here is the additional information: before the exception appear, nutch advertise 2 message: fetching http://cnn.com org.apache.tika.mime.MimeUtils load INFO loading [mime-types.xml] fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException Fetcher: done Seems mime-type has

Re: Fetch issue with Feeds (SOLVED)

2008-01-30 Thread Vinci
Hi, finally I figure out the solution: go to conf/ rename the old mime-types.xml into anyting else, then copy tika-mimetypes.xml into the same directory with name mime-types.xml the crawler should work now. in short, this is because 1.0-dev using tika, but old-day mime detection config file is

Re: What is that mean? robots_denied(18)

2008-01-30 Thread Vinci
hi, I found the anwserthis is generated because robots.txt disallowed crawling of the current url. hope it can help. Vinci wrote: hi, I finally make the crawler running without exception by build from trunk, but I found the linkdb cannot crawl anything...and then I dump the crawl

Can Nutch use part of the url found for the next crawling?

2008-01-30 Thread Vinci
hi, I have some trouble with a site that doing content redirection: nutch can't crawl this site but can crawl its rss, but unfortunately the link in rss is redirect to the site -- this is the bad thing, but I found the link i want is appear in the link as an get parameter:

Cannot parse atom feed with plugin feed installed

2008-01-30 Thread Vinci
Hi, I already add the plugin name into nutch-default.xml, but it still throw exception ParseException: parser not found for contentType=application/atom+xml while the rss feed work fine after I added parse-rss. I checked the feed support atom feed with mime-type application/atom+xml, did I miss

Re: nutch 0.9, fetch2, fetcher.parse conf value not used

2008-01-30 Thread John Mendenhall
I tried to run fetch without parsing by setting the fetcher.parse property to false. When I ran parse, it said the segment had already been parsed, by the fetch process. It appears NUTCH-337 only fixed the unused fetcher.parse configuration value in the Fetcher.java class. I have tried

JDK 1.5 Tomcat 5.5

2008-01-30 Thread Duan, Nick
Does the latest Nutch work with JDK 1.5 or 1.6, and Tomcat 5.5 or 6.0? Thanks! Nick

RE: JDK 1.5 Tomcat 5.5

2008-01-30 Thread Christopher Bader
I'm using Nutch 0.9 (the latest stable release) with JDK 1.5, and Tomcat 6.0. I had a problem with JDK 1.6. CB -Original Message- From: Duan, Nick [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 30, 2008 4:50 PM To: nutch-user@lucene.apache.org Subject: JDK 1.5 Tomcat 5.5 Does

Re: Can Nutch use part of the url found for the next crawling?

2008-01-30 Thread Susam Pal
crawl-urlfilter.txt and regex-urlfilter.txt are used to block or allow certain URLs to be called. It does not allow you to extract a URL from another. You might want to use conf/regex-normalize.xml to do this. Regards, Susam Pal On Jan 31, 2008 1:43 AM, Vinci [EMAIL PROTECTED] wrote: hi, I

strange page rank

2008-01-30 Thread Lyndon Maydwell
Hi list. I'm having trouble figuring out why certain pages are being ranked much higher than others on my Nutch installation. For example, not long ago, the department of computing's homepage was ranked #1 for the query computing department. However, recently it has dropped in the rankings

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-30 Thread Siddhartha Reddy
Do you have the Java heap space options set in the 'mapred.child.java.opts' property (in conf/hadoop-site.xml)? For a machine with 1gb ram and 1gb swap space, I set this to '-Xms1024m -Xmx2048m'. Best, Siddhartha On Jan 31, 2008 3:23 AM, John Mendenhall [EMAIL PROTECTED] wrote: The one task

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

2008-01-30 Thread Siddhartha Reddy
Although only one of the machines will be used for the fetch task (because all your urls are from a single host), the other tasks do not have any such requirements and can run on multiple machines. So running in the distributed might still benefit you. To 'turn off' the 3 slaves, you can simply