Re: Standalone app

2005-11-29 Thread Kasper Hansen
I am using dir path: /home/kah/Downloads/nutch-0.7.1/crawl.pdf/ and getting the following exception! Exception in thread main java.io.FileNotFoundException: /home/kah/Downloads/nutch-0.7.1/crawl.pdf/segments (Is a directory) Tirsdag 22 november 2005 13:12 skrev Kasper Hansen: Hi, I get an

Re: Crawl auto updated in nutch?

2005-11-29 Thread Håvard W. Kongsgård
So how to update a crawl, the updating section of the FAQ is empty :-( http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6 Doug Cutting wrote: Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet

ATB: Crawled cached pages do not show images

2005-11-29 Thread Aled Jones
Thanks. Fixed the issue. That line was commented out in the release chached.jsp for some reason? -Neges Wreiddiol-/-Original Message- Oddi wrth/From: YourSoft [mailto:[EMAIL PROTECTED] Anfonwyd/Sent: 29 November 2005 09:37 At/To: nutch-user@lucene.apache.org Pwnc/Subject:

Re: How can I know how many pages nutch has fetched?

2005-11-29 Thread Thomas Delnoij
Kumar. you can use the nutch readdb [db_name] -stats command to generate statistics for your WebDB and the nutch segread command for your segments. HTH Thomas Delnoij On 11/29/05, Kumar Limbu [EMAIL PROTECTED] wrote: Hi Everyone, I am new to nutch and I would like to know how can I know how

Re: RegexURLFilter / testing regex-urlfilter.txt

2005-11-29 Thread Thomas Delnoij
For the sake of the archives, I will answer my own question here: I had to add the following line to the bin/nutch script to be able to run org.apache.nutch.net.RegexURLFilter from the command line: CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar The nutch script

Re: Standalone app

2005-11-29 Thread Bruno Patini Furtado
The segments dir is Nutch only and has nothing to do with the Lucene Index, which is found at the ${nutch-crawl.dir}/index. The following lucene code works for me: Searcher searcher = new IndexSearcher(${nutch-crawl.dir}/index); I hope this helps. On 11/29/05, Kasper Hansen [EMAIL PROTECTED]

Re: [Nutch-general] ATB: Crawled cached pages do not show images

2005-11-29 Thread YourSoft
See in jira, I sent a patch to solve this problem. Aled Jones wrotte: Further to this, although most cached pages work, I sometimes get errors from Tomcat similar to: type Status report message /cgi-bin/pcrdir2.asp description The requested resource (/cgi-bin/pcrdir2.asp) is not available.

Setting up a crawler for a country.

2005-11-29 Thread Ásgeir Halldórsson
Hello, Is there anyone that can implement a country crawler? I estimate around 40m documents. Please send me info about your prev work and how much time it would take to setup and money :-) Regards Asgeir Halldorsson

Re: Setting up a crawler for a country.

2005-11-29 Thread Ken Krugler
Is there anyone that can implement a country crawler? I estimate around 40m documents. Please send me info about your prev work and how much time it would take to setup and money :-) Check out the paper titled Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering

SV: Standalone app

2005-11-29 Thread Kasper Hansen
Well, using the path ${nutch-crawl.dir}/index gives the following exception, Why is this so? java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(ArrayList.java:326) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at

Re: Setting up a crawler for a country.

2005-11-29 Thread Insurance Squared Inc.
Along these same lines (as I'm interested in a similiar country-specific project), is there any place to get a list of all the domains for a specific TLD to use to seed nutch? i.e. if I wanted to get a list of all currently registered .it, .de, or .ca's? I've looked without success. I'm

Re: Setting up a crawler for a country.

2005-11-29 Thread Matt Kangas
glenn, i know that verisign makes this available for .com and .net as TLD zone files. for ccTLDs like .us and .uk, you'll have to see if the TLD registrar provides the same. the following page has some useful links to these folks: http://www.dnsstuff.com/info/dnslinks.htm --matt On Nov

Lucene term-vector

2005-11-29 Thread Kenji
Anybody using term-vectors? Modifying BasicIndexFilter to enable the term-vector option for contents doesn't seem to produce any: // content is indexed, so that it's searchable, but not stored in index doc.add(Field.UnStored(content, parse.getText(), true)); Any ideas? -Kenji

Good man is Different than Man good in Nutch?

2005-11-29 Thread Victor Lee
Hi, When I went to mozdex.com which is using Nutch, I realized that the search term good man(no double quotes in actual search term) returns different search result than the search term man good (also no double quotes in actual search term). I went to Google and they are doing similar

Re: Good man is Different than Man good in Nutch?

2005-11-29 Thread Victor Lee
ok, now I remembered something from the book Lucene in Action, it said something about word distance. So that's why they returns different results. But still, when I remembered when I went to Google Adwords and get the new Maximum CPC estimates for phases containing same words but with

How to crawl local system files using nutch

2005-11-29 Thread Arun Kumar Sharma
I want to crawl and index local system files, is there any way to do this using nutch? What I need to do and what configuration changes are required? I am very new to nutch so need your help in this regards. thanx in adavance for quick and good response. Regards, Arun Kumar

Re: How to crawl local system files using nutch

2005-11-29 Thread Arun Kumar Sharma
Bill Thanx for response. I have some more questions for Nutch geeks out there: 1.Can u send me default cofiguration that I need to make in crawl-urlfilter.txt for local files spidering ? file content below: # skip file:, ftp:, mailto: urls -^(http|ftp|mailto|https):

Error: searching for 20 raw hits

2005-11-29 Thread Ayyanar Inbamohan
Hi all, I am getting the following error, while i use the nutch to search the crawled nutch database, what would be the problem, Error: searching for 20 raw hits Thanks in advance, inr. - Yahoo! Music Unlimited - Access over 1

Re: Help require in local hard-disk crawling with Nutch

2005-11-29 Thread Jack Tang
Hi I hope this helps http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch /Jack On 11/30/05, Arun Kumar Sharma [EMAIL PROTECTED] wrote: Nutch Geeks- I want to do local hard-disk crawling. I want to know what I need to do for this.I find this article helpful

Re: Error: searching for 20 raw hits

2005-11-29 Thread Jack Tang
Hi More logging info and exceptions will help dealing with the problem quickly;) /Jack On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote: Hi all, I am getting the following error, while i use the nutch to search the crawled nutch database, what would be the problem, Error:

Re: How to hack the config?

2005-11-29 Thread Matt Kangas
(i'm moving this to nutch-user, so we don't piss off the nutch-dev folks.) a few ideas: - if you only want to match one site at a time, you can just add site:xxx to the query. the site field exists in the index by default - if you want assign ids to clusters of sites, you can do the site-