I am using dir path: /home/kah/Downloads/nutch-0.7.1/crawl.pdf/
and getting the following exception!
Exception in thread main
java.io.FileNotFoundException:
/home/kah/Downloads/nutch-0.7.1/crawl.pdf/segments (Is a directory)
Tirsdag 22 november 2005 13:12 skrev Kasper Hansen:
Hi,
I get an
So how to update a crawl, the updating section of the FAQ is empty :-(
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
Doug Cutting wrote:
Håvard W. Kongsgård wrote:
- I want to index about 50 – 100 sites with lots of documents, is it
best use the Intranet
Thanks. Fixed the issue. That line was commented out in the release
chached.jsp for some reason?
-Neges Wreiddiol-/-Original Message-
Oddi wrth/From: YourSoft [mailto:[EMAIL PROTECTED]
Anfonwyd/Sent: 29 November 2005 09:37
At/To: nutch-user@lucene.apache.org
Pwnc/Subject:
Kumar.
you can use the nutch readdb [db_name] -stats command to generate statistics
for your WebDB and the nutch segread command for your segments.
HTH Thomas Delnoij
On 11/29/05, Kumar Limbu [EMAIL PROTECTED] wrote:
Hi Everyone,
I am new to nutch and I would like to know how can I know how
For the sake of the archives, I will answer my own question here: I had to
add the following line to the bin/nutch script to be able to run
org.apache.nutch.net.RegexURLFilter from the command line:
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar
The nutch script
The segments dir is Nutch only and has nothing to do with the Lucene Index,
which is found at the ${nutch-crawl.dir}/index.
The following lucene code works for me:
Searcher searcher = new IndexSearcher(${nutch-crawl.dir}/index);
I hope this helps.
On 11/29/05, Kasper Hansen [EMAIL PROTECTED]
See in jira, I sent a patch to solve this problem.
Aled Jones wrotte:
Further to this, although most cached pages work, I sometimes get errors
from Tomcat similar to:
type Status report
message /cgi-bin/pcrdir2.asp
description The requested resource (/cgi-bin/pcrdir2.asp) is not
available.
Hello,
Is there anyone that can implement a country crawler? I estimate
around 40m documents. Please send me info about your prev work and how much
time it would take to setup and money :-)
Regards
Asgeir Halldorsson
Is there anyone that can implement a country crawler? I
estimate around 40m documents. Please send me info about your prev
work and how much time it would take to setup and money :-)
Check out the paper titled Crawling a Country: Better Strategies
than Breadth-First for Web Page Ordering
Well, using the path ${nutch-crawl.dir}/index gives the following exception,
Why is this so?
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:326)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at
Along these same lines (as I'm interested in a similiar country-specific
project), is there any place to get a list of all the domains for a
specific TLD to use to seed nutch? i.e. if I wanted to get a list of
all currently registered .it, .de, or .ca's?
I've looked without success. I'm
glenn, i know that verisign makes this available for .com and .net as
TLD zone files.
for ccTLDs like .us and .uk, you'll have to see if the TLD registrar
provides the same. the following page has some useful links to these
folks:
http://www.dnsstuff.com/info/dnslinks.htm
--matt
On Nov
Anybody using term-vectors? Modifying BasicIndexFilter to enable the
term-vector option for contents doesn't seem to produce any:
// content is indexed, so that it's searchable, but not stored in index
doc.add(Field.UnStored(content, parse.getText(), true));
Any ideas?
-Kenji
Hi,
When I went to mozdex.com which is using Nutch, I realized that the search
term good man(no double quotes in actual search term) returns different
search result than the search term man good (also no double quotes in actual
search term). I went to Google and they are doing similar
ok, now I remembered something from the book Lucene in Action, it said
something about word distance. So that's why they returns different results.
But still, when I remembered when I went to Google Adwords and get the new
Maximum CPC estimates for phases containing same words but with
I want to crawl and index local system files, is there any way to do this
using nutch? What I need to do and what configuration changes are required? I
am very new to nutch so need your help in this regards.
thanx in adavance for quick and good response.
Regards,
Arun Kumar
Bill
Thanx for response. I have some more questions for Nutch geeks out
there:
1.Can u send me default cofiguration that I need to make in
crawl-urlfilter.txt for local files spidering ?
file content below:
# skip file:, ftp:, mailto: urls
-^(http|ftp|mailto|https):
Hi all,
I am getting the following error, while i use the nutch to search the crawled
nutch database, what would be the problem,
Error: searching for 20 raw hits
Thanks in advance,
inr.
-
Yahoo! Music Unlimited - Access over 1
Hi
I hope this helps
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
/Jack
On 11/30/05, Arun Kumar Sharma [EMAIL PROTECTED] wrote:
Nutch Geeks-
I want to do local hard-disk crawling. I want to know what I need to
do for this.I find this article helpful
Hi
More logging info and exceptions will help dealing with the problem quickly;)
/Jack
On 11/30/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
Hi all,
I am getting the following error, while i use the nutch to search the
crawled nutch database, what would be the problem,
Error:
(i'm moving this to nutch-user, so we don't piss off the nutch-dev
folks.)
a few ideas:
- if you only want to match one site at a time, you can just add
site:xxx to the query. the site field exists in the index by default
- if you want assign ids to clusters of sites, you can do the site-
21 matches
Mail list logo