Re: Plugin HitCollector

2006-10-23 Thread Andrzej Bialecki
Dennis Kubes wrote: We are running into the same issue. Remember that hits just give you doc id and getting hit details from the hit does another read. So looping through the hits to access every document will do a read per document. If it is a small number of hits, no big deal, but the more

Re: "generate db segments topN" with TYPE

2006-10-23 Thread Dennis Kubes
You could use suffix filters to filter out any document that isn't a PDF. Dennis Marco Vanossi wrote: Hi, Do you think there is an easy way to do make nutch generate a list of only certain documents type to fetch? For example: If one would like to crawl only PDF docs (after some pages was a

Re: Plugin HitCollector

2006-10-23 Thread Dennis Kubes
We are running into the same issue. Remember that hits just give you doc id and getting hit details from the hit does another read. So looping through the hits to access every document will do a read per document. If it is a small number of hits, no big deal, but the more hits to access, the

Re: Fetching outside the domain ?

2006-10-23 Thread Andrzej Bialecki
Tomi NA wrote: 2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: Btw we have some virtual local hosts, hoz does the db.ignore.external.links deal with that ? Update: setting db.ignore.external.links to true in nutch-site (and later also in nutch-default as a sanity check) *doesn't work*: I

Re: Plugin HitCollector

2006-10-23 Thread Andrzej Bialecki
steveb wrote: I would like to use my own HitCollector when doing a search using the NutchBean as I have a requirement to access every document in the result set but without incurring the cost of traversing the Hits collection. Accessing every document will be costly no matter what interface

"generate db segments topN" with TYPE

2006-10-23 Thread Marco Vanossi
Hi, Do you think there is an easy way to do make nutch generate a list of only certain documents type to fetch? For example: If one would like to crawl only PDF docs (after some pages was already crawled, wich linked to PDF docs), the command: "bin/nutch generate db segments -topN 1000 -type:pdf

Re: crawling sites which require authentication

2006-10-23 Thread Tomi NA
2006/10/14, Tomi NA <[EMAIL PROTECTED]>: 2006/10/14, Toufeeq Hussain <[EMAIL PROTECTED]>: > From internal tests with ntlmaps + Nutch the conclusion we came to was > that though it "kinda-works" it puts a huge load on the Nutch server > as ntlmaps is a major memory-hog and the mixture of the two

Problems with Nutch 0.8.1 and Cygwin

2006-10-23 Thread Aled Jones
Hi all I'm trying to get my nutch system going with 0.8.1 (currently on 0.7.1). Crawling is fine with 0.7.1, but when trying to start a crawl with cygwin on 0.8.1, I get the following error: bin/nutch: line 194: /cygdrive/d/nutch-0.8.1/D:\Java\jre1.5.0_06/bin/java: No such file or directory. bin/n

Plugin HitCollector

2006-10-23 Thread steveb
I would like to use my own HitCollector when doing a search using the NutchBean as I have a requirement to access every document in the result set but without incurring the cost of traversing the Hits collection. >From looking at the source code I noticed the LuceneQueryOptimizer is already using

RE: Re-injecting URLS, perhaps by removing them from the CrawlDB first?

2006-10-23 Thread Gary Bone
Hi Ben, Attached is a method I use to achieve the process that you are after. Each changed url must be on it's own line in the txt file. #Remove updated URL's exec 0/db -deletepage $url Done As you can see it pulls a list of updated urls from a file and removes them one by one from the db. Th

Re: Fetching outside the domain ?

2006-10-23 Thread Tomi NA
2006/10/18, [EMAIL PROTECTED] <[EMAIL PROTECTED]>: Btw we have some virtual local hosts, hoz does the db.ignore.external.links deal with that ? Update: setting db.ignore.external.links to true in nutch-site (and later also in nutch-default as a sanity check) *doesn't work*: I feed the crawl pr