Re: [Nutch-general] Caching the search results

2006-09-08 Thread Ken Krugler
You may want to consider using memcached - http://www.danga.com/memcached/ - it's super simple and super stable. I use it over at Simpy.com and the memcached daemon there has been up for months without showing any signs of trouble. We've had good luck with ehcache

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a

Customize the crawl process

2006-09-08 Thread NamNH
I want to customize the crawling process by modifying the way pages are stored. As far as I know, Nutch will stored web pages in binary file, page by page. After a link analysis step, Nutch will crawl to the destination page and download it. When pages are stored, I want to write only link to a

Re: Recrawling (Tomi NA)

2006-09-08 Thread Andrzej Bialecki
Tomi NA wrote: On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already

Re: Indexing MS Powerpoint files with Lucene

2006-09-08 Thread Andrzej Bialecki
(moved to nutch-user) Tomi NA wrote: On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote: On Thu, 7 Sep 2006, Tomi NA wrote: On 9/7/06, Venkateshprasanna [EMAIL PROTECTED] wrote: Is there any filter available for extracting

RE: Charset question

2006-09-08 Thread kenneth man
Thanks for your reply. I have found that the method you mentioned looks into the http header from web server. It looks for charset and does the mapping. The apache web server which contains the document has already configured: AddDefaultCharset Big5-HKSCS The crawl engine does treat the

Reduce Error during fetch

2006-09-08 Thread Vishal Shah
Hi, I've been trying to get the nutch fetcher to work since a couple of days, but it always hangs on one of the reduce processes, and the job is aborted. I am using numFetchers=24 during generate, 24 map tasks and 6 reduce tasks during fetch on a 3 machine cluster. The task that failed was

Getting subcollections

2006-09-08 Thread Alvaro Cabrerizo
Hi, I would like to know, how could I get all the subcollections and how many documents belong to each subcollection after making a query. The approach I took was to iterate over the results, getting details for each one. The problem is that every query I make is limited by numHits [

Re: Indexing MS Powerpoint files with Lucene

2006-09-08 Thread Tomi NA
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: (moved to nutch-user) Tomi NA wrote: On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote: On Thu, 7 Sep 2006, Tomi NA wrote: On 9/7/06, Venkateshprasanna [EMAIL PROTECTED]

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index

how could I merge tow index together?

2006-09-08 Thread heack
I fetch tow site, and index seperately. How could I merge them ? Thank YOU!

Re: Customize the crawl process

2006-09-08 Thread Dennis Kubes
You would need to modify Fetcher line 433 to use a a text output format like this: job.setOutputFormat(TextOutputFormat.class); and you would need to modify Fetcher line 307 only collect the information you are looking for, maybe something link this: Outlink[] links =

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

2006-09-08 Thread victor_emailbox
Thanks. I put the following in nutch-site.xml: property namedb.ignore.external.links/name valuetrue/value descriptionIf true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without

Re: # of tasks executed in parallel

2006-09-08 Thread Dennis Kubes
How many urls are you fetching and does each machine have the same settings as below? Remember that number of fetchers is number of fetcher threads per task per machine. So you would be running 2 tasks per machine * 12 threads * 3 machines = 75 fetchers. Dennis Vishal Shah wrote: Hi,

Re: Reduce Error during fetch

2006-09-08 Thread Dennis Kubes
You may be running into problems with regex stalls on filtering. Try removing the regex filter from the nutch-site.xml plugin.includes property. I was having similar problems before switching to just use prefix and suffix filters as below. I attached my prefix and suffix url filter files

Fetching past Authentication

2006-09-08 Thread Jim Wilson
Dear Nutch User List, I am desperately trying to index an Intranet with the following characteristics 1) Some sites require no authentication - these already work great! 2) Some sites require basic HTTP Authentication. 3) Some sites require NTLM Authentication. 4) No sites require both HTTP and

Windows File Shares

2006-09-08 Thread Jim Wilson
Dear Nutch Users, Does anyone have experience indexing the contents of windows files shares? There's information on the Wiki about indexing the local disk, but nothing about remote shares. Also, does Nutch traverse directories on its own, or does it require document links? Thanks in advance.

Re: two nutch indexes on same webserver

2006-09-08 Thread Dennis Kubes
Assuming you have two separate war files deployed, it should be as easy as setting the searcher.dir property in the nutch-site.xml file in the different web-inf directories to the separate index locations. If you want to go the distributed searching route there is a in depth explanation on