List all indexed sites

2005-07-21 Thread lumavanossi
Hi, Is there a way to list all indexed sites ? I've tried searching for "http" (search,jsp?query=http) but after some millions of sites indexed if does not return all sites. Thanks

"Imports"

2005-07-21 Thread lumavanossi
Hi, Witch imports do I need to add at any indexer plugin in (like import org.apache.nutch.parse.Outlink; and ...) order to get the code below to work: // add links Outlink[] outlinks = parse.getData().getOutlinks(); int end = Math.min(outlinks.length, UpdateDatabaseTool.MAX_OUTLINKS_PER_PAG

Re: Unsubscribing

2005-07-21 Thread Gus Bourg
I believe this list is now hosted on the apache.org servers, not sourceforge servers (correct me if I'm wrong)... They transfered all of the sourceforge subscriptions over to the new mailing list server. Gus On Thu, 21 Jul 2005, Paul Stewart wrote: Hi there... I'm not a list newbie..really

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread ogjunk-nutch
Hi, --- Matthias Jaekle <[EMAIL PROTECTED]> wrote: > > You probably don't want to touch indexer.termIndexInterval and > > indexer.maxMergeDocs (determines the max size of an individual > > segment). > Why is maxMergeDocs 50 by default? Should not this value be much > higher? 50 is probably OK fo

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
You probably don't want to touch indexer.termIndexInterval and indexer.maxMergeDocs (determines the max size of an individual segment). Why is maxMergeDocs 50 by default? Should not this value be much higher? I found how to calculate the number of opened files But how could I calculate the memor

Re: [Nutch-general] Re: RDF plugin questions

2005-07-21 Thread Stefan Groschupf
Hi Erik, Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet.

Re: [Nutch-general] Re: RDF plugin questions

2005-07-21 Thread Erik Hatcher
Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet. I've star

Unsubscribing

2005-07-21 Thread Paul Stewart
Hi there... I'm not a list newbie..really I'm not..;) When I goto https://lists.sourceforge.net/lists/listinfo/nutch-general And try to put my email address in to unsubscribe it tells me I don't exist... Is something broken? :) Can anyone on here manually unsubscribe me? Appreciate it, Paul

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread ogjunk-nutch
Hi, --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Matthias Jaekle wrote: > > Hi Andrzej, > > > > thanks for your response. I am not really familar with the lucene > > internals. > > > > I am just running nutch with the default parameters on a debian > sarge > > system with ext3 file syste

Re: Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread Stefan Groschupf
Hi there, I had a patch submitted as well, that does not need any external library just a xml parser and a xslt processor that is - i think - part of jdk anyway. My solution works with a xsl that convert different rss feed to one format and than I parsed this format with a normal xml parser.

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread ogjunk-nutch
Matthias, minMergeDocs is what controls how many Documents will be held in memory before being flushed to the disk, and mergeFactor controls how often segments are merged. Both values are 10 by default in Lucene, I believe. If you have Lucene in Action, this is described in more detail there. If

Re: Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread Chris Mattmann
Hi Andrzej, At the time that I was working diligently on this plugin (April/May), I had done some thorough research into finding what I felt would be the most flexible, reliable way to parse RSS files. The RSS feed parser out of the jakarta-commmons sandbox was what I found, and I stand by it. I

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki
Matthias Jaekle wrote: Hi Andrzej, thanks for your response. I am not really familar with the lucene internals. I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024 files opened, and 1 GB RAM. So is ext3 a bad file system for mill

Re: optimize indexes

2005-07-21 Thread Andy Liu
When you run dedup, it undeletes all previous deletions from the index. So run dedup before you prune, otherwise all the documents you pruned will be undeleted by dedup. As far as optimizing goes, you can open an IndexWriter and run optimize(). Take a look at IndexSegment on how an IndexWriter i

Re: Classnotfoundexception in https plugin

2005-07-21 Thread Piotr Kosiorowski
Hello, You have probably some problems with nutch plugins. It is quite difficult to understand from your email but I assume you have created a new nutch plugin for it. 1) Please check if everything is correctly specified in your plugin.xml file. 2) Check if you have included your plugin in nut

Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
Hi Andrzej, thanks for your response. I am not really familar with the lucene internals. I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024 files opened, and 1 GB RAM. So is ext3 a bad file system for millions of files? I could no

optimize indexes

2005-07-21 Thread [EMAIL PROTECTED]
Dear Users! How to delete realy from deleted entries from index? I run the 'prune' tool and 'dedup' tool, and after it I would like to remove deleted entries from index? How to optimize indexes? Regards, Ferenc

multiple website crawling

2005-07-21 Thread Feng \(Michael\) Ji
hi there, If I put multiple web url in the plain text file of "urls" in the following command, will it fetch mutliple web site for me? " bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log " thanks, Michael, Start yo

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki
Matthias Jaekle wrote: 050721 071234 * Optimizing index... ... this takes a long time ... Hello, optimizing the index takes extremly long. I have the feeling in earlier versions, this was much faster. I just try to index a 7.000.000 Pages Segment. This is running till 10 days now. Processing

Speed up indexing?

2005-07-21 Thread Matthias Jaekle
050721 071234 * Optimizing index... ... this takes a long time ... Hello, optimizing the index takes extremly long. I have the feeling in earlier versions, this was much faster. I just try to index a 7.000.000 Pages Segment. This is running till 10 days now. Processing was starting with around

Re: Nutch Plugins Help

2005-07-21 Thread quovadis
Hi Jerome Date: Thu, 21 Jul 2005 09:44:39 GMT Server: Apache/1.3.26 (Unix) mod_gzip/1.3.26.1a mod_auth_pam/1.0a PHP/4.3.11 PHP/3.0.18 mod_ssl/2.8.10 OpenSSL/0.9.6g mod_perl/1.27 mod_jk/1.1.0 FrontPage/5.0.2.2510 Last-Modified: Wed, 13 Oct 2004 04:08:56 GMT ETag: "11810b-d8a-416caa58" Accept-Ranges

Re: Nutch Plugins Help

2005-07-21 Thread Jérôme Charron
On 7/21/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Hi > > I'm getting alot of the following "errors?" when fetching a > segment: > > 050721 094100 fetch okay, but can't parse > http://www.sahunt.co.za/sahunter/recepies/biltongsoup.html, > reason: failed(2,203): Content-Type not applica

Nutch Plugins Help

2005-07-21 Thread quovadis
Hi I'm getting alot of the following "errors?" when fetching a segment: 050721 094100 fetch okay, but can't parse http://www.sahunt.co.za/sahunter/recepies/biltongsoup.html, reason: failed(2,203): Content-Type not application/msword: The page above is a pure html page however the fetch is ok bu

Re: Skipping the final indexing step?

2005-07-21 Thread Piotr Kosiorowski
Hello Otis, If you are only reading ParseData and FetcherOutput from nutch segment you do not need lucene index at all. So you can safely skip -i switch. Regards Piotr On 7/21/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hello, > > I'm using SegmentMergeTool to merge some large segments, an