Skipping the final indexing step?

2005-07-21 Thread ogjunk-nutch
Hello, I'm using SegmentMergeTool to merge some large segments, and I see that the final index optimization (below) takes a looong time. I think this index creation and optimization is triggered by the -i param to SegmentMergeTools. From what I saw in the SegmentMergeTools.java, this is an

Re: [Nutch-general] RE: benchmarking

2005-07-21 Thread ogjunk-nutch
Hi, I'd like to see your presentation, but that server is down. Otis --- Chris Mattmann [EMAIL PROTECTED] wrote: Hi there Jay, Here are some numbers that a colleague and I presented in my graduate computer science seminar class on search engines in the Spring 05' semester at USC. The

Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread ogjunk-nutch
Hi, Does anyone know why Chris Mattmann's RSS plugin ( http://issues.apache.org/jira/browse/NUTCH-30 ) wasn't put in the repository, and whether there are plans to revive it and include it? Thanks, Otis

Re: Skipping the final indexing step?

2005-07-21 Thread Piotr Kosiorowski
Hello Otis, If you are only reading ParseData and FetcherOutput from nutch segment you do not need lucene index at all. So you can safely skip -i switch. Regards Piotr On 7/21/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I'm using SegmentMergeTool to merge some large segments, and I

Nutch Plugins Help

2005-07-21 Thread quovadis
Hi I'm getting alot of the following errors? when fetching a segment: 050721 094100 fetch okay, but can't parse http://www.sahunt.co.za/sahunter/recepies/biltongsoup.html, reason: failed(2,203): Content-Type not application/msword: The page above is a pure html page however the fetch is ok but

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki
Matthias Jaekle wrote: 050721 071234 * Optimizing index... ... this takes a long time ... Hello, optimizing the index takes extremly long. I have the feeling in earlier versions, this was much faster. I just try to index a 7.000.000 Pages Segment. This is running till 10 days now.

optimize indexes

2005-07-21 Thread [EMAIL PROTECTED]
Dear Users! How to delete realy from deleted entries from index? I run the 'prune' tool and 'dedup' tool, and after it I would like to remove deleted entries from index? How to optimize indexes? Regards, Ferenc

Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
Hi Andrzej, thanks for your response. I am not really familar with the lucene internals. I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024 files opened, and 1 GB RAM. So is ext3 a bad file system for millions of files? I could

Re: Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread Chris Mattmann
Hi Andrzej, At the time that I was working diligently on this plugin (April/May), I had done some thorough research into finding what I felt would be the most flexible, reliable way to parse RSS files. The RSS feed parser out of the jakarta-commmons sandbox was what I found, and I stand by it.

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread ogjunk-nutch
Hi, --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Matthias Jaekle wrote: Hi Andrzej, thanks for your response. I am not really familar with the lucene internals. I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024

Re: [Nutch-general] Re: RDF plugin questions

2005-07-21 Thread Erik Hatcher
Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet. I've

Re: [Nutch-general] Re: RDF plugin questions

2005-07-21 Thread Stefan Groschupf
Hi Erik, Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet.

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
You probably don't want to touch indexer.termIndexInterval and indexer.maxMergeDocs (determines the max size of an individual segment). Why is maxMergeDocs 50 by default? Should not this value be much higher? I found how to calculate the number of opened files But how could I calculate the