Hi,
Is there a way to list all indexed sites ?
I've tried searching for "http" (search,jsp?query=http) but after some
millions of sites indexed if does not return all sites.
Thanks
Hi,
Witch imports do I need to add at any indexer plugin in
(like import org.apache.nutch.parse.Outlink; and ...)
order to get the code below to work:
// add links
Outlink[] outlinks = parse.getData().getOutlinks();
int end = Math.min(outlinks.length, UpdateDatabaseTool.MAX_OUTLINKS_PER_PAG
I believe this list is now hosted on the apache.org servers, not
sourceforge servers (correct me if I'm wrong)... They transfered all of
the sourceforge subscriptions over to the new mailing list server.
Gus
On Thu, 21 Jul 2005, Paul Stewart wrote:
Hi there...
I'm not a list newbie..really
Hi,
--- Matthias Jaekle <[EMAIL PROTECTED]> wrote:
> > You probably don't want to touch indexer.termIndexInterval and
> > indexer.maxMergeDocs (determines the max size of an individual
> > segment).
> Why is maxMergeDocs 50 by default? Should not this value be much
> higher?
50 is probably OK fo
You probably don't want to touch indexer.termIndexInterval and
indexer.maxMergeDocs (determines the max size of an individual
segment).
Why is maxMergeDocs 50 by default? Should not this value be much higher?
I found how to calculate the number of opened files
But how could I calculate the memor
Hi Erik,
Stefan - thanks for the reply. I'm still digesting Nutch and how
to work with it at a basic level but it does make sense to allow
metadata to tag along with fetches - I certainly don't know enough
yet to say whether your patch fits into the long-term vision of
Nutch or not yet.
Stefan - thanks for the reply. I'm still digesting Nutch and how to
work with it at a basic level but it does make sense to allow
metadata to tag along with fetches - I certainly don't know enough
yet to say whether your patch fits into the long-term vision of Nutch
or not yet.
I've star
Hi there...
I'm not a list newbie..really I'm not..;)
When I goto https://lists.sourceforge.net/lists/listinfo/nutch-general
And try to put my email address in to unsubscribe it tells me I don't
exist... Is something broken? :) Can anyone on here manually unsubscribe
me?
Appreciate it,
Paul
Hi,
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Matthias Jaekle wrote:
> > Hi Andrzej,
> >
> > thanks for your response. I am not really familar with the lucene
> > internals.
> >
> > I am just running nutch with the default parameters on a debian
> sarge
> > system with ext3 file syste
Hi there,
I had a patch submitted as well, that does not need any external
library just a xml parser and a xslt processor that is - i think -
part of jdk anyway.
My solution works with a xsl that convert different rss feed to one
format and than I parsed this format with a normal xml parser.
Matthias,
minMergeDocs is what controls how many Documents will be held in memory
before being flushed to the disk, and mergeFactor controls how often
segments are merged. Both values are 10 by default in Lucene, I
believe.
If you have Lucene in Action, this is described in more detail there.
If
Hi Andrzej,
At the time that I was working diligently on this plugin (April/May), I
had done some thorough research into finding what I felt would be the most
flexible, reliable way to parse RSS files. The RSS feed parser out of the
jakarta-commmons sandbox was what I found, and I stand by it. I
Matthias Jaekle wrote:
Hi Andrzej,
thanks for your response. I am not really familar with the lucene
internals.
I am just running nutch with the default parameters on a debian sarge
system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.
So is ext3 a bad file system for mill
When you run dedup, it undeletes all previous deletions from the
index. So run dedup before you prune, otherwise all the documents you
pruned will be undeleted by dedup.
As far as optimizing goes, you can open an IndexWriter and run
optimize(). Take a look at IndexSegment on how an IndexWriter i
Hello,
You have probably some problems with nutch plugins. It is quite
difficult to understand from your email but I assume you have created a
new nutch plugin for it.
1) Please check if everything is correctly specified in your plugin.xml
file.
2) Check if you have included your plugin in
nut
Hi Andrzej,
thanks for your response. I am not really familar with the lucene internals.
I am just running nutch with the default parameters on a debian sarge
system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.
So is ext3 a bad file system for millions of files?
I could no
Dear Users!
How to delete realy from deleted entries from index?
I run the 'prune' tool and 'dedup' tool, and after it I would like to
remove deleted entries from index? How to optimize indexes?
Regards,
Ferenc
hi there,
If I put multiple web url in the plain text file of
"urls" in the following command, will it fetch
mutliple web site for me?
"
bin/nutch crawl urls -dir crawl.test -depth 3 >&
crawl.log
"
thanks,
Michael,
Start yo
Matthias Jaekle wrote:
050721 071234 * Optimizing index...
... this takes a long time ...
Hello,
optimizing the index takes extremly long.
I have the feeling in earlier versions, this was much faster.
I just try to index a 7.000.000 Pages Segment.
This is running till 10 days now.
Processing
050721 071234 * Optimizing index...
... this takes a long time ...
Hello,
optimizing the index takes extremly long.
I have the feeling in earlier versions, this was much faster.
I just try to index a 7.000.000 Pages Segment.
This is running till 10 days now.
Processing was starting with around
Hi Jerome
Date: Thu, 21 Jul 2005 09:44:39 GMT Server: Apache/1.3.26
(Unix) mod_gzip/1.3.26.1a mod_auth_pam/1.0a PHP/4.3.11
PHP/3.0.18 mod_ssl/2.8.10 OpenSSL/0.9.6g mod_perl/1.27
mod_jk/1.1.0 FrontPage/5.0.2.2510 Last-Modified: Wed, 13
Oct 2004 04:08:56 GMT ETag: "11810b-d8a-416caa58"
Accept-Ranges
On 7/21/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> I'm getting alot of the following "errors?" when fetching a
> segment:
>
> 050721 094100 fetch okay, but can't parse
> http://www.sahunt.co.za/sahunter/recepies/biltongsoup.html,
> reason: failed(2,203): Content-Type not applica
Hi
I'm getting alot of the following "errors?" when fetching a
segment:
050721 094100 fetch okay, but can't parse
http://www.sahunt.co.za/sahunter/recepies/biltongsoup.html,
reason: failed(2,203): Content-Type not application/msword:
The page above is a pure html page however the fetch is ok
bu
Hello Otis,
If you are only reading ParseData and FetcherOutput from nutch segment
you do not need lucene index at all. So you can safely skip -i switch.
Regards
Piotr
On 7/21/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I'm using SegmentMergeTool to merge some large segments, an
24 matches
Mail list logo