Has anyone ever deployed the nutch crawler on a machine running cpanel
hosting software?
I mean running as root not inside a virtual private server.
On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:
hi
Thanks. What if urls in my seed file do not have outlinks, let
say .pdf files. Should I still specify topN variable? All I need is
to index all urls in my seed file. And they are about 1 M.
topN means that your generated shards
Hi all,
I am experiencing serious out of memory errors when querying Nutch, and
would appreciate any pointers or advice. I have a Nutch index that I'm
searching using a simple servlet. This servlet queries the index and
returns the results as XML, so other systems in my network can make use
of the
Hello again,
I decided to strip things down to the bare minimum, and I have what I
believe to be a test case that should reproduce this situation, leaving
tomcat etc. Out of the equation. I have attached below a very simple
class that loops over a search (TestNutch.java). If I run this with the
fo
Mark,
I'm very interested in this problem. I'm the author of those
patches. I have access to YourKit. I will setup your test case and
look into it hopefully in the next couple of days. I know we have
done some stress testing, but clearly not enough if you are having
this problem.
Anyth
Hi Kirby,
Thanks for your reply! There's nothing else unusual about my setup.
For my test case, I'm passing nothing but the parameters I originally
posted, so the memory configuration should be the default. IIRC, that's
128Mb in total for the heap. I have tried forcing a GC, but that doesn't
see
Hi.
You said that you open and close the nutch bean at every request.
first
this is very expensive. create the nutch bean only once and save it in
the application and read it from the application if needed.
second!!
not sure but maybe it is possible that the PluginRepository has the
memory
> not sure but maybe it is possible that the PluginRepository has the
> memory leak. i think the cache (the weakhashmap) is growing and
growing.
Is this the same issue as reported here :
https://issues.apache.org/jira/browse/NUTCH-356 ?
It may be adding to my troubles, but I suspect my immediat
On Aug 20, 2009, at 5:42 PM, Mark Round wrote:
not sure but maybe it is possible that the PluginRepository has the
memory leak. i think the cache (the weakhashmap) is growing and
growing.
Is this the same issue as reported here :
https://issues.apache.org/jira/browse/NUTCH-356 ?
ups. yes. s
One day or other you need to go live, so it will for me.
Any one with experience putting nutch on tomcat/jsp hosting packages... I
kind get lost, always something wrong
lack of HD, or lack of memory, or only posting war files. what about jvm
sharing ?
Is there no other solution then renting a ded
In the tutroial on the wiki the depth is not specified and topN=1000. I run
those commands yesterday and it is still running. Will it index all my urls? My
seed file has about 20K urls.
Thanks.
Alex.
-Original Message-
From: Marko Bauhardt
To: nutch-user@lucene.apache.org
Se
Is there a way to extract the keywords from an html page? I can't
find it in ParseData or CrawlDatum anywhere.
--
http://www.linkedin.com/in/paultomblin
Aborting again with a fetchqueue size of 1 any idea what can I do.
2009/8/19 MilleBii
> Well in the segment there is nothing but _temporary... and then after a
> number of spinwaiting for the last 6 elements it aborts and the segment is
> empty...
> I guess everything is left in the tmp file
On Fri, Aug 21, 2009 at 09:36, MilleBii wrote:
> Aborting again with a fetchqueue size of 1 any idea what can I do.
>
That's very strange. Normally fetcher's abort is a controlled action so
there should be data in segments. Can you check your logs to see if there is
anything?
>
> 2009/8/19
14 matches
Mail list logo