nutch and cpanel

2009-08-20 Thread fadzi
Has anyone ever deployed the nutch crawler on a machine running cpanel hosting software? I mean running as root not inside a virtual private server.

Re: topN value in crawl

2009-08-20 Thread Marko Bauhardt
On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote: hi Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M. topN means that your generated shards

Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Mark Round
Hi all, I am experiencing serious out of memory errors when querying Nutch, and would appreciate any pointers or advice. I have a Nutch index that I'm searching using a simple servlet. This servlet queries the index and returns the results as XML, so other systems in my network can make use of the

FW: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Mark Round
Hello again, I decided to strip things down to the bare minimum, and I have what I believe to be a test case that should reproduce this situation, leaving tomcat etc. Out of the equation. I have attached below a very simple class that loops over a search (TestNutch.java). If I run this with the fo

Re: FW: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Kirby Bohling
Mark, I'm very interested in this problem. I'm the author of those patches. I have access to YourKit. I will setup your test case and look into it hopefully in the next couple of days. I know we have done some stress testing, but clearly not enough if you are having this problem. Anyth

RE: FW: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Mark Round
Hi Kirby, Thanks for your reply! There's nothing else unusual about my setup. For my test case, I'm passing nothing but the parameters I originally posted, so the memory configuration should be the default. IIRC, that's 128Mb in total for the heap. I have tried forcing a GC, but that doesn't see

Re: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Marko Bauhardt
Hi. You said that you open and close the nutch bean at every request. first this is very expensive. create the nutch bean only once and save it in the application and read it from the application if needed. second!! not sure but maybe it is possible that the PluginRepository has the memory

RE: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Mark Round
> not sure but maybe it is possible that the PluginRepository has the > memory leak. i think the cache (the weakhashmap) is growing and growing. Is this the same issue as reported here : https://issues.apache.org/jira/browse/NUTCH-356 ? It may be adding to my troubles, but I suspect my immediat

Re: Possible memory leak in Nutch-1.0 ?

2009-08-20 Thread Marko Bauhardt
On Aug 20, 2009, at 5:42 PM, Mark Round wrote: not sure but maybe it is possible that the PluginRepository has the memory leak. i think the cache (the weakhashmap) is growing and growing. Is this the same issue as reported here : https://issues.apache.org/jira/browse/NUTCH-356 ? ups. yes. s

Hosting java/jsp rec ?

2009-08-20 Thread MilleBii
One day or other you need to go live, so it will for me. Any one with experience putting nutch on tomcat/jsp hosting packages... I kind get lost, always something wrong lack of HD, or lack of memory, or only posting war files. what about jvm sharing ? Is there no other solution then renting a ded

Re: topN value in crawl

2009-08-20 Thread alxsss
In the tutroial on the wiki the depth is not specified and topN=1000. I run those commands yesterday and it is still running. Will it index all my urls? My seed file has about 20K urls. Thanks. Alex. -Original Message- From: Marko Bauhardt To: nutch-user@lucene.apache.org Se

Keywords?

2009-08-20 Thread Paul Tomblin
Is there a way to extract the keywords from an html page? I can't find it in ParseData or CrawlDatum anywhere. -- http://www.linkedin.com/in/paultomblin

Re: Fetcher aborting strangely

2009-08-20 Thread MilleBii
Aborting again with a fetchqueue size of 1 any idea what can I do. 2009/8/19 MilleBii > Well in the segment there is nothing but _temporary... and then after a > number of spinwaiting for the last 6 elements it aborts and the segment is > empty... > I guess everything is left in the tmp file

Re: Fetcher aborting strangely

2009-08-20 Thread Doğacan Güney
On Fri, Aug 21, 2009 at 09:36, MilleBii wrote: > Aborting again with a fetchqueue size of 1 any idea what can I do. > That's very strange. Normally fetcher's abort is a controlled action so there should be data in segments. Can you check your logs to see if there is anything? > > 2009/8/19