Re: Efficient focused crawling

2009-11-28 Thread Eran Zinman
Thanks for your help MillBii! I will definitely try the squareroot option - but is that only valid for outlinks or also affects pages linking to the page? Did you try implementing automatic Regex generation? I'm doing focused crawling but I'm also thinking about scaling it in the future. Also I

Re: Efficient focused crawling

2009-11-28 Thread MilleBii
I just use the Java build-in regex features... and therefore just supplied the string, which I design for my case using RegexBuddy a really great tool by the way. Pay attention though at static creation in order to avoid regex creation at each plug-in load and run-time hit. Didn't find a way to

Re: Efficient focused crawling

2009-11-28 Thread MilleBii
oops : why it shouldn't work for others. 2009/11/28 MilleBii mille...@gmail.com I just use the Java build-in regex features... and therefore just supplied the string, which I design for my case using RegexBuddy a really great tool by the way. Pay attention though at static creation in order

Re: Efficient focused crawling

2009-11-28 Thread Eran Zinman
Hi MilleBii, I think you misinterpreted what I've meant. 1. Regarding Regex - I know I can build a Regex beforehand to identify URLs, but I will have to create one manually for each domain I'm crawling - not scalable. I'm looking for a way to build Regex automatically using automatic machine

File too large ...(mergesegs)

2009-11-28 Thread Patricio Galeas
Hi all, I'm trying to run a whole-web crawling on my notebook. with : - 40 GB Free for the crawl directory - and I set “hadoop.tmp.dir” to an external hard disk (usb) with 90 GB free. I used the the crawl script from the wiki: http://wiki.apache.org/nutch/Crawl with the following configuration:

Re: Efficient focused crawling

2009-11-28 Thread MilleBii
Oh ! 1. not worked but if you find something I'm interested 2. the inlinks by definition points to the page you are considering so I don't understand what you mean. Boosting those inlinks actually means giving more weight to the page which gets distributed to the outlinks But probably what

Re: 100 fetches per second?

2009-11-28 Thread Julien Nioche
nutch-721 is a different issue. 719 has no patch but describes the solution to the problem you encountered. if you get errors with 770 it would be helpful to comment on the JIRA 2009/11/27 MilleBii mille...@gmail.com Already applied that patch which is actually 721, I was part of that

Re: 100 fetches per second?

2009-11-28 Thread MilleBii
You right, I forgot to put 719 manually when I moved on my Linux box. Thank Julien. We really ought to have patch for this one and probably also in a Nutch 1.1 I will comment on the JIRA for 770, bare with me I've never done that before. Now to the bandwidth issue : I found a way to greatly

Fetcher not ending

2009-11-28 Thread MilleBii
Although I have applied https://issues.apache.org/jira/browse/NUTCH-719 ( 769) I get my fetcher job hang-up at the end : ... -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=2 -finishing thread

Re: Fetcher not ending

2009-11-28 Thread MilleBii
Oops a bit too quick on this one it actually ended up whilst I was making this post. 2009/11/28 MilleBii mille...@gmail.com Although I have applied https://issues.apache.org/jira/browse/NUTCH-719( 769) I get my fetcher job hang-up at the end : ... -finishing thread FetcherThread,

Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
My nutch crawl just stopped. The process is still there, and doesn't respond to a kill -TERM or a kill -HUP, but it hasn't written anything to the log file in the last 40 minutes. The last thing it logged was some calls to my custom url filter. Nothing has been written in the hadoop directory

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
Paul Tomblin wrote: My nutch crawl just stopped. The process is still there, and doesn't respond to a kill -TERM or a kill -HUP, but it hasn't written anything to the log file in the last 40 minutes. The last thing it logged was some calls to my custom url filter. Nothing has been written in

Re: Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: How can I tell what's going on and why it's stopped? Try to generate a thread dump to see what code is being executed. I didn't do any sort of distributed mode because I've only got one core. I had

missing hadoop folder within org.apache...

2009-11-28 Thread Myname To
hello can someone help me with this: i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249. after using ant i get some errors about import problems. as i read somewhere, nutch imports hadoop itself. so its not necessary to install it separately. i don't realy understand

missing hadoop folder within org.apache...

2009-11-28 Thread Myname To
hello can someone help me with this: i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249. after using ant i get some errors about import problems. as i read somewhere, nutch imports hadoop itself. so its not necessary to install it separately. i don't realy understand

Re: missing hadoop folder within org.apache...

2009-11-28 Thread Varish Mulwad
There is no separate hadoop folder within nutch. -- Varish Myname To wrote: hello can someone help me with this: i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249. after using ant i get some errors about import problems. as i read somewhere, nutch imports hadoop

AW: missing hadoop folder within org.apache...

2009-11-28 Thread Myname To
i see that before, Varish, but have you some advice for using nutch-249 patch: https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel if there is no org.apache.hadoop folder, how can i use this patch? any help, advice, pointers are

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Hm, I can't see anything obviously wrong with that thread dump. What's the CPU and swap usage, and

Re: Nutch frozen but not exiting

2009-11-28 Thread Paul Tomblin
On Sat, Nov 28, 2009 at 8:25 PM, Andrzej Bialecki a...@getopt.org wrote: Hm, the curious thing here is that the java process is sleeping, and 99% of cpu is in system time ... usually this would indicate swapping, but since there is no swap in your setup I'm stumped. Still, this may be related