Re: Nutch reindex cron

2009-06-01 Thread Alexander Aristov
you can restart tomcat from bash. I found such issue in jira NUTCH-376 Add methods to control runtime behaviour of NutchBeanNot sure if the case is still valid but I had to implement some such mechanism for my own use. Solr might have something to work better with index. Best Regards Alexander

hadoop.log in parallel crawling

2009-06-01 Thread Mick Peters
Hi dudes, I'm crawling severals sites in parallel and if I crawl with the default nutch_conf I've problems writting at the same hadoop.log. Now I create a nutch_conf_dir per site and set a new hadoop.log.dir per site in log4j.properties, the problem is when i'm running a crawl and export the new

Getting the language-identifier info

2009-06-01 Thread Larsson85
Hi I wounder how I can extract the info that the language-identifier plugin produces. If I was allowed to wish I would like the info to come when I dump the data from the segments with the following command bin/nutch readseg -dump crawl/segments/... output -nofetch -nogenerate -noparse

Re: Arabic language in Nutch

2009-06-01 Thread Chetan Patel
Hello, I want to crawl Arabic URL (http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It contains charset windows-1256. I have another URL (http://www.afp.com/afpcom/ar/home) and it contains charset UTF-8. This links work fine(crawling, indexing and searching working

Problem opening the index

2009-06-01 Thread Raymond Balmès
Hi guys, My webapp does not work anymore suddenly... I get the following in the logs. Where is this coming from ? -Ray- 2009-06-01 17:48:38,688 INFO SearchBean - opening merged index in d://nutch/crawl/index 2009-06-01 17:48:38,715 WARN FileSystem - uri=file:///

Re: Problem opening the index

2009-06-01 Thread Bartosz Gadzimski
I think is the login of user running nutch. It want's one token after using whoami but it gets: autorite nt\system So change the user for one word like autorite and it should help. Thanks, Bartosz Raymond Balmčs pisze: Hi guys, My webapp does not work anymore suddenly... I get the

RE: Nutch reindex cron

2009-06-01 Thread Malaviya, Sanjay X
Hi Kevin, I have been trying to create a script for re-indexing (I suppose also called re-crawling) to run everynight. I am having problems with the section I listed below. Specially -adddays 30. It take more than 24 hrs to reindex. If I make it smaller then it doesn't pickup the modified files.

Re: Arabic language in Nutch

2009-06-01 Thread Ken Krugler
Hello, I want to crawl Arabic URL (http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It contains charset windows-1256. Are you sure it's really 1256? The charset returned by the server (in the headers) for this page is UTF-*: Content-Type: text/html; charset=utf-8

Re: Problem opening the index

2009-06-01 Thread Raymond Balmès
hum where do you change that ? 2009/6/1 Bartosz Gadzimski bartek...@o2.pl I think is the login of user running nutch. It want's one token after using whoami but it gets: autorite nt\system So change the user for one word like autorite and it should help. Thanks, Bartosz Raymond

Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
Hi, I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my website that has the following robots.txt file: User-agent: imo-robot-intelink Disallow: /App_Themes/ Disallow: /app_themes/ Disallow: /Archive/ Disallow: /archive/ Disallow: /Bin/ Disallow: /bin/ I have the

Re: Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread David M. Cole
At 2:23 PM -0400 6/1/09, Jake Jacobson wrote: User-agent: imo-robot-intelink Disallow: /App_Themes/ Disallow: /app_themes/ Disallow: /Archive/ Disallow: /archive/ Disallow: /Bin/ Disallow: /bin/ Jake: I think you need to add one more line after the last line: Allow: / \dmc --

Re: Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
The Allow directive in the robots.txt is optional. If you don't have an explicit disallow statement, it means that directory or file is available for indexing. Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/people/Jake_Jacobson/622727274 Our greatest fear should

Question on Efficient field updates in the Lucene index in Nutch

2009-06-01 Thread Vijay
Hi all, I have a question regarding field updates to the lucene index in nutch. Suppose I am indexing webpages along with tags as an extra field. I want to add an extra tag to a webpage. Is there a clean way for me to do this without having to re-index the page with the updated tags

Re: Nutch reindex cron

2009-06-01 Thread kevin chen
Yes, I experience problem too if I do it too many times. To cop with it, I restart tomcat server everyday during least traffic period. In my case, I only reload index few times during a day, and it works out fine. On Mon, 2009-06-01 at 09:07 +0400, Alexander Aristov wrote: I experienced problems