you can restart tomcat from bash.
I found such issue in jira
NUTCH-376
Add methods to control runtime behaviour of NutchBeanNot sure if the case is
still valid but I had to implement some such mechanism for my own use.
Solr might have something to work better with index.
Best Regards
Alexander
Hi dudes, I'm crawling severals sites in parallel and if I crawl with the
default nutch_conf I've problems writting at the same hadoop.log.
Now I create a nutch_conf_dir per site and set a new hadoop.log.dir per site
in log4j.properties, the problem is when i'm running a crawl and export the
new
Hi
I wounder how I can extract the info that the language-identifier plugin
produces. If I was allowed to wish I would like the info to come when I dump
the data from the segments with the following command
bin/nutch readseg -dump crawl/segments/... output -nofetch -nogenerate
-noparse
Hello,
I want to crawl Arabic URL
(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It
contains charset windows-1256.
I have another URL (http://www.afp.com/afpcom/ar/home) and it contains
charset UTF-8. This links work fine(crawling, indexing and searching working
Hi guys,
My webapp does not work anymore suddenly... I get the following in the logs.
Where is this coming from ?
-Ray-
2009-06-01 17:48:38,688 INFO SearchBean - opening merged index in
d://nutch/crawl/index
2009-06-01 17:48:38,715 WARN FileSystem - uri=file:///
I think is the login of user running nutch.
It want's one token after using whoami but it gets:
autorite nt\system
So change the user for one word like autorite and it should help.
Thanks,
Bartosz
Raymond Balmčs pisze:
Hi guys,
My webapp does not work anymore suddenly... I get the
Hi Kevin,
I have been trying to create a script for re-indexing (I suppose also
called re-crawling) to run everynight. I am having problems with the
section I listed below. Specially -adddays 30. It take more than 24
hrs to reindex. If I make it smaller then it doesn't pickup the modified
files.
Hello,
I want to crawl Arabic URL
(http://www.kuna.net.kw/NewsAgenciesPublicSite/HomePage.aspx?Language=ar) It
contains charset windows-1256.
Are you sure it's really 1256? The charset
returned by the server (in the headers) for this
page is UTF-*:
Content-Type: text/html; charset=utf-8
hum where do you change that ?
2009/6/1 Bartosz Gadzimski bartek...@o2.pl
I think is the login of user running nutch.
It want's one token after using whoami but it gets:
autorite nt\system
So change the user for one word like autorite and it should help.
Thanks,
Bartosz
Raymond
Hi,
I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my
website that has the following robots.txt file:
User-agent: imo-robot-intelink
Disallow: /App_Themes/
Disallow: /app_themes/
Disallow: /Archive/
Disallow: /archive/
Disallow: /Bin/
Disallow: /bin/
I have the
At 2:23 PM -0400 6/1/09, Jake Jacobson wrote:
User-agent: imo-robot-intelink
Disallow: /App_Themes/
Disallow: /app_themes/
Disallow: /Archive/
Disallow: /archive/
Disallow: /Bin/
Disallow: /bin/
Jake:
I think you need to add one more line after the last line:
Allow: /
\dmc
--
The Allow directive in the robots.txt is optional. If you don't have
an explicit disallow statement, it means that directory or file is
available for indexing.
Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/people/Jake_Jacobson/622727274
Our greatest fear should
Hi all,
I have a question regarding field updates to the lucene index in
nutch.
Suppose I am indexing webpages along with tags as an extra field. I
want to add an extra tag to a webpage. Is there a clean way for me to do
this without having to re-index the page with the updated tags
Yes, I experience problem too if I do it too many times.
To cop with it, I restart tomcat server everyday during least traffic
period. In my case, I only reload index few times during a day, and it
works out fine.
On Mon, 2009-06-01 at 09:07 +0400, Alexander Aristov wrote:
I experienced problems
14 matches
Mail list logo