as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.
AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
--
AJ Chen, PhD
Palo Alto, CA
http
/jira
--
AJ Chen, PhD
http://web2express.org
I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing reduce sort
and reduce reduce reduce. This is not necessary since it uses only the
local file system. I'm not familiar with map-reduce code, but guess it may
be
-devcode.
Thanks,
--
AJ Chen, PhD
http://web2express.org
I'm customizing 0.9-dev code for my vertical search engine. After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting Tomcat:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: \ (The system cannot find the path specified)
This is solved. I accidentally put log4j.prperties into
ROOT\WEB-INF\classes.
-aj
On 9/7/06, AJ Chen [EMAIL PROTECTED] wrote:
I'm customizing 0.9-dev code for my vertical search engine. After rebuild
the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error
when starting
Groschupf [EMAIL PROTECTED] wrote:
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in bin/nutch.
Btw, please do not crosspost.
Thanks.
Stefan
Am 09.07.2006 um 21:47 schrieb AJ Chen:
I checked out the 0.8 code from trunk and tried to set
I checked out the 0.8 code from trunk and tried to set it up in eclipse.
When trying to run Crawl from Eclipse using args urls -dir crawl -depth 3
-topN 50, I got the following error, which started from LogFactory.getLog(
Crawl.class). Any idea what file was not found? There is a url file under
I'm about to use nutch to crawl semantic data. Links to semantic data files
(RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2) BODY a
href Does nutch crawler follows the HEAD link?
I'm also creating a semantic data publishing tool, I would appreciate any
suggestion regarding
will be searched from the same nutch search interface.
Thanks,
AJ
On 6/16/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
AJ Chen wrote:
I'm about to use nutch to crawl semantic data. Links to semantic data
files
(RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2)
BODY a
href Does
I have started to see this problem recently. topN=20 per crawl, but
fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000
pages are missing. this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.
I also see lots of Response content
connection pool
problem in httpclient? If yes, I can filter out url containing these trouble
ports before httpclient is fixed.
Thanks,
AJ
On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
AJ Chen wrote:
Stefan,
Here is the trace in my log. My SSFetcher (for site-specific fetch) is
the
same
I have seen repeatedly the following severe errors during fetching
400,000 pages with 200 threads. What may cause Host connection pool
not found? This type of error must be avoided, otherwise the fetcher
will stop prematurely.
051224 075950 SEVERE Host connection pool not found,
Although tagging is not directly related to nutch, I think combining nutch
search and the ability to tag search result pages will be quite powerful.
Anyone has implemented tagging on nutch search site? Is there a java open
source package for tagging function?
AJ
I'm using elicpse for nutch java code and trying to set up eclipse for
debugging JSP pages. I have got WST plugin installed, created a new dynamic
web project called nutch071web, and imported all the webcontent and jars.
But, it failed to run index.jsp page, see error message below. Is anyone
-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 4:03 PM
To: nutch-dev@lucene.apache.org
Subject: merge indices from multiple webdb
Has anyone merged indices from two separate webdb? I have two
separate webdb and need to find a good way to combine them
and then build one more segment again.
Thank you,
Andrey
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 2:02 PM
To: nutch-dev@lucene.apache.org
Subject: Re: merge indices from multiple webdb
Thanks so much, Graham. This should do it.
A related
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep status:) What's the bandwidth you have?
-AJ
On 10/11/05, Fuad Efendi (JIRA) [EMAIL PROTECTED] wrote:
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Fuad Efendi updated
Another observation: when the same size fetch list and same number of
threads were used, the fetcher started at different speed in different runs,
ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation
in downlaod speed could be due to the variation in DSL connection. If using
several days at current
speed - just too slow. I'm planning to get more bandwidth. Could someone
share their experience on what stable rate (pages/sec) can be achieved using
3 mbps or 10 mbps inbound connection?
Thanks,
AJ
On 9/28/05, AJ Chen [EMAIL PROTECTED] wrote:
I started the crawler
I started the crawler with about 2000 sites. The fetcher could achieve
7 pages/sec initially, but the performance gradually dropped to about 2
pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages
and I used 500 threads. What are the main causes of this slowing down?
Below
Jerome, thanks a lot. This is helpful.
-AJ
Jérôme Charron wrote:
Following the tutorial, I redirect the log messages to a log file. But,
when crawling 1 million pages, this log file can become hugh and writing
log messages to a huge file can slow down the fetching process. Is
there a better
Following the tutorial, I redirect the log messages to a log file. But,
when crawling 1 million pages, this log file can become hugh and writing
log messages to a huge file can slow down the fetching process. Is
there a better way to manage the log? maybe saving it to a series of
smaller
In vertical crawling, there are always some large sites that have tens
of thousands of pages. Fetching a page from these large sites very often
returns retry later because http.max.delays is exceeded. Setting
appropriate values for http.max.delays and fetcher.server.delay can
minimize this
Jack,
Set the max to 100, but run 10 cycles (i.e., depth=10) with the
CrawlTool. You may see all the outlinks are collected toward the end. 3
cycles is usually not enough.
-AJ
Jack Tang wrote:
Yes, Stefan.
But it missed some URLs, and I set the value to 3000, then everything is OK
/Jack
for a public beta. I'll be sure to post here when we're
finally open for business. :)
--Matt
On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler,
it seems that a new urlfilter is a good place to extend the
inclusion regex capability
FAILED
nutch\trunk\build.xml:173: Could not create task or type of type: junit.
Did I miss anything for junit? Appreciate your help.
AJ Chen
codes.
Apparently, the command ant test does not work. Anybody has an idea
how to make the unit test work?
AJ
Michael Ji wrote:
What is junit test standing for? A particular patch?
Sorry, if my question is silly.
Michael Ji,
--- AJ Chen [EMAIL PROTECTED] wrote:
I'm a new comer, trying
Regards,
Fuad Efendi
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Sunday, August 28, 2005 9:01 PM
To: nutch-dev
Subject: junit test failed
I'm a new comer, trying to test Nutch for vertical search. I downloaded
the code and compiled it in cygwin. But, the unit
30 matches
Mail list logo