Re: Reviving Nutch 0.7

2007-01-22 Thread AJ Chen
as possible so that I can easily upgrade the application to keep up with new nutch release. Keeping away from the newest nutch version is somewhat backward to me. AJ -- AJ Chen, PhD Palo Alto, CA http://web2express.org

Re: [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-22 Thread AJ Chen
-- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira -- AJ Chen, PhD Palo Alto, CA http

Re: [jira] Resolved: (NUTCH-395) Increase fetching speed

2006-11-13 Thread AJ Chen
/jira -- AJ Chen, PhD http://web2express.org

how to minimize reduce operations when using single machine

2006-10-27 Thread AJ Chen
I use 0.9-dev code and local file system to crawl on a single machine. After fetching pages, nutch spends huge amount of time doing reduce sort and reduce reduce reduce. This is not necessary since it uses only the local file system. I'm not familiar with map-reduce code, but guess it may be

outlink extractor finds lots of junk

2006-10-23 Thread AJ Chen
-devcode. Thanks, -- AJ Chen, PhD http://web2express.org

log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen
I'm customizing 0.9-dev code for my vertical search engine. After rebuild the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error when starting Tomcat: log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (The system cannot find the path specified)

Re: log error in deploying nutch-0.9-dev.jar

2006-09-07 Thread AJ Chen
This is solved. I accidentally put log4j.prperties into ROOT\WEB-INF\classes. -aj On 9/7/06, AJ Chen [EMAIL PROTECTED] wrote: I'm customizing 0.9-dev code for my vertical search engine. After rebuild the nutch-0.9-dev.jar and put it into ROOT\WEB-INF\lib, there is an error when starting

Re: [Nutch-dev] Crawl error

2006-07-10 Thread AJ Chen
Groschupf [EMAIL PROTECTED] wrote: Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set

Crawl error

2006-07-09 Thread AJ Chen
I checked out the 0.8 code from trunk and tried to set it up in eclipse. When trying to run Crawl from Eclipse using args urls -dir crawl -depth 3 -topN 50, I got the following error, which started from LogFactory.getLog( Crawl.class). Any idea what file was not found? There is a url file under

does nutch follow HEAD link element?

2006-06-16 Thread AJ Chen
I'm about to use nutch to crawl semantic data. Links to semantic data files (RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2) BODY a href Does nutch crawler follows the HEAD link? I'm also creating a semantic data publishing tool, I would appreciate any suggestion regarding

Re: does nutch follow HEAD link element?

2006-06-16 Thread AJ Chen
will be searched from the same nutch search interface. Thanks, AJ On 6/16/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: AJ Chen wrote: I'm about to use nutch to crawl semantic data. Links to semantic data files (RDF, OWL, etc.) can be placed in two places: (1) HEAD link; (2) BODY a href Does

Re: problems http-client

2006-01-06 Thread AJ Chen
I have started to see this problem recently. topN=20 per crawl, but fetched pages = 15 - 17, while error pages = 2000 - 5000. 25000 pages are missing. this is reproducible with nutch0.7.1, both protocol-http and protocol-httpclient are included. I also see lots of Response content

Re: severe error in fetch

2005-12-30 Thread AJ Chen
connection pool problem in httpclient? If yes, I can filter out url containing these trouble ports before httpclient is fixed. Thanks, AJ On 12/26/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: AJ Chen wrote: Stefan, Here is the trace in my log. My SSFetcher (for site-specific fetch) is the same

severe error in fetch

2005-12-25 Thread AJ Chen
I have seen repeatedly the following severe errors during fetching 400,000 pages with 200 threads. What may cause Host connection pool not found? This type of error must be avoided, otherwise the fetcher will stop prematurely. 051224 075950 SEVERE Host connection pool not found,

java open source software for Tagging ?

2005-11-07 Thread AJ Chen
Although tagging is not directly related to nutch, I think combining nutch search and the ability to tag search result pages will be quite powerful. Anyone has implemented tagging on nutch search site? Is there a java open source package for tagging function? AJ

debug JSP with eclipse

2005-10-30 Thread AJ Chen
I'm using elicpse for nutch java code and trying to set up eclipse for debugging JSP pages. I have got WST plugin installed, created a new dynamic web project called nutch071web, and imported all the webcontent and jars. But, it failed to run index.jsp page, see error message below. Is anyone

Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 4:03 PM To: nutch-dev@lucene.apache.org Subject: merge indices from multiple webdb Has anyone merged indices from two separate webdb? I have two separate webdb and need to find a good way to combine them

Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
and then build one more segment again. Thank you, Andrey -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 2:02 PM To: nutch-dev@lucene.apache.org Subject: Re: merge indices from multiple webdb Thanks so much, Graham. This should do it. A related

Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread AJ Chen
Fuad, Several days for 120,000 pages? That's very slow. Could you show some status lines in the log file? (grep status:) What's the bandwidth you have? -AJ On 10/11/05, Fuad Efendi (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Fuad Efendi updated

fetch speed issue

2005-10-10 Thread AJ Chen
Another observation: when the same size fetch list and same number of threads were used, the fetcher started at different speed in different runs, ranging from 200kb/s to 1200kb/s. I'm using DSL at home, so this variation in downlaod speed could be due to the variation in DSL connection. If using

Re: what contibute to fetch slowing down

2005-10-02 Thread AJ Chen
several days at current speed - just too slow. I'm planning to get more bandwidth. Could someone share their experience on what stable rate (pages/sec) can be achieved using 3 mbps or 10 mbps inbound connection? Thanks, AJ On 9/28/05, AJ Chen [EMAIL PROTECTED] wrote: I started the crawler

what contibute to fetch slowing down

2005-09-28 Thread AJ Chen
I started the crawler with about 2000 sites. The fetcher could achieve 7 pages/sec initially, but the performance gradually dropped to about 2 pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages and I used 500 threads. What are the main causes of this slowing down? Below

Re: saving log file

2005-09-21 Thread AJ Chen
Jerome, thanks a lot. This is helpful. -AJ Jérôme Charron wrote: Following the tutorial, I redirect the log messages to a log file. But, when crawling 1 million pages, this log file can become hugh and writing log messages to a huge file can slow down the fetching process. Is there a better

saving log file

2005-09-20 Thread AJ Chen
Following the tutorial, I redirect the log messages to a log file. But, when crawling 1 million pages, this log file can become hugh and writing log messages to a huge file can slow down the fetching process. Is there a better way to manage the log? maybe saving it to a series of smaller

how to deal with large/slow sites

2005-09-11 Thread AJ Chen
In vertical crawling, there are always some large sites that have tens of thousands of pages. Fetching a page from these large sites very often returns retry later because http.max.delays is exceeded. Setting appropriate values for http.max.delays and fetcher.server.delay can minimize this

Re: db.max.outlinks.per.page is misunderstood?

2005-09-07 Thread AJ Chen
Jack, Set the max to 100, but run 10 cycles (i.e., depth=10) with the CrawlTool. You may see all the outlinks are collected toward the end. 3 cycles is usually not enough. -AJ Jack Tang wrote: Yes, Stefan. But it missed some URLs, and I set the value to 3000, then everything is OK /Jack

Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
for a public beta. I'll be sure to post here when we're finally open for business. :) --Matt On Sep 2, 2005, at 11:43 AM, AJ Chen wrote: From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, it seems that a new urlfilter is a good place to extend the inclusion regex capability

junit test failed

2005-08-28 Thread AJ Chen
FAILED nutch\trunk\build.xml:173: Could not create task or type of type: junit. Did I miss anything for junit? Appreciate your help. AJ Chen

Re: junit test failed

2005-08-28 Thread AJ Chen
codes. Apparently, the command ant test does not work. Anybody has an idea how to make the unit test work? AJ Michael Ji wrote: What is junit test standing for? A particular patch? Sorry, if my question is silly. Michael Ji, --- AJ Chen [EMAIL PROTECTED] wrote: I'm a new comer, trying

Re: junit test failed

2005-08-28 Thread AJ Chen
Regards, Fuad Efendi -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Sunday, August 28, 2005 9:01 PM To: nutch-dev Subject: junit test failed I'm a new comer, trying to test Nutch for vertical search. I downloaded the code and compiled it in cygwin. But, the unit