Very interesting stats, Jeff. Let me know what your further tests reveal, and thanks!
Cheers, Chris On 7/18/10 2:48 PM, "jeff" <[email protected]> wrote: Chris, I couldn't tell as I am still investigating what had happened. In the mean time, however, I do find some differences between the tika html parser and the neko html parser. My test was that I uncommented out the following mime types in parse-plugins.xml: <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> <mimeType name="text/sgml"> <plugin id="parse-html" /> </mimeType> Guess what, when I crawled apache.org, the originally shipped nutch 1.1 returned the following info: CrawlDb statistics start: crawl_tmp/crawldb Statistics for CrawlDb: crawl_tmp/crawldb TOTAL urls: 4855 retry 0: 4855 min score: 0.0 avg score: 4.988671E-4 max score: 1.032 status 1 (db_unfetched): 4458 status 2 (db_fetched): 365 status 3 (db_gone): 6 status 4 (db_redir_temp): 16 status 5 (db_redir_perm): 10 CrawlDb statistics: done , whereas the uncommented version returned: CrawlDb statistics start: crawl_data/crawldb Statistics for CrawlDb: crawl_data/crawldb TOTAL urls: 3771 retry 0: 3770 retry 1: 1 min score: 0.0 avg score: 0.0032256695 max score: 1.173 status 1 (db_unfetched): 3375 status 2 (db_fetched): 358 status 3 (db_gone): 29 status 4 (db_redir_temp): 1 status 5 (db_redir_perm): 8 CrawlDb statistics: done So the choice of tika seems to be working with more urls than neko. I just don't know why the result is so different when I used nutch 1.0 vs. nutch 1.1. Will get back to you should I have the investigation done. Thanks, Jeff On Sun, 2010-07-18 at 10:46 -0700, Mattmann, Chris A (388J) wrote: > Hi Jeff, > > Can you clarify what you are seeing? Is this a parsing problem, or a URL > filter problem? > > Cheers, > Chris > > > On 7/18/10 9:46 AM, "Jeff Zhou" <[email protected]> wrote: > > Thanks Faruk. > > So I wonder why the 1.1 release use TIKA which seems not to be stable at the > moment. Any ideas? > > > On Sun, Jul 18, 2010 at 7:21 AM, Faruk Berksöz <[email protected]> wrote: > > > > > There is an open issue > > (NUTCH-817<https://issues.apache.org/jira/browse/NUTCH-817>) > > that can related with your problem !! > > > > 2010/7/16 jeff-4 [via Lucene] > > <[email protected]<ml-node%[email protected]> > > <ml-node%[email protected]<ml-node%[email protected]> > > > > > > > > > > > I did check. Nutch 1.0 crawled over 300 links while Nutch 1.1 only 2. > > > > > > On Fri, 2010-07-16 at 14:21 +0800, xiao yang wrote: > > > > > > > You can use "bin/nutch readdb crawl/crawldb -stats" to check the > > > > number of pages they crawled. > > > > > > > > On Fri, Jul 16, 2010 at 2:07 PM, jeff <[hidden email]< > > http://user/SendEmail.jtp?type=node&node=971632&i=0>> > > > wrote: > > > > > Hi, > > > > > > > > > > I am testing nutch 1.1 with the exactly same configuration as that > > > > > tested on nutch 1.0. It has taken 1.0 to crawl the bestbuy site by a > > > few > > > > > hours, while it only takes 2-3 minutes for 1.1. Does anyone have the > > > > > similar experience and know why? > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------ > > > View message @ > > > > > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p971632.html > > > To unsubscribe from Nutch, click here< (link removed) >. > > > > > > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p976259.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

