Hi Jeff, Can you clarify what you are seeing? Is this a parsing problem, or a URL filter problem?
Cheers, Chris On 7/18/10 9:46 AM, "Jeff Zhou" <[email protected]> wrote: Thanks Faruk. So I wonder why the 1.1 release use TIKA which seems not to be stable at the moment. Any ideas? On Sun, Jul 18, 2010 at 7:21 AM, Faruk Berksöz <[email protected]> wrote: > > There is an open issue > (NUTCH-817<https://issues.apache.org/jira/browse/NUTCH-817>) > that can related with your problem !! > > 2010/7/16 jeff-4 [via Lucene] > <[email protected]<ml-node%[email protected]> > <ml-node%[email protected]<ml-node%[email protected]> > > > > > > > I did check. Nutch 1.0 crawled over 300 links while Nutch 1.1 only 2. > > > > On Fri, 2010-07-16 at 14:21 +0800, xiao yang wrote: > > > > > You can use "bin/nutch readdb crawl/crawldb -stats" to check the > > > number of pages they crawled. > > > > > > On Fri, Jul 16, 2010 at 2:07 PM, jeff <[hidden email]< > http://user/SendEmail.jtp?type=node&node=971632&i=0>> > > wrote: > > > > Hi, > > > > > > > > I am testing nutch 1.1 with the exactly same configuration as that > > > > tested on nutch 1.0. It has taken 1.0 to crawl the bestbuy site by a > > few > > > > hours, while it only takes 2-3 minutes for 1.1. Does anyone have the > > > > similar experience and know why? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > ------------------------------ > > View message @ > > > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p971632.html > > To unsubscribe from Nutch, click here< (link removed) >. > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p976259.html > Sent from the Nutch - User mailing list archive at Nabble.com. > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

