Hi Jeff,

Can you clarify what you are seeing? Is this a parsing problem, or a URL filter 
problem?

Cheers,
Chris


On 7/18/10 9:46 AM, "Jeff Zhou" <[email protected]> wrote:

Thanks Faruk.

So I wonder why the 1.1 release use TIKA which seems not to be stable at the
moment. Any ideas?


On Sun, Jul 18, 2010 at 7:21 AM, Faruk Berksöz <[email protected]> wrote:

>
> There is an open issue
> (NUTCH-817<https://issues.apache.org/jira/browse/NUTCH-817>)
> that can related with your problem !!
>
> 2010/7/16 jeff-4 [via Lucene]
> <[email protected]<ml-node%[email protected]>
> <ml-node%[email protected]<ml-node%[email protected]>
> >
> >
>
> > I did check. Nutch 1.0 crawled over 300 links while Nutch 1.1 only 2.
> >
> > On Fri, 2010-07-16 at 14:21 +0800, xiao yang wrote:
> >
> > > You can use "bin/nutch readdb crawl/crawldb -stats" to check the
> > > number of pages they crawled.
> > >
> > > On Fri, Jul 16, 2010 at 2:07 PM, jeff <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=971632&i=0>>
> > wrote:
> > > > Hi,
> > > >
> > > > I am testing nutch 1.1 with the exactly same configuration as that
> > > > tested on nutch 1.0. It has taken 1.0 to crawl the bestbuy site by a
> > few
> > > > hours, while it only takes 2-3 minutes for 1.1. Does anyone have the
> > > > similar experience and know why?
> > > >
> > > > Thanks.
> > > >
> > > >
> >
> >
> >
> >
> > ------------------------------
> >  View message @
> >
> http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p971632.html
> > To unsubscribe from Nutch, click here< (link removed) >.
> >
> >
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p976259.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to