Chris,

I couldn't tell as I am still investigating what had happened. In the
mean time, however, I do find some differences between the tika html
parser and the neko html parser. 

My test was that I uncommented out the following mime types in
parse-plugins.xml:

<mimeType name="text/html">
                <plugin id="parse-html" />
        </mimeType>

        <mimeType name="text/sgml">
                <plugin id="parse-html" />
        </mimeType>

Guess what, when I crawled apache.org, the originally shipped nutch 1.1
returned the following info:

CrawlDb statistics start: crawl_tmp/crawldb
Statistics for CrawlDb: crawl_tmp/crawldb
TOTAL urls:     4855
retry 0:        4855
min score:      0.0
avg score:      4.988671E-4
max score:      1.032
status 1 (db_unfetched):        4458
status 2 (db_fetched):  365
status 3 (db_gone):     6
status 4 (db_redir_temp):       16
status 5 (db_redir_perm):       10
CrawlDb statistics: done

, whereas the uncommented version returned:

CrawlDb statistics start: crawl_data/crawldb
Statistics for CrawlDb: crawl_data/crawldb
TOTAL urls:     3771
retry 0:        3770
retry 1:        1
min score:      0.0
avg score:      0.0032256695
max score:      1.173
status 1 (db_unfetched):        3375
status 2 (db_fetched):  358
status 3 (db_gone):     29
status 4 (db_redir_temp):       1
status 5 (db_redir_perm):       8
CrawlDb statistics: done


So the choice of tika seems to be working with more urls than neko. I
just don't know why the result is so different when I used nutch 1.0 vs.
nutch 1.1. Will get back to you should I have the investigation done.

Thanks,
Jeff

On Sun, 2010-07-18 at 10:46 -0700, Mattmann, Chris A (388J) wrote:
> Hi Jeff,
> 
> Can you clarify what you are seeing? Is this a parsing problem, or a URL 
> filter problem?
> 
> Cheers,
> Chris
> 
> 
> On 7/18/10 9:46 AM, "Jeff Zhou" <[email protected]> wrote:
> 
> Thanks Faruk.
> 
> So I wonder why the 1.1 release use TIKA which seems not to be stable at the
> moment. Any ideas?
> 
> 
> On Sun, Jul 18, 2010 at 7:21 AM, Faruk Berksöz <[email protected]> wrote:
> 
> >
> > There is an open issue
> > (NUTCH-817<https://issues.apache.org/jira/browse/NUTCH-817>)
> > that can related with your problem !!
> >
> > 2010/7/16 jeff-4 [via Lucene]
> > <[email protected]<ml-node%[email protected]>
> > <ml-node%[email protected]<ml-node%[email protected]>
> > >
> > >
> >
> > > I did check. Nutch 1.0 crawled over 300 links while Nutch 1.1 only 2.
> > >
> > > On Fri, 2010-07-16 at 14:21 +0800, xiao yang wrote:
> > >
> > > > You can use "bin/nutch readdb crawl/crawldb -stats" to check the
> > > > number of pages they crawled.
> > > >
> > > > On Fri, Jul 16, 2010 at 2:07 PM, jeff <[hidden email]<
> > http://user/SendEmail.jtp?type=node&node=971632&i=0>>
> > > wrote:
> > > > > Hi,
> > > > >
> > > > > I am testing nutch 1.1 with the exactly same configuration as that
> > > > > tested on nutch 1.0. It has taken 1.0 to crawl the bestbuy site by a
> > > few
> > > > > hours, while it only takes 2-3 minutes for 1.1. Does anyone have the
> > > > > similar experience and know why?
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > >
> > >
> > >
> > >
> > > ------------------------------
> > >  View message @
> > >
> > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p971632.html
> > > To unsubscribe from Nutch, click here< (link removed) >.
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Nutch-1-1-crawls-fewer-links-than-1-0-tp971589p976259.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 


Reply via email to