Hi guys.

I wonder if anyone has ever faced the experience on web crawling that the
number of crawled counts differs between MCF0.4 and MCF0.5.


I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
around half of the contents.
I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
I hope changing DB does not affect the crawling results:


MCF0.4:
  - Crawled Counts: 12000 and over
  - Solr3.5
  - PostgreSQL 9.1.3
  - Tomcat6
  - Max Hop on Links: 15
  - Max Hop on Redirects: 10
  - Include only hosts matching seeds: Checked
  - org.apache.manifoldcf.crawler.threads: 50
  - org.apache.manifoldcf.database.maxhandles: 100


MCF0.5:
  - Crawled Counts: around 6000
  - Solr3.5
  - MySQL5.5
  - Tomcat6
  - Max Hop on Links: 15
  - Max Hop on Redirects: 10
  - Include only hosts matching seeds: Checked
  - org.apache.manifoldcf.crawler.threads: 50
  - org.apache.manifoldcf.database.maxhandles: 100


Does anyone have any ideas?

Reply via email to