There was a bug fixed in the way hopcount was being computed. See CONNECTORS-464.
This means that fewer documents are left in the queue, but the number of indexed documents should be the same. Karl On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi <[email protected]> wrote: > > Hi guys. > > > I wonder if anyone has ever faced the experience on web crawling that the > number of crawled counts differs between MCF0.4 and MCF0.5. > > > I crawled some portal sites on intranet using MCF0.4 and MCF0.5. > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only > around half of the contents. > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL. > I hope changing DB does not affect the crawling results: > > > MCF0.4: > - Crawled Counts: 12000 and over > - Solr3.5 > - PostgreSQL 9.1.3 > - Tomcat6 > - Max Hop on Links: 15 > - Max Hop on Redirects: 10 > - Include only hosts matching seeds: Checked > - org.apache.manifoldcf.crawler.threads: 50 > - org.apache.manifoldcf.database.maxhandles: 100 > > > MCF0.5: > - Crawled Counts: around 6000 > - Solr3.5 > - MySQL5.5 > - Tomcat6 > - Max Hop on Links: 15 > - Max Hop on Redirects: 10 > - Include only hosts matching seeds: Checked > - org.apache.manifoldcf.crawler.threads: 50 > - org.apache.manifoldcf.database.maxhandles: 100 > > > Does anyone have any ideas? >
