There was a bug fixed in the way hopcount was being computed.  See
CONNECTORS-464.

This means that fewer documents are left in the queue, but the number
of indexed documents should be the same.

Karl

On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi
<[email protected]> wrote:
>
> Hi guys.
>
>
> I wonder if anyone has ever faced the experience on web crawling that the
> number of crawled counts differs between MCF0.4 and MCF0.5.
>
>
> I crawled some portal sites on intranet using MCF0.4 and MCF0.5.
> MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled only
> around half of the contents.
> I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL.
> I hope changing DB does not affect the crawling results:
>
>
> MCF0.4:
>   - Crawled Counts: 12000 and over
>   - Solr3.5
>   - PostgreSQL 9.1.3
>   - Tomcat6
>   - Max Hop on Links: 15
>   - Max Hop on Redirects: 10
>   - Include only hosts matching seeds: Checked
>   - org.apache.manifoldcf.crawler.threads: 50
>   - org.apache.manifoldcf.database.maxhandles: 100
>
>
> MCF0.5:
>   - Crawled Counts: around 6000
>   - Solr3.5
>   - MySQL5.5
>   - Tomcat6
>   - Max Hop on Links: 15
>   - Max Hop on Redirects: 10
>   - Include only hosts matching seeds: Checked
>   - org.apache.manifoldcf.crawler.threads: 50
>   - org.apache.manifoldcf.database.maxhandles: 100
>
>
> Does anyone have any ideas?
>

Reply via email to