Karl,
I do not see any exceptions in the log. Thanks. Regards, Shigeki 2012/7/31 Karl Wright <[email protected]> > One more question: do you see any exceptions in the manifoldcf log file? > > Karl > > On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright <[email protected]> wrote: > > This means that we are seeing some kind of transactional integrity > problem > > with MySQL. I have seen hints of this behavior before. It is not a > > difference in logic. It could be due to either MySQL bugs or subtle > > differences in how transactions work in MySQL. > > > > I will try to write a load test that uses hopcount filters in order to > see > > if the problem can be reliably reproduced here. If it turns out to be a > > MySQL problem there would not be much we could do to fix the issue. > > > > Karl > > > > Sent from my Windows Phone > > ________________________________ > > From: Shigeki Kobayashi > > Sent: 7/30/2012 6:36 AM > > To: [email protected] > > Subject: Re: crawled counts on WEB crawling differ between MCF0.4 and > MCF0.5 > > > > > >>(1) Make sure that the repository connections and job definitions are > > indeed identical between MySQL and PostgreSQL. > > > > Yes, they are all the same. > > > >>(2) See if you can locate an example document that was crawled with > > PostgreSQL but not crawled with MySQL. > > > > I confirmed the documents crawled with PostgreSQL but not crawled with > MySQL > > actually exist. > > > >>(3) If you create a second web connection and job under MySQL, and run > > the job to completion, does the document that was not included get > > skipped again? Or does it seem random which documents are skipped on > > each run? > > > > Ok. I created two connections and jobs with exactly same description, and > > then > > ran the jobs to completion. > > Those run resulted with different number of crawled documents ( as shown > in > > the attached picture). > > > > It seems the first run skipped some documents and the second run skipped > > different documents, but all the skipped docs can be located. I have no > > clue how those docs are skipped. > > > > > > Regards, > > > > Shigeki > > > > 2012/7/30 Karl Wright <[email protected]> > >> > >> There should be no differences between crawling using MySQL as the > >> database and PostgreSQL, on the same version of ManifoldCF. > >> > >> We include an RSS crawling test which finds exactly the expected > >> number of documents on MySQL. This is a 100,000 document crawl. > >> There are no back-end-specific logic differences in the web connector > >> that would be expected to yield different results based on the > >> back-end database. > >> > >> If you believe you have found a difference between MySQL and > >> PostgreSQL, I suggest the following: > >> > >> (1) Make sure that the repository connections and job definitions are > >> indeed identical between MySQL and PostgreSQL. > >> (2) See if you can locate an example document that was crawled with > >> PostgreSQL but not crawled with MySQL. > >> (3) If you create a second web connection and job under MySQL, and run > >> the job to completion, does the document that was not included get > >> skipped again? Or does it seem random which documents are skipped on > >> each run? > >> > >> Thanks, > >> Karl > >> > >> > >> > >> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi > >> <[email protected]> wrote: > >> > Aren't there some difference in crawling logics between MySQL and > >> > PostgreSQL? > >> > > >> > > >> > > >> > I did some tests on web crawling using both of MySQL and PostgreSQL. > >> > > >> > > >> > > >> > > >> > > >> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 > >> > running on > >> > PostgreSQL indexed over 12000 documents. > >> > > >> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on > >> > PostgreSQL > >> > indexed over 12000 documents. > >> > > >> > > >> > > >> > > >> > > >> > Each number of indexed documents above is a result of first crawling > >> > after > >> > deleting indexing history from DB. > >> > > >> > It seems that changing DB affects crawling and indexing. > >> > > >> > > >> > > >> > Regards, > >> > > >> > Shigeki > >> > > >> > 2012/7/27 Karl Wright <[email protected]> > >> >> > >> >> There was a bug fixed in the way hopcount was being computed. See > >> >> CONNECTORS-464. > >> >> > >> >> This means that fewer documents are left in the queue, but the number > >> >> of indexed documents should be the same. > >> >> > >> >> Karl > >> >> > >> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi > >> >> <[email protected]> wrote: > >> >> > > >> >> > Hi guys. > >> >> > > >> >> > > >> >> > I wonder if anyone has ever faced the experience on web crawling > that > >> >> > the > >> >> > number of crawled counts differs between MCF0.4 and MCF0.5. > >> >> > > >> >> > > >> >> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5. > >> >> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled > >> >> > only > >> >> > around half of the contents. > >> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL. > >> >> > I hope changing DB does not affect the crawling results: > >> >> > > >> >> > > >> >> > MCF0.4: > >> >> > - Crawled Counts: 12000 and over > >> >> > - Solr3.5 > >> >> > - PostgreSQL 9.1.3 > >> >> > - Tomcat6 > >> >> > - Max Hop on Links: 15 > >> >> > - Max Hop on Redirects: 10 > >> >> > - Include only hosts matching seeds: Checked > >> >> > - org.apache.manifoldcf.crawler.threads: 50 > >> >> > - org.apache.manifoldcf.database.maxhandles: 100 > >> >> > > >> >> > > >> >> > MCF0.5: > >> >> > - Crawled Counts: around 6000 > >> >> > - Solr3.5 > >> >> > - MySQL5.5 > >> >> > - Tomcat6 > >> >> > - Max Hop on Links: 15 > >> >> > - Max Hop on Redirects: 10 > >> >> > - Include only hosts matching seeds: Checked > >> >> > - org.apache.manifoldcf.crawler.threads: 50 > >> >> > - org.apache.manifoldcf.database.maxhandles: 100 > >> >> > > >> >> > > >> >> > Does anyone have any ideas? > >> >> > > >> > > >> > > >> > > >> > > >> > -- > >> > ~~~~~~~~~~~~~~~~~~~~~~~~ > >> > ソフトバンクモバイル株式会社 > >> > 情報システム本部 > >> > システムサービス事業統括部 > >> > サービス企画部 > >> > > >> > 小林 茂樹 > >> > [email protected] > >> > ~~~~~~~~~~~~~~~~~~~~~~~~ > >> > > >> > > >> > > > > > > > > > > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > ソフトバンクモバイル株式会社 > > 情報システム本部 > > システムサービス事業統括部 > > サービス企画部 > > > > 小林 茂樹 > > [email protected] > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > -- *~~~~~~~~~~~~~~~~~~~~**~~~~* ソフトバンクモバイル株式会社 情報システム本部 システムサービス事業統括部 サービス企画部 小林 茂樹 [email protected] *~~~~~~~~~~~~~~~~~~~~**~~~~*
