One more question: do you see any exceptions in the manifoldcf log file? Karl
On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright <[email protected]> wrote: > This means that we are seeing some kind of transactional integrity problem > with MySQL. I have seen hints of this behavior before. It is not a > difference in logic. It could be due to either MySQL bugs or subtle > differences in how transactions work in MySQL. > > I will try to write a load test that uses hopcount filters in order to see > if the problem can be reliably reproduced here. If it turns out to be a > MySQL problem there would not be much we could do to fix the issue. > > Karl > > Sent from my Windows Phone > ________________________________ > From: Shigeki Kobayashi > Sent: 7/30/2012 6:36 AM > To: [email protected] > Subject: Re: crawled counts on WEB crawling differ between MCF0.4 and MCF0.5 > > >>(1) Make sure that the repository connections and job definitions are > indeed identical between MySQL and PostgreSQL. > > Yes, they are all the same. > >>(2) See if you can locate an example document that was crawled with > PostgreSQL but not crawled with MySQL. > > I confirmed the documents crawled with PostgreSQL but not crawled with MySQL > actually exist. > >>(3) If you create a second web connection and job under MySQL, and run > the job to completion, does the document that was not included get > skipped again? Or does it seem random which documents are skipped on > each run? > > Ok. I created two connections and jobs with exactly same description, and > then > ran the jobs to completion. > Those run resulted with different number of crawled documents ( as shown in > the attached picture). > > It seems the first run skipped some documents and the second run skipped > different documents, but all the skipped docs can be located. I have no > clue how those docs are skipped. > > > Regards, > > Shigeki > > 2012/7/30 Karl Wright <[email protected]> >> >> There should be no differences between crawling using MySQL as the >> database and PostgreSQL, on the same version of ManifoldCF. >> >> We include an RSS crawling test which finds exactly the expected >> number of documents on MySQL. This is a 100,000 document crawl. >> There are no back-end-specific logic differences in the web connector >> that would be expected to yield different results based on the >> back-end database. >> >> If you believe you have found a difference between MySQL and >> PostgreSQL, I suggest the following: >> >> (1) Make sure that the repository connections and job definitions are >> indeed identical between MySQL and PostgreSQL. >> (2) See if you can locate an example document that was crawled with >> PostgreSQL but not crawled with MySQL. >> (3) If you create a second web connection and job under MySQL, and run >> the job to completion, does the document that was not included get >> skipped again? Or does it seem random which documents are skipped on >> each run? >> >> Thanks, >> Karl >> >> >> >> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi >> <[email protected]> wrote: >> > Aren't there some difference in crawling logics between MySQL and >> > PostgreSQL? >> > >> > >> > >> > I did some tests on web crawling using both of MySQL and PostgreSQL. >> > >> > >> > >> > >> > >> > MCF0.5 running on MySQL indexed around 6000, and meanwhile MCF0.5 >> > running on >> > PostgreSQL indexed over 12000 documents. >> > >> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 running on >> > PostgreSQL >> > indexed over 12000 documents. >> > >> > >> > >> > >> > >> > Each number of indexed documents above is a result of first crawling >> > after >> > deleting indexing history from DB. >> > >> > It seems that changing DB affects crawling and indexing. >> > >> > >> > >> > Regards, >> > >> > Shigeki >> > >> > 2012/7/27 Karl Wright <[email protected]> >> >> >> >> There was a bug fixed in the way hopcount was being computed. See >> >> CONNECTORS-464. >> >> >> >> This means that fewer documents are left in the queue, but the number >> >> of indexed documents should be the same. >> >> >> >> Karl >> >> >> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi >> >> <[email protected]> wrote: >> >> > >> >> > Hi guys. >> >> > >> >> > >> >> > I wonder if anyone has ever faced the experience on web crawling that >> >> > the >> >> > number of crawled counts differs between MCF0.4 and MCF0.5. >> >> > >> >> > >> >> > I crawled some portal sites on intranet using MCF0.4 and MCF0.5. >> >> > MCF0.4 crawled over 12000 contents, and meanwhile, MCF0.5 crawled >> >> > only >> >> > around half of the contents. >> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL. >> >> > I hope changing DB does not affect the crawling results: >> >> > >> >> > >> >> > MCF0.4: >> >> > - Crawled Counts: 12000 and over >> >> > - Solr3.5 >> >> > - PostgreSQL 9.1.3 >> >> > - Tomcat6 >> >> > - Max Hop on Links: 15 >> >> > - Max Hop on Redirects: 10 >> >> > - Include only hosts matching seeds: Checked >> >> > - org.apache.manifoldcf.crawler.threads: 50 >> >> > - org.apache.manifoldcf.database.maxhandles: 100 >> >> > >> >> > >> >> > MCF0.5: >> >> > - Crawled Counts: around 6000 >> >> > - Solr3.5 >> >> > - MySQL5.5 >> >> > - Tomcat6 >> >> > - Max Hop on Links: 15 >> >> > - Max Hop on Redirects: 10 >> >> > - Include only hosts matching seeds: Checked >> >> > - org.apache.manifoldcf.crawler.threads: 50 >> >> > - org.apache.manifoldcf.database.maxhandles: 100 >> >> > >> >> > >> >> > Does anyone have any ideas? >> >> > >> > >> > >> > >> > >> > -- >> > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 >> > ソフトバンクモバイル株式会社 >> > 情報システム本部 >> > システムサービス事業統括部 >> > サービス企画部 >> > >> > 小林 茂樹 >> > [email protected] >> > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 >> > >> > >> > > > > > > -- > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 > ソフトバンクモバイル株式会社 > 情報システム本部 > システムサービス事業統括部 > サービス企画部 > > 小林 茂樹 > [email protected] > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 > > >
