I've shortened the test so that it runs in 2 hours on PostgreSQL, but I've run into the problem that even the PostgreSQL test produces different counts on every run. I've created the ticket CONNECTORS-501 to track this issue. I'll let you know when this is resolved. It may be a week or two, since it looks like it will be somewhat difficult to diagnose.
Karl On Sun, Aug 5, 2012 at 8:59 PM, Shigeki Kobayashi <[email protected]> wrote: > Karl, > > I was also testing the latest commit and it was too slow. > I will wait for you back. > > Regards, > > Shigeki > > 2012/8/3 Karl Wright <[email protected]> >> >> I'm running it here and it is pretty much too slow to ever finish. >> Mysqld is chugging for minutes at a time with little apparent >> progress. >> >> I'll have to look into this further when I get back on Tuesday. >> >> Karl >> >> On Fri, Aug 3, 2012 at 5:10 AM, Karl Wright <[email protected]> wrote: >> > Hi Shigeki, >> > >> > It turns out that the test has not been passing for me, but merely >> > timing out. I've increased the timeout now and committed that change. >> > Can you stop your test, drop your "testdb" database, and start the >> > test over again? A successful test will print the following before >> > printing any shutdown or cleanup messages: >> > >> > System.err.println("Crawl required "+new >> > Long(System.currentTimeMillis()-startTime).toString()+" >> > milliseconds"); >> > >> > Karl >> > >> > >> > >> > On Fri, Aug 3, 2012 at 5:05 AM, Karl Wright <[email protected]> wrote: >> >> If the test starts to clean up and then hangs, I believe that means >> >> that it passed. There is also a problem with the test cleanup code >> >> which is unrelated that I need to look at. >> >> >> >> Thanks, >> >> Karl >> >> >> >> On Fri, Aug 3, 2012 at 3:23 AM, Shigeki Kobayashi >> >> <[email protected]> wrote: >> >>> Hi Karl, >> >>> >> >>> I figured out where to put the mysql driver. I put >> >>> mysql-connector-5.x.x.jar in MCF_HOME/lib_proprietary/ then the error >> >>> was >> >>> resolved. >> >>> >> >>> I also had to modify >> >>> >> >>> MCF_HOME/framework/core/src/test/java/org/apache/manifoldcf/core/tests/BaseMySQL.java >> >>> to change the root password for MySQL. >> >>> >> >>> I still get a warning saying "Preclean failed: Error getting >> >>> connection: >> >>> Access denied for user 'testuser'@'localhost' (using password: YES)". >> >>> I >> >>> don't know how to fix this. Am I supposed to set password for >> >>> 'testuser'? >> >>> >> >>> The program seems to be running at the mcf-test-build.run-load-mysql >> >>> phase. >> >>> >> >>> I will let you know when it's done. >> >>> >> >>> >> >>> Thanks >> >>> >> >>> >> >>> Regards, >> >>> >> >>> Shigeki >> >>> >> >>> 2012/8/3 Shigeki Kobayashi <[email protected]> >> >>>> >> >>>> Hi Karl, >> >>>> >> >>>> I executed the following: >> >>>> >> >>>> ant run-webcrawler-loadtests-mysql >> >>>> >> >>>> I recieved an error saying "Unable to load database driver: >> >>>> com.mysql.jdbc.Driver" >> >>>> >> >>>> I suppose I have to put mysql-connector-5.x.x.jar somewhere in order >> >>>> to >> >>>> build the test. If so which directory am I supposed to put in? >> >>>> >> >>>> Please let me know. >> >>>> >> >>>> Regards, >> >>>> >> >>>> Shigeki >> >>>> 2012/8/3 Karl Wright <[email protected]> >> >>>>> >> >>>>> A test has been created for both Postgresql and for MySQL. If you >> >>>>> check out trunk, you can run the tests like this: >> >>>>> >> >>>>> ant run-webcrawler-loadtests-postgresql >> >>>>> >> >>>>> and >> >>>>> >> >>>>> ant run-webcrawler-loadtests-mysql >> >>>>> >> >>>>> I've run the Postgresql test here on Windows and it succeeds. Can >> >>>>> you >> >>>>> confirm that the mysql test fails for you? >> >>>>> >> >>>>> Thanks, >> >>>>> Karl >> >>>>> >> >>>>> >> >>>>> On Tue, Jul 31, 2012 at 8:03 AM, Karl Wright <[email protected]> >> >>>>> wrote: >> >>>>> > I've created CONNECTORS-496 to track this issue. >> >>>>> > >> >>>>> > Karl >> >>>>> > >> >>>>> > >> >>>>> > On Tue, Jul 31, 2012 at 3:11 AM, Shigeki Kobayashi >> >>>>> > <[email protected]> wrote: >> >>>>> >> Hi Karl >> >>>>> >> >> >>>>> >> >> >>>>> >> I use MySQL5.5 and CentOS5.8. >> >>>>> >> I did not make any MySQL setting. I just specified the manifold's >> >>>>> >> database >> >>>>> >> maxhandles to 100. >> >>>>> >> >> >>>>> >> Regards, >> >>>>> >> >> >>>>> >> Shigeki >> >>>>> >> >> >>>>> >> >> >>>>> >> 2012/7/31 Karl Wright <[email protected]> >> >>>>> >>> >> >>>>> >>> Hi Shigeki, >> >>>>> >>> >> >>>>> >>> With the standard MySQL load test, with throttling wide open, >> >>>>> >>> running >> >>>>> >>> on Windows Vista, I get very poor overall performance and >> >>>>> >>> parallelism >> >>>>> >>> - indeed, it's so poor that I doubt there is much parallelism at >> >>>>> >>> all >> >>>>> >>> going on, which may be why I've seen problems only once in a >> >>>>> >>> great >> >>>>> >>> while. See >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> https://cwiki.apache.org/confluence/display/CONNECTORS/Database+Performance >> >>>>> >>> . Are you seeing better parallelism than this? Are there MySQL >> >>>>> >>> switch settings you have changed to enable decent performance? >> >>>>> >>> What >> >>>>> >>> version of MySQL are you using, and what OS? >> >>>>> >>> >> >>>>> >>> Karl >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> On Mon, Jul 30, 2012 at 9:05 PM, Shigeki Kobayashi >> >>>>> >>> <[email protected]> wrote: >> >>>>> >>> > Karl, >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > I do not see any exceptions in the log. >> >>>>> >>> > >> >>>>> >>> > Thanks. >> >>>>> >>> > >> >>>>> >>> > Regards, >> >>>>> >>> > >> >>>>> >>> > Shigeki >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > 2012/7/31 Karl Wright <[email protected]> >> >>>>> >>> >> >> >>>>> >>> >> One more question: do you see any exceptions in the >> >>>>> >>> >> manifoldcf log >> >>>>> >>> >> file? >> >>>>> >>> >> >> >>>>> >>> >> Karl >> >>>>> >>> >> >> >>>>> >>> >> On Mon, Jul 30, 2012 at 7:03 AM, Karl Wright >> >>>>> >>> >> <[email protected]> >> >>>>> >>> >> wrote: >> >>>>> >>> >> > This means that we are seeing some kind of transactional >> >>>>> >>> >> > integrity >> >>>>> >>> >> > problem >> >>>>> >>> >> > with MySQL. I have seen hints of this behavior before. It >> >>>>> >>> >> > is >> >>>>> >>> >> > not a >> >>>>> >>> >> > difference in logic. It could be due to either MySQL bugs >> >>>>> >>> >> > or >> >>>>> >>> >> > subtle >> >>>>> >>> >> > differences in how transactions work in MySQL. >> >>>>> >>> >> > >> >>>>> >>> >> > I will try to write a load test that uses hopcount filters >> >>>>> >>> >> > in >> >>>>> >>> >> > order >> >>>>> >>> >> > to >> >>>>> >>> >> > see >> >>>>> >>> >> > if the problem can be reliably reproduced here. If it >> >>>>> >>> >> > turns out >> >>>>> >>> >> > to >> >>>>> >>> >> > be a >> >>>>> >>> >> > MySQL problem there would not be much we could do to fix >> >>>>> >>> >> > the >> >>>>> >>> >> > issue. >> >>>>> >>> >> > >> >>>>> >>> >> > Karl >> >>>>> >>> >> > >> >>>>> >>> >> > Sent from my Windows Phone >> >>>>> >>> >> > ________________________________ >> >>>>> >>> >> > From: Shigeki Kobayashi >> >>>>> >>> >> > Sent: 7/30/2012 6:36 AM >> >>>>> >>> >> > To: [email protected] >> >>>>> >>> >> > Subject: Re: crawled counts on WEB crawling differ between >> >>>>> >>> >> > MCF0.4 and >> >>>>> >>> >> > MCF0.5 >> >>>>> >>> >> > >> >>>>> >>> >> > >> >>>>> >>> >> >>(1) Make sure that the repository connections and job >> >>>>> >>> >> >> definitions are >> >>>>> >>> >> > indeed identical between MySQL and PostgreSQL. >> >>>>> >>> >> > >> >>>>> >>> >> > Yes, they are all the same. >> >>>>> >>> >> > >> >>>>> >>> >> >>(2) See if you can locate an example document that was >> >>>>> >>> >> >> crawled >> >>>>> >>> >> >> with >> >>>>> >>> >> > PostgreSQL but not crawled with MySQL. >> >>>>> >>> >> > >> >>>>> >>> >> > I confirmed the documents crawled with PostgreSQL but not >> >>>>> >>> >> > crawled >> >>>>> >>> >> > with >> >>>>> >>> >> > MySQL >> >>>>> >>> >> > actually exist. >> >>>>> >>> >> > >> >>>>> >>> >> >>(3) If you create a second web connection and job under >> >>>>> >>> >> >> MySQL, >> >>>>> >>> >> >> and >> >>>>> >>> >> >> run >> >>>>> >>> >> > the job to completion, does the document that was not >> >>>>> >>> >> > included >> >>>>> >>> >> > get >> >>>>> >>> >> > skipped again? Or does it seem random which documents are >> >>>>> >>> >> > skipped on >> >>>>> >>> >> > each run? >> >>>>> >>> >> > >> >>>>> >>> >> > Ok. I created two connections and jobs with exactly same >> >>>>> >>> >> > description, >> >>>>> >>> >> > and >> >>>>> >>> >> > then >> >>>>> >>> >> > ran the jobs to completion. >> >>>>> >>> >> > Those run resulted with different number of crawled >> >>>>> >>> >> > documents ( >> >>>>> >>> >> > as >> >>>>> >>> >> > shown >> >>>>> >>> >> > in >> >>>>> >>> >> > the attached picture). >> >>>>> >>> >> > >> >>>>> >>> >> > It seems the first run skipped some documents and the >> >>>>> >>> >> > second run >> >>>>> >>> >> > skipped >> >>>>> >>> >> > different documents, but all the skipped docs can be >> >>>>> >>> >> > located. I >> >>>>> >>> >> > have >> >>>>> >>> >> > no >> >>>>> >>> >> > clue how those docs are skipped. >> >>>>> >>> >> > >> >>>>> >>> >> > >> >>>>> >>> >> > Regards, >> >>>>> >>> >> > >> >>>>> >>> >> > Shigeki >> >>>>> >>> >> > >> >>>>> >>> >> > 2012/7/30 Karl Wright <[email protected]> >> >>>>> >>> >> >> >> >>>>> >>> >> >> There should be no differences between crawling using >> >>>>> >>> >> >> MySQL as >> >>>>> >>> >> >> the >> >>>>> >>> >> >> database and PostgreSQL, on the same version of >> >>>>> >>> >> >> ManifoldCF. >> >>>>> >>> >> >> >> >>>>> >>> >> >> We include an RSS crawling test which finds exactly the >> >>>>> >>> >> >> expected >> >>>>> >>> >> >> number of documents on MySQL. This is a 100,000 document >> >>>>> >>> >> >> crawl. >> >>>>> >>> >> >> There are no back-end-specific logic differences in the >> >>>>> >>> >> >> web >> >>>>> >>> >> >> connector >> >>>>> >>> >> >> that would be expected to yield different results based on >> >>>>> >>> >> >> the >> >>>>> >>> >> >> back-end database. >> >>>>> >>> >> >> >> >>>>> >>> >> >> If you believe you have found a difference between MySQL >> >>>>> >>> >> >> and >> >>>>> >>> >> >> PostgreSQL, I suggest the following: >> >>>>> >>> >> >> >> >>>>> >>> >> >> (1) Make sure that the repository connections and job >> >>>>> >>> >> >> definitions >> >>>>> >>> >> >> are >> >>>>> >>> >> >> indeed identical between MySQL and PostgreSQL. >> >>>>> >>> >> >> (2) See if you can locate an example document that was >> >>>>> >>> >> >> crawled >> >>>>> >>> >> >> with >> >>>>> >>> >> >> PostgreSQL but not crawled with MySQL. >> >>>>> >>> >> >> (3) If you create a second web connection and job under >> >>>>> >>> >> >> MySQL, >> >>>>> >>> >> >> and >> >>>>> >>> >> >> run >> >>>>> >>> >> >> the job to completion, does the document that was not >> >>>>> >>> >> >> included >> >>>>> >>> >> >> get >> >>>>> >>> >> >> skipped again? Or does it seem random which documents are >> >>>>> >>> >> >> skipped >> >>>>> >>> >> >> on >> >>>>> >>> >> >> each run? >> >>>>> >>> >> >> >> >>>>> >>> >> >> Thanks, >> >>>>> >>> >> >> Karl >> >>>>> >>> >> >> >> >>>>> >>> >> >> >> >>>>> >>> >> >> >> >>>>> >>> >> >> On Sun, Jul 29, 2012 at 9:51 PM, Shigeki Kobayashi >> >>>>> >>> >> >> <[email protected]> wrote: >> >>>>> >>> >> >> > Aren't there some difference in crawling logics between >> >>>>> >>> >> >> > MySQL >> >>>>> >>> >> >> > and >> >>>>> >>> >> >> > PostgreSQL? >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > I did some tests on web crawling using both of MySQL and >> >>>>> >>> >> >> > PostgreSQL. >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > MCF0.5 running on MySQL indexed around 6000, and >> >>>>> >>> >> >> > meanwhile >> >>>>> >>> >> >> > MCF0.5 >> >>>>> >>> >> >> > running on >> >>>>> >>> >> >> > PostgreSQL indexed over 12000 documents. >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > MCF0.6 running on MySQL indexed around 6000. MCF0.4 >> >>>>> >>> >> >> > running >> >>>>> >>> >> >> > on >> >>>>> >>> >> >> > PostgreSQL >> >>>>> >>> >> >> > indexed over 12000 documents. >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > Each number of indexed documents above is a result of >> >>>>> >>> >> >> > first >> >>>>> >>> >> >> > crawling >> >>>>> >>> >> >> > after >> >>>>> >>> >> >> > deleting indexing history from DB. >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > It seems that changing DB affects crawling and indexing. >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > Regards, >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > Shigeki >> >>>>> >>> >> >> > >> >>>>> >>> >> >> > 2012/7/27 Karl Wright <[email protected]> >> >>>>> >>> >> >> >> >> >>>>> >>> >> >> >> There was a bug fixed in the way hopcount was being >> >>>>> >>> >> >> >> computed. >> >>>>> >>> >> >> >> See >> >>>>> >>> >> >> >> CONNECTORS-464. >> >>>>> >>> >> >> >> >> >>>>> >>> >> >> >> This means that fewer documents are left in the queue, >> >>>>> >>> >> >> >> but >> >>>>> >>> >> >> >> the >> >>>>> >>> >> >> >> number >> >>>>> >>> >> >> >> of indexed documents should be the same. >> >>>>> >>> >> >> >> >> >>>>> >>> >> >> >> Karl >> >>>>> >>> >> >> >> >> >>>>> >>> >> >> >> On Fri, Jul 27, 2012 at 3:00 AM, Shigeki Kobayashi >> >>>>> >>> >> >> >> <[email protected]> wrote: >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > Hi guys. >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > I wonder if anyone has ever faced the experience on >> >>>>> >>> >> >> >> > web >> >>>>> >>> >> >> >> > crawling >> >>>>> >>> >> >> >> > that >> >>>>> >>> >> >> >> > the >> >>>>> >>> >> >> >> > number of crawled counts differs between MCF0.4 >> >>>>> >>> >> >> >> > and MCF0.5. >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > I crawled some portal sites on intranet using MCF0.4 >> >>>>> >>> >> >> >> > and >> >>>>> >>> >> >> >> > MCF0.5. >> >>>>> >>> >> >> >> > MCF0.4 crawled over 12000 contents, and meanwhile, >> >>>>> >>> >> >> >> > MCF0.5 >> >>>>> >>> >> >> >> > crawled >> >>>>> >>> >> >> >> > only >> >>>>> >>> >> >> >> > around half of the contents. >> >>>>> >>> >> >> >> > I ran MCF0.4 on PostgreSQL and MCF0.5 on MySQL. >> >>>>> >>> >> >> >> > I hope changing DB does not affect the crawling >> >>>>> >>> >> >> >> > results: >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > MCF0.4: >> >>>>> >>> >> >> >> > - Crawled Counts: 12000 and over >> >>>>> >>> >> >> >> > - Solr3.5 >> >>>>> >>> >> >> >> > - PostgreSQL 9.1.3 >> >>>>> >>> >> >> >> > - Tomcat6 >> >>>>> >>> >> >> >> > - Max Hop on Links: 15 >> >>>>> >>> >> >> >> > - Max Hop on Redirects: 10 >> >>>>> >>> >> >> >> > - Include only hosts matching seeds: Checked >> >>>>> >>> >> >> >> > - org.apache.manifoldcf.crawler.threads: 50 >> >>>>> >>> >> >> >> > - org.apache.manifoldcf.database.maxhandles: 100 >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > MCF0.5: >> >>>>> >>> >> >> >> > - Crawled Counts: around 6000 >> >>>>> >>> >> >> >> > - Solr3.5 >> >>>>> >>> >> >> >> > - MySQL5.5 >> >>>>> >>> >> >> >> > - Tomcat6 >> >>>>> >>> >> >> >> > - Max Hop on Links: 15 >> >>>>> >>> >> >> >> > - Max Hop on Redirects: 10 >> >>>>> >>> >> >> >> > - Include only hosts matching seeds: Checked >> >>>>> >>> >> >> >> > - org.apache.manifoldcf.crawler.threads: 50 >> >>>>> >>> >> >> >> > - org.apache.manifoldcf.database.maxhandles: 100 >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > >> >>>>> >>> >> >> >> > Does anyone have any ideas? >> >>>> >> >>>> >> >>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 >> >>> ソフトバンクモバイル株式会社 >> >>> 情報システム本部 >> >>> システムサービス事業統括部 >> >>> サービス企画部 >> >>> >> >>> 小林 茂樹 >> >>> [email protected] >> >>> 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 >> >>> >> >>> >> >>> > > > > > -- > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 > ソフトバンクモバイル株式会社 > 情報システム本部 > システムサービス事業統括部 > サービス企画部 > > 小林 茂樹 > [email protected] > 〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜 > > >
