Author: Alexander Barkov Email: b...@mnogosearch.org Message: > I add those Disallow lines so that, both app can crawl the same > number of urls (approximatively 140), > > Disallow */basket-villeurbanne/author/* > Disallow *?p=* > Disallow */feed > > As it seems that Mnogosearch can manage a robots.txt, but not the > meta robots noindex,follow.... >
mnoGoSearch supports meta robots. Can you please give an URL of a document whose robots directives are ignored. I'll check what's happenning. > Here are the results : > > ----- > > indexer -C; > indexer; > > [18898]{01} Done (53 seconds, 168 documents, 3503752 bytes, 64.56 > Kbytes/sec.) > > ------ > > indexer -C; > inexer -N5; > > [19261]{02} Done (15 seconds, 46 documents, 982938 bytes, 63.99 > Kbytes/sec.) > [19261]{03} Done (15 seconds, 48 documents, 930200 bytes, 60.56 > Kbytes/sec.) > [19261]{01} Done (5 seconds, 14 documents, 323667 bytes, 63.22 > Kbytes/sec.) > [19261]{05} Done (15 seconds, 46 documents, 974427 bytes, 63.44 > Kbytes/sec.) > [19261]{04} Done (5 seconds, 14 documents, 292520 bytes, 57.13 > Kbytes/sec.) > [19261]{--} Done (26 seconds, 168 documents, 3503752 bytes, 131.60 > Kbytes/sec.) > > > indexer -C; > indexer -N50; > [20289]{11} Done (11 seconds, 28 documents, 585571 bytes, 51.99 > Kbytes/sec.) > [20289]{28} Done (11 seconds, 29 documents, 705247 bytes, 62.61 > Kbytes/sec.) > [20289]{16} Done (11 seconds, 30 documents, 635782 bytes, 56.44 > Kbytes/sec.) > [20289]{30} Done (11 seconds, 30 documents, 635178 bytes, 56.39 > Kbytes/sec.) > [20289]{--} Done (21 seconds, 168 documents, 3504392 bytes, 162.96 > Kbytes/sec.) > > > mysql -uroot -p -N --database=db_test_mnogo --execute="SELECT url > FROM url" > ~/ALL.txt; > > (cat ~/ALL.txt | parallel -j8 --gnu "wget {}"); > > real 0m10.638s > user 0m1.256s > sys 0m1.519s > > > --- > > Screaming Frog : 12s > > > It just confirm the fact that Mnogosearch is relatively slower than > Sreaming Frog, and even when i compare to parallel wget bash, > mnogosearch is slower. > > It get little better with indexer -N50 though. Well, this effect can happen with a *small* site, with an empty database. When indexer starts multiple threads (say 10) and the database is empty, 9 threads immediately go to sleep for 10 seconds. So only the first thread is actually working. After 10 seconds the database is not empty, because the first thread has collected some links. So it actually start working in multi-thread mode after 10 seconds only. With a bigger site you will not see any difference between mnoGoSearch vs wget/frog. If you really need to crawl a small site quickly, please apply this patch: <patch> === modified file 'src/indexer.c' --- src/indexer.c 2016-03-30 12:13:49 +0000 +++ src/indexer.c 2016-05-14 08:28:25 +0000 @@ -2872,7 +2872,7 @@ int maxthreads= 1; UDM_CRAWLER *ThreadCrawlers= NULL; int thd_errors= 0; -#define UDM_NOTARGETS_SLEEP 10 +#define UDM_NOTARGETS_SLEEP 0 #ifdef WIN32 unsigned int __stdcall UdmCrawlerMain(void *arg) </patch> Here are the results: ./indexer -Cw ; ./indexer -N10 [5853]{--} Done (12 seconds, 168 documents, 3504192 bytes, 285.17 Kbytes/sec.) It's now as fast as wget and frog, and it crawls more documents (168 vs 140). Please note: Aggressive crawling is not polite and can be even considered as an attack. It better not to crawl sites that way, unless it is your own site, or unless site owners allow you to do it this way. Reply: <http://www.mnogosearch.org/board/message.php?id=21768> _______________________________________________ General mailing list General@mnogosearch.org http://lists.mnogosearch.org/listinfo/general