[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-14 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> I add those Disallow lines so that, both app can crawl the same 
> number of urls (approximatively 140), 
> 
> Disallow */basket-villeurbanne/author/*
> Disallow *?p=*
> Disallow */feed
> 
> As it seems that Mnogosearch can manage a robots.txt, but not the 
> meta robots noindex,follow
> 


mnoGoSearch supports meta robots.
Can you please give an URL of a document whose robots directives are ignored. 
I'll check what's happenning.


> Here are the results :
> 
> -
> 
> indexer -C;
> indexer;
> 
> [18898]{01} Done (53 seconds, 168 documents, 3503752 bytes, 64.56 
> Kbytes/sec.)
> 
> --
> 
> indexer -C;
> inexer -N5;
> 
> [19261]{02} Done (15 seconds, 46 documents, 982938 bytes, 63.99 
> Kbytes/sec.)
> [19261]{03} Done (15 seconds, 48 documents, 930200 bytes, 60.56 
> Kbytes/sec.)
> [19261]{01} Done (5 seconds, 14 documents, 323667 bytes, 63.22 
> Kbytes/sec.)
> [19261]{05} Done (15 seconds, 46 documents, 974427 bytes, 63.44 
> Kbytes/sec.)
> [19261]{04} Done (5 seconds, 14 documents, 292520 bytes, 57.13 
> Kbytes/sec.)
> [19261]{--} Done (26 seconds, 168 documents, 3503752 bytes, 131.60 
> Kbytes/sec.)
> 
> 
> indexer -C;
> indexer -N50;
> [20289]{11} Done (11 seconds, 28 documents, 585571 bytes, 51.99 
> Kbytes/sec.)
> [20289]{28} Done (11 seconds, 29 documents, 705247 bytes, 62.61 
> Kbytes/sec.)
> [20289]{16} Done (11 seconds, 30 documents, 635782 bytes, 56.44 
> Kbytes/sec.)
> [20289]{30} Done (11 seconds, 30 documents, 635178 bytes, 56.39 
> Kbytes/sec.)
> [20289]{--} Done (21 seconds, 168 documents, 3504392 bytes, 162.96 
> Kbytes/sec.)
> 
> 
> mysql -uroot -p -N --database=db_test_mnogo --execute="SELECT url 
> FROM url" > ~/ALL.txt;
> 
> (cat ~/ALL.txt | parallel -j8 --gnu "wget {}");
> 
> real  0m10.638s
> user  0m1.256s
> sys   0m1.519s
> 
> 
> ---
> 
> Screaming Frog : 12s
> 
> 
> It just confirm the fact that Mnogosearch is relatively slower than 
> Sreaming Frog, and even when i compare to parallel wget bash, 
> mnogosearch is slower.
> 
> It get little better with indexer -N50 though.


Well, this effect can happen with a *small* site, with an empty database.

When indexer starts multiple threads (say 10) and the database is empty, 9 
threads immediately go to sleep for 10 seconds.
So only the first thread is actually working.

After 10 seconds the database is not empty, because the first thread has 
collected some links.

So it actually start working in multi-thread mode after 10 seconds only.

With a bigger site you will not see any difference between mnoGoSearch
vs wget/frog.


If you really need to crawl a small site quickly,
please apply this patch:


=== modified file 'src/indexer.c'
--- src/indexer.c   2016-03-30 12:13:49 +
+++ src/indexer.c   2016-05-14 08:28:25 +
@@ -2872,7 +2872,7 @@ int maxthreads=   1;
 UDM_CRAWLER *ThreadCrawlers= NULL;
 int thd_errors= 0;
 
-#define UDM_NOTARGETS_SLEEP 10
+#define UDM_NOTARGETS_SLEEP 0
 
 #ifdef  WIN32
 unsigned int __stdcall UdmCrawlerMain(void *arg)



Here are the results:

./indexer -Cw ; ./indexer -N10
[5853]{--} Done (12 seconds, 168 documents, 3504192 bytes, 285.17 Kbytes/sec.)


It's now as fast as wget and frog, and it crawls more documents (168 vs 140).


Please note:
Aggressive crawling is not polite and can be even considered as an 
attack. It better not to crawl sites that way, unless it is your
own site, or unless site owners allow you to do it this way.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-09 Thread bar
Author: rafikCyc
Email: 
Message:
I add those Disallow lines so that, both app can crawl the same 
number of urls (approximatively 140), 

Disallow */basket-villeurbanne/author/*
Disallow *?p=*
Disallow */feed

As it seems that Mnogosearch can manage a robots.txt, but not the 
meta robots noindex,follow

Here are the results :

-

indexer -C;
indexer;

[18898]{01} Done (53 seconds, 168 documents, 3503752 bytes, 64.56 
Kbytes/sec.)

--

indexer -C;
inexer -N5;

[19261]{02} Done (15 seconds, 46 documents, 982938 bytes, 63.99 
Kbytes/sec.)
[19261]{03} Done (15 seconds, 48 documents, 930200 bytes, 60.56 
Kbytes/sec.)
[19261]{01} Done (5 seconds, 14 documents, 323667 bytes, 63.22 
Kbytes/sec.)
[19261]{05} Done (15 seconds, 46 documents, 974427 bytes, 63.44 
Kbytes/sec.)
[19261]{04} Done (5 seconds, 14 documents, 292520 bytes, 57.13 
Kbytes/sec.)
[19261]{--} Done (26 seconds, 168 documents, 3503752 bytes, 131.60 
Kbytes/sec.)


indexer -C;
indexer -N50;
[20289]{11} Done (11 seconds, 28 documents, 585571 bytes, 51.99 
Kbytes/sec.)
[20289]{28} Done (11 seconds, 29 documents, 705247 bytes, 62.61 
Kbytes/sec.)
[20289]{16} Done (11 seconds, 30 documents, 635782 bytes, 56.44 
Kbytes/sec.)
[20289]{30} Done (11 seconds, 30 documents, 635178 bytes, 56.39 
Kbytes/sec.)
[20289]{--} Done (21 seconds, 168 documents, 3504392 bytes, 162.96 
Kbytes/sec.)


mysql -uroot -p -N --database=db_test_mnogo --execute="SELECT url 
FROM url" > ~/ALL.txt;

(cat ~/ALL.txt | parallel -j8 --gnu "wget {}");

real0m10.638s
user0m1.256s
sys 0m1.519s


---

Screaming Frog : 12s


It just confirm the fact that Mnogosearch is relatively slower than 
Sreaming Frog, and even when i compare to parallel wget bash, 
mnogosearch is slower.

It get little better with indexer -N50 though.

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-04 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Here is the site : http://www.asbuers.com/

After crawling this site with mnoGoSearch, I did the following:

# Extracted the list of all documents found (478 documents)
mysql -uroot -N --database=tmp --execute="SELECT url FROM url" >ALL.txt

# Run "wget" with 8 threads 
time (cat ALL.txt | parallel -j8 --gnu "wget {}")


With 8 parallel processes, wget downloaded this site in 38 seconds,
which is around the same time that mnoGoSearch spends on the same site.

I guess when you run screaming frog, it's not really downloading the entire 
site.



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-04 Thread bar
Author: rafikCyc
Email: 
Message:
Thank you for the reply.

Well, You're right...
With -P0 it does not have a limit of 1s.

But it remain very slow though.

--

I just did a quick speed test on a small site (500 documents)
Mnogosearch VS screaming frog.

The results :

Mnogosearch : 3.2 urls / second
Screaming Frog: 40 urls / second

Same connection , same remote site, but 10 time faster :(

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-04 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Hello,
> 
> I've tried this :
> 
> ./indexer -p 0
> 
> but it doesn't work :(
> The indexer sleeps for at least one seconde after each URL.

With -p0 it does not do any delays between URLs.
I guess the bottleneck is in the connection, or in the remote site.


To speed up crawling performance, you can run multiple crawling threads in 
parallel, for example:

indexer -N5

Make sure not to put the the remote site down though.


> 
> It seems impossible to index faster than 1s after each url.
> 
> To index 300 000 document on my website for example, the crawl takes 2 full 
> days !
> 
> Is there a solution ?

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: How to speed up the crawl delay after each URL ?

2016-05-04 Thread bar
Author: rafikCyc
Email: 
Message:
Hello,

I've tried this :

./indexer -p 0

but it doesn't work :(
The indexer sleeps for at least one seconde after each URL.

It seems impossible to index faster than 1s after each url.

To index 300 000 document on my website for example, the crawl takes 2 full 
days !

Is there a solution ?

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general