Re: [Koha] Koha slowed down by Google indexing?!

2017-05-04 Thread Tomas Cohen Arazi
It was mentioned on the release notes a while back. I agree that the
divergence between source installs (what the manual usually talks about)
and the packages has become a real problem.

El jue., 4 may. 2017 a las 11:49, Michael Kuhn ()
escribió:

> Hi Magnus
>
> >> The sitemapper tool is baked in Koha. The packages have a handy
> >> koha-sitemap script.
> >
> > And the documentation for it is available if you do this on the command
> line:
> >
> > $ man koha-sitemap
>
> Yes - but before, of course, the world needs to know there IS such a
> command. That's why I wrote in my e-mail from 3rd May 2017 16:38 the
> following:
>
>
> Unfortunately I couldn't find anything about it in the the Koha 16.11
> manual. The only sources I found was:
>
> * https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190
> *
> http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap
>
> Since I couldn't find the command in
>
> https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_packages
> I have now added "koha-sitemap" there in the new section "Bot-related".
>
> I still think it'd be good to propagate this better (for example mention
> it in the Koha manual) - or even activate it by default because who
> would want such behavior?
>
>
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
> ___
> Koha mailing list  http://koha-community.org
> Koha@lists.katipo.co.nz
> https://lists.katipo.co.nz/mailman/listinfo/koha
>
-- 
Tomás Cohen Arazi
Theke Solutions (https://theke.io )
✆ +54 9351 3513384
GPG: B2F3C15F
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-04 Thread Michael Kuhn

Hi Magnus


The sitemapper tool is baked in Koha. The packages have a handy
koha-sitemap script.


And the documentation for it is available if you do this on the command line:

$ man koha-sitemap


Yes - but before, of course, the world needs to know there IS such a 
command. That's why I wrote in my e-mail from 3rd May 2017 16:38 the 
following:



Unfortunately I couldn't find anything about it in the the Koha 16.11 
manual. The only sources I found was:


* https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190
* 
http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap


Since I couldn't find the command in 
https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_packages 
I have now added "koha-sitemap" there in the new section "Bot-related".


I still think it'd be good to propagate this better (for example mention 
it in the Koha manual) - or even activate it by default because who 
would want such behavior?



Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-04 Thread Magnus Enger
On 3 May 2017 at 20:52, Tomas Cohen Arazi  wrote:
> The sitemapper tool is baked in Koha. The packages have a handy
> koha-sitemap script.

And the documentation for it is available if you do this on the command line:

$ man koha-sitemap

Best regards,
Magnus
Libriotech
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Tomas Cohen Arazi
The sitemapper tool is baked in Koha. The packages have a handy
koha-sitemap script.

Regards.

El mié., 3 may. 2017 a las 11:45, Michael Kuhn ()
escribió:

> Hi Mark
>
> >>   # ufw status
> >>   Status: active
> >>
> >>   To Action  From
> >>   -- --  
> >>   22/tcp ALLOW   Anywhere
> >>   80/tcp ALLOW   Anywhere
> >>   8080/tcp   ALLOW   Anywhere
> >>   Anywhere   DENY66.249.64.32
> >>
> >> But however this is possible, still Googlebot is crawling and eating
> >> CPU!
> >
> > I haven't used UFW, but I'm looking at the documentation here:
> >
> >   https://help.ubuntu.com/community/UFW
> >
> > and it seems that the order of the rules is important.  Quote:
> >
> >   Once a rule is matched the others will not be evaluated (see manual
> >   below) so you must put the specific rules first. As rules change you
> >   may need to delete old rules to ensure that new rules are put in the
> >   proper order.
> >
> > and from the man page:
> >
> >   Rule ordering is important and the first match wins. Therefore when
> >   adding rules, add the more specific rules first with more general
> >   rules later.
>
> Many thanks for the clarification! Yes, this makes sense.
>
> So I would have to delete all rules and write them again in the correct
> order? Very "uncomplicated" indeed ;-)
>
> However, I now deletd the rule for 66.249.64.32 in UFW since the rule in
> iptables succeeded (without giving any special order).
>
> Best wishes & thanks again: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
> ___
> Koha mailing list  http://koha-community.org
> Koha@lists.katipo.co.nz
> https://lists.katipo.co.nz/mailman/listinfo/koha
>
-- 
Tomás Cohen Arazi
Theke Solutions (https://theke.io )
✆ +54 9351 3513384
GPG: B2F3C15F
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Michael Kuhn

Hi Mark


  # ufw status
  Status: active

  To Action  From
  -- --  
  22/tcp ALLOW   Anywhere
  80/tcp ALLOW   Anywhere
  8080/tcp   ALLOW   Anywhere
  Anywhere   DENY66.249.64.32

But however this is possible, still Googlebot is crawling and eating
CPU!


I haven't used UFW, but I'm looking at the documentation here:

  https://help.ubuntu.com/community/UFW

and it seems that the order of the rules is important.  Quote:

  Once a rule is matched the others will not be evaluated (see manual
  below) so you must put the specific rules first. As rules change you
  may need to delete old rules to ensure that new rules are put in the
  proper order.

and from the man page:

  Rule ordering is important and the first match wins. Therefore when
  adding rules, add the more specific rules first with more general
  rules later.


Many thanks for the clarification! Yes, this makes sense.

So I would have to delete all rules and write them again in the correct 
order? Very "uncomplicated" indeed ;-)


However, I now deletd the rule for 66.249.64.32 in UFW since the rule in 
iptables succeeded (without giving any special order).


Best wishes & thanks again: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Michael Kuhn

Hi Hugo


You 're not the only one who has suffered this from Google, but Baidu is
worse and some others as well, giving you telegram answers to your points...

Yes I have also suffered a lot from crawlers, and I have spend a lot of
hours trying to adjut firewalls, robots

What version of Koha you're using? moderns one have a command
koha-sitemap (If I am not wrong)


I'm on Koha 16.11.04, and yes, there is a command "koha-sitemap".

Unfortunately I couldn't find anything about it in the the Koha 16.11 
manual. The only sources I found was:


* https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190
* 
http://search.cpan.org/~fredericd/Koha-Contrib-Tamil-0.011/bin/koha-sitemap


Since I couldn't find the command in 
https://wiki.koha-community.org/wiki/Commands_provided_by_the_Debian_packages 
I added it there in the new section "Bot-related". But I think it'd be 
good to propagate this better - or even activate it by default because 
who would want such behavior?



Google Webmaster warns you that it has not inmediate effect, you should
wait a little more...

In summary you have done all the expected work, now it is just time to
ajust it and wait for the results

With the combinations of robots.txt, koha-sitemap & firewall I have been
happy for a long time... but you're not save from this never

:( I am sorry..


I'm getting more patient now since I was at least able to cure the 
symptoms...


Best wishes & thanks again: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Mark Alexander
Excerpts from Michael Kuhn's message of 2017-05-03 16:14:55 +0200:
>   # ufw status
>   Status: active
> 
>   To Action  From
>   -- --  
>   22/tcp ALLOW   Anywhere
>   80/tcp ALLOW   Anywhere
>   8080/tcp   ALLOW   Anywhere
>   Anywhere   DENY66.249.64.32
> 
> But however this is possible, still Googlebot is crawling and eating 
> CPU!

I haven't used UFW, but I'm looking at the documentation here:

  https://help.ubuntu.com/community/UFW

and it seems that the order of the rules is important.  Quote:

  Once a rule is matched the others will not be evaluated (see manual
  below) so you must put the specific rules first. As rules change you
  may need to delete old rules to ensure that new rules are put in the
  proper order. 

and from the man page:

  Rule ordering is important and the first match wins. Therefore when
  adding rules, add the more specific rules first with more general
  rules later. 

___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Hugo Agud
Hi

You 're not the only one who has suffered this from Google, but Baidu is
worse and some others as well, giving you telegram answers to your points...

Yes I have also suffered a lot from crawlers, and I have spend a lot of
hours trying to adjut firewalls, robots

What version of Koha you're using? moderns one have a command koha-sitemap
(If I am not wrong)

Google Webmaster warns you that it has not inmediate effect, you should
wait a little more...


In summary you have done all the expected work, now it is just time to
ajust it and wait for the results

With the combinations of robots.txt, koha-sitemap & firewall I have been
happy for a long time... but you're not save from this never

:( I am sorry..








2017-05-03 16:14 GMT+02:00 Michael Kuhn :

> Hi Mark and Hugo
>
> Many thanks for your hints! I have now done the following.
>
> 1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing
> this:
>
>  Sitemap: sitemapindex.xml
>  User-agent: *
>  Disallow: /cgi-bin/
>
> 2. I generated a Koha sitemap using the seemingly undocumented Perl script
> "sitemap.pl" (according to https://bugs.koha-community.or
> g/bugzilla3/show_bug.cgi?id=11190) which created the file
> "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the file
> "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.
>
> 3. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 4. I went to Google Webmaster Tools where I downloaded the HTML
> confirmation file "googleb56bd3db2af352b1.html" and placed it in
> "/usr/share/koha/opac/htdocs" as well. I also followed the steps given on
> the Wemaster Tools page, i. e. I called the URL and I confirmed the
> download.
>
> 5. Even after a complete reboot of the host the "opac-search.pl"
> processes were still there, appearing immediately after the reboot!
>
> 6. I then installed the Uncomplicated Firewall / UFW where I applied the
> following rules and enabled it:
>
>  # ufw status
>  Status: active
>
>  To Action  From
>  -- --  
>  22/tcp ALLOW   Anywhere
>  80/tcp ALLOW   Anywhere
>  8080/tcp   ALLOW   Anywhere
>  Anywhere   DENY66.249.64.32
>
> But however this is possible, still Googlebot is crawling and eating CPU!
> This can be seen in the log file "plack.log" where hundreds and thousands
> of lines like the following can be seen:
>
>  66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> And I also found another bot:
>
>  62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET /opac/
> opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" "Linguee
> Bot (http://www.linguee.com/bot; b...@linguee.com)"
>
> Now what I don't understand is how Googlebot (66.249.64.32) can access
> the webserver even if it is blocked by UFW?!
>
> 9. Already quite desperate I finally executed the following line to drop
> all packets from 66.249.64.32.
>
>  # iptables -I INPUT -s 66.249.64.32 -j DROP
>  # iptables -I INPUT -s 62.138.14.218 -j DROP
>
> And yes - this actually stopped these harassing bots.
>
> But of course, next was this:
>
>  66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET /opac/
> opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-"
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
>
> I also dropped this IP address and now - finally! - the OPAC search for
> the normal user works as fast as expected.
>
> In fact I can't believe I should be the only one experiencing this
> behavior (especially since the stuff about creating "sitemap.pl" is quite
> hidden and however undocumented in the Koha manual).
>
> The other thing is people usually say it's a good thing to be indexed by
> Google. Today however, I won't agree. Maybe tomorrow, I will then try to
> delete the rule which drops the Google packets and I really hope Google
> will then do what it is told to do in "robots.txt", using the Koha sitemap.
>
> So all this just for the record - maybe it will help someone in the future.
>
> Best wishes: Michael
> --
> Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
> Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
> T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
>



-- 

*Hugo Agud - Orex Digital *

*www.orex.es *


[image: www.orex.es/koha] 
   [image: www.orex.es/vufind] 



Director

Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933
856 138   ha...@orex.es · http://www.orex.es/



No imprima este mensaje a no ser que sea 

Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Michael Kuhn

Hi Mark and Hugo

Many thanks for your hints! I have now done the following.

1. I created a file "/usr/share/koha/opac/htdocs/robots.txt" containing 
this:


 Sitemap: sitemapindex.xml
 User-agent: *
 Disallow: /cgi-bin/

2. I generated a Koha sitemap using the seemingly undocumented Perl 
script "sitemap.pl" (according to 
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=11190) which 
created the file "/usr/share/koha/opac/htdocs/sitemapindex.xml" and the 
file "/usr/share/koha/opac/htdocs/sitemap0001.xml" containing the URLs.


3. Even after a complete reboot of the host the "opac-search.pl" 
processes were still there, appearing immediately after the reboot!


4. I went to Google Webmaster Tools where I downloaded the HTML 
confirmation file "googleb56bd3db2af352b1.html" and placed it in 
"/usr/share/koha/opac/htdocs" as well. I also followed the steps given 
on the Wemaster Tools page, i. e. I called the URL and I confirmed the 
download.


5. Even after a complete reboot of the host the "opac-search.pl" 
processes were still there, appearing immediately after the reboot!


6. I then installed the Uncomplicated Firewall / UFW where I applied the 
following rules and enabled it:


 # ufw status
 Status: active

 To Action  From
 -- --  
 22/tcp ALLOW   Anywhere
 80/tcp ALLOW   Anywhere
 8080/tcp   ALLOW   Anywhere
 Anywhere   DENY66.249.64.32

But however this is possible, still Googlebot is crawling and eating 
CPU! This can be seen in the log file "plack.log" where hundreds and 
thousands of lines like the following can be seen:


 66.249.64.32 - - [03/May/2017:15:48:28 +0200] "GET 
/opac/opac-authoritiesdetail.pl?authid=12872 HTTP/1.1" 200 17703 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


And I also found another bot:

 62.138.14.218 - - [03/May/2017:15:48:29 +0200] "GET 
/opac/opac-search.pl?q=se,phr:%22Zeitreise%22 HTTP/1.1" 200 54672 "-" 
"Linguee Bot (http://www.linguee.com/bot; b...@linguee.com)"


Now what I don't understand is how Googlebot (66.249.64.32) can access 
the webserver even if it is blocked by UFW?!


9. Already quite desperate I finally executed the following line to drop 
all packets from 66.249.64.32.


 # iptables -I INPUT -s 66.249.64.32 -j DROP
 # iptables -I INPUT -s 62.138.14.218 -j DROP

And yes - this actually stopped these harassing bots.

But of course, next was this:

 66.249.64.35 - - [03/May/2017:15:59:21 +0200] "GET 
/opac/opac-authoritiesdetail.pl?authid=16429 HTTP/1.1" 200 17661 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


I also dropped this IP address and now - finally! - the OPAC search for 
the normal user works as fast as expected.


In fact I can't believe I should be the only one experiencing this 
behavior (especially since the stuff about creating "sitemap.pl" is 
quite hidden and however undocumented in the Koha manual).


The other thing is people usually say it's a good thing to be indexed by 
Google. Today however, I won't agree. Maybe tomorrow, I will then try to 
delete the rule which drops the Google packets and I really hope Google 
will then do what it is told to do in "robots.txt", using the Koha sitemap.


So all this just for the record - maybe it will help someone in the future.

Best wishes: Michael
--
Geschäftsführer · Diplombibliothekar BBS, Informatiker eidg. Fachausweis
Admin Kuhn GmbH · Pappelstrasse 20 · 4123 Allschwil · Schweiz
T 0041 (0)61 261 55 61 · E m...@adminkuhn.ch · W www.adminkuhn.ch
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Hugo Agud
Hi

Yes this is annoying issue with boots, this is google but there are plenty
of them...

You should use robots.txt propertly, but If I am not wrong with Google it
is more effective go to google webmaster web and modify the googleboot
behaviour with your koha installarion
You should also use a koha-sitemap.. depending on the version is out of the
box functionality
Perhaps you may think on use ufw or even ufw + fail2ban

Some times bots are nightmare

2017-05-03 13:49 GMT+02:00 Mark Alexander :

> > When I searched for who is 66.249.64.32 I saw this IP addresse belongs
> > to Google.
>
> This does seem to be the Google indexer:
>
>   % nslookup 66.249.64.32
>   ...
>   32.64.249.66.in-addr.arpa name = crawl-66-249-64-32.googlebot.com.
>
> I haven't seen this problem (yet), but perhaps that is because I have
> a /usr/share/koha/opac/htdocs/robots.txt containing this:
>
> Crawl-delay: 60
>
> User-agent: *
> Disallow: /
>
> User-agent: Googlebot
> Disallow: /cgi-bin/koha/opac-search.pl
> Disallow: /cgi-bin/koha/opac-showmarc.pl
> Disallow: /cgi-bin/koha/opac-detailprint.pl
> Disallow: /cgi-bin/koha/opac-ISBDdetail.pl
> Disallow: /cgi-bin/koha/opac-MARCdetail.pl
> Disallow: /cgi-bin/koha/opac-reserve.pl
> Disallow: /cgi-bin/koha/opac-export.pl
> Disallow: /cgi-bin/koha/opac-detail.pl
> Disallow: /cgi-bin/koha/opac-authoritiesdetail.pl
> ___
> Koha mailing list  http://koha-community.org
> Koha@lists.katipo.co.nz
> https://lists.katipo.co.nz/mailman/listinfo/koha
>



-- 

*Hugo Agud - Orex Digital *

*www.orex.es *


[image: www.orex.es/koha] 
   [image: www.orex.es/vufind] 



Director

Calle Sant Joaquin,117, 2º-3ª · 08922 Santa Coloma de Gramanet - Tel: 933
856 138   ha...@orex.es · http://www.orex.es/



No imprima este mensaje a no ser que sea necesario. Una tonelada de papel
implica la tala de 15 árboles y el consumo de 250.000 litros de agua.



Aviso de confidencialidad
Este mensaje contiene información que puede ser CONFIDENCIAL y/o de USO
RESTRINGIDO. Si usted no es el receptor deseado del mensaje (ni
está autorizado a recibirlo por el remitente), no está autorizado a copiar,
reenviar o divulgar el mensaje o su contenido. Si ha recibido este mensaje
por error, por favor, notifíquenoslo inmediatamente y bórrelo de su sistema.
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha


Re: [Koha] Koha slowed down by Google indexing?!

2017-05-03 Thread Mark Alexander
> When I searched for who is 66.249.64.32 I saw this IP addresse belongs
> to Google.

This does seem to be the Google indexer:

  % nslookup 66.249.64.32
  ...
  32.64.249.66.in-addr.arpa name = crawl-66-249-64-32.googlebot.com.

I haven't seen this problem (yet), but perhaps that is because I have
a /usr/share/koha/opac/htdocs/robots.txt containing this:

Crawl-delay: 60

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /cgi-bin/koha/opac-search.pl
Disallow: /cgi-bin/koha/opac-showmarc.pl
Disallow: /cgi-bin/koha/opac-detailprint.pl
Disallow: /cgi-bin/koha/opac-ISBDdetail.pl
Disallow: /cgi-bin/koha/opac-MARCdetail.pl
Disallow: /cgi-bin/koha/opac-reserve.pl
Disallow: /cgi-bin/koha/opac-export.pl
Disallow: /cgi-bin/koha/opac-detail.pl
Disallow: /cgi-bin/koha/opac-authoritiesdetail.pl
___
Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
https://lists.katipo.co.nz/mailman/listinfo/koha