subject:"\[NTG\-context\] Bad PDF to text crawlers"

Re: [NTG-context] Bad PDF to text crawlers

2015-08-20 Thread Kip Warner

On Wed, 2015-08-19 at 23:35 +0200, Peter Münster wrote:
> Even if you would find a way today, tomorrow there would be other 
> bots, that see the same text, as the humans.

Yes, probably.

> Get the value of HTTP_USER_AGENT and send the replacement text, if 
> the agent is a bot. Or use robots.txt.

I'll give that some thought. Thank you.

-- 
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com


signature.asc
Description: This is a digitally signed message part
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Bad PDF to text crawlers

2015-08-19 Thread Peter Münster

On Wed, Aug 19 2015, Kip Warner wrote:

> I was thinking that there must be some way of tricking these bots, 
> depending on how they are implemented, and let's assume they will always 
> find the PDF, to get them to extract only a small invisible layer that 
> just contains some hidden text directing a user to the location to 
> download the original high quality ConTeXt PDF.

Even if you would find a way today, tomorrow there would be other bots,
that see the same text, as the humans.


> Any suggestions?

Get the value of HTTP_USER_AGENT and send the replacement text, if the
agent is a bot. Or use robots.txt.

-- 
   Peter
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

[NTG-context] Bad PDF to text crawlers

2015-08-19 Thread Kip Warner

Hey list,

I have an important document online that I would prefer to keep as a PDF 
and not in another format. Unfortunately bots frequently try to provide 
those looking for it with a text version they try to extract (beyond my 
control). The extraction looks just absolutely awful and has been a 
major pain in leaving readers with a really bad understanding of the 
contents of the document.

I was thinking that there must be some way of tricking these bots, 
depending on how they are implemented, and let's assume they will always 
find the PDF, to get them to extract only a small invisible layer that 
just contains some hidden text directing a user to the location to 
download the original high quality ConTeXt PDF.

Any suggestions?

-- 
Kip Warner -- Senior Software Engineer
OpenPGP encrypted/signed mail preferred
http://www.thevertigo.com


signature.asc
Description: Digital signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] Bad PDF to text crawlers

Re: [NTG-context] Bad PDF to text crawlers

[NTG-context] Bad PDF to text crawlers

3 matches

Site Navigation

Mail list logo

Footer information