Re: [R] Creating a web site checker using R

2019-08-09 Thread Enrico Schumann
> "Chris" == Chris Evans  writes:

Chris> I use R a great deal but the huge web crawling power of
Chris> it isn't an area I've used. I don't want to reinvent a
Chris> cyberwheel and I suspect someone has done what I want.
Chris> That is a program that would run once a day (easy for
Chris> me to set up as a cron task) and would crawl a single
Chris> root of a web site (mine) and get the file size and a
Chris> CRC or some similar check value for each page as pulled
Chris> off the site (and, obviously, I'd want it not to follow
Chris> off site links). The other key thing would be for it to
Chris> store the values and URLs and be capable of being run
Chris> in "create/update database" mode or in "check pages"
Chris> mode and for the change mode run to Email me a warning
Chris> if a page changes.  The reason I want this is that two
Chris> of my sites have recently had content "disappear":
Chris> neither I nor the ISP can see what's happened and we
Chris> are lacking the very useful diagnostic of the date when
Chris> the change happened which might have mapped it some
Chris> component of WordPress, plugins or themes having
Chris> updated.

Chris> I am failing to find anything such and all the services
Chris> that offer site checking of this sort are prohibitively
Chris> expensive for me (my sites are zero income and either
Chris> personal or offering free utilities and information).

Chris> If anyone has done this, or something similar, I'd love
Chris> to hear if you were willing to share it.  Failing that,
Chris> I think I will have to create this but I know it will
Chris> take me days as this isn't my area of R expertise and
Chris> as, to be brutally honest, I'm a pretty poor
Chris> programmer.  If I go that way, I'm sure people may be
Chris> able to point me to things I may be (legitimately) able
Chris> to recycle in parts to help construct this.

Chris> Thanks in advance,

Chris> Chris

Chris> -- 
Chris> Chris Evans  Skype: chris-psyctc
Chris> Visiting Professor, University of Sheffield 

Chris> I do some consultation work for the University of Roehampton 
 and other places but this  
remains my main Email address.
Chris> I have "semigrated" to France, see: 
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to 
book to talk, I am trying to keep that to Thursdays and my diary is now 
available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Chris> Beware: French time, generally an hour ahead of UK.  That page will 
also take you to my blog which started with earlier joys in France and Spain!

Not an answer, but perhaps two pointers/ideas:

1) Since you know cron, I suppose you work on a
   Unix-like system, and you likely have a programme
   called 'wget' either installed or can easily install
   it. 'wget' has an option 'mirror', which allows you
   to mirror a website.

2) There is tools::md5sum for computing checksums. You
   could store those to a file and check changes in the
   files content (e.g. via 'diff').


regards
Enrico
-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Creating a web site checker using R

2019-08-08 Thread Chris Evans
I use R a great deal but the huge web crawling power of it isn't an area I've 
used. I don't want to reinvent a cyberwheel and I suspect someone has done what 
I want.  That is a program that would run once a day (easy for me to set up as 
a cron task) and would crawl a single root of a web site (mine) and get the 
file size and a CRC or some similar check value for each page as pulled off the 
site (and, obviously, I'd want it not to follow off site links). The other key 
thing would be for it to store the values and URLs and be capable of being run 
in "create/update database" mode or in "check pages" mode and for the change 
mode run to Email me a warning if a page changes.  The reason I want this is 
that two of my sites have recently had content "disappear": neither I nor the 
ISP can see what's happened and we are lacking the very useful diagnostic of 
the date when the change happened which might have mapped it some component of 
WordPress, plugins or themes having updated.

I am failing to find anything such and all the services that offer site 
checking of this sort are prohibitively expensive for me (my sites are zero 
income and either personal or offering free utilities and information).

If anyone has done this, or something similar, I'd love to hear if you were 
willing to share it.  Failing that, I think I will have to create this but I 
know it will take me days as this isn't my area of R expertise and as, to be 
brutally honest, I'm a pretty poor programmer.  If I go that way, I'm sure 
people may be able to point me to things I may be (legitimately) able to 
recycle in parts to help construct this.

Thanks in advance,

Chris

-- 
Chris Evans  Skype: chris-psyctc
Visiting Professor, University of Sheffield 
I do some consultation work for the University of Roehampton 
 and other places but this  
remains my main Email address.
I have "semigrated" to France, see: 
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to 
book to talk, I am trying to keep that to Thursdays and my diary is now 
available at: https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take 
you to my blog which started with earlier joys in France and Spain!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.