Re: A NH project...

2019-09-23 Thread Joshua Judson Rosen
On 9/23/19 12:36 PM, Bobby Casey wrote:
> I love the idea of this and would contribute if I had some time and energy.
> I've written a few Python scrapers before, but those were all years ago and 
> one-offs.

I expect that these are going to generally be one-offs as well

> If you could share some details about how you've developed the ones you're 
> presently using

I start off by writing a script that just fetches whatever web pages look 
relevant
into a revision-controlled directory; pretty-prints the HTML with "hxnormalize 
-x"
(some of these things are inscrutable without going through that step),
runs a vc diff to stdout, and then commits any new/changed files.

Then set up a cron job to run that script every few hours (and sends me e-mail
whenever the "vc diff to stdout" cited above actually shows a difference).

At that point, while I'm waiting for the `raw' diff-notices, I can look through
the HTML structures and try to identify the parts that I need and
and figure out what the relevant bits of text/data are and how to summarize them
(web.archive.org can be helpful looking for past revisions of the pages,
 if it seems likely that there's something that's just not represented
 in the samples I've got so far).

So then I write code to *summarize* whatever "status items" are described in 
the source material,
into single-line versions (like "NOW PLAYING: ...", "STARTING ON SUNDAY 
(DD/MM): ..." or "ENDS TODAY: ...",
and I just write all of the lines of "current" summary into a text-file.

And I also revision-control the text-file. Because that means that looking for
"NEWS items" is just doing a VC diff and looking for lines that start with "+" 
:)

In the case of the Wilton Town Hall Theatre, the text is only *very slightly*
edited from the raw multi-line display format used on 

(_most_ of the phrasing is copied verbatim from the source material--I'm just
 adjusting punctuation, capitalization, and removing some redundancies
 and awkwardnesses that become more apparent when things are condensed
 into single-line format, e.g.: "STARTS SUNDAY (10/27) PLAYING FOR ONE DAY 
ONLY").

The news-item lines are written into my "summary listing" file without 
hashtags/bangtags or other markup--
the thing that actually *posts* them runs some sed substitutions to convert
phrases like "silent film" and "live music" to tags.

Once I'm reasonably sure that a given converter isn't going to just spew 
gibberish
or be overly aggressive about re-notifying based on overly-subtle source-changes
that a reasonable person wouldn't actually count as "news", I add a social 
account
for the the bot and set it up to post, per this article:


https://www.linux.com/tutorials/weekend-project-monitor-your-server-statusnet-updates/


Most of the methodology here is coming from the experience of having 
collaborated on
a "Taco Salad Early Warning System" that a friend started when we worked 
together
about a decade ago :)

> I assume most lists could be easily developed and maintained with one of the 
> many Python web scraping libraries, although I have little to no experience 
> with them.

I did the Wilton movie-listings scraper in bash ;p

That website is kind of "simultaneously ideal and pessimal" source material:

* there's not really any "structured data" to pull out,
  mostly just lists of headings, with some details put into HTML tables
* but (going by the diffs) the HTML seems to be generated automatically
  based on titles/dates/flags set in some sort of database;
  so it seems safe to assume _some level_ of consistency
* nothing about that site (other than the specific movies/times being 
listed)
  has changed in *years*--possibly even *decades* at this point,
  and seems unlikely to change any time soon (the web-designer 
referenced
  at the bottom of the page is actually an HVAC contractor who at some 
point
  when the WWW was relatively new apparently decided to dabble in it;
  and the whole point of the theatre is "not changing with the times".

Milford Drive-In's site is apparently based on WordPress..., but with all of 
the blogging/chronology
stuff removed Looks like there are RSS feeds, but I've yet to see them 
actually contain anything;
at least the HTML (provided by movienewsletters.com) includes some structured 
markup to make it
easier to pick stuff out.

Looks like the Drive-In's autumn update-schedule is "take the listings offline 
the day after the showing,
decide mid-week what the next set of movies will be and put them up then".

Haven't really looked at any other sites yet.

-- 
Connect with me on the GNU social network! 

Not on the network? Ask me for more info!
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org

Re: A NH project...

2019-09-23 Thread Bobby Casey
On Mon, Sep 23, 2019 at 1:34 PM Tyson Sawyer  wrote:

>
> On Mon, Sep 23, 2019 at 12:38 PM Bobby Casey  wrote:
>
>> I've written a few Python scrapers before, but those were all years ago
>> and one-offs.
>>
>
> Back when you used to be a software engineer, before you became a meeting
> attendee? 
>

Yeah, that about sums it up.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: A NH project...

2019-09-23 Thread Tyson Sawyer
On Mon, Sep 23, 2019 at 12:38 PM Bobby Casey  wrote:

> I've written a few Python scrapers before, but those were all years ago
> and one-offs.
>

Back when you used to be a software engineer, before you became a meeting
attendee? 

-- 
Tyson D Sawyer

A strong conviction that something must be done is the parent
of many bad measures.   - Daniel Webster
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: A NH project...

2019-09-23 Thread Bobby Casey
I love the idea of this and would contribute if I had some time and
energy.  I've written a few Python scrapers before, but those were all
years ago and one-offs. If you could share some details about how you've
developed the ones you're presently using then I can try to take a look
if/when I have some spare time. I assume most lists could be easily
developed and maintained with one of the many Python web scraping
libraries, although I have little to no experience with them.

Here's a random list of events and sources that immediately come to mind:

   - Chunky's has a lot of sporting events, classic movies, etc that I
   often wish I knew about sooner
   - Canobie Lake Park has a lot of special dates, events, and shows
   - Salem NH has a summer concert series in the "Field of Dreams" park
   - The visitnh.gov website has a list of festivals that I only seem to
   look at after the interesting events occur (
   https://www.visitnh.gov/things-to-do/event-calendar/festivals  )
   - Not in NH, but Lowell has a great set of festivals and concerts
   - There are several lists of beer festivals, Oktoberfests, etc

Thanks for getting this started.

On Sun, Sep 22, 2019 at 2:10 PM Joshua Judson Rosen 
wrote:

> I just started a project to provide "guerrilla" newsfeeds for
> movie-theatres and stuff;
> wondering if anyone would be interested in helping out--adding more
> newsfeeds for different things.
>
> First is the Wilton Town Hall Theatre, which now has a GNU social stream +
> RSS/Atom feeds
> via , as described here:
>
> https://status.hackerposse.com/conversation/134528#notice-134772
>
> ... because I love seeing things there, and I've been subscribe to their
> mailing-list
> for years..., but I only recently came to realize/accept that the reason I
> *haven't been*
> seeing things there is that their "ALL THE THINGS!" style of listings
> (both in the e-mails and on the website) is not usable for me--
> and my wife (and kids) and I would probably go see more movies if it didn't
> require so much work to digest the listings.
>
> I've also started capturing the Milford Drive-In's listings into a
> revision-control system
> for the last couple of weeks so that I can find trends in how they update;
> thinking about adding a feed for them next--though, since their season is
> about to end,
> I may postpone work on that in favor of adding a listing for something
> else's that's
> interesting and live during the winter.
>
> Looking for suggestions--and help writing scrapers/converters. Ideally for
> more small-time
> local things that are more likely to appreciate the publicity than to sue
> me
> into oblivion "just per general corporate policy".
>
> --
> Connect with me on the GNU social network! <
> https://status.hackerposse.com/rozzin>
> Not on the network? Ask me for more info!
> ___
> gnhlug-discuss mailing list
> gnhlug-discuss@mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
>
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/