[tg-docs] Re: Sad state of the doc wiki

Christopher Arndt Sat, 19 May 2007 07:55:20 -0700

Lee McFadden schrieb:
> If everyone else agrees with Chris I will take a look at the moinmoin
> instance to try and remove the comment system.


I hope the */PageCommentData pages will not be removed as well? We could
harvest them automatically later and add their content to the parent
pages. Could probably done by a little script with BeautifulSoup and
mechanize.

BTW, I just hacked together a little script that downloads every page in
the wiki and checks, if it is broken (i.e. contains a DIV with
class="traceback"). This could be easiliy extended to do more checks and
we could run it on a regular basis. It should cache the downloaded pages
though. The script is attached.

It currently coughs up the following list of broken pages:

/1.0/AlternativeTemplating
/1.0/CLIReference
/1.0/Configuration
/1.0/GenerateFigures
/1.0/GettingStarted/Admin
/1.0/GettingStarted/Configuration
/1.0/TgAdmin
/1.0/ThirdParty
/DocTeam
/VideoHelp

Chris

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"TurboGears Docs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/turbogears-docs?hl=en
-~----------~----~----~----~------~----~------~--~---

#!/usr/bin/env python

import urllib
import sys

from BeautifulSoup import BeautifulSoup

BASE_URL = 'http://docs.turbogears.org'
TITLE_INDEX = BASE_URL + '/TitleIndex'


def search_pages(urls, searchstring):
    """Download all pages in urls and return those whose content contains serachstring."""

    pages = []
    print >>sys.stderr, "Downloading and parsing wiki pages..."
    for url in urls:
        try:
            print >>sys.stderr, "Downloading '%s'..." % url
            ret = urllib.urlopen(BASE_URL + url)
        except:
            print >>sys.stderr, "Could not open '%s'" % url
        else:
            if searchstring in ret.read():
                pages.append(url)
    return pages

def main(args):
    try:
        print >>sys.stderr, "Retrieving title index..."
        ret = urllib.urlopen(TITLE_INDEX)
    except:
        print >>sys.stderr, "Could not retrieve title index from", TITLE_INDEX
        return 1
    else:
        title_index = ret.read()
    soup = BeautifulSoup(title_index)
    content = soup.find('div', id='content')
    links = content.findAll('a', href=True)
    # only visit relative URLs without query params
    urls = [l['href'] for l in links if not
        (l['href'].startswith('http://') or l['href'].startswith('#') or
        '?' in l['href'])]
    urls.sort()
    print >>sys.stderr, urls
    broken_pages = search_pages(urls, '<div class="traceback">')
    print "Broken pages"
    print
    print "\n".join(broken_pages)

if __name__ == '__main__':
    sys.exit(main(sys.argv[1:]))

[tg-docs] Re: Sad state of the doc wiki

Reply via email to