As it happens, I was working on a tokenizer last night (was the reason for this
tweet: http://twitter.com/jtauber/status/11611451291 ). I had planned to put it
up on GitHub (as part of a general "text tools" project) but here is the
current state:
The core of it is just a two-line python function:
--------------------
from itertools import groupby
def tokenize(stream, chr_classes):
"""
tokenize the given stream based on the given character classes.
chr_classes should be a dictionary mapping character class label to a
string of member unicode characters::
{
"numbers": u"0123456789",
"whitespace": u" \n",
}
"""
# build reverse index from character to character class
idx = dict((ch, chr_class) for chr_class, chrs in chr_classes.items() for
ch in chrs)
# tokenize text
return groupby((ch for line in stream for ch in line.decode("utf-8")),
idx.get)
--------------------
So, for example you set up your character classes (yes, I could have just
defined ranges but, in my code last night, I was being explicit about the
character appearing in the particular text I was tokenizing)
--------------------
def u(s):
"""
convert utf-8 encoded string to unicode.
"""
return s.decode("utf-8")
CHR_CLASSES = {
"editorial": u("[]‹›()"),
"letters": u(
"ΒΓΔΖΘΚΛΜΝΞΠΣΤΦΧΨ" "ΑΕΗΟ" "Ῥ"
"ἈἊἉἌἍἙἘἝἜἚἩἨἬἭἹἸἽἼὍὉὊὈὌὙὩὭὫὨ"
"βγδζθκλμνξπρσςτφχψ" "αεηιουω" "ῥῤ"
"ἀἁάὰᾶἄἅἂᾳᾷἃἆᾴᾄ" "ἐἑέὲἔἕἓ" "ἡήὴῆἢἤῃῄῇἥἦἧᾐἠᾖἣᾗ"
"ἰἱίὶῖἶἷἴἵϊἳΐῒ" "ὀὁόὸὅὃὄὂ" "ὐὑύὺῦὔὖὕὓ"
"ὡώὼῶὧὥῳᾠῷῴᾧὦὤὠὢᾡὣ"
"΄"
),
"whitespace": u(" \n"),
"numbers": u("1234567890"),
"punctuation": u(".,·;“”"),
"temp": u("†-"),
}
--------------------
and then you're good to go...
--------------------
import sys
FILENAME = sys.argv[1]
for chr_class, token in tokenize(open(FILENAME), CHR_CLASSES):
print "".join(token).encode("utf-8"), chr_class
--------------------
James
On Apr 5, 2010, at 1:44 PM, Weston Ruter wrote:
> DM:
>
> But what we really need is not a parser but a tokenizer. I'm thinking about
> writing one (my degree work was in compiler writing). Basically, we repeat
> the same tokenization code in several places. It should be trivial to write a
> complete, accurate one.
>
> I've also been wanting to work on a tokenizer. At Open Scriptures, the text
> of a work is currently represented by two models (database tables): Token and
> Structure. Tokens are the smallest divisible units of text, such as words,
> punctuation, and whitespace; and structures are the spans of tokens that form
> logical units, such as verses, paragraphs, quotes, etc. The structures are
> standoff-markup for the tokens. With the underlying data stored in this way,
> it can then be serialized in whichever hierarchy desired
> (book-section-paragraph, book-chapter-verse, all-milestoned, etc) or
> whichever data format is needed (OSIS, SWORD Module, XHTML, etc.)
>
> So what I'm currently rumenating on is the process of importing the raw data
> into the Token and Structure models. I wrote an importer for the Tischendorf
> GNT data which does everything both tokenizing and parsing, but obviously
> there is going to be a lot of code in common with other importers that are
> written. So I too am thinking about how these importers can be reduced to the
> bare minimum to handle the unique aspects of the raw data (i.e. normalize
> it), and then stream the tokens back to a central importer that parses the
> input and stores it into the Token and Structure models. This central
> importer facility could be a web service.
>
> I've love to collaborate with you on this. We could come up with a common
> tokenizer that can be used by both SWORD and Open Scriptures. The importer
> web service could take tokens as input and as output generate a SWORD module
> and also populate the Open Scriptures models at the same time.
>
> Thoughts?
>
> Weston
>
>
>
> On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <[email protected]> wrote:
> Yes, I agree, and if there were a feedback mechanism for the module creator
> to let them know how to start fixing an OSIS file or conf file, it would save
> Chris (or whoever else approves modules) time on the basic stuff.
>
> Daniel
>
>
> On 4/5/2010 11:09 AM, DM Smith wrote:
> This is a great idea. Rather than emailing source to modules at crosswire dot
> org, one could upload it via a web service. We could have stages of
> validation (xmllint) and construction (osis2mod). Such a service could
> evaluate the quality of the submission.
>
> In Him,
> DM
>
> On 04/05/2010 12:01 PM, Weston Ruter wrote:
> Why not turn osis2mod into a web service? Then it wouldn't matter how it is
> implemented since it would be abstracted away by the web service interface.
> It could use the best XML libraries available today and written in the
> programming language of choice, both of which would make maintenance and the
> addition of new features much easier.
>
> Weston
>
>
>
> On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <[email protected]> wrote:
> On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:
> On 5 April 2010 13:55, Manfred Bergmann<[email protected]> wrote:
>
> Hi DM.
>
> Am 05.04.2010 um 13:21 schrieb DM Smith:
>
>
> Regarding using a "real" parser, it is a good idea. But we don't want SWORD
> to be dependant on an external parser.
>
> What's the reason for that?
> I could understand if it would mean for the user to install certain libraries
> manually but when the sources can be integrated into the project and has the
> appropriate licence then why not?
>
>
> Manfred
>
>
> IMHO there is no harm in bringing in libxml or a much more lightweight
> parser like GMarkup. The build system just needs to be adjusted to
> link e.g. libxml for the osis2mod binary and not shared sword library.
> in can be even called a new tool osisxml2mod for example and make it
> be build optionally such that you can still have full sword dev
> environment without libxml.
>
> Tools for creating modules do not have be linked with sword or even
> live in sword taball / svn. Although it does help consistent
> distribution of tools.
>
> I don't remember all of Troy's reasoning when I argued for a true parser.
>
> From what I recall:
> o To maintain freedom to re-license SWORD (e.g. for some other Bible society)
> we need to be able to keep 3-rd party library dependencies well managed. The
> license needs to be compatible with the GPL but cannot be GPL.
>
> o The parser that we have is minimal and simple, sacrificing accuracy and
> completeness for speed. Regarding accuracy, e.g. the parser allows for spaces
> around = in attribute declarations. Regarding completeness, e.g. it does not
> handle namespaces, cdata, dtds/schemas, .... Significantly, it does not
> require a well-formed document, allowing for fragments. Rather than an error,
> it continues when an xml parser is required to stop.
>
> o This parser has better error reporting in that it is based upon knowledge
> of the input. E.g. it reports the verse having the problem.
>
> o By SWORD having the parser, we are not dependent on finding an
> implementation for every platform (e.g. Windows).
>
> There may be other reasons. I'm willing to live with it.
>
> But what we really need is not a parser but a tokenizer. I'm thinking about
> writing one (my degree work was in compiler writing). Basically, we repeat
> the same tokenization code in several places. It should be trivial to write a
> complete, accurate one.
>
> In His Service,
> DM
>
>
> _______________________________________________
> sword-devel mailing list: [email protected]
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Open Scriptures" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/open-scriptures?hl=en.
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page