Re: Heuristically processing documents
MRAB goo...@mrett.plus.com wrote: BJörn Lindqvist wrote: 8 --- For example, to find the email you can use a simple regexp. If there is a match you can be certain that that is the authors email. But what algorithms can you use to figure out the other information? Tricky! :-) How would _you_ recognise them? Have a look at the documents and see if you can see a pattern. For example, names and address often consist of a sequence of words in title case, eg Björn Lindqvist, which might help you narrow down the list of possibilities. What do telephone numbers look like, etc? It may help you to think about the problem if you imagine yourself having to extract the information from documents written in a language that you do not understand. An address may be identified by a number in a line (street address or PO box) that is followed some lines later by another number (zip code). But this hardly qualifies as an algorithm. A mailto:; and/or a set of angle brackets is a strong clue too... Don't have a clue about the name, though. - plain title case might work for John Brown but it fails with Koos van der Merwe. If there is an email addy in the doc, then it might serve as a clue to where to look - based on the theory that the contact information would be grouped together. Another clue might be to look for the word Author or its equivalent in a bunch of languages. Tricky is an understatement. - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Heuristically processing documents
I have a large set of documents in various text formats. I know that each document contains its authors name, email and phone number. Sometimes it also contains the authors home address. The task is to find out the name, email and phone of as many documents as possible. Since the documents are not in a specific format, you have to do a lot of guessing and getting approximate results is fine. For example, to find the email you can use a simple regexp. If there is a match you can be certain that that is the authors email. But what algorithms can you use to figure out the other information? -- mvh Björn -- http://mail.python.org/mailman/listinfo/python-list
Re: Heuristically processing documents
BJörn Lindqvist wrote: I have a large set of documents in various text formats. I know that each document contains its authors name, email and phone number. Sometimes it also contains the authors home address. The task is to find out the name, email and phone of as many documents as possible. Since the documents are not in a specific format, you have to do a lot of guessing and getting approximate results is fine. For example, to find the email you can use a simple regexp. If there is a match you can be certain that that is the authors email. But what algorithms can you use to figure out the other information? Tricky! :-) How would _you_ recognise them? Have a look at the documents and see if you can see a pattern. For example, names and address often consist of a sequence of words in title case, eg Björn Lindqvist, which might help you narrow down the list of possibilities. What do telephone numbers look like, etc? -- http://mail.python.org/mailman/listinfo/python-list