I've played around with similar things myself. If you know that there is
only one 'record' per HTML page, then you might try using <@REPLACE> to
remove the garbage (<TABLE...> tags, <B> tags, etc.) by making the
REPLACESTR=''. You might need to use the <@SUBSTRING> tag to assist, making
the START attribute equal to a <@FIND> tag that looks for a string that you
know exists. For example, if you use this on a variable that contains the
HTML source file from your first example below:

<@SUBSTRING STR="stringvalue" START="<@LOCATE STR='stringvalue'
FINDSTR='Bolide Forum message starts here'>" NUMCHARS="<@CALC EXPR='<@LENGTH
STR="stringvalue"> - <@LOCATE STR="stringvalue" FINDSTR="Bolide Forum
message starts here">'">

...then you get rid of all of the junk up to where the message starts. Use
this same long tag with START=1 and NUMCHARS='<@LOCATE STR="newstringvalue"
FINDSTR="<!---------- post a followup heading + form ---------->">' and you
get rid of everything after the body of the message. I've guessed at where
you want to start and end but I hope you get the idea.

Then find out what starts the respective section such as a particular
comment or formatting code (i.e. <TR>) and use the <@SUBSTRING> tag to
replace it with a unique character such as the pipe character '|'. Then use
the <@TOKENIZE> tag to create a one-row array with each of the parts of the
message being placed in different columns of the array.

It will take a little bit of work to get it done the first time but
depending on how many HTML files you've got, it might be worth it because
you'll be able to whip through the remaining files.

Hope this helps,

Steve Smith

Skadt Information Solutions
Office: (519) 624-4388
GTA:    (416) 606-3885
Fax:    (519) 624-3353
Cell:   (416) 606-3885
Email:  [EMAIL PROTECTED]
Web:    http://www.skadt.com


> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Nicholas Froome
> Sent: September 11, 2002 7:22 AM
> To: Multiple recipients of list witango-talk
> Subject: Witango-Talk: parsing text files [OT]
>
>
> I'm moving a Forum from flat HTML pages into a Filemaker
> database, and then maybe into something quicker
>
> I have 800-odd Forum postings that I have to transfer over into a
> database, and it's killing me to do it by hand
>
> The required text is always bookended the same way, so a search
> for text (the posting) bounded by text strings (the HTML) would
> get the data
>
> I need to extract the following:
>
> * filename
> * name
> * date
> * subject
> * body
>
> A sample page old page is at:
> http://www.bolide.co.uk/bbs/messages/246.html
>
> The new, 90% complete, Forum is at
> http://www.bolide.co.uk/actions?forum.taf
>
>
> I've tried a search & replace on the files using BBEdit, but the
> samll differences in text wrapping and content mean I can't get
> it to work. I've tried (and failed) to AppleScript a batch import
> into Filemaker, but even if I did that I'd have problems parsing
> & extracting the text
>
> I'm not a good enough programmer to whisk something up to parse a
> folder full of HTML files - any suggestions?
>
>
> Many thanks!
> ________________________________________________________________________
> TO UNSUBSCRIBE: send a plain text/US ASCII email to [EMAIL PROTECTED]
>                 with unsubscribe witango-talk in the message body
>

________________________________________________________________________
TO UNSUBSCRIBE: send a plain text/US ASCII email to [EMAIL PROTECTED]
                with unsubscribe witango-talk in the message body

Reply via email to