I've played around with similar things myself. If you know that there is only one 'record' per HTML page, then you might try using <@REPLACE> to remove the garbage (<TABLE...> tags, <B> tags, etc.) by making the REPLACESTR=''. You might need to use the <@SUBSTRING> tag to assist, making the START attribute equal to a <@FIND> tag that looks for a string that you know exists. For example, if you use this on a variable that contains the HTML source file from your first example below:
<@SUBSTRING STR="stringvalue" START="<@LOCATE STR='stringvalue' FINDSTR='Bolide Forum message starts here'>" NUMCHARS="<@CALC EXPR='<@LENGTH STR="stringvalue"> - <@LOCATE STR="stringvalue" FINDSTR="Bolide Forum message starts here">'"> ...then you get rid of all of the junk up to where the message starts. Use this same long tag with START=1 and NUMCHARS='<@LOCATE STR="newstringvalue" FINDSTR="<!---------- post a followup heading + form ---------->">' and you get rid of everything after the body of the message. I've guessed at where you want to start and end but I hope you get the idea. Then find out what starts the respective section such as a particular comment or formatting code (i.e. <TR>) and use the <@SUBSTRING> tag to replace it with a unique character such as the pipe character '|'. Then use the <@TOKENIZE> tag to create a one-row array with each of the parts of the message being placed in different columns of the array. It will take a little bit of work to get it done the first time but depending on how many HTML files you've got, it might be worth it because you'll be able to whip through the remaining files. Hope this helps, Steve Smith Skadt Information Solutions Office: (519) 624-4388 GTA: (416) 606-3885 Fax: (519) 624-3353 Cell: (416) 606-3885 Email: [EMAIL PROTECTED] Web: http://www.skadt.com > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Nicholas Froome > Sent: September 11, 2002 7:22 AM > To: Multiple recipients of list witango-talk > Subject: Witango-Talk: parsing text files [OT] > > > I'm moving a Forum from flat HTML pages into a Filemaker > database, and then maybe into something quicker > > I have 800-odd Forum postings that I have to transfer over into a > database, and it's killing me to do it by hand > > The required text is always bookended the same way, so a search > for text (the posting) bounded by text strings (the HTML) would > get the data > > I need to extract the following: > > * filename > * name > * date > * subject > * body > > A sample page old page is at: > http://www.bolide.co.uk/bbs/messages/246.html > > The new, 90% complete, Forum is at > http://www.bolide.co.uk/actions?forum.taf > > > I've tried a search & replace on the files using BBEdit, but the > samll differences in text wrapping and content mean I can't get > it to work. I've tried (and failed) to AppleScript a batch import > into Filemaker, but even if I did that I'd have problems parsing > & extracting the text > > I'm not a good enough programmer to whisk something up to parse a > folder full of HTML files - any suggestions? > > > Many thanks! > ________________________________________________________________________ > TO UNSUBSCRIBE: send a plain text/US ASCII email to [EMAIL PROTECTED] > with unsubscribe witango-talk in the message body > ________________________________________________________________________ TO UNSUBSCRIBE: send a plain text/US ASCII email to [EMAIL PROTECTED] with unsubscribe witango-talk in the message body
