That's a handy handler, Peter, but I think it would need to be enhanced to accommodate Paul's request here, as his algo needs to account for not only white space but also punctuation. Trickier, it needs to accommodate punctuation across multiple languages, so the range of characters to be checked could be potentially quite lengthy and perhaps difficult to anticipate for all possible use cases.

Personally, I wouldn't bother with any language-parsing tasks in anything prior to v7.0, given the power of trueWord. As Mark Waddingham has noted here, most of the increase in the engine size between v6 and v7 is Unicode libraries and tables whose purpose is to handle exactly this sort of problem.

V6 and v7 have been identified as approaching EOL ASAP, when v8.0 goes final. All serious apps I work on here are being developed in v8, shipping for now with either v6.x or 7.x as needed depending on the specifics of the app at hand. But the moment v8.0 goes final I'll be able to have confidence that it'll do what I need because I've already run my work through this new engine and have already submitted bug reports that have already been addressed.

Waiting to run my work in v8.0 until after v8.0 Stable is released would only increase my changes that some uncommon thing my app depends on met with a regression I didn't identify when I had the chance, pushing back my own time-to-market by having to wait for a v8.1.

With more than 2500+ bug fixes and enhancements between v6.0 and v8.0, there's plenty there to keep me motivated about the upgrade.

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 ____________________________________________________________________
 ambassa...@fourthworld.com                http://www.FourthWorld.com


Peter M. Brigham wrote:

On Mar 17, 2016, at 8:20 AM, David Bovill wrote:

Hi Peter, any chance of sharing it?

Sure. Below is the offsets function that returns all the offsets of a string in 
a container. Then all you have to do is something like this:

function getStringChunks pSearchStr,pText,beginsWholeWord,endsWholeWord
   if beginsWholeWord = empty then put false into beginsWholeWord
   if endsWholeWord = empty then put false into endsWholeWord
   -- default to simple offsets, not whole word offsets
   put offsets(pSearchStr,pText) into offSts
   replace comma with cr in offSts
   put len(pSearchStr) into strLen
   put cr & space & tab & " " into wSpace
   -- include non-breaking space
   repeat for each line i in offSts
      put char i-1 of pText into charBefore
      put char i+strLen of pText into charAfter
      if beginsWholeWord and not (charBefore is in wSpace) then next repeat
      if endsWholeWord and not (charAfter is in wSpace) then next repeat
      put i & comma & (i+strLen-1) & cr after outList
   end repeat
   return line 1 to -1 of outList
end getStringChunks

Pass beginsWholeWord = true and endsWholeWord = true for wholeMatches.
Might not be really fast for pText of 100K+ characters, but should be quite 
efficient on smaller texts. Often LC's chunking functions are faster than regex 
anyway.

---------

function offsets str, pContainer
   -- returns a comma-delimited list of all the offsets of str in pContainer
   -- returns 0 if not found
   -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
   --     ie, overlapping offsets are not counted
   -- note: to get the last occurrence of a string in a container (often useful)
   --     use "item -1 of offsets(...)"

   if str is not in pContainer then return 0
   put 0 into startPoint
   repeat
      put offset(str,pContainer,startPoint) into thisOffset
      if thisOffset = 0 then exit repeat
      add thisOffset to startPoint
      put startPoint & comma after offsetList
      add length(str)-1 to startPoint
   end repeat
   return item 1 to -1 of offsetList -- delete trailing comma
end offsets

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to