Re: Small regex project for pay [CLOSED]

Richard Gaskin Sat, 19 Mar 2016 17:45:51 -0700

That's a handy handler, Peter, but I think it would need to be enhancedto accommodate Paul's request here, as his algo needs to account for notonly white space but also punctuation. Trickier, it needs toaccommodate punctuation across multiple languages, so the range ofcharacters to be checked could be potentially quite lengthy and perhapsdifficult to anticipate for all possible use cases.

Personally, I wouldn't bother with any language-parsing tasks inanything prior to v7.0, given the power of trueWord. As Mark Waddinghamhas noted here, most of the increase in the engine size between v6 andv7 is Unicode libraries and tables whose purpose is to handle exactlythis sort of problem.

V6 and v7 have been identified as approaching EOL ASAP, when v8.0 goesfinal. All serious apps I work on here are being developed in v8,shipping for now with either v6.x or 7.x as needed depending on thespecifics of the app at hand. But the moment v8.0 goes final I'll beable to have confidence that it'll do what I need because I've alreadyrun my work through this new engine and have already submitted bugreports that have already been addressed.

Waiting to run my work in v8.0 until after v8.0 Stable is released wouldonly increase my changes that some uncommon thing my app depends on metwith a regression I didn't identify when I had the chance, pushing backmy own time-to-market by having to wait for a v8.1.

With more than 2500+ bug fixes and enhancements between v6.0 and v8.0,there's plenty there to keep me motivated about the upgrade.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 ____________________________________________________________________
 ambassa...@fourthworld.com                http://www.FourthWorld.com


Peter M. Brigham wrote:

On Mar 17, 2016, at 8:20 AM, David Bovill wrote:

Hi Peter, any chance of sharing it?


Sure. Below is the offsets function that returns all the offsets of a string in 
a container. Then all you have to do is something like this:

function getStringChunks pSearchStr,pText,beginsWholeWord,endsWholeWord
   if beginsWholeWord = empty then put false into beginsWholeWord
   if endsWholeWord = empty then put false into endsWholeWord
   -- default to simple offsets, not whole word offsets
   put offsets(pSearchStr,pText) into offSts
   replace comma with cr in offSts
   put len(pSearchStr) into strLen
   put cr & space & tab & " " into wSpace
   -- include non-breaking space
   repeat for each line i in offSts
      put char i-1 of pText into charBefore
      put char i+strLen of pText into charAfter
      if beginsWholeWord and not (charBefore is in wSpace) then next repeat
      if endsWholeWord and not (charAfter is in wSpace) then next repeat
      put i & comma & (i+strLen-1) & cr after outList
   end repeat
   return line 1 to -1 of outList
end getStringChunks

Pass beginsWholeWord = true and endsWholeWord = true for wholeMatches.
Might not be really fast for pText of 100K+ characters, but should be quite 
efficient on smaller texts. Often LC's chunking functions are faster than regex 
anyway.

---------

function offsets str, pContainer
   -- returns a comma-delimited list of all the offsets of str in pContainer
   -- returns 0 if not found
   -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
   --     ie, overlapping offsets are not counted
   -- note: to get the last occurrence of a string in a container (often useful)
   --     use "item -1 of offsets(...)"

   if str is not in pContainer then return 0
   put 0 into startPoint
   repeat
      put offset(str,pContainer,startPoint) into thisOffset
      if thisOffset = 0 then exit repeat
      add thisOffset to startPoint
      put startPoint & comma after offsetList
      add length(str)-1 to startPoint
   end repeat
   return item 1 to -1 of offsetList -- delete trailing comma
end offsets

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Small regex project for pay [CLOSED]

Reply via email to