That's a handy handler, Peter, but I think it would need to be enhanced
to accommodate Paul's request here, as his algo needs to account for not
only white space but also punctuation. Trickier, it needs to
accommodate punctuation across multiple languages, so the range of
characters to be checked could be potentially quite lengthy and perhaps
difficult to anticipate for all possible use cases.
Personally, I wouldn't bother with any language-parsing tasks in
anything prior to v7.0, given the power of trueWord. As Mark Waddingham
has noted here, most of the increase in the engine size between v6 and
v7 is Unicode libraries and tables whose purpose is to handle exactly
this sort of problem.
V6 and v7 have been identified as approaching EOL ASAP, when v8.0 goes
final. All serious apps I work on here are being developed in v8,
shipping for now with either v6.x or 7.x as needed depending on the
specifics of the app at hand. But the moment v8.0 goes final I'll be
able to have confidence that it'll do what I need because I've already
run my work through this new engine and have already submitted bug
reports that have already been addressed.
Waiting to run my work in v8.0 until after v8.0 Stable is released would
only increase my changes that some uncommon thing my app depends on met
with a regression I didn't identify when I had the chance, pushing back
my own time-to-market by having to wait for a v8.1.
With more than 2500+ bug fixes and enhancements between v6.0 and v8.0,
there's plenty there to keep me motivated about the upgrade.
--
Richard Gaskin
Fourth World Systems
Software Design and Development for the Desktop, Mobile, and the Web
____________________________________________________________________
ambassa...@fourthworld.com http://www.FourthWorld.com
Peter M. Brigham wrote:
On Mar 17, 2016, at 8:20 AM, David Bovill wrote:
Hi Peter, any chance of sharing it?
Sure. Below is the offsets function that returns all the offsets of a string in
a container. Then all you have to do is something like this:
function getStringChunks pSearchStr,pText,beginsWholeWord,endsWholeWord
if beginsWholeWord = empty then put false into beginsWholeWord
if endsWholeWord = empty then put false into endsWholeWord
-- default to simple offsets, not whole word offsets
put offsets(pSearchStr,pText) into offSts
replace comma with cr in offSts
put len(pSearchStr) into strLen
put cr & space & tab & " " into wSpace
-- include non-breaking space
repeat for each line i in offSts
put char i-1 of pText into charBefore
put char i+strLen of pText into charAfter
if beginsWholeWord and not (charBefore is in wSpace) then next repeat
if endsWholeWord and not (charAfter is in wSpace) then next repeat
put i & comma & (i+strLen-1) & cr after outList
end repeat
return line 1 to -1 of outList
end getStringChunks
Pass beginsWholeWord = true and endsWholeWord = true for wholeMatches.
Might not be really fast for pText of 100K+ characters, but should be quite
efficient on smaller texts. Often LC's chunking functions are faster than regex
anyway.
---------
function offsets str, pContainer
-- returns a comma-delimited list of all the offsets of str in pContainer
-- returns 0 if not found
-- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
-- ie, overlapping offsets are not counted
-- note: to get the last occurrence of a string in a container (often useful)
-- use "item -1 of offsets(...)"
if str is not in pContainer then return 0
put 0 into startPoint
repeat
put offset(str,pContainer,startPoint) into thisOffset
if thisOffset = 0 then exit repeat
add thisOffset to startPoint
put startPoint & comma after offsetList
add length(str)-1 to startPoint
end repeat
return item 1 to -1 of offsetList -- delete trailing comma
end offsets
-- Peter
Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode