Thanks Mark,

Your suggestion would not help for my application, the offset function is not faster than counting up returns (line chunk statement) to get to the proper line, but I have used it before for parsing a large USDA food nutrition database and it helped a lot. However, it does give me some more ideas about how I can use a hybrid approach to speed things up a bit. I know exactly what line and item the data I want is in, and it is always the next one. I might be able to suffer with the chunk specification for the line#, then use a repeat for each item and put 2500 items in an array. That way I will only need 2500 array items at any one time instead of 125,000,000 array items per data file. But I will still have to put 125,000,000 items into array elements and then read them back out again once per data pass. Perhaps 10-100 times slower than an "access" keyword instead of 1000-10,000 times slower. I will do some sample tests and see what I come up with.

Dennis

On Apr 12, 2005, at 7:29 PM, Mark Brownell wrote:

Hi Dennis,

I have found that large data files can be broken down into smaller objects using simplified XML where access is obtained using a pull-parser. Unlike the XML parser in Revolution a very fast pull-parser can be used to break down objects and parse out specific fields without ever building a full parsing in the more traditional form using standard parsers. So if you can break down your data and transform it using simple element type XML structuring then you might be able to create a system that can find information in large data objects.

I once asked the creators of Rev to add or create a faster pull-parser. They came up with something that would improve on my Transcript based pull-parser by about 20%. I found out that all I needed to do was lock the screen and unlock it after I was done parsing my files in order to get the speeds I was looking for. In other words what I did in Transcript was very fast for a native written pull-parser.

Here it is one more time:

HTH,

Mark

==================

-- put getElementsArray("<record>", "</record>", tZap) into theArray
function getElementsArray tStartTag, tEndTag, StringToSearch
put empty into tArray
put 0 into tStart1
put 0 into tStart2
put 1 into tElementNum
put the number of chars in tStartTag into dChars
repeat
put offset(tStartTag,StringToSearch,tStart1) into tNum1
put (tNum1 + tStart1) into tStart1
if tNum1 < 1 then exit repeat
put offset(tEndTag,StringToSearch,tStart2) into tNum2
put (tNum2 + tStart2) into tStart2
if tNum2 < 1 then exit repeat
--if tNum2 < tNum1 then exit repeat
put char (tStart1 + dChars) to (tStart2 - 1) of StringToSearch into zapped
put zapped into tArray[tElementNum]
add 1 to tElementNum
end repeat
return tArray
end getElementsArray


-- put getElement("<record>", "</record>", tZap) into theElement
function getElement tStTag, tEdTag, stngToSch
  put empty into zapped
  put the number of chars in tStTag into dChars
  put offset(tStTag,stngToSch) into tNum1
  put offset(tEdTag,stngToSch) into tNum2
  if tNum1 < 1 then
    return "error"
    exit getElement
  end if
  if tNum2 < 1 then
    return "error"
    exit getElement
  end if
  put char (tNum1 + dChars) to (tNum2 - 1) of stngToSch into zapped
  return zapped
end getElement

=================


The Idea is to break apart the essential functional elements of the
repeat for each control to allow more flexibility.  This sample has a
bit more refinement than what I posted yesterday in Bugzilla.

The new keyword would be "access" , but could be something else.

An example of the use of the new keywords syntax would be:

access each line X in arrayX--initial setup of pointers and X value
access each item Y in arrayY --initial setup of pointers and Y value
repeat for number of lines of arrayX times --same as a repeat for each
   put X & comma & Y & return after ArrayXY --merged array
   next line X --puts the next line value in X
   next item Y --if arrayY has fewer elements than arrayX, then empty
is supplied, could also put "End of String" in the result
end repeat

Another advantage of this syntax is that it provides for more
flexibility in structure of loops.  You could repeat forever, then
exit repeat when you run out of values (based on getting an empty
back).  The possibilities for high speed sequential access data
processing are much expanded which opens up more possibilities for
Revolution.

I would love to get your feedback or other ideas about solving this
problem.

Dennis

_______________________________________________ use-revolution mailing list [email protected] http://lists.runrev.com/mailman/listinfo/use-revolution


_______________________________________________ use-revolution mailing list [email protected] http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to