[R] Parsing very large xml datafiles with SAX: How to profile anonymous functions?

2012-10-26 Thread Frederic Fournier
Hello everyone,

I'm trying to parse a very large XML file using SAX with the XML package
(i.e., mainly the xmlEventParsing function). This function takes as an
argument a list of other functions (handlers) that will be called to handle
particular xml nodes.

If when I use Rprof(), all the handler functions are lumped together under
the anonymous label, and I get something like this:

$by.total
   total.time total.pct self.time self.pct
system.time  151.22 99.99  0.00 0.00
MyParsingFunction149.38 98.77  0.00 0.00
xmlEventParse149.38 98.77  0.00 0.00
.Call149.32 98.73  3.04 2.01
Anonymous  146.74 97.02141.2693.40---
!!
xmlValue   3.04  2.01  0.46 0.30
xmlValue.XMLInternalNode   2.58  1.71  0.14 0.09
standardGeneric2.12  1.40  0.50 0.33
gc 1.86  1.23  1.86 1.23
...


Is there a way to make Rprof() identify the different handler functions, so
I can know which one might be a bottleneck? Is there another profiling tool
that would be more appropriate in a case like this?

Thank you very much for your help!

Frederic

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Parsing very large xml datafiles with SAX: How to profile anonymous functions?

2012-10-26 Thread Duncan Temple Lang
Hi Frederic

 Perhaps the simplest way to profile the individual functions in your
handlers is to write the individual handlers as regular
named functions, i.e. assigned to a variable in your work space (or function 
body)
and then two write the handler functions as wrapper functions that call these
by name

  startElement = function(name, attr, ...) {
 # code you want to run when we encounter the start of an XML element
  }

  myText = function(...) {
 # code
  }

  Now, when calling xmlEventParse()

   xmlEventParse(filename,
  handlers = list(.startElement = function(...) 
startElement(...),
  .text = function(...) myText(...)))

Then the profiler will see the calls to startElement and myText.

There is small overhead of the extra layers, but you will get the profile 
information.

  D.

On 10/26/12 9:49 AM, Frederic Fournier wrote:
 Hello everyone,
 
 I'm trying to parse a very large XML file using SAX with the XML package
 (i.e., mainly the xmlEventParsing function). This function takes as an
 argument a list of other functions (handlers) that will be called to handle
 particular xml nodes.
 
 If when I use Rprof(), all the handler functions are lumped together under
 the anonymous label, and I get something like this:
 
 $by.total
total.time total.pct self.time self.pct
 system.time  151.22 99.99  0.00 0.00
 MyParsingFunction149.38 98.77  0.00 0.00
 xmlEventParse149.38 98.77  0.00 0.00
 .Call149.32 98.73  3.04 2.01
 Anonymous  146.74 97.02141.2693.40---
 !!
 xmlValue   3.04  2.01  0.46 0.30
 xmlValue.XMLInternalNode   2.58  1.71  0.14 0.09
 standardGeneric2.12  1.40  0.50 0.33
 gc 1.86  1.23  1.86 1.23
 ...
 
 
 Is there a way to make Rprof() identify the different handler functions, so
 I can know which one might be a bottleneck? Is there another profiling tool
 that would be more appropriate in a case like this?
 
 Thank you very much for your help!
 
 Frederic
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.