Thanks Scott & Dave. It is great to be able to get insight and pointers to direct my investigation so quickly!
Would it be correct to conclude that mergesort: 1) theoretically requires linear space for the merge step that likely is prohibitive for large XML documents (assuming linked lists / similar are not involved, which eliminate the memory requirements); and 2) practically may require quadractic space for the merge step due to successive merge steps not re-using memory space allocated to prior merge steps (prior merge-step's space is left for garbage collection -- this is obviously implementation-specific)? Scott/Dave/anyone, what do you think? More practically: "Large" (XML documents in #1 above) can be a relative term based on real and/or virtual memory available to the JVM; based on our results I'd say a document is large if it's in-memory footprint is above 1/4 the size of the JVM heap (default 64M in 1.2.2). Obviously DOM-based representations suffer hugely here. In my example, 20,000 * 30 = 600,000 objects and with 250bytes each that's approx 143MB. With a 1GB heap, a mergesort which just required linear space would potentially take more than 1/4 of heap. If my conclusion above as to practical memory usage (need to investigate the NodeSorter class you mentioned) is correct then the memory needed for this operation would greatly exceed the heap. Any comments on this welcome as well. I'll post what I find in NodeSorter. Thanks, Martin On Fri, 2003-03-21 at 22:10, David N Bertoni/Cambridge/IBM wrote: > > > > > Just thought I'd chime in. One of my experiences in doing the sorting work > in Xalan-C was, that for large data sets, the cost of caching could start > to overwhelm the comparison cost. As a result, I left some conditional > code in that disabled caching, which helped for large documents where the > sort key values were large. > > Dave > > > > > > Scott > > Boag/Cambridge/I To: Martin Dengler > <[EMAIL PROTECTED]> > BM cc: [EMAIL PROTECTED], > (bcc: David N > <[EMAIL PROTECTED] Bertoni/Cambridge/IBM) > > bm.com> Subject: Re: Xalan sort > algorithm performance? > > > 03/21/2003 12:10 > > PM > > > > > > > > It's a mergesort (see Robert Sedgewick's Algorithms book). An issue may be > the match-and-select it has to do, though these should be cached so that > once the match-and-select is done, only the comparison need be made. At > one time I am under the impression that our sorting was comparatively > pretty good. But it's been I while since I've been in that code. > > You might want to root around in the > org.apache.xalan.transformer.NodeSorter code, or do a line profile of that > class. > > -scott > > Martin Dengler <[EMAIL PROTECTED]> wrote on 03/20/2003 04:17:27 PM: > > > Hi, > > > > Please yell if this is an FAQ or my googling skills need help. > > > > What sorting algorithm does Xalan use -- and what is the algorithm(s) > > runtime and memory footprint in big-O notation? > > > > I am using Xalan to sort large XML documents: approx 3 levels deep of > > a tree with 2-20,000 first child (attached to root) nodes and 30 > > grandchildren per child node. Sorting is done on an attribute of the > > grandchild nodes. The performance -- especially in memory usage -- is > > abysmal: 10-60 seconds sort time, memory usage is at least O(n^2). > > > > I've read the w3c performance pointers, but cutting the document down > > and doing fewer transforms really doesn't cut it (neither are applicable > > to a one-pass sort of a large document). > > > > Are there any ways to change the sort algortihm's memory/runtime > > behavior? > > > > Thanks, > > Martin > > > > PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun > > E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors. > > [attachment "signature.asc" deleted by Scott Boag/Cambridge/IBM]
signature.asc
Description: This is a digitally signed message part
