Thanks Scott & Dave.  It is great to be able to get insight and pointers
to direct my investigation so quickly!

Would it be correct to conclude that mergesort: 1) theoretically
requires linear space for the merge step that likely is prohibitive for
large XML documents (assuming linked lists / similar are not involved,
which eliminate the memory requirements); and 2) practically may require
quadractic space for the merge step due to successive merge steps not
re-using memory space allocated to prior merge steps (prior merge-step's
space is left for garbage collection -- this is obviously
implementation-specific)?

Scott/Dave/anyone, what do you think?

More practically: "Large" (XML documents in #1 above) can be a relative
term based on real and/or virtual memory available to the JVM; based on
our results I'd say a document is large if it's in-memory footprint is
above 1/4 the size of the JVM heap (default 64M in 1.2.2).  Obviously
DOM-based representations suffer hugely here.  In my example, 20,000 *
30 = 600,000 objects and with 250bytes each that's approx 143MB.  With a
1GB heap, a mergesort which just required linear space would potentially
take more than 1/4 of heap.  If my conclusion above as to practical
memory usage (need to investigate the NodeSorter class you mentioned) is
correct then the memory needed for this operation would greatly exceed
the heap.

Any comments on this welcome as well.  I'll post what I find in
NodeSorter.

Thanks,
Martin

On Fri, 2003-03-21 at 22:10, David N Bertoni/Cambridge/IBM wrote:
> 
> 
> 
> 
> Just thought I'd chime in.  One of my experiences in doing the sorting work
> in Xalan-C was, that for large data sets, the cost of caching could start
> to overwhelm the comparison cost.  As a result, I left some conditional
> code in that disabled caching, which helped for large documents where the
> sort key values were large.
> 
> Dave
> 
> 
> 
>                                                                               
>                                            
>                       Scott                                                   
>                                            
>                       Boag/Cambridge/I         To:      Martin Dengler 
> <[EMAIL PROTECTED]>                                
>                       BM                       cc:      [EMAIL PROTECTED], 
> (bcc: David N                      
>                       <[EMAIL PROTECTED]         Bertoni/Cambridge/IBM)       
>                                              
>                       bm.com>                  Subject: Re: Xalan sort 
> algorithm performance?                            
>                                                                               
>                                            
>                       03/21/2003 12:10                                        
>                                            
>                       PM                                                      
>                                            
>                                                                               
>                                            
> 
> 
> 
> 
> It's a mergesort (see Robert Sedgewick's Algorithms book).  An issue may be
> the match-and-select it has to do, though these should be cached so that
> once the match-and-select is done, only the comparison need be made.  At
> one time I am under the impression that our sorting was comparatively
> pretty good.  But it's been I while since I've been in that code.
> 
> You might want to root around in the
> org.apache.xalan.transformer.NodeSorter code, or do a line profile of that
> class.
> 
> -scott
> 
> Martin Dengler <[EMAIL PROTECTED]> wrote on 03/20/2003 04:17:27 PM:
> 
> > Hi,
> >
> >    Please yell if this is an FAQ or my googling skills need help.
> >
> >    What sorting algorithm does Xalan use -- and what is the algorithm(s)
> > runtime and memory footprint in big-O notation?
> >
> >    I am using Xalan to sort large XML documents: approx 3 levels deep of
> > a tree with 2-20,000 first child (attached to root) nodes and 30
> > grandchildren per child node.  Sorting is done on an attribute of the
> > grandchild nodes.  The performance -- especially in memory usage -- is
> > abysmal: 10-60 seconds sort time, memory usage is at least O(n^2).
> >
> >    I've read the w3c performance pointers, but cutting the document down
> > and doing fewer transforms really doesn't cut it (neither are applicable
> > to a one-pass sort of a large document).
> >
> >    Are there any ways to change the sort algortihm's memory/runtime
> > behavior?
> >
> > Thanks,
> > Martin
> >
> > PS -- test specs: Xerces, Xalan (latest), Solaris 2.8, JVM 1.2.2_07, Sun
> > E450, 2GB real mem, 1GB JVM heap, 2x450mhz processors.
> > [attachment "signature.asc" deleted by Scott Boag/Cambridge/IBM]

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to