RE: cherry picking nodes [LONG and NOT urgent]

Swanson, Brion Fri, 20 Jul 2001 08:40:37 -0700
On a personal note, from what I've heard about DTM, it's considerable
lighter-weight than the current DOM implementation that Xerces1 uses.  If
that is indeed the case, then it would be preferrable to have that smaller
implementation for the processing I do.  I deal with literally millions of
records daily/weekly/monthly that are at some point turned into a DOM tree
to be transformed and modified before being serialized into a file
repository.  If the DOM trees did not take up as much memory as they do now,
some of the problems we've had in the past with memory would be alleviated
to a point.  Also, I imagine some processing speed improvements for simple
traversals and DOM operations would be acquired from this change.

I'd be interested in contributing, but am not sure that I have the
appropriate level of experience and/or time available to commit at this
point.  I also have to do quite a bit more research into DTM to understand
it more fully....right now it's mostly a buzzword that I've heard good
things about. :-)

I wonder how many other people are interested in this particular aspect of
Xerces2?

Regards,
Brion Swanson

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Friday, July 20, 2001 11:53 AM
To: [EMAIL PROTECTED]
Subject: RE: cherry picking nodes [LONG and NOT urgent]


Hi Brion,

Obviously one of the factors that will influence whether the DTM makes its
way into Xerces2 is user demand.  How many folks out there would like to
see this?  And perhaps more importantly:  How many folks are willing to
help make it happen?

I don't recall this question being discussed very much on the list before.
My personal suspicion is that there's more need for schema support in
Xerces2, but if there are people all fired up to crank out DTM code, who
knows what might happen?  There are also some questions I have about the
DTM's intimate relationship with XPath--which we also don't currently
support--but this all certainly merits discussion.

Cheers,
Neil

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  416-448-3519, T/L 778-3519
E-mail:  [EMAIL PROTECTED]



"Swanson, Brion" <[EMAIL PROTECTED]> on 07/20/2001 09:55:26 AM

Please respond to [EMAIL PROTECTED]

To:   "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
cc:
Subject:  RE: cherry picking nodes [LONG and NOT urgent]


I think I heard (a rumor?) that Xerces2 will use a lighter-weight DOM
(based
on, or using DTM?) versus the larger, clunkier DOM that Xerces1 uses.  And
then again, I might just be talking out of wishful thinking.

Could someone acknowledge or refute my statement?

Brion Swanson

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Friday, July 20, 2001 3:41 AM
To: [EMAIL PROTECTED]
Subject: RE: cherry picking nodes [LONG and NOT urgent]



Thanks for your reply, it always feels good to know there is life out
there.

Anyway, reading your solution, and re-reading my question again, I think I
am chasing a wild goose.

Your solution is similar to mine, tho you use getElementByTagName() where I
use XPath (I do have a good reason for that).

In fact I think what I want is to be able to apply XPath to something
lighter (in bytecode) than a DOM, if there is such a thing. But I guess the
complexity of XPath queries require something like the DOM. Another
hypothetical solution would be the ability to pre-serialize the Nodes I am
happy to ignore, assuming that a serialized Node takes up less memory than
the Node itself.

Maybe I ought to look into the code itself to find out what exactly is a
DOM document (but I am too new to Java and doubt I could make sense of it)

Is it a heavy-duty lists of heavy-weight Nodes that contain their
attributes and values and heavy-duty lists objects of children and sibling
Nodes?
Or is it a light-weight linked list of light-weight Nodes that only point
to a single common binary representation of the XML?
Or something else?

Does anyone know of a good resource about these deep meaningfull questions?

Thanks for your comments.




"Swanson, Brion" <[EMAIL PROTECTED]> on 19/07/2001 20:53:02

Please respond to [EMAIL PROTECTED]

To:   "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
cc:
Subject:  RE: cherry picking nodes [LONG and NOT urgent]


I'm not sure my solution is exactly what you're looking for, but at least
it
SOUNDS different from what you're currently doing.

First of all, you still need to parse the XML document into a DOM (if your
getting the NodeSet from a file, otherwise you can probably just pass in an
already-build DOM tree).

Second, use the org.w3c.dom.Document.getElementsByTagName(String name) to
get a NodeList of all of your 'target' nodes.  In this way, you've only
selected exactly those nodes that you wanted (and each of them know about
their parent and children).  If you need sibling information, simply get
the
node's parent and traverse it's children.

This way, it saves you having to hardcode (or to know in any manner) the
xpath of the target node beyond it's name.  It also returns you a nice neat
list of nodes that you can convert into a hashtable if you want for fast
lookup.

The changes you make to those nodes are 'live' changes, meaning you are
changing the actual DOM tree since Java passes most everything by
reference.

Finally, when you're ready to write it all out, you simply have to get the
document element (if you haven't already) and serialize it!  Voila!

A code snippet might look similar to the following:

  NodeList targetNodes = myDocument.getElementsByTagName("target");
  Hashtable nodeTable = new Hashtable();
  for (int i=0; i<targetNodes.getLength(); i++) {
    Node target = targetNodes.item(i);
    String id = ((Element)target).getAttribute("id");

    nodeTable.put(id, target);
  }

  for (Enumeration keys = nodeTable.keys();  keys.hasMoreElements();) {
    Node currentNode = (Node)nodeTable.get(keys.nextElement());
    // ... do something with this target node ...
  }

    // ... now we're done, serialize the Document ...

Hope this helps!

Brion Swanson

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 18, 2001 5:30 AM
To: [EMAIL PROTECTED]
Subject: cherry picking nodes [LONG and NOT urgent]



Hiya,

Can anyone think of a nice alternative to my newbie solution to the
following problem? All comments will be greatly appreciated.

Starting from something like :

<A>
  <B>
   <C/>
   <target id="one"/>
   <D>
    <E/>
    <target id="two"/>
   </D>
  </B>
  <target id="three"/>
  <E>
    <F>
      <G>...</G>
   </F>
  </E>
</A>

I want to build some sort of hash table {"one": node1, "two":node2,
"three": node3} as soon as possible. I really don't care about other nodes,
but I'd like the resulting object to be as light weight as possible. The
aim is to latter come back and replace the <target> nodes with specific
data, without having to traverse the DOM (or whatever memory representation
of the parsed xml) again.

Right now, I have taken the heavy weight approach:

1) Parse the document as a DOM
2) Find my <target> nodes (using XPath) and add them to my map.
3) use the map to setup the content of the target nodes
4) keep the (DOM+map_ object around for writing to file, or further use of
the map...

The serialized object (DOM+hashmap) is about 14kb for a 4kb xml source.
Considering that most of this is information about nodes I don't care about
(tho I can't discard them because I need them to create the final xml
document), I am looking for an alternative approach, using a lighter
representation of the DOM.

A good example is the <E> node. Once I know that it doesn't contain any
targets, I don't need to know about its children or siblings. If fact, to
make latter serialization (to a string) faster, I'd like to keep it as a
string.

But, I do care about the children or siblings of the <target> nodes,
because I do some processing on them (checking out attributes and cloing
nodes for example) before putting them in my map.

Finally I do not want (not that I have the ability to anyway) to reinvent
the wheel, therefore I do not really fancy using SAX to build my own
personal DOM.

Conclusion, here is my wish list:

1) Need to start from a text source.
2) Need a XPath like way to find my nodes
3) Need something very similar to org.w3c.dom.Node for my target nodes
4) Need something very light, String like for all other nodes.
5) Need to be able to serialize the result back to text.

Any thought?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: cherry picking nodes [LONG and NOT urgent]

Reply via email to