[ http://issues.apache.org/jira/browse/XERCESJ-977?page=comments#action_54756 ] Ed Tyrrill commented on XERCESJ-977: ------------------------------------
I ran into this same problem using the xerces parser that is packaged with java 5.0. I did some investigation, and discovered the resason for the problem. First, you can download an xml document and dtd that will allow you to reproduce the problem from: ftp://ftp.avamar.com/pub/files/sun/x6.xml ftp://ftp.avamar.com/pub/files/sun/event_catalog.dtd All of my investigation was using the code that comes with jdk1.5.0. When I compared this code to xerces 2.6.2 they appear to be virually identicle. The fix is not simple because it really requires a minor design change. I'll describe in detail what is happening so that this problem may be fixed. In com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl there are a bunch of two dimensional arrays that keep track of the values and structure of the document. One of these arrays, fNodePrevSib keeps track of the previous sibling in the tree of the current node. Now the problem is that the value of -1 is used to indicate there is no previous sibling. This is a problem because the value -1 is also used to indicate that that index in the array is unused. Now a little bit about the two dimensional arrays. These arrays allocate new "chunks" as the parsing proceeds, and will dereference the chunks so they can be garbage collected if a chunk becomes empty. The NPE is occurring because it thinks a chunk in fNodePrevSib is empty, frees it, and then goes back to get more previous sibling information from it. So why is a chunk becoming empty? The xml file we are trying to parse uses a lot of entities. When DeferredDocumentImpl finds an entity it places the entity name in one index, and then the replacement string for the entity is placed in the next index, and then it goes back, and actually replaces the entity name with the replacement string. Just by chance the entity is placed in the last index in chunk 11. Then the replacement string is over 64 characters long so it gets broken in two, and is placed in the first two indexes of chunk 12. The first part of the replacement string has no previous sibling so when it is added to chunk 12 the use count is not incremented on fNodePrevSib. When the second part of the string is added the usage count on fNodePrevSib becomes 1. The next thing that happens is replacing the entity with it's replacement string. So the first part of the string is taken out of chunk 12. The reference count on chunk 12 of fNodePrevSib is decremented to 0, and the chunk is dereferenced (set to null). So when we go to get the second half of the string we get the NPE trying to access the null chunk. So the real problem is that the dual use of the -1 value causes the usage count on the chunks to get off. This only ever matters when you delete enough stuff to stop using an entire chunk You might ask, where is this all happening in the code. Let me describe that now. In the appendChild() method on line 673, getChunkIndex() is called to get the index of the previous child node. So that index is -1 for the first half of the entity replacement string. That value, olast, is passed into setChunkIndex() to set the value -1. If you go down to setChunkIndex() you will see that on 1977 that if the "value" parameter is -1 that instead of storing that value that it calls clearChunkIndex() instead of storing the value. The second previous sibling info is then correctly stored in the next call to appendChild(). Next when the entity replacement is being performed insertBefore() is called. In insertBefore(), the second call to setChunkIndex() again has a value of -1. This causes setChunkIndex() to call clearChunkIndex() again, and this time the code on line 2038 is run, causing the chunk to be set to null. Soon after that another call is made to insertBefore(), which causes the NPE. I hope this gives you all the information you will need to resolve this issue. Thanks, Ed Tyrrill > Null pointer exception during DOM parsing > ----------------------------------------- > > Key: XERCESJ-977 > URL: http://issues.apache.org/jira/browse/XERCESJ-977 > Project: Xerces2-J > Type: Bug > Components: DOM > Versions: 2.6.2 > Reporter: Emily Horton > > We are parsing large numbers of xml files with DOM and are very occasionally getting > a null pointer exception when parsing. In this case we tracked the problem down to > a point in the text where there was a quoted attribute inside quoted text: > “[a]nimals should be housed in facilities dedicated to or assigned for that > purpose...<bibr rid="b2"/>” > Any of the following changes to the document would get rid of the null pointer > exception and allow parsing: > 1) Changing the bibr tag to a different without any attributes. > 2) Removing the outside quotes. > 3) Moving the bibr tag to outside the quotes. > Here is the stack trace for the error: > 522316528 [Thread-200] ERROR -> > org.apache.xerces.dom.DeferredDocumentImpl.setChunkIndex(Unknown Source) > 522316529 [Thread-200] ERROR -> > org.apache.xerces.dom.DeferredDocumentImpl.insertBefore(Unknown Source) > 522316529 [Thread-200] ERROR -> > org.apache.xerces.parsers.AbstractDOMParser.endGeneralEntity(Unknown Source) > 522316529 [Thread-200] ERROR -> > org.apache.xerces.impl.dtd.XMLDTDValidator.endGeneralEntity(Unknown Source) > 522316529 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) > 522316530 [Thread-200] ERROR -> org.apache.xerces.impl.XMLEntityScanner.load(Unknown > Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > Source) > 522316530 [Thread-200] ERROR -> > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) > 522316531 [Thread-200] ERROR -> > org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > 522316531 [Thread-200] ERROR -> > org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) > 522316531 [Thread-200] ERROR -> org.apache.xerces.parsers.XMLParser.parse(Unknown > Source) > 522316531 [Thread-200] ERROR -> org.apache.xerces.parsers.DOMParser.parse(Unknown > Source) > 522316531 [Thread-200] ERROR -> > org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > 522316531 [Thread-200] ERROR -> javax.xml.parsers.DocumentBuilder.parse(Unknown > Source) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
