http://nagoya.apache.org/bugzilla/show_bug.cgi?id=2336 *** shadow/2336 Tue Jun 26 12:56:20 2001 --- shadow/2336.tmp.8454 Tue Jun 26 12:56:20 2001 *************** *** 0 **** --- 1,80 ---- + +============================================================================+ + | Large data problem with SAX: characters method chops value when offset at | + +----------------------------------------------------------------------------+ + | Bug #: 2336 Product: Xerces-J | + | Status: NEW Version: 1.4.1 | + | Resolution: Platform: PC | + | Severity: Major OS/Version: Windows NT/2K | + | Priority: Other Component: SAX | + +----------------------------------------------------------------------------+ + | Assigned To: [EMAIL PROTECTED] | + | Reported By: [EMAIL PROTECTED] | + +----------------------------------------------------------------------------+ + | URL: | + +============================================================================+ + | DESCRIPTION | + Using JDK 1.3.1 on Win2K and SAX 2.0 (also with SAX 1.x). + + Steps to reproduce: + + 1. run against the file: http://www.geocities.com/ascii_text/sax_example.xml + 2. as the file is parsed, a the character array/buffer of size 16384 is passed + to the characters method in the implementation class + 3. the offset value passed to the characters method is moved along to match + text contained within the elements + 4. what happens if text value is partly in one buffer and partly in the next + buffer? only part of the text value is given + + For example, in the file referenced above, we have the following xml snippet: + + "... + <column> + <column_name>COLUMN_5</column_name> + <column_value>VALUE_5</column_value> + </column> + ..." + + It turns out that as the file is processed, the current buffer contains: + + "... + <column> + <column_name>CO" [END OF BUFFER] + + and the next buffer contains: + + [BEGINNING OF BUFFER]"LUMN_5</column_name> + <column_value>VALUE_5</column_value> + </column> + ..." + + The corresponding buffer-related values passed to the characters + method are as follows. + + * For the current buffer: + + OFFSET: 16382 + LENGTH: 2 + CHARACTER ARRAY LENGTH: 16384 + + * For the next buffer: + + OFFSET: 0 + LENGTH: 6 + CHARACTER ARRAY LENGTH: 16384 + + We can see that the text value COLUMN_5 is chopped into "CO" and "LUMN_5" + and the reason is that the first part of the value ("CO") lies in one + buffer and the second part ("LUMN_5") lies in the next buffer. + + As a result of all this, the characters value reported is incorrect. + + This doesn't happen for every buffer; in fact, it took an xml file + 45,000 lines long to get the problem to show up. But for large xml + files, it almost always happens to me. This is a serious problem + because it prevents me from using and testing large data sets. + + Please e-mail me if there is a work-around or if I am mis-using + the API. + + I was unable to find another reference to this problem, but I + would be surprised if others haven't encountered it. \ No newline at end of file --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
