Claude, I am afraid you will be disappointed that today daffodil 2.3.0 cannot handle data files this large in one shot.
A unlimited size stream of smaller data items can be handled, but not a single large file being parsed into a single json root. Lots of people have this requirement. We do have ambitions in the roadmap to provide a more incremental parse and unparse for large files when the dfdl schema allows it. On the parse side this would be more like XML SAX or STAX APIs (you can still create json as Infoset, this is just the API style). The unparse side already has a streaming API, but the implementation doesnt provide the streaming behavior except in very very simple schemas. If you are interested in becoming a Daffodil developer to implement what you need, we are always looking for contributors to dig in and would provide lots of initial assistance. -Mike Beckerle Tresys From: Claude Mamo Sent: Thursday, May 23, 2:21 AM Subject: Unparsing a 10 GB JSON infoset To: [email protected] Hello all, I'm testing Daffodil's capability to handle large files. The parsing is done in chunks but the unparsing happens in one go. For the latter, the following error happens after about 100 MB is written out to disk: "OutOfMemoryError: GC Overhead Limit Exceeded". Should unparsing happen in chunks as well or could this be a memory leak? The DFDL schema isn't particularly complex from my perspective and the validation is very basic (mostly maxOccurs=1 for a few elements). Claude
