Hi Mike,

Thus far I have only been able to parse about ΒΌ of the 744 MB Shape file. The 
portion that I successfully parsed resulted in producing an XML file that is 
1.2 GB in size. Presumably, if I could parse the entire Shape file, the 
resulting XML would be somewhere in the 5 GB size range. Interestingly, I 
wanted to count the number of <variable-length-record> elements (there is one 
such element per shape) in the resulting XML file. I wrote an XSLT program to 
do the counting. SAXON generated an out-of-memory error. So, I used streaming 
XML to do the count - there are 149,108 variable length records (shapes). 
Extrapolating, the entire Shape file contains roughly 600,000 shapes.

I don't know if this Shape file is typical. It is a Shape file for a coastal 
region. I can imagine such files would often be huge.

/Roger

From: Mike Beckerle [mailto:mbecke...@tresys.com]
Sent: Tuesday, June 12, 2018 10:45 AM
To: users@daffodil.apache.org
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that 
arrays can't be bigger than 1024 elements?




So, a shape file that big may not be possible to parse right now.



If you just think about the enlargement due to expanding the data from the 
shapefile representation, which is fairly dense, to something more like an XML 
DOM tree, every field in the data becomes a java object, or several, each of 
which has many bytes of overhead. A file of 744M might turn into 7 gig of 
storage. That's assuming only a 10-to-1 expansion, which honestly, might not be 
a big enough factor. It could be 20 to 1 though I doubt it is 100 to 1.



Question: are the files typically like this? Or are these somewhat extreme 
examples?



This may be a case where true streaming (e.g., XML SAX style) parsing is 
needed. This is something on our roadmap 
(https://issues.apache.org/jira/browse/DAFFODIL-934)

but it's not yet been made high priority as yet.



In the interim.... You *could* just get a *lot* more RAM.... e.g., my laptop 
has 64Gbytes.....



Alternatively,... know any Scala/Java developers who might want to add features 
to Daffodil?? We can certainly help someone get up the learning curve, learn 
enough Scala, provide and/or review a design, etc.



...mike beckerle

Tresys









________________________________
From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Tuesday, June 12, 2018 10:18:07 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: RE: Why is there an arbitrary limit that Daffodil imposes so that 
arrays can't be bigger than 1024 elements?


Thanks for the explanation Mike!



  *   what if we just increased all these initial limits to 1M instead of 1K ?



1M is not sufficient. I just encountered a Shape file containing a polygon with 
4,900,469 points.



I used the maxOccursBounds flag to increase the repetition limit to 5 million:



daffodil parse -TmaxOccursBounds=5000000 ...



But that resulted in a "GC overhead limit exceeded".



So I added a -UseGCOverheadLimit flag to avoid that:



set JOPTS=-Xms4096M -Xmx4096M -XX:ReservedCodeCacheSize=512M 
-XX:-UseGCOverheadLimit



But now I'm getting a "java.lang.OutOfMemoryError: Java heap space" error. My 
next attempt will be to further increase -Xms



Any suggestions you might have would be appreciated. My Shape file is large ... 
744 MB.



/Roger



From: Mike Beckerle [mailto:mbecke...@tresys.com]
Sent: Tuesday, June 12, 2018 10:00 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that 
arrays can't be bigger than 1024 elements?



We certainly can enlarge these initial settings, as they do seem awfully small.



And we can probably have a "unlimited" setting, but the point of this "limited" 
behavior was in general to avoid the kinds of problems that come up with 
unlimited - as in "did you really mean 8 trillion is ok?"



E.g., a regex with ".*", did you really mean "*" as in any number as in 
trillions? Or did you mean "pretty big by human standards like maybe a 
million." ?



In DFDL, due to backtracking, if there is an error in the data, it is possible 
for the parser to waste a lot of time thrashing around trying to parse data 
hopelessly. Some reasonable limits to make it fail faster are helpful in these 
cases. Commercial data integration products have various tunable limits of this 
sort as well.



There's various guidance associated with using regex in XSD, for example, that 
frowns upon use of the wildcard * and + quantifiers for these same reasons.



So all that said... what if we just increased all these initial limits to 1M 
instead of 1K ?



I'm open to all suggestions for how to improve here. Just wanted to explain 
current rationales.



...mike beckerle

Tresys

________________________________

From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays 
can't be bigger than 1024 elements?



Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape 
file and the parse crashed. I discovered that the Shape file has a polygon with 
1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil 
imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon 
investigation I found the Shape file has a polygon with 3,087 points. So I 
increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there 
should be no limit. Is there a reason that it can't be unlimited?

/Roger

Reply via email to