Thanks for the explanation Mike!

  *   what if we just increased all these initial limits to 1M instead of 1K ?

1M is not sufficient. I just encountered a Shape file containing a polygon with 
4,900,469 points.

I used the maxOccursBounds flag to increase the repetition limit to 5 million:

daffodil parse -TmaxOccursBounds=5000000 ...

But that resulted in a "GC overhead limit exceeded".

So I added a -UseGCOverheadLimit flag to avoid that:

set JOPTS=-Xms4096M -Xmx4096M -XX:ReservedCodeCacheSize=512M 
-XX:-UseGCOverheadLimit

But now I'm getting a "java.lang.OutOfMemoryError: Java heap space" error. My 
next attempt will be to further increase -Xms

Any suggestions you might have would be appreciated. My Shape file is large ... 
744 MB.

/Roger

From: Mike Beckerle [mailto:mbecke...@tresys.com]
Sent: Tuesday, June 12, 2018 10:00 AM
To: users@daffodil.apache.org
Subject: Re: Why is there an arbitrary limit that Daffodil imposes so that 
arrays can't be bigger than 1024 elements?


We certainly can enlarge these initial settings, as they do seem awfully small.



And we can probably have a "unlimited" setting, but the point of this "limited" 
behavior was in general to avoid the kinds of problems that come up with 
unlimited - as in "did you really mean 8 trillion is ok?"



E.g., a regex with ".*", did you really mean "*" as in any number as in 
trillions? Or did you mean "pretty big by human standards like maybe a 
million." ?



In DFDL, due to backtracking, if there is an error in the data, it is possible 
for the parser to waste a lot of time thrashing around trying to parse data 
hopelessly. Some reasonable limits to make it fail faster are helpful in these 
cases. Commercial data integration products have various tunable limits of this 
sort as well.



There's various guidance associated with using regex in XSD, for example, that 
frowns upon use of the wildcard * and + quantifiers for these same reasons.



So all that said... what if we just increased all these initial limits to 1M 
instead of 1K ?



I'm open to all suggestions for how to improve here. Just wanted to explain 
current rationales.



...mike beckerle

Tresys

________________________________
From: Costello, Roger L. <coste...@mitre.org<mailto:coste...@mitre.org>>
Sent: Tuesday, June 12, 2018 7:08:31 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: Why is there an arbitrary limit that Daffodil imposes so that arrays 
can't be bigger than 1024 elements?

Hi Folks,

I am creating a DFDL schema for Shape files. I ran my DFDL schema on a Shape 
file and the parse crashed. I discovered that the Shape file has a polygon with 
1,371 points (so I need the <Point> element repeated 1,371 times) but Daffodil 
imposes a limit of 1,024 repetitions. I learned how to increase that limit:

daffodil parse -TmaxOccursBounds=2048 ...

I did that and it took care of the error I was getting.

I ran my DFDL schema on another Shape file and the parse crashed. Upon 
investigation I found the Shape file has a polygon with 3,087 points. So I 
increased the limit again:

daffodil parse -TmaxOccursBounds=4096 ...

I did that and it took care of the error I was getting.

Now I begin to wonder - why? Why does Daffodil impose a limit? I think there 
should be no limit. Is there a reason that it can't be unlimited?

/Roger

Reply via email to