Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Steve Lawrence Thu, 11 Jan 2024 09:42:42 -0800

I'm sorry to hear that didn't make much of a difference. I think thisgoes to show the importance adding some kind of measuring system toDaffodil so we can make more educated guesses where the slow downs areand if schema changes would make a difference.

We did recently do some profiling on your schema and a test we filefound online. This found one obvious bit of unoptimized code with afairly simple fix that saw about 15% faster parse time in my testing(with a null infoset outputter). This has been merged, so if you areable to build Daffodil from the main branch it might be interesting tosee how that affects the timings with your test file



On 2024-01-11 11:54 AM, Roger L Costello wrote:

Update on my DFDL schema …

I spent 40 hours adding a third layer of dfdl:choiceDispatchKey. Thathas made the DFDL schema extremely complicated. Now my DFDL schemarequires no backtracking. Previously, before adding the third layer ofdfdl:choiceDispatchKey, it took 226 seconds to parse the input file.Now, with the third layer of dfdl:choiceDispatchKey and no backtrackingit takes 214 seconds. So, after putting in 40 hours of work, introducinga ton of complexity, eliminating all backtracking, I get a savings of 12seconds.


*From:* Mike Beckerle <mbecke...@apache.org>
*Sent:* Tuesday, January 9, 2024 6:12 PM
*To:* users@daffodil.apache.org

*Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4minutes - too slow!

I was thinking about instrumentation output from Daffodil that wouldmake it clear how many attempts to parse each choice branch that end inbacktracking are done. A histogram listing each element declaration, andhow many times it is parsed

I was thinking about instrumentation output from Daffodil that wouldmake it clear how many attempts to parse each choice branch that end inbacktracking are done.

A histogram listing each element declaration, and how many times it isparsed would basically tell you whether things were being parsed overand over. When something is parsed many more times than its parent isparsed, that's an indication it's being parsed excessively.


I want to comment on this refactoring of the schema:

    What we really need ... is direct dispatch on these choices,

    but it does require a little restructuring. Instead, you could replace

    the above choice with something like this:

        <element name="Foo_Records">

          <complexType>

            <sequence>

              <element name="field1" ... />

              <element name="field2" ... />

              <element name="field3" ... />

              <choice dfdl:choiceDispatchKey="{ ./field3 }">

                <element ref="Foo_PrimaryRecords"
    dfdl:choiceBranchKey="A" />

                <element ref="Foo_SecondaryRecords"
    dfdl:choiceBranchKey="B" />

                <element ref="Foo_TertiaryRecords"
    dfdl:choiceBranchKey="C" />

              </choice>

            </sequence>

          </complexType>

        </element>

I think the above is elegant in that it expresses

 1. that field1, 2, 3 are first and are uniformly defined in all the
    data (of these Foo records).
 2. that only field3 is needed to determine which of Primary, Secondary,

and Tertiary records is in the data.Those two things together are what enable an efficient non-backtrackingparser to be created for this. They are also statements found in mostdata message format specifications buried somewhere in the prose. Thatinformation needs to be captured in the DFDL schema somehow that allowsthe implementation to take advantage of it.

If you don't restructure the schema like the above, how else would youcapture that information in the DFDL schema?

Leaving these things out and just expressing a choice of a bunch oflayouts,... in that case the DFDL schema is telling me much less aboutthe format. It's not declaring the uniformity which is intentionallypart of the format spec.

On Fri, Jan 5, 2024 at 2:29 PM Steve Lawrence <slawre...@apache.org<mailto:slawre...@apache.org>> wrote:


    Roger,

    I've taken a look at your schema, and it does look like excessive
    backtracking is potentially part of the issue. Although many of your
    outer choices use direct dispatch, some of the inner-choices do not.
    I'll try to stay general in my examples so the discussions can stay
    public, but if the solution isn't clear let me know and we can talk
    privately.

    Your schema has a lot of choices that look like this:

        <choice>
          <element ref="Foo_PrimaryRecords" />
          <element ref="Foo_SecondaryRecords" />
          <element ref="Foo_TertiaryRecords" />
          ...
        </choice>

    With each of those looking something like this:

        <element name="Foo_PrimaryRecords">
          <complexType>
            <sequence>
              <element name="field1" ... />
              <element name="field2" ... />
              <element name="field3" ...>
                <simpleType>
                  <restriction base="validString">
                    <enumeration value="A" />
                  </restriction>
                </simpleType>
              </element>
            </sequence>
            <!-- fields specific to Foo_PrimaryRecord -->
          </complexType>
        </element>

        <element name="Foo_SecondaryRecords">
          <complexType>
            <sequence>
              <element name="field1" ... />
              <element name="field2" ... />
              <element name="field3" ...>
                <simpleType>
                  <restriction base="validString">
                    <enumeration value="B" />
                  </restriction>
                </simpleType>
              </element>
            </sequence>
            <!-- fields specific to Foo_SecondaryRecord -->
          </complexType>
        </element>

    And something similar for the TertiaryRecords.

    Note that the first few fields are exactly the same, differing only
    after field3, and based on the valid of field3. This means, for
    example,
    that if the record is a tertiary record, then Daffodil will need to
    repeatedly parse the same first fields and backtrack before the
    validString assertion finally succeeds. So in order to parse one
    record,
    we may parse the same fields multiple times, which is inefficient and
    could slow things down.

    What we really need to avoid this is direct dispatch on these choices,
    but it does require a little restructuring. Instead, you could replace
    the above choice with something like this:

        <element name="Foo_Records">
          <complexType>
            <sequence>
              <element name="field1" ... />
              <element name="field2" ... />
              <element name="field3" ... />
              <choice dfdl:choiceDispatchKey="{ ./field3 }">
                <element ref="Foo_PrimaryRecords"
    dfdl:choiceBranchKey="A" />
                <element ref="Foo_SecondaryRecords"
    dfdl:choiceBranchKey="B" />
                <element ref="Foo_TertiaryRecords"
    dfdl:choiceBranchKey="C" />
              </choice>
            </sequence>
          </complexType>
        </element>

    And then each of the Foo_*Records elements contain only fields specific
    to that group of records (i.e. what they have now without the first
    common fields). And note that those might contain even more choices.
    For
    example, maybe Foo_SecondaryRecords has a choice of
    Foo_SecondaryOldRecords and Foo_SecondaryNewRecords.

    You already do something similar for some of your outer choices--doing
    it for all the choices should help to speed things up.

    I'm sure there are still performance improvements we can make to
    Daffodil to optimize text parsing, but these kinds of changes should
    help quite in the mean time.


    On 2024-01-04 01:54 PM, Mike Beckerle wrote:
     > Roger said: So, that means it takes 130 seconds to parse the
     > 5-million-line input file and build an internal infoset but only 96
     > seconds to create the 4 GB XML file. That makes no sense.
     >
     > I agree with you that parsing 5M records and it taking 2 minutes ...
     > something is clearly wrong and it is much too slow.
     > 2 minutes to a modern CPU is like the Jurassic Age to a human. I.e.,
     > it's an almost unimaginably long time.
     >
     > But parsing taking longer than writing out XML */can/* make sense
     > depending on the format complexity.  Writing out 4GB of XML is
    just a
     > bunch of text I/O. Per byte that's pretty fast.
     >
     > The speed of DFDL parsing is proportional to /the number of
    decisions/
     > the parser must make, which is linearly, but weakly, correlated
    to the
     > data size. The constant factors vary widely with the format.
     >
     > For an extreme example: there is a mil-std-2045 message header
    that is
     > 33 bits long. It mostly consists of hidden groups of presence
    bits that
     > are 0 indicating that some optional component is not present.
    Each such
     > bit requires a DFDL choice of two possibilities, evaluation of a
     > choice-dispatch-key expression or a occursCount expression, and
    most of
     > those then create *nothing* in the infoset. So a bunch of
    overhead to
     > consume 1 bit of input and decide to add nothing to the infoset.
    Repeat
     > almost 30 times. You have now consumed less than 5 bytes of the
    input.
     > In terms of parse speed in bytes/second this is going to be super
    slow

> because every byte requires a bunch of parser decision making.Writing

     > out the corresponding 956 bytes of XML text is going to be very
    quick in
     > comparison to this parsing.
     >
     > (FYI: This extreme example is on github here:
     >
    
https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml
 
<https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml>
 
<https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml
 
<https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml>>
     > The test is named test_2045_C_minimum_size_header. )
     >
     > I realize your data and schema likely don't have such extreme
    behavior.
     >
     > We need to get your schema so we can figure out where the
    performance
     > problem is, whether there is a workaround, and what kind of Daffodil
     > features would eliminate all this guesswork about what's slow
    about it.
     >
     > On Thu, Jan 4, 2024 at 3:59 AM Roger L Costello
    <coste...@mitre.org <mailto:coste...@mitre.org>
     > <mailto:coste...@mitre.org <mailto:coste...@mitre.org>>> wrote:
     >
     >     Steve wrote:____
     >
     >     __ __
     >
     >     __Ø__One way to get a decent approximation for how much time
    is used for
     >     the former [build an internal infoset] is to use the "null"
    infoset
     >     outputter, e.g.____
     >
     >     __Ø____ __
     >
     >     __Ø__  daffodil parse -I null ...____
     >
     >     __ __
     >
     >     __Ø__This still parses data and builds the internal infoset
    but turns
     >     infoset ____
     >
     >       * serialization into a no-op.____
     >
     >     __ __
     >
     >     Thanks Steve. I did as you suggested and here’s the result:____
     >
     >     __ __
     >
     >       * 130 seconds____
     >
     >     __ __
     >
     >     That is super surprising. I would have expected it to take much,
     >     much less time.____
     >
     >     __ __
     >
     >     So, that means it takes 130 seconds to parse the 5-million-line
     >     input file and build an internal infoset but only 96 seconds to
     >     create the 4 GB XML file. That makes no sense.____
     >
     >     __ __
     >
     >     /Roger____
     >
     >     __ __
     >
     >     *From:*Steve Lawrence <slawre...@apache.org
    <mailto:slawre...@apache.org>
     >     <mailto:slawre...@apache.org <mailto:slawre...@apache.org>>>
     >     *Sent:* Tuesday, January 2, 2024 9:18 AM
     >     *To:* users@daffodil.apache.org
    <mailto:users@daffodil.apache.org> <mailto:users@daffodil.apache.org
    <mailto:users@daffodil.apache.org>>
     >     *Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4
     >     minutes - too slow!____
     >
     >     __ __
     >
     >     You are correct that daffodil builds an internal infoset and then
     >     serializes that to something else (e. g. XML, EXI, JSON). One
    way to
     >     get a decent approximation for how much time is used for the
    former
     >     is to use the "null" infoset outputter, ____
     >
     >     ZjQcmQRYFpfptBannerStart____
     >
     >     You are correct that daffodil builds an internal infoset and
    then ____
     >
     >     serializes that to something else (e.g. XML, EXI, JSON). One
    way to
     >     get ____
     >
     >     a decent approximation for how much time is used for the
    former is
     >     to ____
     >
     >     use the "null" infoset outputter, e.g.____
     >
     >     __ __
     >
     >         daffodil parse -I null ...____
     >
     >     __ __
     >
     >     This still parses data and builds the internal infoset but turns
     >     infoset ____
     >
     >     serialization into a no-op.____
     >
     >     __ __
     >
     >     On 2023-12-26 02:04 PM, Roger L Costello wrote:____
     >
     >     > Hi Folks,____
     >
     >     > ____
     >
     >     > My input file contains 5 million 132-character records.____
     >
     >     > ____
     >
     >     > I have done everything that I can think of to make the
    parsing faster:____
     >
     >     > ____
     >
     >     >  1. I precompiled the schema and used it to do the parsing____
     >
     >     >  2. I set Java -Xmx40960m____
     >
     >     >  3. I used a bunch of dfdl:choiceDispatchKey to
    divide-and-conquer____
     >
     >     > ____
     >
     >     > And yet it still takes 4 minutes before the (4 GB) XML file
    is produced. ____
     >
     >     > Waiting 4 minutes is not acceptable for my clients.____
     >
     >     > ____
     >
     >     > A couple of questions:____
     >
     >     > ____
     >
     >     >  1. Is there anything else that I can do to speed things
    up?____
     >
     >     >  2. I believe there is time needed to do the parsing and
    generate an____
     >
     >     >     in-memory parse tree, and there is time needed to
    serialize the____
     >
     >     >     in-memory parse tree to an XML file. Is there a way to
    find those____
     >
     >     >     two times? I suspect the former is a lot quicker than
    the latter.____
     >
     >     > ____
     >
     >     > /Roger____
     >
     >     > ____
     >
     >     __ __
     >

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to