Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Steve Lawrence Fri, 05 Jan 2024 11:29:01 -0800

Roger,

I've taken a look at your schema, and it does look like excessivebacktracking is potentially part of the issue. Although many of yourouter choices use direct dispatch, some of the inner-choices do not.I'll try to stay general in my examples so the discussions can staypublic, but if the solution isn't clear let me know and we can talkprivately.


Your schema has a lot of choices that look like this:

  <choice>
    <element ref="Foo_PrimaryRecords" />
    <element ref="Foo_SecondaryRecords" />
    <element ref="Foo_TertiaryRecords" />
    ...
  </choice>

With each of those looking something like this:

  <element name="Foo_PrimaryRecords">
    <complexType>
      <sequence>
        <element name="field1" ... />
        <element name="field2" ... />
        <element name="field3" ...>
          <simpleType>
            <restriction base="validString">
              <enumeration value="A" />
            </restriction>
          </simpleType>
        </element>
      </sequence>
      <!-- fields specific to Foo_PrimaryRecord -->
    </complexType>
  </element>

  <element name="Foo_SecondaryRecords">
    <complexType>
      <sequence>
        <element name="field1" ... />
        <element name="field2" ... />
        <element name="field3" ...>
          <simpleType>
            <restriction base="validString">
              <enumeration value="B" />
            </restriction>
          </simpleType>
        </element>
      </sequence>
      <!-- fields specific to Foo_SecondaryRecord -->
    </complexType>
  </element>

And something similar for the TertiaryRecords.

Note that the first few fields are exactly the same, differing onlyafter field3, and based on the valid of field3. This means, for example,that if the record is a tertiary record, then Daffodil will need torepeatedly parse the same first fields and backtrack before thevalidString assertion finally succeeds. So in order to parse one record,we may parse the same fields multiple times, which is inefficient andcould slow things down.

What we really need to avoid this is direct dispatch on these choices,but it does require a little restructuring. Instead, you could replacethe above choice with something like this:


  <element name="Foo_Records">
    <complexType>
      <sequence>
        <element name="field1" ... />
        <element name="field2" ... />
        <element name="field3" ... />
        <choice dfdl:choiceDispatchKey="{ ./field3 }">
          <element ref="Foo_PrimaryRecords" dfdl:choiceBranchKey="A" />
          <element ref="Foo_SecondaryRecords" dfdl:choiceBranchKey="B" />
          <element ref="Foo_TertiaryRecords" dfdl:choiceBranchKey="C" />
        </choice>
      </sequence>
    </complexType>
  </element>

And then each of the Foo_*Records elements contain only fields specificto that group of records (i.e. what they have now without the firstcommon fields). And note that those might contain even more choices. Forexample, maybe Foo_SecondaryRecords has a choice ofFoo_SecondaryOldRecords and Foo_SecondaryNewRecords.

You already do something similar for some of your outer choices--doingit for all the choices should help to speed things up.

I'm sure there are still performance improvements we can make toDaffodil to optimize text parsing, but these kinds of changes shouldhelp quite in the mean time.



On 2024-01-04 01:54 PM, Mike Beckerle wrote:

Roger said: So, that means it takes 130 seconds to parse the5-million-line input file and build an internal infoset but only 96seconds to create the 4 GB XML file. That makes no sense.
I agree with you that parsing 5M records and it taking 2 minutes ...something is clearly wrong and it is much too slow.2 minutes to a modern CPU is like the Jurassic Age to a human. I.e.,it's an almost unimaginably long time.
But parsing taking longer than writing out XML */can/* make sensedepending on the format complexity. Writing out 4GB of XML is just abunch of text I/O. Per byte that's pretty fast.
The speed of DFDL parsing is proportional to /the number of decisions/the parser must make, which is linearly, but weakly, correlated to thedata size. The constant factors vary widely with the format.
For an extreme example: there is a mil-std-2045 message header that is33 bits long. It mostly consists of hidden groups of presence bits thatare 0 indicating that some optional component is not present. Each suchbit requires a DFDL choice of two possibilities, evaluation of achoice-dispatch-key expression or a occursCount expression, and most ofthose then create *nothing* in the infoset. So a bunch of overhead toconsume 1 bit of input and decide to add nothing to the infoset. Repeatalmost 30 times. You have now consumed less than 5 bytes of the input.In terms of parse speed in bytes/second this is going to be super slowbecause every byte requires a bunch of parser decision making. Writingout the corresponding 956 bytes of XML text is going to be very quick incomparison to this parsing.
(FYI: This extreme example is on github here:https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml <https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml>
The test is named test_2045_C_minimum_size_header. )

I realize your data and schema likely don't have such extreme behavior.
We need to get your schema so we can figure out where the performanceproblem is, whether there is a workaround, and what kind of Daffodilfeatures would eliminate all this guesswork about what's slow about it.
On Thu, Jan 4, 2024 at 3:59 AM Roger L Costello <coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
    Steve wrote:____

    __ __

    __Ø__One way to get a decent approximation for how much time is used for
    the former [build an internal infoset] is to use the "null" infoset
    outputter, e.g.____

    __Ø____ __

    __Ø__  daffodil parse -I null ...____

    __ __

    __Ø__This still parses data and builds the internal infoset but turns
    infoset ____

      * serialization into a no-op.____

    __ __

    Thanks Steve. I did as you suggested and here’s the result:____

    __ __

      * 130 seconds____

    __ __

    That is super surprising. I would have expected it to take much,
    much less time.____

    __ __

    So, that means it takes 130 seconds to parse the 5-million-line
    input file and build an internal infoset but only 96 seconds to
    create the 4 GB XML file. That makes no sense.____

    __ __

    /Roger____

    __ __

    *From:*Steve Lawrence <slawre...@apache.org
    <mailto:slawre...@apache.org>>
    *Sent:* Tuesday, January 2, 2024 9:18 AM
    *To:* users@daffodil.apache.org <mailto:users@daffodil.apache.org>
    *Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4
    minutes - too slow!____

    __ __

    You are correct that daffodil builds an internal infoset and then
    serializes that to something else (e. g. XML, EXI, JSON). One way to
    get a decent approximation for how much time is used for the former
    is to use the "null" infoset outputter, ____

    ZjQcmQRYFpfptBannerStart____

    You are correct that daffodil builds an internal infoset and then ____

    serializes that to something else (e.g. XML, EXI, JSON). One way to
    get ____

    a decent approximation for how much time is used for the former is
    to ____

    use the "null" infoset outputter, e.g.____

    __ __

        daffodil parse -I null ...____

    __ __

    This still parses data and builds the internal infoset but turns
    infoset ____

    serialization into a no-op.____

    __ __

    On 2023-12-26 02:04 PM, Roger L Costello wrote:____

    > Hi Folks,____

    > ____

    > My input file contains 5 million 132-character records.____

    > ____

    > I have done everything that I can think of to make the parsing faster:____

    > ____

    >  1. I precompiled the schema and used it to do the parsing____

    >  2. I set Java -Xmx40960m____

    >  3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer____

    > ____

    > And yet it still takes 4 minutes before the (4 GB) XML file is produced. 
____

    > Waiting 4 minutes is not acceptable for my clients.____

    > ____

    > A couple of questions:____

    > ____

    >  1. Is there anything else that I can do to speed things up?____

    >  2. I believe there is time needed to do the parsing and generate an____

    >     in-memory parse tree, and there is time needed to serialize the____

    >     in-memory parse tree to an XML file. Is there a way to find those____

    >     two times? I suspect the former is a lot quicker than the latter.____

    > ____

    > /Roger____

    > ____

    __ __

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to