Re: Parsing text without an end terminator?

Mike Beckerle Thu, 14 Nov 2024 07:56:48 -0800

This is great Mark,

DFDL does not have a way to say that one element is terminated by the
initiator of whatever comes next. It's a long standing feature request.


So you need to use look-ahead to do what you want. That means you need to
use dfdl:lengthKind="pattern" to gather the data up to, but not including,
the expected terminator.

dfdl:lengthPattern="[^\[]{1,100}(?=\[)"

That will match 1 to 100 non open-bracket characters followed by, but not
including, a "[" character.

The next element can just be a string that is lengthKind delimited with no
terminator specified which means "to end of data".

The usual problem with dfdl:lengthKind="pattern" is that a failure to match
the pattern at all does *not* cause a parse error. Rather it just causes
the length to be zero. If the data is type xs:string then zero is a legal
length. So often a string element with lengthKind 'pattern' carries a
dfdl:assert that the length is not zero, so as to cause a non-match to be a
parse-error. This is needed often enough that a named simpleType "nzString"
for "non-zero-length string" turns out to be convenient to have around.


On Thu, Nov 14, 2024 at 10:16 AM Mark Kozak <mark.ko...@adeptus-cs.com>
wrote:

> Apologies for being unclear. Hopefully this helps.
>
>
>
> From that example I would want:
>
> <f1>AAA</f1>
>
> <f2>["bbb"["ccc"]]</f2><!-- literally all characters after the AAA as just
> a string -->
>
>
>
> The number of nested brackets is unknown. A more complicated example could
> be:
>
> AAA["bbb”["ccc”][“ddd”]
>
> Producing:
>
> <f1>AAA</f1>
>
> <f2>["bbb"["ccc"][“ddd”]]</f2>
>
>
>
> So all I really need is two elements which are the string before the first
> separator (the left bracket) and literally everything else.
>
>
>
> I have a similar data stream that uses : as the separator. That example
> might look like:
>
>
>
> AAA:”bbb”:”ccc”
>
>
>
> Again, the number of : separated strings is unbounded. So getting  the
> following would work:
>
>
>
> <f1>AAA</f1>
>
> <f2>"bbb":"ccc"</f2>
>
>
>
> I think both examples are really the same problem, with one using the [ as
> a separator and the second using :
>
> I can use different schema solutions if they are in fact not the same
> problem.
>
>
>
>
>
>
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Thursday, November 14, 2024 9:51 AM
> *To:* users@daffodil.apache.org
> *Subject:* Re: Parsing text without an end terminator?
>
>
>
> I'm going to need more to go on than this.
>
>
>
> Can you provide (several) richer examples? It's not clear from this little
> snippet what's even the terminator you were describing before.
>
> You started with ":" terminators, now we're looking at matched pairs of
> brackets. How does one relate to the other?
>
>
>
> When you say the second element is "the rest of the line", what exactly do
> you mean by that? Do you want:
>
>
>
> <f1>AAA</f1>
>
> <f2>["bbb"["ccc"]]</f2><!-- literally all characters after the AAA as just
> a string -->
>
>
>
> Or something where the fields inside f2 are also parsed based on the
> brackets?
>
>
>
> <f1>AAA</f1>
>
> <f2>
>
>   <f3>bbb</f3>
>
>   <f4>ccc</f4>
>
> </f2>
>
>
>
>
>
> On Thu, Nov 14, 2024 at 9:28 AM Mark Kozak <mark.ko...@adeptus-cs.com>
> wrote:
>
> Here is an example of the type of data I need to parse.
>
>
>
> AAA["bbb”["ccc”]]
>
>
>
> The file has exactly one line with no terminator. Ideally, I would like to
> get 2 elements. The first is the AAA, and the second is the rest of the
> line. I can work with or without the first left bracket.
>
>
>
>
>
> *From:* Mark Kozak
> *Sent:* Thursday, November 14, 2024 8:58 AM
> *To:* users@daffodil.apache.org
> *Subject:* RE: Parsing text without an end terminator?
>
>
>
> The final terminator is not allowed.
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Thursday, November 14, 2024 8:55 AM
> *To:* users@daffodil.apache.org
> *Subject:* Re: Parsing text without an end terminator?
>
>
>
> Did you try using dfdl:separator ?
>
>
>
> To clarify, in your format is this final terminator optional, or is it not
> allowed to be present?
>
>
>
> Alas, the dfdl:documentFinalTerminatorCanBeMissing property is not
> implemented by Daffodil. (See https://daffodil.apache.org/unsupported/)
>
> It is suitable only for final terminators that are optional, but which
> will be added when unparsing.
>
>
>
>
>
> On Wed, Nov 13, 2024 at 5:42 PM Mark Kozak <mark.ko...@adeptus-cs.com>
> wrote:
>
> Hello Community,
>
>
>
> I have a text file that is delimited with a character like :
>
> The challenge I am having is that there is no delimiter at the end of the
> file. I can get things to work if I add a new-line to the end and specify a
> terminator to be the NL. I thought the documentFinalTerminatorCanBeMissing
> property would be the solution, but setting that to yes did not appear to
> make a difference. Are there any recommended workarounds?
>
>
>
> Thank for the support,
>
>
>
> Mark Kozak
>
> Director of Engineering
>
> Adeptus Cyber Solutions
>
> Adeptus-CS.com
>
>
>
>

Re: Parsing text without an end terminator?

Reply via email to