RE: Parsing text without an end terminator?

Mark Kozak Thu, 14 Nov 2024 11:21:21 -0800

That suggestion works like a charm. This tool and support comes through every 
time!

Leaving out the delimiter was the trick I needed.

Thank you.

From: Mike Beckerle <mbecke...@apache.org> 
Sent: Thursday, November 14, 2024 10:55 AM
To: users@daffodil.apache.org
Subject: Re: Parsing text without an end terminator?

This is great Mark,

DFDL does not have a way to say that one element is terminated by the initiator 
of whatever comes next. It's a long standing feature request. 

So you need to use look-ahead to do what you want. That means you need to use 
dfdl:lengthKind="pattern" to gather the data up to, but not including, the 
expected terminator. 

dfdl:lengthPattern="[^\[]{1,100}(?=\[)"

That will match 1 to 100 non open-bracket characters followed by, but not 
including, a "[" character. 

The next element can just be a string that is lengthKind delimited with no 
terminator specified which means "to end of data".

The usual problem with dfdl:lengthKind="pattern" is that a failure to match the 
pattern at all does *not* cause a parse error. Rather it just causes the length 
to be zero. If the data is type xs:string then zero is a legal length. So often 
a string element with lengthKind 'pattern' carries a dfdl:assert that the 
length is not zero, so as to cause a non-match to be a parse-error. This is 
needed often enough that a named simpleType "nzString" for "non-zero-length 
string" turns out to be convenient to have around. 

On Thu, Nov 14, 2024 at 10:16 AM Mark Kozak <mark.ko...@adeptus-cs.com 
<mailto:mark.ko...@adeptus-cs.com> > wrote:

Apologies for being unclear. Hopefully this helps.

>From that example I would want:

<f1>AAA</f1>

<f2>["bbb"["ccc"]]</f2><!-- literally all characters after the AAA as just a 
string -->

The number of nested brackets is unknown. A more complicated example could be:

AAA["bbb”["ccc”][“ddd”]

Producing:

<f1>AAA</f1>

<f2>["bbb"["ccc"][“ddd”]]</f2>

So all I really need is two elements which are the string before the first 
separator (the left bracket) and literally everything else.

I have a similar data stream that uses : as the separator. That example might 
look like:

AAA:”bbb”:”ccc”

Again, the number of : separated strings is unbounded. So getting  the 
following would work:

<f1>AAA</f1>

<f2>"bbb":"ccc"</f2>

I think both examples are really the same problem, with one using the [ as a 
separator and the second using :

I can use different schema solutions if they are in fact not the same problem.

From: Mike Beckerle <mbecke...@apache.org <mailto:mbecke...@apache.org> > 
Sent: Thursday, November 14, 2024 9:51 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: Re: Parsing text without an end terminator?

I'm going to need more to go on than this. 

Can you provide (several) richer examples? It's not clear from this little 
snippet what's even the terminator you were describing before.

You started with ":" terminators, now we're looking at matched pairs of 
brackets. How does one relate to the other?

When you say the second element is "the rest of the line", what exactly do you 
mean by that? Do you want:

<f1>AAA</f1>

<f2>["bbb"["ccc"]]</f2><!-- literally all characters after the AAA as just a 
string -->

Or something where the fields inside f2 are also parsed based on the brackets?

<f1>AAA</f1>

<f2>

  <f3>bbb</f3>

  <f4>ccc</f4>

</f2>

On Thu, Nov 14, 2024 at 9:28 AM Mark Kozak <mark.ko...@adeptus-cs.com 
<mailto:mark.ko...@adeptus-cs.com> > wrote:

Here is an example of the type of data I need to parse. 

AAA["bbb”["ccc”]]

The file has exactly one line with no terminator. Ideally, I would like to get 
2 elements. The first is the AAA, and the second is the rest of the line. I can 
work with or without the first left bracket.

From: Mark Kozak 
Sent: Thursday, November 14, 2024 8:58 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: RE: Parsing text without an end terminator?

The final terminator is not allowed.

From: Mike Beckerle <mbecke...@apache.org <mailto:mbecke...@apache.org> > 
Sent: Thursday, November 14, 2024 8:55 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: Re: Parsing text without an end terminator?

Did you try using dfdl:separator ? 

To clarify, in your format is this final terminator optional, or is it not 
allowed to be present? 

Alas, the dfdl:documentFinalTerminatorCanBeMissing property is not implemented 
by Daffodil. (See https://daffodil.apache.org/unsupported/)

It is suitable only for final terminators that are optional, but which will be 
added when unparsing. 

On Wed, Nov 13, 2024 at 5:42 PM Mark Kozak <mark.ko...@adeptus-cs.com 
<mailto:mark.ko...@adeptus-cs.com> > wrote:

Hello Community,

I have a text file that is delimited with a character like :

The challenge I am having is that there is no delimiter at the end of the file. 
I can get things to work if I add a new-line to the end and specify a 
terminator to be the NL. I thought the documentFinalTerminatorCanBeMissing 
property would be the solution, but setting that to yes did not appear to make 
a difference. Are there any recommended workarounds?

Thank for the support,

Mark Kozak

Director of Engineering

Adeptus Cyber Solutions

Adeptus-CS.com

smime.p7s
Description: S/MIME cryptographic signature

RE: Parsing text without an end terminator?

Reply via email to