AI/LLMs - was: Re: BLOB feature - is it being used?

Mike Beckerle Fri, 21 Feb 2025 12:11:12 -0800

Consistent with my experience. There's nowhere near enough DFDL materials
on github or stack overflow (yet) for a LLM to really create DFDL from a
spec, and I've had little luck uploading format specs and asking it
questions about them.


Just asking GPT4o about DFDL is quite interesting though.

Here's a chat I did:
https://chatgpt.com/share/67b8bf26-1814-800f-95d6-d08ac63e14b8

I actually tried to use GPT4o to help with generating a property index into
the PDF so that one could quickly jump to properties. It can't even list
the properties in DFDL, but asking this was very interesting though. DFDL
has a property named "finalTerminatorCanBeMissing". If you read that once,
grokked the concept, but couldn't remember the exact name,
well "terminatorCanBeRequired" is not a bad guess, and that's what GPT4o's
guess was, which shows that it does have this stuff organized somehow as
concepts, not a bunch of snippets to regurgitate. When I followed up with:
" I haven't heard of the dfdl:terminatorCanBeRequired before. How does that
work?" it corrected its mistake, and did get the name
dfdl:finalTerminatorCanBeMissing this time.

So the challenge is, if you know DFDL well, you can spot its mistakes, but
if you are a naive user trying to learn DFDL, it gets quite a bit right,
but the mistakes it does make are perhaps problematic.

I believe I can prompt GPT4.0 to create a DFDL schema for me. Since I know
DFDL, this is not faster than typing for me. But if you don't know DFDL,
you can ask it to do lots of things that would save time like 'move that
local simple type definition to a separate file and name it "intType2"'.
And it is going to do that sort of thing just fine, particularly because
that is mostly just using knowledge of XSD.

This is all pretty cool.


On Sat, Feb 15, 2025 at 7:37 AM Claude Mamo <claude.m...@gmail.com> wrote:

> I don't know if it's possible to capture the DICOM format in DFDL
>
>
> Hah, a research team generated DICOM DFDL schemas from LLMs:
>
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-claude-3-5-haiku-20241022.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-claude-3-5-sonnet-20241022.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-deepseek-ai-deepseek-v3.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-gemini-1.5-flash.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-gpt-4-turbo.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-gpt-4o.xsd
>
> https://github.com/narfindustries/llm-tests-langsec/blob/main/results/1.0/DICOM/dicom-meta-llama-llama-3.3-70b-instruct-turbo.xsd
>
> The schema from gemini particularly stands out ;)
>
> On Fri, Jan 10, 2025 at 7:04 PM Claude Mamo <claude.m...@gmail.com> wrote:
>
>> My 2 cents. I reckon this BLOB feature would be useful to have as part of
>> the spec long-term. In healthcare integration, DICOM to FHIR integration is
>> a use case which I came across. I don't know if it's possible to capture
>> the DICOM format in DFDL (it seems doable at first glance), but suppose it
>> is, then I can easily imagine a situation where the integrator wants to
>> lift the metadata from the DICOM file to create resources on a FHIR server
>> and dump the pixel data somewhere else.
>>
>> Claude
>>
>> On Fri, Jan 10, 2025 at 4:50 PM Mike Beckerle <mbecke...@apache.org>
>> wrote:
>>
>>> Very helpful thanks Mark. I can cite this thread of emails as support
>>> for the BLOB feature to be added to DFDL v2.0.
>>>
>>> On Fri, Jan 10, 2025 at 10:46 AM Mark Kozak <mark.ko...@adeptus-cs.com>
>>> wrote:
>>>
>>>> Yes, I agree, but…
>>>>
>>>> When I say ‘other image processing software’, I am not talking about
>>>> photoshop or other ‘standard’ commercial applications that require a well
>>>> formed image file like a JFIF. I have files that use various compression
>>>> algorithms such as JPEG2000 for example. I can write that compressed pixel
>>>> blob and use a JPEG 2000 library to decompress it to work with the actual
>>>> pixel values outside the DFDL dataflow. Once decompressed, the processing
>>>> could be python scripts or any other pixel processing software. It’s also
>>>> helpful in examining blob payloads that are supposed to be image data but
>>>> are not behaving as one might expect. I can use a variable to turn on image
>>>> debug mode to get the image blob to a file for examination.
>>>>
>>>>
>>>>
>>>> So, yes, there are times I do both a and b.
>>>>
>>>> I hope that helps.
>>>>
>>>>
>>>>
>>>> -Mark
>>>>
>>>>
>>>>
>>>> *From:* Mike Beckerle <mbecke...@apache.org>
>>>> *Sent:* Friday, January 10, 2025 10:23 AM
>>>> *To:* users@daffodil.apache.org
>>>> *Subject:* Re: BLOB feature - is it being used?
>>>>
>>>>
>>>>
>>>> Thanks Mark,
>>>>
>>>> I have a question, or rather I really don't understand parsing image
>>>> data but also using BLOBs to process image content.
>>>>
>>>> This is from the Wiki page describing the BLOB feature:
>>>>
>>>> *A variety of data formats such as for image and video files, consist
>>>> of fields of what is effectively metadata, surrounding large blocks of data
>>>> containing compressed image or video data.*
>>>>
>>>>
>>>>
>>>> *An important use case for DFDL is to expose this metadata for easy
>>>> use, and to provide access to the large data via a streaming mechanism akin
>>>> to opening a file, rather than including large chunks of a hexBinary string
>>>> in the infoset, as is common today.*
>>>>
>>>> The above suggests BLOBs can be used to keep a giant array of pixel
>>>> bytes out of memory. So far so good.
>>>>
>>>>
>>>>
>>>> But if you are trying to both
>>>>
>>>>
>>>>
>>>> (a) expose and inspect/sanitize the image metadata, and also
>>>>
>>>> (b) process the image (e.g., to remove steganography),
>>>>
>>>>
>>>>
>>>> then I don't see how this works. Standard image processing libraries
>>>> are going to want the entire image "file", not just the pixel data bytes
>>>> from somewhere down inside that file.  That implies that the BLOB isn't
>>>> just the blob of pixel data, but rather the BLOB must be the entire image
>>>> "file" extracted from within the surrounding data envelope. But that
>>>> implies that you are not using a DFDL schema to parse the image field by
>>>> field so as to inspect/sanitize the metadata fields.
>>>>
>>>>
>>>>
>>>> In other words, DFDL+BLOB extension will let you do (a) or (b) but not
>>>> both.
>>>>
>>>>
>>>>
>>>> Do I have this right, or am I misunderstanding the use case?
>>>>
>>>>
>>>>
>>>> Thanks for any info
>>>>
>>>>
>>>>
>>>> -mike beckerle
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 9, 2025 at 4:02 PM Mark Kozak <mark.ko...@adeptus-cs.com>
>>>> wrote:
>>>>
>>>> I have occasionally used this feature to get pixel data (other than
>>>> NITF)
>>>> out to a file so that it can be processed using other image processing
>>>> software.
>>>>
>>>> -----Original Message-----
>>>> From: Mike Beckerle <mbecke...@apache.org>
>>>> Sent: Thursday, January 9, 2025 2:14 PM
>>>> To: users@daffodil.apache.org
>>>> Subject: BLOB feature - is it being used?
>>>>
>>>> Is anyone using the experimental BLOB feature in Daffodil.
>>>> (https://s.apache.org/daffodil-blob-feature)
>>>>
>>>> If so, please reply, or you can email me directly.
>>>>
>>>> This BLOB feature was added as we thought it would be used for the
>>>> pixels of
>>>> images.
>>>>
>>>> I've not seen any questions about it or discussion since it got
>>>> implemented.
>>>>
>>>> I do know that it is used in the NITF DFDL schema on github, but the
>>>> test
>>>> data for that schema does *not* use that element at all, so nothing
>>>> that is
>>>> part of that schema exercises the feature.
>>>>
>>>> I ask because this extension to DFDL v1.0, if used, would be a strong
>>>> candidate for inclusion in the next version of the DFDL specification
>>>> (from
>>>> OGF and ISO).
>>>> But.... if nobody is using the BLOB feature, that means other
>>>> techniques are
>>>> sufficient, and then there will be push back within the DFDL Workgroup
>>>> against adding this feature to DFDL as part of the standard.
>>>>
>>>> Personally, I have used this idea for "blobs" of data:
>>>>
>>>> <element name="pixels" dfdl:lengthKind="explicit" dfdl:length='{ ...
>>>> the big blob length ...}'>
>>>>   <complexType>
>>>>     <sequence>
>>>>        <element name="blob" dfdl:lengthKind="implicit">
>>>>            <!-- this blob element allows a dfdl:outputValueCalc='{
>>>> dfdl:contentLength(..../pixels/blob) }' to work to capture the length
>>>> when
>>>> unparsing -->
>>>>            <complexType>
>>>>               <sequence>
>>>>                  <!-- Avoid giant lines. This is XML. Users *may* want
>>>> to
>>>> open it in a text editor.
>>>>                         Note max size of blob is 100000100.
>>>>                     -->
>>>>                  <element name="a" type="xs:hexBinary" minOccurs="0"
>>>> maxOccurs="1000000" dfdl:lengthKind="explicit" dfdl:length="100"
>>>> dfdl:occursCountKind="implicit"/>
>>>>                  <element name="last" type="xs:hexBinary"
>>>> minOccurs="0" maxOccurs="1" dfdl:lengthKind="delimited"/>
>>>>              </sequence>
>>>>           </complexType>
>>>>         </element>
>>>>      </sequence>
>>>>    </complexType>
>>>> </element>
>>>>
>>>> That combined with the use of EXI to avoid the XML text bloat seems
>>>> like it
>>>> would address most needs.
>>>>
>>>>

AI/LLMs - was: Re: BLOB feature - is it being used?

Reply via email to