lsbfdump for data archaeology -- Happy Holidays! -> Your gift is: a nice LSBF Dump Utility

Mike Beckerle Fri, 15 Dec 2023 06:41:09 -0800

Enough of us are doing data archaeology (i.e., digging around in the bits)
with LSBF data these days that I created this utility.


lsbfdump is a command line tool that creates a data dump at bits level for
data that has dfdl:bitOrder="leastSignificantBitFirst".

This is what it outputs for example:

01000110 01001100 01000101 01111111 | 0x00000000
00000000 00000001 00000001 00000010 | 0x00000004
00000000 00000000 00000000 00000000 | 0x00000008
00000000 00000000 00000000 00000000 | 0x0000000C
00000000 00111110 00000000 00000011 | 0x00000010
                  00000000 00000001 | 0x00000014

The address (in hex) is on the right. The bytes start on the right and
increase moving left and downward. The least significant bit of each byte
is on the right (as people usually write numbers).
This is not at the status of an official Apache release or anything, you
have to git clone the daffodil-extra repository ( via 'git clone
https://github.com/apache/daffodil-extra.git' ) and build your own, and it
is not tagged or anything.
(The whole daffodil-extra repo on github is for these sort of unofficial
side-pony projects and examples.)

But building this is easy, and creates a small-ish native binary executable
(less than 10MB in size) via the very cool *sbt native image* plugin (not
to be confused with 'scala native' which I tried and it failed). The sbt
native image plugin pulls down and uses GraalVM technology under the hood.

Caveat: I have only built this on Linux. Have not tried MS-Windows - but
sbt-native-image plugin and GraalVM documentation say that this will work.
(I hope someone tries this, or maybe even contributes automated setup for
the repo to auto-test this on linux and windows every commit.)

See: https://github.com/apache/daffodil-extra/tree/main/lsbfdump

I hope you find this useful.
Interestingly, I used generative AI, specifically ChatGPT4.0 to create the
first draft of this, which I subsequently modified to clean it up, but it
was a huge time saver.

The rest of this message is about generative AI.

If you are interested in generative-AI tools for programming assist, here's
a link to the whole chat session I did to create this lsbfdump tool: Scala
Binary Bytes Display
<https://chat.openai.com/share/44944901-f1fb-4e53-8e87-042584ac61f5> .
If you have not used chatGPT for coding before you may find it of interest. I
also recently tried google bard (updated just last week) and it now seems
to generate useful & interesting code now as well, and even includes
reference citations to its sources of knowledge.

To whet your appetite, here are some prompts I gave chatGPT 4.0. This was
kind of like 'pair programming' where it was writing the code, and I was
prompting it what to change or add.

First... "Scala program to display bytes 4 at a time right to left
ordering, as binary bits."

(creates and explains program)

Next... "Can you create all the files for a complete scala-native project
that implements this taking a file name and an offset into that file as
where to start displaying the bytes?"

(modifies program accordingly)

Next... "Two changes. One make the offset argument optional. Two, add an
option to append the address, the address of the first byte (rightmost
byte) of each line, on the right side of each line either in decimal or in
hex."

(modifies program accordingly)

Next... "A few more changes. Add another optional argument which is the
number of bytes to print with the default value of 128 if not provided.
Change the --address to --noAddress and invert the sense so that you get
the address displayed by default, and supplying --noAddress turns it off.
Do not display the byte in hex before the bits. Do not display a dashed
line between rows. Change the name of the scala object to LSBFDump"

(modifies program accordingly)

I had to ask it to correct some mistakes it made, which it did smoothly,
and at some point it was clear the remaining changes were easier for me to
just do, than ask the bot to do them, so I took the code and by-hand
modified it after that.
Some things it did were very nice. I told it to "add a --help option which
displays an explanation of usage" and then to "add the usage examples to
the help text" and voila it adds it and creates a nice help text, etc:

  def printHelp(): Unit = {
    println(
      """Usage: LSBFDump --file <filename> [--offset <offset>] [--length
<numBytes>] [--noAddress] [--help]
        |
        |<filename>   : The file to read bytes from or '-' for standard
input.
        |[offset]     : The starting offset in the file (default is 0).
        |[length]     : The number of bytes to display (default is 128).
        |--noAddress  : Do not display the address of each byte line.
        |--help       : Display this help information.
        |
        |Examples:
        | Default usage (128 bytes from standard input, starting at offset
0, with addresses):
        |   ./LSBFDump --file -
        |
        | With specific file, offset and byte count:
        |   ./LSBFDump --file filename --offset 10 --length 64
        |
        | With --noAddress to hide addresses:
        |   ./LSBFDump --file filename --offset 10 --length 64 --noAddress
        |""".stripMargin)
  }


Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com

lsbfdump for data archaeology -- Happy Holidays! -> Your gift is: a nice LSBF Dump Utility

Reply via email to