Handling complex, multi-record, single-line pipe-separated text?

Andrew Thorburn Wed, 13 Nov 2013 18:06:33 -0800

First up, I'm currently building a proxy, effectively, for some
interfaces that I have to work with. On one side is a set of web
services that my application will be calling, so that I can
standardise on *something*. On the other side is a set of IBM MQ
queues that - mostly - require fixed-length records. So what I am
doing is sending a SOAP message to ServiceMix, transforming that into
a POJO, then transforming that POJO into a flat file via BeanIO, which
in turn gets sent out to MQ. That might seem a little inefficient, but
it beats writing thousands of lines of XSL to transform the XML into a
flat file.


One format in particular, however, doesn't seem to be achievable with
any of the data formats available in Camel, as far as I can tell, but
I would appreciate some advice on this front. Bindy cannot - it isn't
nearly comprehensive enough. BeanIO *almost* can, if only the records
were on separate lines. But they're not - they're all on the same
line. Neither Flatpack nor Smooks seem to handle this format either,
from what I've read.

The basic format is like a CSV file except with pipes " | " instead of
commas. However, despite being only a single line, there are multiple
records in that line, and several of the records have a variable
number of repetitions.

Now, given that I only need to *generate* this format, not parse it, I
could probably generate it with BeanIO and do some sort of
post-processing on it to strip out the newlines or something similar
(the response is significantly simpler and contains no repeating
records - parsing that is not a problem). However, I would like to
know if there is anything out there which would support this format
properly, should I find it necessary to parse it in the future, and to
avoid hacking in something now which I will later regret.

For example, if we take the following sample (trimmed down
significantly for brevity):

COM|US|CORP|FIELD1|DATE1|TIME1|DATE2|DATE3|TYPE|1|A|ABC123|DEF456||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|ADDRESS4|ADDRESS5|A|ABC123|DEF456||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|ADDRESS4|ADDRESS5|A|ABC123|DEF456||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|ADDRESS4|ADDRESS5|S|1|STUFF||CODE||ESTATUS||G|ID|||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|G|ID|||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|G|ID|||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|I|...|R|...|R|...|R|...|Y|...|Y|...|Y|...

And break that down, the first set of columns,
COM|US|CORP|FIELD1|DATE1|TIME1|DATE2|DATE3|TYPE|1, is effectively a
header. This appears exactly once and is not an issue.

The next set of columns,
A|ABC123|DEF456||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3|ADDRESS4|ADDRESS5,
represents a single record that can appear 1 to 99 times in the line.
This where I start seeing problems. In my POJO, I would like to
represent this as a list of "A" records, and then have the data format
generate one record for each list item, but without adding a
line-break afterwards. If I were to parse this, it would need to know
that after the first record, the "COM" record, it should look at the
first character to see what the type of the record is - in this case
it is an "A" record, and there must be at least one record, and there
may be up to 99 records. In the example, I have repeated it three
times. Note that while BeanIO could, in theory, handle this as a
repeating segment, I have other repeating segments following this one.

The next set of columns, S|1|STUFF||CODE||ESTATUS||, is repeated
exactly once, and the type of record is "S", identified by the first
character.

The next set, 
G|ID|||SURNAME|FIRSTNAME|MIDDLENAME|GENDER|DOB|ADDRESS1|ADDRESS2|ADDRESS3,
is similar to the "A" record, but can appear 0 to 99 times. I have
included it three times in this example.

The next set, I|..., is similar to "S" in that it only appears once.

The next set, R|..., can appear 0 to 99 times.

The next set, Y|..., can appear 0 to 99 times.

There are other repeating segments too - that's just a small part of
the whole record.

I hope this makes sense - it seems like this is a particularly unusual
record format to have to deal with, so it is perhaps unsurprising that
I can't find a tool that will handle it.

Handling complex, multi-record, single-line pipe-separated text?

Reply via email to