RE: Re: Re: Bindy plus Unicode

dev Sat, 25 Jan 2020 08:42:26 -0800


Hi Alex,
 
well, which would then be the appropriate branch? Master or 3.x? 
I guess if i create a ticket I get informed by e-mail what happens to the 
thing, right?
I think there could be a ticket + PR in the next two weeks.


I word on ICU4J. Of course I understand, that an Apache Project has to be 
careful, but there 
are features like splitting strings into graphemes, that need features, the old 
logic in the JDK
doesn't support. The lib is very common (e.g. LibreOffice uses it) and AFAIK 
the de-facto standard
for working with elaborate Unicode. 

-- Mik
 
----
Gesendet: Freitag, 24. Januar 2020 um 19:15 Uhr
Von: "Alex Dettinger" <aldettin...@gmail.com>
An: users@camel.apache.org
Betreff: Re: Re: Bindy plus Unicode
Hi Michael,

Good to know that you sorted it out :) The compatibility between the
ICU4L and Apache License is not straightforward, we would need to look
closer.
Still creating a quick ticket and sharing a github project would make it
possible to save your work, and may be of interest later on to the
community.
Would one provide a PR against 3.x, chances are that this could be
back-ported to 2.x. Please, keep time frame in mind as 2.x may close end of
this year.

Alex

On Fri, Jan 24, 2020 at 5:20 PM Michael Greulich <d...@greulich-online.eu>
wrote:

>
> Hi Alex,
>
> well, your comment was already very helpful. I created a custom DataFormat
> and ModelFactory from the default ones for FixedLength. Of course I obeyed
> the license terms of the Apache license ;-) For some aspect of recognizing
> chars, I used the ICU4J-lib, because the support for some things (e.g.
> emojis) in the Java runtime is not up to date. The license of ICU it quite
> permitting, too. I’ve no idea, if this is a problem for an Apache project...
>
> Well I think I’m not the only one, that has this use-case -- so I think
> this can be useful for the community, too. Currently I’m under pressure,
> but I think I will create a JIRA ticket when the stress has become less. If
> the community is interested, I can provide the code of my solution and
> would be glad if this thing goes upstream (i.e. into the camel distro) some
> day.
>
> Currently we (the company I work for) are using Camel 2.2 and I guess this
> will be the case for some time. If this feature or bug (not very determined
> what it actually is, I will leave the decision to the community) in which
> version will it be included? Only Camel 3.x or will it be backported to 2.2?
>
> -- Mik
>
> --------------------------------------------------------------------------
> Gesendet: Freitag, 24. Januar 2020 um 11:43 Uhr
> Von: "Alex Dettinger" <aldettin...@gmail.com>
> An: users@camel.apache.org
> Betreff: Re: Bindy plus Unicode
> Hi Michael,
>
> I was just looking at this component for another purpose and it looks
> to me that fixed length tokenzation occurs here:
>
>
> https://github.com/apache/camel/blob/master/components/camel-bindy/src/main/java/org/apache/camel/dataformat/bindy/BindyFixedLengthFactory.java#L212..L216
> So, It counts in java chars and not code points. You can maybe experiment
> injecting a custom BindyFixedLengthFactory, via
> dataFormat.setModelFactory(..).
>
> Would you feel that an extension point to customize count/selection of
> chars/codepoint/grapheme would be valuable to the community, feel free to
> raise a JIRA ticket.
>
> Alex
>
>
> On Fri, Jan 24, 2020 at 9:52 AM Michael Greulich <
> mich...@greulich-online.eu>
> wrote:
>
> > Hi,
> >
> > I’m having problems with the bindy component and wonder if there is
> > something I missed. Maybe one can help me addressing it. I cannot
> believe,
> > that I’m the first to hit this problem.
> >
> > I need to port an EAI application built using bindy, that reads a fixed
> > type file(*) converts it and sends the data somewhere else. Currently
> this
> > file is in Latin 1 encoding, but we need to take it to Unicode –
> > effectively UTF-8. We have an ugly, but effectively unavoidable legacy
> > application that creates the file.
> >
> > Unicode is a bit tricky, when it comes to counting the length of a string
> > specially since Java uses internally UTF-16, which means depending on the
> > codepoint 1 – 2 (Java-)chars. Bindy seems to use internally for selection
> > substring and counts chars like Java does. This means the length of a
> > string is the count of the chars, i.e. UTF-16 surrogates, but not
> > codepoints, which is the common denominator (e.g. see definition of
> string
> > length in XMLSchema). And when one takes combing chars into account (one
> > “base char” plus 0 – n combining chars are perceived as one “char” by
> > users) it becomes even more of a problem.
> >
> > Is there a possibility to tell bindy how it counts an and selects the
> > tokens based on char counts in a given line? Any suggestions? Is the are
> > related bug or change to come that addresses this problem?
> >
> > -- Mik
> >
> > (*) This means, that on certain positions there start certain data
> > (columns if you will).
> >
> >
>
>
>

RE: Re: Re: Bindy plus Unicode

Reply via email to