Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-05 Thread rhkramer
On Wednesday, April 04, 2018 02:45:49 PM Don Armstrong wrote: > On Wed, 04 Apr 2018, rhkra...@gmail.com wrote: > > I've considered maildir--it meets some of my requirements (that is, to > > make something close to an askSam workalike), but one drawback is that > > it is essentially one email (i.e.,

Re: Invalid UTF-8 byte?

2018-04-04 Thread Michael Stone
On Thu, Apr 05, 2018 at 09:42:19AM +1200, Ben Caradoc-Davies wrote: On 05/04/18 02:09, to...@tuxteam.de wrote: Try UTF-16, what Microsoft (and a couple of years ago Apple) love to call "Unicode": in more "Western" contexts every second byte is NULL! The Java platform uses UTF-16 internally:

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread deloptes
rhkra...@gmail.com wrote: > I'll probably look into notmuch, just for kicks. > > I've considered maildir--it meets some of my requirements (that is, to > make something close to an askSam workalike), but one drawback is that it > is essentially one email (i.e., my "record").  One of the desirable

Re: Invalid UTF-8 byte?

2018-04-04 Thread Ben Caradoc-Davies
On 05/04/18 02:09, to...@tuxteam.de wrote: Try UTF-16, what Microsoft (and a couple of years ago Apple) love to call "Unicode": in more "Western" contexts every second byte is NULL! The Java platform uses UTF-16 internally: "The char data type (and therefore the value that a Character object

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 01:36:15 PM Don Armstrong wrote: > On Tue, 03 Apr 2018, rhkra...@gmail.com wrote: > > I am building (have built several iterations) of a free format > > database to work something like askSam. It is a mashup of several > > applications, things like recol, kmail, nail, k

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 02:10:16 PM Jonathan de Boyne Pollard wrote: > rhkramer: > > The reason I wanted such a byte was to use it as a record separator in > > a set of text files (that I use as an askSam "workalike" (or > > "worksimilar") so that I could use msort (which depends on a 1 byte >

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Apr 04, 2018 at 03:44:23PM -0300, Henrique de Moraes Holschuh wrote: [...] > That said, it is always safe to break valid "modified UTF-8" into > records using zeroes, as long as you don't expect the result to be valid > UTF-8 (it isn't valid

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard
Henrique de Moraes Holschuh: Also, a text file MAY contain NULs (the character), it is just considered bad practice (nowadays?). Don't assume you won't see any. For example, received e-mail is *more* likely to have NULs in it than normal text due to the quality of some mail agents out there.

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Don Armstrong
On Wed, 04 Apr 2018, rhkra...@gmail.com wrote: > I've considered maildir--it meets some of my requirements (that is, to > make something close to an askSam workalike), but one drawback is that > it is essentially one email (i.e., my "record"). One of the desirable > features of askSam is that you d

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard
rhkramer: Where were you in 2000 when I started the project? I cannot speak for anyone else, but I was probably once again giving a frequently given answer that I eventually put up on a WWW page. http://jdebp.eu./FGA/mail-mbox-formats.html

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Henrique de Moraes Holschuh
On Wed, 04 Apr 2018, to...@tuxteam.de wrote: > On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote: > > On Tue, 03 Apr 2018, Michael Lange wrote: > > > I believe (please anyone correct me if I am wrong) that "text" files > > > won't contain any null byte; many text editors e

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 01:36:15 PM Don Armstrong wrote: > On Tue, 03 Apr 2018, rhkra...@gmail.com wrote: > > I am building (have built several iterations) of a free format > > database to work something like askSam. It is a mashup of several > > applications, things like recol, kmail, nail, k

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Jonathan de Boyne Pollard
rhkramer: The reason I wanted such a byte was to use it as a record separator in a set of text files (that I use as an askSam "workalike" (or "worksimilar") so that I could use msort (which depends on a 1 byte record separator to --separate the records ;-) while sorting. Some of the files alr

Re: mbox vs maildir vs better formats [Re: Invalid UTF-8 byte? (was: Re: utf)]

2018-04-04 Thread Nicolas George
Don Armstrong (2018-04-04): > There are definitely better formats than Maildir, like Dovecot's > multi-dbox.[1] > > These issues are why almost everyone who uses Maildir just uses it as > the backing message store and uses the index on top to do avoid ever > reading all of the messages in the Mail

mbox vs maildir vs better formats [Re: Invalid UTF-8 byte? (was: Re: utf)]

2018-04-04 Thread Don Armstrong
On Wed, 04 Apr 2018, Nicolas George wrote: > Don Armstrong (2018-04-04): > > You should consider looking at using Maildir with notmuch and using > > things which integrate notmuch.[1] > > Maildir is not that much better than mbox. Sure, it eliminates most of > its worse flaws, but it brings flaws

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
Don Armstrong (2018-04-04): > You should consider looking at using Maildir with notmuch and using > things which integrate notmuch.[1] Maildir is not that much better than mbox. Sure, it eliminates most of its worse flaws, but it brings flaws of its own, like trashing the inode and dentries caches

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Don Armstrong
On Tue, 03 Apr 2018, rhkra...@gmail.com wrote: > I am building (have built several iterations) of a free format > database to work something like askSam. It is a mashup of several > applications, things like recol, kmail, nail, kate and the data is > stored in mbox formatted files. > > Each record

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 10:24:06 AM Greg Wooledge wrote: > On Wed, Apr 04, 2018 at 04:15:48PM +0200, Andre Majorel wrote: > > On 2018-04-04 14:55 +0200, Nicolas George wrote: > > > I have given you advice (for free), you are not taking it. Too bad for > > > you. Good day. > > > > Is advice th

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 10:15:48 AM Andre Majorel wrote: > On 2018-04-04 14:55 +0200, Nicolas George wrote: > > I have given you advice (for free), you are not taking it. Too bad for > > you. Good day. > > Is advice that comes with condescension truly free ? Thank you!

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 04:15:48PM +0200, Andre Majorel wrote: > On 2018-04-04 14:55 +0200, Nicolas George wrote: > > > I have given you advice (for free), you are not taking it. Too bad for > > you. Good day. > > Is advice that comes with condescension truly free ? Any advice that stops the OP

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Andre Majorel
On 2018-04-04 14:55 +0200, Nicolas George wrote: > I have given you advice (for free), you are not taking it. Too bad for > you. Good day. Is advice that comes with condescension truly free ? -- André Majorel I trust bugs.debian.org to not publish my email addr

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Apr 04, 2018 at 08:18:23AM -0300, Henrique de Moraes Holschuh wrote: > On Tue, 03 Apr 2018, Michael Lange wrote: > > I believe (please anyone correct me if I am wrong) that "text" files > > won't contain any null byte; many text editors even re

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
On Wednesday, April 04, 2018 08:26:41 AM Greg Wooledge wrote: > On Wed, Apr 04, 2018 at 01:23:25PM +0200, Nicolas George wrote: > > rhkra...@gmail.com (2018-04-03): > > > and the data is stored in mbox formatted files. > > > > DO NOT DO THAT. > > > > This is the only goo

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-04): > I'll convert the file format after you convert the programs to work with the > different file format. Those programs include kmail, nail, (essentially all > email programs that use mbox as the file format), recoll (conversion should > not > be difficult), var

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
I'll convert the file format after you convert the programs to work with the different file format. Those programs include kmail, nail, (essentially all email programs that use mbox as the file format), recoll (conversion should not be difficult), various editors (nedit, kate, for which I've wr

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Greg Wooledge
On Wed, Apr 04, 2018 at 01:23:25PM +0200, Nicolas George wrote: > rhkra...@gmail.com (2018-04-03): > > and the data is stored in mbox formatted files. > > DO NOT DO THAT. > > This is the only good advice you can have for that project. Store your > data in a decent form

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-04): > Sorry, I already have 300 MB plus stored in that format. Then convert. Small extra work now. Many less headaches later. Regards, -- Nicolas George signature.asc Description: Digital signature

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread rhkramer
Sorry, I already have 300 MB plus stored in that format. Where were you in 2000 when I started the project? On Wednesday, April 04, 2018 07:23:25 AM Nicolas George wrote: > rhkra...@gmail.com (2018-04-03): > > and the data is stored in mbox formatted files. > > DO NO

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Nicolas George
rhkra...@gmail.com (2018-04-03): > and the data is stored in mbox formatted files. DO NOT DO THAT. This is the only good advice you can have for that project. Store your data in a decent format. Regards, -- Nicolas George signature.asc Description: Digital sig

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-04 Thread Henrique de Moraes Holschuh
On Tue, 03 Apr 2018, Michael Lange wrote: > I believe (please anyone correct me if I am wrong) that "text" files > won't contain any null byte; many text editors even refuse to open such a Depends on the encoding. For ASCII, ISO-8859-* and UTF-8 (and any other modern encoding AFAIK, other than mo

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread rhkramer
On Tuesday, April 03, 2018 08:30:04 AM Greg Wooledge wrote: > WHAT ARE YOU TRYING TO DO? I am building (have built several iterations) of a free format database to work something like askSam. It is a mashup of several applications, things like recol, kmail, nail, kate and the data is stored in

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread rhkramer
On Tuesday, April 03, 2018 08:30:04 AM Greg Wooledge wrote: > > Addendum: iirc (again please correct me if I am wrong) unix file names > > may contain (at least in theory) any byte except 2F (the slash) and the > > null byte. So if your text files might contain arbitrary file names there > > may be

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Michael Lange
On Tue, 3 Apr 2018 15:47:57 -0400 Greg Wooledge wrote: > On Tue, Apr 03, 2018 at 09:36:42PM +0200, Michael Lange wrote: > > >From what i have understood I think the OP should certainly at least, > > whatever the files they want to include exactly look like and > > whichever byte they choose as de

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Greg Wooledge
On Tue, Apr 03, 2018 at 09:36:42PM +0200, Michael Lange wrote: > >From what i have understood I think the OP should certainly at least, > whatever the files they want to include exactly look like and whichever > byte they choose as delimiter, scan the file first for such a byte and if > it is actua

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Michael Lange
Hi, On Tue, 3 Apr 2018 14:32:08 +0200 wrote: > > > Probably it is the same with some other control characters like 04 > > > (End of Transmission). When I look at > > > https://en.wikipedia.org/wiki/ASCII it seems like 1C (File > > > Separator) or 1E (Record Separator) might be appropriate choice

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Tue, Apr 03, 2018 at 02:14:07PM +0200, Michael Lange wrote: > On Tue, 3 Apr 2018 13:58:33 +0200 > Michael Lange wrote: > > > I believe (please anyone correct me if I am wrong) that "text" files > > won't contain any null byte; many text editors ev

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Greg Wooledge
> Addendum: iirc (again please correct me if I am wrong) unix file names > may contain (at least in theory) any byte except 2F (the slash) and the > null byte. So if your text files might contain arbitrary file names there > may be (at least in theory) a (admittedly very small) chance that such a >

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread rhkramer
On Tuesday, April 03, 2018 07:54:35 AM Nicolas George wrote: > rhkra...@gmail.com (2018-04-03): > > Next I'll have to refresh my memory on how to replace the existing From > > with From preceded by the null character, i.e., something like: > > > > Find: \n\nFrom > > Replace with \n\n0x00\nFrom >

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Michael Lange
On Tue, 3 Apr 2018 13:58:33 +0200 Michael Lange wrote: > I believe (please anyone correct me if I am wrong) that "text" files > won't contain any null byte; many text editors even refuse to open such > a file, I guess since they assume it is a "binary" file. > Probably it is the same with some ot

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Michael Lange
Hi, On Tue, 3 Apr 2018 07:43:02 -0400 rhkra...@gmail.com wrote: > > maybe you could use the null byte? > > Thanks! > > Surprisingly (to me), this (and maybe several other of the control > characters might work--I did a search of one of the files, and there > are no null bytes. I believe (pleas

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread Nicolas George
rhkra...@gmail.com (2018-04-03): > Next I'll have to refresh my memory on how to replace the existing From with > From preceded by the null character, i.e., something like: > > Find: \n\nFrom > Replace with \n\n0x00\nFrom This is a very bad idea, and you are obviously about tu reproduce the err

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-03 Thread rhkramer
On Monday, April 02, 2018 06:43:28 PM Michael Lange wrote: > On Mon, 2 Apr 2018 08:37:54 -0400 > > rhkra...@gmail.com wrote: > > A few weeks ago, I was looking for a byte that, in UTF-8, would be a > > totally invalid byte (not an invalid sequence of bytes). At the time, > > I tried some googling

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread Michael Lange
Hi, On Mon, 2 Apr 2018 08:37:54 -0400 rhkra...@gmail.com wrote: > A few weeks ago, I was looking for a byte that, in UTF-8, would be a > totally invalid byte (not an invalid sequence of bytes). At the time, > I tried some googling, but it looked rather hopeless (maybe it was my > googling that w

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread rhkramer
Thanks, again, to Henrique and tomas for the followups! On Monday, April 02, 2018 02:40:55 PM to...@tuxteam.de wrote: > On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote:

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Mon, Apr 02, 2018 at 08:40:55PM +0200, to...@tuxteam.de wrote: > On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote: > > On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > > > The wikipedia article is rather interesting, in a

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Mon, Apr 02, 2018 at 03:18:38PM -0300, Henrique de Moraes Holschuh wrote: > On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > > The wikipedia article is rather interesting, in a quick skim, I learned > > some > > interesting things about UTF-8, esp

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread Henrique de Moraes Holschuh
On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > The wikipedia article is rather interesting, in a quick skim, I learned some > interesting things about UTF-8, especially the property of self- > synchronization. Yes, UTF-8 is a brilliant design. > I had trouble reading that large table--but if I

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread rhkramer
Thanks to tomas and Henrique! The wikipedia article is rather interesting, in a quick skim, I learned some interesting things about UTF-8, especially the property of self- synchronization. I had trouble reading that large table--but if I simply take the red boxes at face value, maybe there are

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread Henrique de Moraes Holschuh
On Mon, 02 Apr 2018, rhkra...@gmail.com wrote: > A few weeks ago, I was looking for a byte that, in UTF-8, would be a totally > invalid byte (not an invalid sequence of bytes). At the time, I tried some > googling, but it looked rather hopeless (maybe it was my googling that was > hopeless). 0

Re: Invalid UTF-8 byte? (was: Re: utf)

2018-04-02 Thread tomas
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Mon, Apr 02, 2018 at 08:37:54AM -0400, rhkra...@gmail.com wrote: > On Monday, April 02, 2018 03:39:05 AM Andre Majorel wrote: > > > Why? UTF (especially UTF-8) is vastly superior for all purposes: > > I wouldn't say that. UTF-8 breaks a number of as