Re: Mozilla Charset Detectors

2017-05-30 Thread Gabriel Sandor
They can come from arbitrary sources that are out of my control. Therefore i may not get the charset of the original document, so all i'm left with is heuristic detection for those fragments. The application must be able to deal with any XML it receives, it doesn't impose any particular structure o

Re: Mozilla Charset Detectors

2017-05-26 Thread Daniel Veditz
On Fri, May 26, 2017 at 4:12 AM, wrote: > Still, sometimes XML fragments come up and even if they are not 100% XML > spec compliant i still have to process them. This includes encoding > detection as well, when the XML declaration is missing from the fragments. > ​Where do the fragments come fro

Re: Mozilla Charset Detectors

2017-05-26 Thread gabi . t . sandor
On Friday, May 26, 2017 at 10:01:18 AM UTC+3, Henri Sivonen wrote: > > Think of XML files without the "encoding" attribute in the declaration or > > HTML files without the meta charset tag. > > Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and > as UTF-8 otherwise. It's highl

Re: Mozilla Charset Detectors

2017-05-26 Thread Henri Sivonen
On Thu, May 25, 2017 at 10:44 PM, wrote: > Think of XML files without the "encoding" attribute in the declaration or > HTML files without the meta charset tag. Per spec, these must be treated as UTF-16 if there's a UTF-16 BOM and as UTF-8 otherwise. It's highly inappropriate to run heuristic d

Re: Mozilla Charset Detectors

2017-05-25 Thread gabi . t . sandor
On Tuesday, May 23, 2017 at 7:47:12 PM UTC+3, Joshua Cranmer 🐧 wrote: > On 5/23/17 2:58 AM, Gabriel Sandor wrote: > > Hello Henri, > > > > I was afraid this might be the case, so the library really is deprecated. > > > > The project i'm working on implies multi-lingual environment, users, and > > f

Re: Mozilla Charset Detectors

2017-05-23 Thread Joshua Cranmer 🐧
On 5/23/17 2:58 AM, Gabriel Sandor wrote: Hello Henri, I was afraid this might be the case, so the library really is deprecated. The project i'm working on implies multi-lingual environment, users, and files, so yes, having a good encoding detector is important. Thanks for the alternate recomme

Re: Mozilla Charset Detectors

2017-05-23 Thread Gabriel Sandor
t; wrote: > > I recently came across the Mozilla Charset Detectors tool, at > > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working > on > > a C# project where I could use a port of this library (e.g. > > https://github.com/errepi/ude) f

Re: Mozilla Charset Detectors

2017-05-22 Thread Henri Sivonen
On Mon, May 22, 2017 at 12:13 PM, Gabriel Sandor wrote: > I recently came across the Mozilla Charset Detectors tool, at > https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on > a C# project where I could use a port of this library (e.g. > https://github.co

Re: Mozilla Charset Detectors

2017-05-22 Thread Jonathan Kew
On 22/05/2017 10:13, Gabriel Sandor wrote: Greetings, I recently came across the Mozilla Charset Detectors tool, at https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on a C# project where I could use a port of this library (e.g. https://github.com/errepi/ude) for adv

Mozilla Charset Detectors

2017-05-22 Thread Gabriel Sandor
Greetings, I recently came across the Mozilla Charset Detectors tool, at https://www-archive.mozilla.org/projects/intl/chardet.html. I'm working on a C# project where I could use a port of this library (e.g. https://github.com/errepi/ude) for advanced charset detection. I'm not sure