Hello, Am 15.04.2018 um 05:08 schrieb suzuki toshiya: > Hi, > > Now I'm reworking with cpp-frontend encoding issue, and found that: > previous patch fixes the issue for text drawn in a PDF, but does > not fix the broken metadata if it is coded by PDFDocEncoding > (sample PDF is attached). > > During the rework, I'm becoming questionable about the usage of > iconv() in cpp frontend, by following reasons. > > 1) currently, the usage of iconv() in cpp frontend is only the > conversion of Unicode: UTF-8 <-> UTF-16. This is completely > mathematical conversion, using iconv() could be overkill. > > in fact, poppler has its own conversion routines (see UTF.h), > which are independent from iconv(). > > 2) iconv() requires the allocated buffer, but does not help > the calculation of the required buffer size. > > because this difficulty, current usages of iconv() have ugly > fallbacks for the cases that prepared buffer is too short, > like this: > > size_t ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char, > &str_data, &str_len_left); > if ((ir == (size_t)-1) && (errno == E2BIG)) { > const size_t delta = str_data - &str[0]; > str_len_left += str.size(); > str.resize(str.size() * 2); > str_data = &str[delta]; > ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char, &str_data, > &str_len_left); > if (ir == (size_t)-1) { > return byte_array(); > } > } > str.resize(str.size() - str_len_left); > return str; > > maybe due to this difficulty, glib or Qt frontends do not use > iconv() directly, they use better utility functions in these > frameworks. > > the functions in UTF.h have the functions to calculate the > required buffer size from given UTF-8 or UTF-16 data. > > 3) iconv() receives the buffer filled by, or to be filled by > UTF-16 data. currently, casting by reinterpret_cast<char *>() > is used to interchange between ustring (the array of unsigned > short) and a byte buffer, but it is not theoretically portable. > On the systems whose short is greater than 16-bit, or its > alignment is greater than 16-bit, the casted byte buffer would > have unexpected data from MSB bits or padding, and it could > make iconv() confused. using uint16_t (as the functions in UTF.h) > would be safer. > > To do in a theoretically portable way, we should allocate the > byte buffer for UTF-16 data, and copy to or copy from it, > instead of the casting. This would be uneasy for the maintenance. > > == > > How could I resolve this issue? There might be a few candidates. > > A) use the functions in UTF.h, and change ustring class > to fit them. > - pros: easiest maintainability, minimum duplication. > - cons: API changed, and the memory management by gmem.h > would be introduced in cpp frontend, it would be less > consistent in cpp frontend internal. > > B) rewrite the functions in UTF.h by template programming > and have the interfaces fitting to current ustring(). > - pros: minimum duplication, good consistency, API kept. > - cons: maintainability would be decreased slightly. > > C) write diversions of the functions in UTF.h in cpp frontend. > - pros: acceptable maintainability, good consistency, API kept. > - cons: duplication of almost same code into 2 places.
I think what is best here is hard to decide without seeing actual code that at least starts in one of these directions. Concerning ustring, I think an API change is not possible except for changing unsigned short to uint16_t since it already was broken where sizeof(unsigned short) != 2. Best regards, Adam > I hope to hear the comments from experts. > > Regards, > mpsuzuki > > > > _______________________________________________ > poppler mailing list > poppler@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/poppler >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler