Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Hilko Bengen
Package: codesearch Version: 0.0~hg20120502-2 Severity: normal Tags: patch After noticing that cindex silently skipped some files in my codebase, I found that this happened to files that contained non-ASCII characters that were encoded as Latin-1. Here's a quick reproducer: $ pwd

Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Michael Stapelberg
Hi Hilko, Hilko Bengen ben...@debian.org writes: The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I The issue I have with that is that it will break the assumption that everything which is in the index is

Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Hilko Bengen
Hi Michael, Hilko Bengen ben...@debian.org writes: The first patch simply removes this check. It has worked well for me for several months (before the codesearch package appeared in Debian). I The issue I have with that is that it will break the assumption that everything which is in the

Bug#719858: codesearch: Indexer only accepts only valid UTF-8

2013-08-16 Thread Michael Stapelberg
Hi Hilko, Hilko Bengen ben...@debian.org writes: BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6 fc) to csearch. This led to regexp/syntax failing with an invalid UTF-8 error, so this does not work, even if the character encoding of the search term matches that of the