[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-02-28 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Mason James  changed:

   What|Removed |Added

 Status|Pushed to oldoldstable  |RESOLVED
 Resolution|--- |FIXED

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-02-20 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Hayley Mapley  changed:

   What|Removed |Added

 Version(s)|20.05.00, 19.11.03, |20.05.00, 19.11.03,
released in|19.05.08|19.05.08, 18.11.14
 CC||hayleymap...@catalyst.net.n
   ||z
 Status|Pushed to oldstable |Pushed to oldoldstable

--- Comment #50 from Hayley Mapley  ---
Backported to 18.11.x for 18.11.14

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-02-05 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Lucas Gass  changed:

   What|Removed |Added

 Status|Pushed to stable|Pushed to oldstable
 CC||lu...@bywatersolutions.com
 Version(s)|20.05.00, 19.11.03  |20.05.00, 19.11.03,
released in||19.05.08

--- Comment #49 from Lucas Gass  ---
backported to 19.05.x for 19.05.08

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-30 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Joy Nelson  changed:

   What|Removed |Added

 Version(s)|20.05.00|20.05.00, 19.11.03
released in||
 CC||j...@bywatersolutions.com
 Status|Pushed to master|Pushed to stable

--- Comment #48 from Joy Nelson  ---
Pushed to 19.11.x branch for 19.11.03

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-22 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

 Blocks||24167


Referenced Bugs:

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=24167
[Bug 24167] We should support installation on Debian 10
-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

 Status|Passed QA   |Pushed to master
 Version(s)||20.05.00
released in||

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #47 from Martin Renvoize  ---
Nice work everyone!

Pushed to master for 20.05

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #46 from Jonathan Druart  
---
(In reply to Marcel de Rooy from comment #43)
> (In reply to Jonathan Druart from comment #42)
> > Martin, I think you forgot to add the new dependency.
> 
> C4/Charset.pm:use Unicode::Normalize;
> C4/Installer/PerlDependencies.pm:'Unicode::Normalize' => {
> C4/Record.pm:use Unicode::Normalize; # _entity_encode
> Koha/Patron.pm:use Unicode::Normalize;
> misc/migration_tools/bulkmarcimport.pl:use Unicode::Normalize;
> 
> What did you mean?

Indeed, I missed that it was already in our dependency list.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Marcel de Rooy  changed:

   What|Removed |Added

  Attachment #97099|0   |1
is obsolete||

--- Comment #44 from Marcel de Rooy  ---
Created attachment 97174
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=97174=edit
Bug 14759: Replace Text::Unaccent with Unicode::Normalize

As shown in the comments on the bug, it appears that Unicode::Normalize
is the most reliable way to strip accents from strings for this use
case.

Signed-off-by: Jonathan Druart 

Signed-off-by: Marcel de Rooy 

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Marcel de Rooy  changed:

   What|Removed |Added

  Attachment #97100|0   |1
is obsolete||

--- Comment #45 from Marcel de Rooy  ---
Created attachment 97175
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=97175=edit
Bug 14759: Add test

Signed-off-by: Jonathan Druart 

Signed-off-by: Marcel de Rooy 

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Marcel de Rooy  changed:

   What|Removed |Added

 Status|Signed Off  |Passed QA
   Patch complexity|Medium patch|Small patch

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Marcel de Rooy  changed:

   What|Removed |Added

   Assignee|ke...@carvingit.com |martin.renvoize@ptfs-europe
   ||.com
 QA Contact|testo...@bugs.koha-communit |m.de.r...@rijksmuseum.nl
   |y.org   |

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-10 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Marcel de Rooy  changed:

   What|Removed |Added

 CC||m.de.r...@rijksmuseum.nl

--- Comment #43 from Marcel de Rooy  ---
(In reply to Jonathan Druart from comment #42)
> Martin, I think you forgot to add the new dependency.

C4/Charset.pm:use Unicode::Normalize;
C4/Installer/PerlDependencies.pm:'Unicode::Normalize' => {
C4/Record.pm:use Unicode::Normalize; # _entity_encode
Koha/Patron.pm:use Unicode::Normalize;
misc/migration_tools/bulkmarcimport.pl:use Unicode::Normalize;

What did you mean?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-09 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Jonathan Druart  changed:

   What|Removed |Added

 CC||jonathan.dru...@bugs.koha-c
   ||ommunity.org

--- Comment #42 from Jonathan Druart  
---
Martin, I think you forgot to add the new dependency.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-09 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #41 from Jonathan Druart  
---
Created attachment 97100
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=97100=edit
Bug 14759: Add test

Signed-off-by: Jonathan Druart 

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-09 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Jonathan Druart  changed:

   What|Removed |Added

 Status|Needs Signoff   |Signed Off

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-09 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Jonathan Druart  changed:

   What|Removed |Added

  Attachment #96974|0   |1
is obsolete||

--- Comment #40 from Jonathan Druart  
---
Created attachment 97099
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=97099=edit
Bug 14759: Replace Text::Unaccent with Unicode::Normalize

As shown in the comments on the bug, it appears that Unicode::Normalize
is the most reliable way to strip accents from strings for this use
case.

Signed-off-by: Jonathan Druart 

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-09 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Hans Pålsson  changed:

   What|Removed |Added

 CC||hans.pals...@hkr.se

--- Comment #39 from Hans Pålsson  ---
I have applied this patch after having the exact same problem as described in
bug 24292. The patch solved the problem and enabled upgrading of the
installation from 19.05.04 to 19.05.06. However the patch was applied
"manually" editing the code as describes so I can't do a sign-off with a clear
conscience. This is un-orthodox but I needed the patch to update a test server,
which worked, but I do not have a sandbox or similar to recreate the problem.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-08 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Katrin Fischer  changed:

   What|Removed |Added

 CC||mjn...@gmail.com

--- Comment #38 from Katrin Fischer  ---
*** Bug 24292 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-08 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

   Severity|normal  |critical

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-08 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

 Status|In Discussion   |Needs Signoff

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-08 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

  Attachment #42120|0   |1
is obsolete||

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-08 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #37 from Martin Renvoize  ---
Created attachment 96974
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=96974=edit
Bug 14759: Replace Text::Unaccent with Unicode::Normalize

As shown in the comments on the bug, it appears that Unicode::Normalize
is the most reliable way to strip accents from strings for this use
case.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2020-01-01 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #36 from David Cook  ---
Perhaps we could make this a pluggable configuration option? We could even
leave it with Text::Unaccent, and then us non-Debian using folk could manually
switch to using Text::Unaccent::PurePerl?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-12-23 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #35 from Katrin Fischer  ---
(In reply to Martin Renvoize from comment #34)
> I'd really like to understand why we need to unaccent the login's in the
> first place... was it because of how mysql stores the data perhaps and then
> does a lookup?

I think that was not the problem - or can't imagine it would be. What I could
think of is having limited 'keyboards' on self checks and similar that don't
allow to log in with diacritics. The username is also used by external services
for authentication possibly... maybe adapting to that?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-12-23 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Martin Renvoize  changed:

   What|Removed |Added

 CC||martin.renvoize@ptfs-europe
   ||.com

--- Comment #34 from Martin Renvoize  ---
I'd really like to understand why we need to unaccent the login's in the first
place... was it because of how mysql stores the data perhaps and then does a
lookup?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-12-22 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Mason James  changed:

   What|Removed |Added

 CC||m...@kohaaloha.com

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-12-22 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Katrin Fischer  changed:

   What|Removed |Added

   See Also||https://bugs.koha-community
   ||.org/bugzilla3/show_bug.cgi
   ||?id=24292

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-11-04 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #33 from Colin Campbell  ---
Looks like in addition to the problems of building on 64 bit systems it fails
to build on the most recent versions of Perl. I think this is a result of the
removal of . from the default path

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-01-06 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #32 from David Cook  ---
Went to the Australia mirror...

http://ftp.au.debian.org/debian/pool/main/libt/libtext-unaccent-perl/

http://ftp.au.debian.org/debian/pool/main/libt/libtext-unaccent-perl/libtext-unaccent-perl_1.08-1.3.diff.gz

Patch looks consistent with the following:

https://bugs.launchpad.net/ubuntu/+source/libtext-unaccent-perl/+bug/460640

https://rt.cpan.org/Public/Bug/Display.html?id=21177

I don't think the maintainer has done anything on CPAN or Debian for many
years. Looks like other Debian folk fixed the Debian package, and CPAN is just
abandoned.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-01-06 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #31 from David Cook  ---
In any case, this seems like a major sticking point for anyone not using
Debian-based distros.

For now, I'm just removing Text::Unaccent where necessary.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2019-01-06 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #30 from David Cook  ---
(In reply to Colin Campbell from comment #25)
> Text::Unaccent does not build on 64bit systems, the tests fail because of
> errors in the ccode. There has been a patch for that for four years but it
> looks like the upstream code is moribund. If you look at the test results it
> now fails on all linux test builds. The module has not been kept up to date
> to handle modern perl strings. I think the debian version may be patched to
> fix the bug in 64bit tests but it is buggy and should not be relied on.
> Suggest moving to Text::Unaccent::PurPerl be prioritized

I just ran into this again. 

Can't build on a 64 bit system running Perl 5.26. If you do force the build and
install, you'll just get difficult to diagnose 500 errors in Koha.

As Colin mentioned, there's been known issues for this for ages:
https://rt.cpan.org/Public/Dist/Display.html?Name=Text-Unaccent

I've gone to https://packages.debian.org/stretch/libtext-unaccent-perl to see
if I can find the patch that Debian uses.

I reckon
http://deb.debian.org/debian/pool/main/libt/libtext-unaccent-perl/libtext-unaccent-perl_1.08-1.3.diff.gz
is the patch but the Debian CDN keeps timing out on me. 

When I try with curl, it seems like the CDN is refusing connections on port 80
and 443, so something looks like it's up with Debian...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2018-12-13 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

David Cook  changed:

   What|Removed |Added

   See Also||https://bugs.koha-community
   ||.org/bugzilla3/show_bug.cgi
   ||?id=21848

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2018-12-13 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #29 from David Cook  ---
(In reply to Mirko Tietgen from comment #28)
> Some test fails ATM when building a package in Buster (unstable) and looking
> into that I ended up here. Looking into packaging Text::Unaccent::PurePerl I
> found it has not had an update since 2013 so we would exchange one dead
> package for another.

I'm thinking "Strip NonspacingMark" instead of Text::Accent* might be the
solution?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2018-11-16 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Mirko Tietgen  changed:

   What|Removed |Added

 CC||mi...@abunchofthings.net

--- Comment #28 from Mirko Tietgen  ---
Some test fails ATM when building a package in Buster (unstable) and looking
into that I ended up here. Looking into packaging Text::Unaccent::PurePerl I
found it has not had an update since 2013 so we would exchange one dead package
for another.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2017-11-20 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #27 from Colin Campbell  ---
(In reply to David Cook from comment #26)
> (In reply to Colin Campbell from comment #25)
> > Text::Unaccent does not build on 64bit systems, the tests fail because of
> > errors in the ccode. There has been a patch for that for four years but it
> > looks like the upstream code is moribund. If you look at the test results it
> > now fails on all linux test builds. The module has not been kept up to date
> > to handle modern perl strings. I think the debian version may be patched to
> > fix the bug in 64bit tests but it is buggy and should not be relied on.
> > Suggest moving to Text::Unaccent::PurPerl be prioritized
> 
> Where are we using Text::Unaccent? Is it still just in userid strings?

I think that is the only location

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2017-11-16 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #26 from David Cook  ---
(In reply to Colin Campbell from comment #25)
> Text::Unaccent does not build on 64bit systems, the tests fail because of
> errors in the ccode. There has been a patch for that for four years but it
> looks like the upstream code is moribund. If you look at the test results it
> now fails on all linux test builds. The module has not been kept up to date
> to handle modern perl strings. I think the debian version may be patched to
> fix the bug in 64bit tests but it is buggy and should not be relied on.
> Suggest moving to Text::Unaccent::PurPerl be prioritized

Where are we using Text::Unaccent? Is it still just in userid strings?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2017-11-16 Thread bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Colin Campbell  changed:

   What|Removed |Added

 CC||colin.campbell@ptfs-europe.
   ||com

--- Comment #25 from Colin Campbell  ---
Text::Unaccent does not build on 64bit systems, the tests fail because of
errors in the ccode. There has been a patch for that for four years but it
looks like the upstream code is moribund. If you look at the test results it
now fails on all linux test builds. The module has not been kept up to date to
handle modern perl strings. I think the debian version may be patched to fix
the bug in 64bit tests but it is buggy and should not be relied on. Suggest
moving to Text::Unaccent::PurPerl be prioritized

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-17 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #24 from David Cook  ---
(In reply to Katrin Fischer from comment #22)
> I can confirm what has been said about Japanese - removing the diacritics
> from katakana or hiragana makes an unwanted difference.
> 
> Looking at the answers so far, it seems like they can't be a perfect
> solution working as expected for all languages - maybe we need to make this
> optional?

I still think it might be worthwhile to re-open
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7411, although that's
probably easier said than done now that it's problem behaviour has been
corrected...

Clearly, Text::Unaccent hasn't been doing anything for CJK or Arabic anyway,
and everything has been fine. So I don't know if we really need to be
unaccenting the userid anyway. Can the problem really be localized to French?
Or were French libraries the only ones to notice the original problem?

The original comment on this bug also talks about replacing Text::Unaccent with
Text::Unaccent::PurePerl because there were issues on 64 bit CentOS, but the
community doesn't really support anything other than Debian/Ubuntu anyway

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-13 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #22 from Katrin Fischer  ---
I can confirm what has been said about Japanese - removing the diacritics from
katakana or hiragana makes an unwanted difference.

Looking at the answers so far, it seems like they can't be a perfect solution
working as expected for all languages - maybe we need to make this optional?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-13 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Karam Qubsi  changed:

   What|Removed |Added

 CC||karamqu...@gmail.com

--- Comment #23 from Karam Qubsi  ---
Hi , 
For Arabic, Yes it is true if you deleted all the diacritics ( َ  ُ  ِ  ْ  ٍ  ٌ 
 ً )
that will be fine in most cases , 
but in more details  the diacritics are affecting the meaning of any word (ie :
مُدَرِّسة means a female teacher ) if we alter other diacritics and make it : 
مَدرَسة , it will mean a school . 
but in general they are not widely used in our daily life as we can understand
the meaning of any word from its context .

the ancients Arabs were not using any type of diacritics and it been used only
to help the new learner of Arabic language . 

so if the output will be without any diacritics I think in most cases that
would be OK . 

I hope that helps ,

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Yuval Hager  changed:

   What|Removed |Added

 CC||yha...@yhager.com

--- Comment #16 from Yuval Hager  ---
I ran the script from comment #5 on some more Hebrew text, I hope I did not
forget any diacritic marks. 
I don't know what Text::Unaccent is doing. Text::Unaccent::PurePerl on the
other hand seems to be doing too little, if at all. 

All the outputs from 'Strip NonspacingMark' seems correct - it's perfectly
readable and have all the diacritics removed.

I modified the script a bit, so it's easy to compare the three options:

Text::Unaccent   - קָמָץ => קָ×ָץ
Text::Unaccent   - פַתח => פַת×
Text::Unaccent   - עִבְרִית => ×¢Ö´×ְרִ×ת
Text::Unaccent   - חוֹלָם => ××Ö¹×Ö¸×
Text::Unaccent   - זָנָב, תָּכְנִית => ×ָנָ×, תָּ×Ö°× Ö´×ת
Text::Unaccent   - צָהֳרַיִם => צָ×ֳרַ×Ö´×
Text::Unaccent   - קַל => קַ×
Text::Unaccent   - חֲלוֹם => ×Ö²××Ö¹×
Text::Unaccent   - מֶלֶךְ => ×Ö¶×Ö¶×Ö°
Text::Unaccent   - נֶאֱמָן => × Ö¶×Ö±×Ö¸×
Text::Unaccent   - לֵב => ×Öµ×
Text::Unaccent   - יִכְתְּבוּ => ×Ö´×ְתְּ××Ö¼
Text::Unaccent   - שִׁיר => שִ××ר
Text::Unaccent   - דֻּבִּים => ×Ö»Ö¼×Ö´Ö¼××
Text::Unaccent   - חֹלִי => ×Ö¹×Ö´×
Text::Unaccent   - סוּס => ס×ּס
Text::Unaccent   - נוֹף => × ×Ö¹×£
Text::Unaccent   - גַמּד => ×Ö·×Ö¼×
Text::Unaccent   - מְסַבֵּךְ => ×ְסַ×ÖµÖ¼×Ö°
Text::Unaccent   - שולחנהּ => ש ×Ö¼
Text::Unaccent   - שֵׁם => שֵ××
Text::Unaccent   - עֶשֶׂר => עֶשֶ×ר
Text::Unaccent   - אֵלֶּה, אָנָּא, הֵמָּה, לָמָּה, שָׁמָּה, בָּתִּים, 
שָׁבַרְתִּי, תַּלְתַּל, לְבַד, חַג,
לַיְלָה => ×Öµ×Ö¶Ö¼×, ×ָנָּ×, ×Öµ×Ö¸Ö¼×, ×Ö¸×Ö¸Ö¼×, שָ××Ö¸Ö¼×,
×ָּתִּ××, שָ××ַרְתִּ×, תַּ×ְתַּ×, ×Ö°×Ö·×, ×Ö·×, ×Ö·×Ö°×Ö¸×
Text::Unaccent::PurePerl - קָמָץ => קָמָץ
Text::Unaccent::PurePerl - פַתח => פַתח
Text::Unaccent::PurePerl - עִבְרִית => עִבְרִית
Text::Unaccent::PurePerl - חוֹלָם => חוֹלָם
Text::Unaccent::PurePerl - זָנָב, תָּכְנִית => זָנָב, תָּכְנִית
Text::Unaccent::PurePerl - צָהֳרַיִם => צָהֳרַיִם
Text::Unaccent::PurePerl - קַל => קַל
Text::Unaccent::PurePerl - חֲלוֹם => חֲלוֹם
Text::Unaccent::PurePerl - מֶלֶךְ => מֶלֶךְ
Text::Unaccent::PurePerl - נֶאֱמָן => נֶאֱמָן
Text::Unaccent::PurePerl - לֵב => לֵב
Text::Unaccent::PurePerl - יִכְתְּבוּ => יִכְתְּבוּ
Text::Unaccent::PurePerl - שִׁיר => שִׁיר
Text::Unaccent::PurePerl - דֻּבִּים => דֻּבִּים
Text::Unaccent::PurePerl - חֹלִי => חֹלִי
Text::Unaccent::PurePerl - סוּס => סוּס
Text::Unaccent::PurePerl - נוֹף => נוֹף
Text::Unaccent::PurePerl - גַמּד => גַמּד
Text::Unaccent::PurePerl - מְסַבֵּךְ => מְסַבֵּךְ
Text::Unaccent::PurePerl - שולחנהּ => שולחנהּ
Text::Unaccent::PurePerl - שֵׁם => שֵׁם
Text::Unaccent::PurePerl - עֶשֶׂר => עֶשֶׂר
Text::Unaccent::PurePerl - אֵלֶּה, אָנָּא, הֵמָּה, לָמָּה, שָׁמָּה, בָּתִּים, 
שָׁבַרְתִּי, תַּלְתַּל, לְבַד, חַג,
לַיְלָה => אֵלֶּה, אָנָּא, הֵמָּה, לָמָּה, שָׁמָּה, בָּתִּים, שָׁבַרְתִּי, 
תַּלְתַּל, לְבַד, חַג, לַיְלָה
Strip NonspacingMark - קָמָץ => קמץ
Strip NonspacingMark - פַתח => פתח
Strip NonspacingMark - עִבְרִית => עברית
Strip NonspacingMark - חוֹלָם => חולם
Strip NonspacingMark - זָנָב, תָּכְנִית => זנב, תכנית
Strip NonspacingMark - צָהֳרַיִם => צהרים
Strip NonspacingMark - קַל => קל
Strip NonspacingMark - חֲלוֹם => חלום
Strip NonspacingMark - מֶלֶךְ => מלך
Strip NonspacingMark - נֶאֱמָן => נאמן
Strip NonspacingMark - לֵב => לב
Strip NonspacingMark - יִכְתְּבוּ => יכתבו
Strip NonspacingMark - שִׁיר => שיר
Strip NonspacingMark - דֻּבִּים => דבים
Strip NonspacingMark - חֹלִי => חלי
Strip NonspacingMark - סוּס => סוס
Strip NonspacingMark - נוֹף => נוף
Strip NonspacingMark - גַמּד => גמד
Strip NonspacingMark - מְסַבֵּךְ => מסבך
Strip NonspacingMark - שולחנהּ => שולחנה
Strip NonspacingMark - שֵׁם => שם
Strip NonspacingMark - עֶשֶׂר => עשר
Strip NonspacingMark - אֵלֶּה, אָנָּא, הֵמָּה, לָמָּה, שָׁמָּה, בָּתִּים, 
שָׁבַרְתִּי, תַּלְתַּל, לְבַד, חַג,
לַיְלָה => אלה, אנא, המה, למה, שמה, בתים, שברתי, תלתל, לבד, חג, לילה

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #17 from Katrin Fischer  ---
Thx Yuval! It looks to me like using the new method would be a big step in the
right direction.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #19 from David Cook  ---
(In reply to Katrin Fischer from comment #17)
> Thx Yuval! It looks to me like using the new method would be a big step in
> the right direction.

I agree.

Text::Unaccent and Text::Unaccent::PurePerl don't appear to be comprehensive
enough to deal with many languages. While it seems to handle Latin and Greek
characters, it doesn't do so well with Arabic and Hebrew.

Note that nothing seems to happen with the (Japanese?) ideograms that Galen
tested. I wonder if accents are even a thing with CJK languages... I've asked a
friend who knows Chinese for her input on that one. Oh, I know some people with
Japanese experience as well... I should ask them.

I think we should also ask Vietnamese users, as Vietnamese has a lot of
diacritics... and I think they might actually be quite significant.
(https://en.wikipedia.org/wiki/Vietnamese_alphabet#Tone_marks)

I'll update the listserv to ask for people with Vietnamese knowledge too... as
that could potentially answer Galen's question about whether or not we should
even be unaccenting userid values...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #18 from David Cook  ---
Way to go Yuval!

Try replacing the following line:

print "Text::Unaccent   - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";

with these lines:

print "Text::Unaccent   - $_ => ";
print Text::Unaccent::unac_string('utf-8', $_)."\n";

--

I suspect that will make the output of Text::Unaccent and
Text::Unaccent::PurePerl the same. 

The epic-length posts I wrote earlier were about how Perl wasn't handling the
output of Text::Unaccent as expected. 

--

Replacing this line:

use Text::Unaccent qw//;

with these lines:

use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);

That will also tell you what Text::Unaccent is doing (or probably not doing).

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #20 from Yuval Hager  ---
> I suspect that will make the output of Text::Unaccent and
> Text::Unaccent::PurePerl the same. 
>

Not really, it stays the same garbled mess.

> unac_debug($Text::Unaccent::DEBUG_HIGH);
> 
> That will also tell you what Text::Unaccent is doing (or probably not doing).

I tested on one string:

unac.c:13708: unac_data0[7] & unac_positions[0][8]: 0x05e7 => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[30] & unac_positions[0][31]: 0x05de => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x05e5 => untouched
Text::Unaccent   - קָמָץ => קָ×ָץ


> Note that nothing seems to happen with the (Japanese?) ideograms that Galen
> tested. I wonder if accents are even a thing with CJK languages...

I am definitely not an authoritative source, but I know a tiny bit of Japanese.
The letters above are Kanji alphabet, and to the best of my knowledge do not
have diacritics. BUT Japanese has two more alphabets, Hiragana and Katakana,
both use diacritics, which CANNOT be removed, or they change the sound (and
potentially the meaning).
For example, in the word Hiragana, the first syllable is ひ (Hi, pronounce Hee).
This same syllable, with two ticks is び, and it sounds like Bee. A circle makes
it ぴ - sounds like Pee. Testing those three:

Text::Unaccent   - ひびぴ => ã²ã²ã²
Text::Unaccent::PurePerl - ひびぴ => ひひひ
Strip NonspacingMark - ひびぴ => ひひひ

So we've changed 'Hee Bee Pee' to 'Hee Hee Hee'. The same result (and same
syllables) for Katakana:

Text::Unaccent   - ヒビピ => ããã
Text::Unaccent::PurePerl - ヒビピ => ヒヒヒ
Strip NonspacingMark - ヒビピ => ヒヒヒ

So diacritics, at least in those two alphabets, should not be removed, to the
best of my knowledge.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-10 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #21 from David Cook  ---
(In reply to Yuval Hager from comment #20)
> > I suspect that will make the output of Text::Unaccent and
> > Text::Unaccent::PurePerl the same. 
> >
> 
> Not really, it stays the same garbled mess.
> 

That's odd.

In that case, you could try replacing the following line:

print "Text::Unaccent   - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";

with these lines:

use Encode;
my $unaccented = Text::Unaccent::unac_string('utf-8', $_);
$unaccented = encode("UTF-8",$unaccented);

print "Text::Unaccent   - $_ => $unaccented \n";

The garbled mess is, basically, because we're using "use utf8" and
Text::Unaccent returns strings without a UTF8 flag.

> > unac_debug($Text::Unaccent::DEBUG_HIGH);
> > 
> > That will also tell you what Text::Unaccent is doing (or probably not 
> > doing).
> 
> I tested on one string:
> 
> unac.c:13708: unac_data0[7] & unac_positions[0][8]: 0x05e7 => untouched
> unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
> unac.c:13708: unac_data0[30] & unac_positions[0][31]: 0x05de => untouched
> unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
> unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x05e5 => untouched
> Text::Unaccent   - קָמָץ => קָ×ָץ
> 
> 
> > Note that nothing seems to happen with the (Japanese?) ideograms that Galen
> > tested. I wonder if accents are even a thing with CJK languages...
> 
> I am definitely not an authoritative source, but I know a tiny bit of
> Japanese. The letters above are Kanji alphabet, and to the best of my
> knowledge do not have diacritics. BUT Japanese has two more alphabets,
> Hiragana and Katakana, both use diacritics, which CANNOT be removed, or they
> change the sound (and potentially the meaning).
> For example, in the word Hiragana, the first syllable is ひ (Hi, pronounce
> Hee). This same syllable, with two ticks is び, and it sounds like Bee. A
> circle makes it ぴ - sounds like Pee. Testing those three:
> 

I was just reading some comments from a friend who was suggesting the same
thing. 

> Text::Unaccent   - ひびぴ => ã²ã²ã²
> Text::Unaccent::PurePerl - ひびぴ => ひひひ
> Strip NonspacingMark - ひびぴ => ひひひ
> 
> So we've changed 'Hee Bee Pee' to 'Hee Hee Hee'. The same result (and same
> syllables) for Katakana:
> 
> Text::Unaccent   - ヒビピ => ããã
> Text::Unaccent::PurePerl - ヒビピ => ヒヒヒ
> Strip NonspacingMark - ヒビピ => ヒヒヒ
> 
> So diacritics, at least in those two alphabets, should not be removed, to
> the best of my knowledge.

In that case, I really wonder whether we should actually be removing accents
for any languages, and instead look at why we started stripping accents in the
first place.

Text::Unaccent is clearly not removing accents for many languages, so clearly
it can't be that big of a problem, no?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-08 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #14 from Galen Charlton  ---
(In reply to David Cook from comment #13)
> As I pointed out in my overly long comments, it doesn't appear that
> Text::Unaccent is actually mangling non-Latin characters. 
> 
> Rather, in your example, it looks like Perl doesn't correctly handle the
> concatenated string composed of one string with a UTF8 flag set and one
> string without a UTF8 flag set. 

Other way around: Text::Unaccent is not, as it would be much preferable,
emitting Perl Unicode strings; rather, it is emitting octet-sequences.  A good
pattern is aim for is using *only* Unicode strings within core code, and
relegating use of Encode and friends to input and output; Text::Unaccent would
get in the way of that.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-08 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #15 from David Cook  ---
(In reply to Galen Charlton from comment #14)
> Other way around: Text::Unaccent is not, as it would be much preferable,
> emitting Perl Unicode strings; rather, it is emitting octet-sequences.

Sorry, I must have been unclear; I meant to say that Text::Unaccent is emitting
octet-sequences (hence why using encode() on the string returned by
Text::Unaccent would create a Perl Unicode string).

And that Perl itself was causing problems when it tried to create a new string
from an octet sequence string and a Perl Unicode string.

> A good pattern is aim for is using *only* Unicode strings within core code,
> and relegating use of Encode and friends to input and output; Text::Unaccent
> would get in the way of that.

Fair enough. I'm not in favour of Text::Unaccent per se. I was curious why it
seemed to mangle some strings, and I shared what answers I found. 

I suspect Unicode::Normalize will really be the way to go, as you suggest. It
seems much more comprehensive than Text::Unaccent and Text::Unaccent::PurePerl.
I imagine we just need feedback from people experienced in Arabic, Hebrew, and
CJK languages.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #8 from Galen Charlton  ---
(In reply to David Cook from comment #7)
> Is it for generating a userid?

Yes

> Do we really need to remove accents
> for that?

Per bug 7411, there was apparently an issue searching on usernames with
diacritics, although in retrospect that may simply have been an issue with
mismatched Unicode normalization forms -- impossible to tell now.

The current patcheset for bug 7679 also proposes to use Text::Unaccent, but I'm
dubious about that one.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #11 from David Cook  ---
Analyzing what "use utf8" does and it's... interesting.


#use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent   - مُدَرِّسَة => مُدَرِّسَة

echo "مُدَرِّسَة" | xxd -p
d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a

[That last 0a byte is just a LF character (ie \n)]

use utf8;
#binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = 454f2f4e315051334e29
Text::Unaccent   - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©

#use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = d985d98fd8afd98ed8b1d990d991d8b3d98ed8a9
Text::Unaccent   - ��د�ر��س�ة => ��د�ر��س�ة

use utf8;
binmode STDOUT, ':utf8';
say "Hex = ".unpack("H*",$_);

Hex = 454f2f4e315051334e29
Text::Unaccent   - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
--

I have no idea what 454f2f4e315051334e29 is... it's not UTF-8 or Latin1. In
fact, if you try to read it as either... you'll just read that EO/N1PQ3N).

Ahh, I was missing this error message: Character in 'H' format wrapped in
unpack at unaccent.pl line 46.

Here's some more info using Devel::Peek::Dump():
PV = 0x1ba6b20
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]

Indeed, if we look back at our UTF-8 table:
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1536

0645 is the code point for ARABIC LETTER MEEM which would be encoded as d9 85.

454f2f4e315051334e29 is clearly a butchering of the internal string of Unicode
codepoints
"\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}" where
only the low-byte values of the code point is being shown.

--

Ahh... I think I might have figured it out.

When you use "use utf8":

$_ = PV = 0xf25f60
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0
[UTF8 "\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629}"]
Text::Unaccent::unac_string('UTF-8', $_) = PV = 0x2a0a0c0
"\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251"\0

If you print out the content of "Text::Unaccent::unac_string('UTF-8', $_)" on
its own, you'll get مُدَرِّسَة.

However, if you mix $_ and $unaccented in a single concatenated string, you're
going to wind up with a correct $_ but a double-encoded $unaccented.

If you look at the concatenated string, you'll get a PV of:

PV = 0x29028c0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\303\231\302\205\303\231\302\217\303\230\302\257\303\231\302\216\303\230\302\261\303\231\302\220\303\231\302\221\303\230\302\263\303\231\302\216\303\230\302\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{d9}\x{85}\x{d9}\x{8f}\x{d8}\x{af}\x{d9}\x{8e}\x{d8}\x{b1}\x{d9}\x{90}\x{d9}\x{91}\x{d8}\x{b3}\x{d9}\x{8e}\x{d8}\x{a9}
\n"]

So in that UTF8 section you have $_ represented by Unicode codepoints while the
UTF-8 encoded bytes of $unaccepted have been transformed into a string of
codepoints using a hexadecimal byte for each code point.

If you wanted to concatenate them both in the string, you'd first have to run
"$unaccented = decode('UTF-8', $unaccented)". Then your concatenated string
would internally look like: 

PV = 0x27812a0 "Text::Unaccent -
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
-
\331\205\331\217\330\257\331\216\330\261\331\220\331\221\330\263\331\216\330\251
\n"\0 [UTF8 "Text::Unaccent -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} -
\x{645}\x{64f}\x{62f}\x{64e}\x{631}\x{650}\x{651}\x{633}\x{64e}\x{629} \n"]

And that would be correct:

Text::Unaccent - مُدَرِّسَة - مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => مدرسة

I mean... the output still doesn't do us much good, but that explains the
mangling.

While we gave Text::Unaccent a Perl string with a UTF8 flag set, it took that
string through to some C code using a XS interface, did a few things (depending
on the scenario), and then passed back a Perl string without a UTF8 flag set,
which seems to confuse Perl.

If we do a utf8::upgrade($unaccented) earlier, it still creates a string with
incorrect code points...

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #9 from David Cook  ---
(In reply to Galen Charlton from comment #8)
> (In reply to David Cook from comment #7)
> > Do we really need to remove accents
> > for that?
> 
> Per bug 7411, there was apparently an issue searching on usernames with
> diacritics, although in retrospect that may simply have been an issue with
> mismatched Unicode normalization forms -- impossible to tell now.
> 
> The current patcheset for bug 7679 also proposes to use Text::Unaccent, but
> I'm dubious about that one.

It's surprising that Text::Unaccent doesn't appear to be working correctly,
since it is using iconv for the heavy lifting, and iconv seems to be pretty
good when it comes to character conversions.

I can't speak to Hebrew or Greek (while I thought I wasn't bad with the modern
Greek alphabet, I didn't know they used accents...), Arabic is sure
interesting.

So we have the following string:
مُدَرِّسَة

If we run the following:
echo "مُدَرِّسَة" | xxd -p

We get this hex:
d985d98fd8afd98ed8b1d990d991d8b3d98ed8a90a

If we look at the first couple bytes there using a UTF-8 table
(http://www.utf8-chartable.de/unicode-utf8-table.pl)

d985 = م = ARABIC LETTER MEEM
d98f = ُُ = ARABIC DAMMA
Together, these are written like مُ 

However, if you add the letter "dal":
d8af = د = ARABIC LETTER DAL

You'll get something like the following:
مُد

We'd recognize that from the "English end/Arabic start" of the string: 
"مُدَرِّسَة"

I had forgotten that Hebrew only has consonants in its alphabet, and it appears
Arabic is the same. So that "damma" indicates a vowel sound but isn't a letter
per se. I'd say it's a diacritic and this would agree:
https://en.wikipedia.org/wiki/Arabic_diacritics#.E1.B8.8Cammah

So the output for "Strip Nonspacing Mark" looks good in the very first case at
least:

Strip NonspacingMark - مُدَرِّسَة => مدرسة

Although I don't know if it makes sense semantically as I don't read Arabic. If
I understand correctly, you can omit vowel sounds from written Arabic and rely
purely on context for meaning?
(https://en.wikipedia.org/wiki/Arabic_alphabet#Vowels)

At a glance, the Strip NonspacingMark looks OK for Greek too as those
diacritics appear to be there purely for pronunciation like in languages
written in the Roman alphabet.
(https://en.wikipedia.org/wiki/Modern_Greek#Phonology_and_orthography)

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #12 from David Cook  ---
More interesting things... 

You can still have a Perl string with a UTF8 flag set, even when you're not
using "use utf8"...

My example:

my $arabic =  "\x{0645}";
PV = 0x190e950 "\331\205"\0 [UTF8 "\x{645}"]


Interestingly, if I don't use "use utf8", and use a UTF8 encoded character in
my source code, I get a string without a UTF8 flag:

my $arabic_text = "ﻡ";
PV = 0x1ea92b0 "\331\205"\0

I imagine use of the \x{} construct must do a utf8::upgrade...

--

In any case, if I put $arabic and $arabic_text into the same string, I get the
following:

my $arabic_result = "Arabic = $arabic_text = $arabic";
say $arabic_result;

Arabic = Ù� = م 
PV = 0x29feda0 "Arabic = \303\231\302\205 = \331\205"\0 [UTF8 "Arabic =
\x{d9}\x{85} = \x{645}"]


However, if I try "$arabic_text = decode("UTF-8",$arabic_text")", which
according to http://perldoc.perl.org/Encode.html means: $characters =
decode('UTF-8', $octets), then I get the following:

Arabic = م = م
PV = 0x15efe50 "Arabic = \331\205 = \331\205"\0 [UTF8 "Arabic = \x{645} =
\x{645}"]

Alternatively, I could have done "$arabic = encode("UTF-8",$arabic);", which
would yield this result:

Arabic = م = م
PV = 0x832210 "Arabic = \331\205 = \331\205"\0

This explains the UTF8 flag a bit:
http://perldoc.perl.org/Encode.html#The-UTF8-flag

So yeah... that's cool... who knew that was a thing, eh?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #13 from David Cook  ---
(In reply to Galen Charlton from comment #5)
> Some conclusions:
> 
> [1] Text::Unaccent mangles non-Latin characters outright; that's enough
> reason to get rid of it.

As I pointed out in my overly long comments, it doesn't appear that
Text::Unaccent is actually mangling non-Latin characters. 

Rather, in your example, it looks like Perl doesn't correctly handle the
concatenated string composed of one string with a UTF8 flag set and one string
without a UTF8 flag set. 

It looks like Perl tries to do a utf8::upgrade() on the string without the UTF8
flag set (ie the one returned from Text::Unaccent's C code), and instead of
reading it as an octet string and correctly translating into a UTF8 string of
corresponding Unicode code points, it reads each octet in as a code point,
which creates a completely different string for display purposes even though
the underlying octets are the same. 

When given the octets d9 and 85 (ie the Arabic letter Meem), it creates a "UTF8
string" with the code points of "\x{d9}\x{85}" when it should create a "UTF8
string" with the code point "\x{645}".

Instead of creating "\x{645}", Perl reads the octets d9 and 85 in as
"\x{d9}\x{85}"

This only appears to be a problem when you put the Text::Unaccent string in the
same string as a Perl string with a UTF8 flag. If you were to break them into
two separate lines, they'd display correctly in the terminal. Or you could use
Encode::decode("UTF-8",$unaccented) to create a Perl string with a UTF8 flag
with the proper code point "\x{645}";

> [2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters
> are better -- they strip accents from Latin scripts, and don't mangle
> non-Latin.  Removing NonspacingMark characters is more aggressive; I think
> we need input from Arabic, Hebrew, and Greek suers as to whether that is
> acceptable -- or, alternatively, if we need a system preference, or need to
> bite the bullet and package Text::Unaccent::PurePerl.

I suspect that Text::Unaccent and Text::Unaccent::PurePerl are mostly the same,
but that Text::Unaccent::PurePerl doesn't lose the UTF8 flag on the input
string. We could avoid Text::Unaccent::PurePerl if we simply use
"Encode::decode("UTF-8",$unaccented)" when using Text::Unaccent to translate
the internal byte string into an internal UTF8 string. While it might not be
required that we do that, doing so would probably prevent future buggy
behaviour from occurring.

That said, Text::Unaccent and Text::Unaccent::PurePerl don't necessarily look
good enough. They miss diacritics in Arabic at least, although I think we
definitely need input from Arabic, Hebrew, and CJK users regarding how
stripping NonspacingMark affects those strings. My guess is that it's fine to
strip the diacritics out of Arabic, but there are people much more qualified
than me to answer that question on the listserv. 

Greek actually looks OK with Text::Unaccent if the encoding is handled. We can
see that a bit more clearly with the following lines:

use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-07 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #10 from David Cook  ---
So I think I've tracked down the C code behind Text::Unaccent:

https://github.com/gitpan/Text-Unaccent/blob/master/unac.c

The only reference I see to "damma" is in the U+FE70...U+FEFF code point range
which appears to list isolated forms which is not what we're dealing with in
these examples.

While I haven't reviewed the code extensively, it looks like the tables used
for Text::Unaccent are lacking...

If you replace the following line in Galen's script:

use Text::Unaccent qw//;

with

use Text::Unaccent qw/unac_debug/;
unac_debug($Text::Unaccent::DEBUG_HIGH);

You'll get more details of how Text::Unaccent is working (or not working as it
were).

Here's the output I get for the Arabic:

unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent   - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
Strip NonspacingMark - مُدَرِّسَة => مدرسة

Here's the output I get for the Greek:
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent   - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Î�α Î�ε Î�η Î�ι Î�ο
Υ� Ω�
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Interestingly we can see a reference to the Greek cpaital letter alpha with the
tonos diacritic: 

* 0386 GREEK CAPITAL LETTER ALPHA WITH TONOS
*   0391 GREEK CAPITAL LETTER ALPHA

Indeed, in the output, we can see that 0x0386 was changed to 0x0391... although
admittedly I don't know exactly how. It looks like a binary operation that uses
a bitmask to produce a certain value... we don't need to know 100% how that
mechanism is working right now... just that it works as described above.

--

So in the Arabic example... everything was "untouched" and yet the output is
garbled. That's certainly an encoding issue... 

Indeed, look at the following:

dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8
��د�ر��س�ة

That is the same output as Text::Unaccent:

Text::Unaccent   - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©

So somewhere along the line that UTF-8 string is getting double-encoded.

Check this out:

dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 | iconv 
-f
utf-8 -t latin1
مُدَرِّسَة

I think the double-encoding is down to us using "binmode STDOUT, ':utf8';"
(which tells Perl to output UTF-8 encoded bytes instead of Latin-1 (or some
other single byte encoding it normally uses) and "use utf8" which tells Perl
that the source code uses UTF-8...

Removing those gets us the following:

Text::Unaccent   - été => ete

Strip NonspacingMark - été => A▒tA▒
Text::Unaccent   - umlaüt => umlaut
Wide character in print at unaccent.pl line 47.
Strip NonspacingMark - umlaüt => umlaA1⁄4t
Text::Unaccent   - עברית => עברית

Strip NonspacingMark - עברית => עב▒ י▒a

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-06 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

David Cook  changed:

   What|Removed |Added

 CC||dc...@prosentient.com.au

--- Comment #7 from David Cook  ---
My knowledge of Arabic is pretty much non-existent, but I recall a librarian I
know once wanting Zebra to remove hamza for search purposes...
(https://en.wikipedia.org/wiki/Arabic_diacritics)

What's the purpose of Text::Unaccent currently? It's only used when adding
members? Is it for generating a userid? Do we really need to remove accents for
that?

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Katrin Fischer  changed:

   What|Removed |Added

   Keywords||dependency
 CC||katrin.fisc...@bsz-bw.de

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Katrin Fischer  changed:

   What|Removed |Added

   Severity|enhancement |normal
Version|3.20|master

--- Comment #6 from Katrin Fischer  ---
Seeing the bad results for non-latin scripts I am promoting this from
enhancement to bug. Not sure about the Arabic - I think we need a native
speaker/reader.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #3 from Galen Charlton  ---
I've also had issues getting Text::Unaccent to install on RedHat-like distros. 
That said, at the moment Text::Unaccent::PurePerl is not packaged for Debian or
Ubuntu.

It is also (presumably) slower than Text::Unaccent, although given that
unac_string() is used only when registering a new patron, I don't speed matters
much here.

I'd like to suggest an alternative way to get the job done - use
Unicode::Normalize, which is already a Koha dependency:

http://stackoverflow.com/a/17561928/880696

If we do that, we can drop the dependency on Text::Unaccent entirely.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

--- Comment #5 from Galen Charlton  ---
I wrote a little test program to compare the options:

___BEGIN___
#!/usr/bin/perl

use Modern::Perl;
use Text::Unaccent qw//;
use Text::Unaccent::PurePerl qw//;
use utf8;
use Unicode::Normalize;

binmode STDOUT, ':utf8';
my @str = (
'été',
'umlaüt',
'עברית',
'חוֹלָם',
'北京市',
'Άά Έέ Ήή Ίί Όό Ύύ Ώώ',
'مُدَرِّسَة'
);

sub unaccent {
my $str = NFKD(shift);
$str =~ s/\p{NonspacingMark}//g;
return $str;
}

foreach (@str) {
if ($_ eq 'مُدَرِّسَة') {
# special case to avoid locking my terminal session (!)
print "Text::Unaccent   - $_ => *refusing to let Text::Unaccent
do this*\n";
} else {
print "Text::Unaccent   - $_ => " .
Text::Unaccent::unac_string('utf-8', $_) . "\n";
}
print "Text::Unaccent::PurePerl - $_ => " .
Text::Unaccent::PurePerl::unac_string($_) . "\n";
print "Strip NonspacingMark - $_ => " . unaccent($_) . "\n";
}
___END___

Here's its output:

Text::Unaccent   - été => ete
Text::Unaccent::PurePerl - été => ete
Strip NonspacingMark - été => ete
Text::Unaccent   - umlaüt => umlaut
Text::Unaccent::PurePerl - umlaüt => umlaut
Strip NonspacingMark - umlaüt => umlaut
Text::Unaccent   - עברית => ×¢×ר×ת
Text::Unaccent::PurePerl - עברית => עברית
Strip NonspacingMark - עברית => עברית
Text::Unaccent   - חוֹלָם => ××Ö¹×Ö¸×
Text::Unaccent::PurePerl - חוֹלָם => חוֹלָם
Strip NonspacingMark - חוֹלָם => חולם
Text::Unaccent   - 北京市 => å京å¸
Text::Unaccent::PurePerl - 北京市 => 北京市
Strip NonspacingMark - 北京市 => 北京市
Text::Unaccent   - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Îα Îε Îη Îι Îο Î¥Ï
 ΩÏ
Text::Unaccent::PurePerl - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω
Text::Unaccent   - مُدَرِّسَة => *refusing to let Text::Unaccent do 
this*
Text::Unaccent::PurePerl - مُدَرِّسَة => مُدَرِّسَة
Strip NonspacingMark - مُدَرِّسَة => مدرسة

Some conclusions:

[1] Text::Unaccent mangles non-Latin characters outright; that's enough reason
to get rid of it.
[2] Both Text::Unaccent::PurePerl and stripping NonspacingMark characters are
better -- they strip accents from Latin scripts, and don't mangle non-Latin. 
Removing NonspacingMark characters is more aggressive; I think we need input
from Arabic, Hebrew, and Greek suers as to whether that is acceptable -- or,
alternatively, if we need a system preference, or need to bite the bullet and
package Text::Unaccent::PurePerl.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Galen Charlton  changed:

   What|Removed |Added

 Status|Needs Signoff   |In Discussion

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-12-05 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Katrin Fischer  changed:

   What|Removed |Added

   See Also||http://bugs.koha-community.
   ||org/bugzilla3/show_bug.cgi?
   ||id=7411
 CC||sophie.meyni...@biblibre.co
   ||m

--- Comment #4 from Katrin Fischer  ---
I didn't remember, but it looks like I introduced this dependency. I am in
favor of reducing troublesome dependencies - so I am totally fine with an
alternative solution.

Unfortunately the initial bug 7411 has no clear problem description, so it's
hard to tell now why we made the change in the first place.

Maybe Biblibre could check the linked Mantis entry?
http://mantis.biblibre.com/view.php?id=7744
Adding Sophie to this bug.

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-09-01 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Zeno Tajoli  changed:

   What|Removed |Added

 CC||z.taj...@cineca.it
   Patch complexity|--- |Medium patch

--- Comment #2 from Zeno Tajoli  ---
Patch complexity is 'Medium' because this change has many architectural
connections

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-09-01 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Zeno Tajoli  changed:

   What|Removed |Added

 CC|z.taj...@cineca.it  |

-- 
You are receiving this mail because:
You are watching all bug changes.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/


[Koha-bugs] [Bug 14759] Replacement for Text::Unaccent

2015-08-31 Thread bugzilla-daemon
http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759

Ketan Kulkarni  changed:

   What|Removed |Added

   Assignee|koha-b...@lists.koha-commun |ke...@carvingit.com
   |ity.org |
 Status|NEW |Needs Signoff
 CC||ke...@carvingit.com

--- Comment #1 from Ketan Kulkarni  ---
Created attachment 42120
  -->
http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=42120=edit
This patch uses the proposed module - Text::Unaccent::PurePerl

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.
___
Koha-bugs mailing list
Koha-bugs@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/