Your code does not explicitly give the charsets when reading/writing files. Don’t use FileWriter to save the output, use new OutputStreamWriter(new FileOutputStream(“…”), “UTF-8”); (unfortunately FileWriter does not have the charset parameter, so you need to use OutputStreamWriter).
The reason why you see a difference may be because karaf uses another default charset. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de <http://www.thetaphi.de/> eMail: [email protected] From: Bratislav Stojanovic [mailto:[email protected]] Sent: Friday, October 18, 2013 11:42 AM To: [email protected] Subject: Fwd: Tika OSGi bundle does not produce UTF-8 output Hi, I've tried exactly the same code in two scenarios : Tika tika = new Tika(); Metadata metadata = new Metadata(); Reader reader = tika.parse(new File("...")); FileWriter fw = new FileWriter(new File("...")); int data = reader.read(); StringBuilder sb = new StringBuilder(); while (data != -1){ char dataChar = (char) data; sb.append(dataChar); fw.write(dataChar); data = reader.read(); } When I put this code in a simple Java project with tika-app-1.4.jar as a dependency, it generates UTF-8 output (correct). When I put this code inside a bundle with tika-bundle and tika-core as dependencies and deploy it inside karaf, it generates ANSI output (blah). Both projects are managed with maven and Eclipse 4.2. Do I have to additionaly set something or should I embed tika-app inside my bundle (using maven-bundle-plugin)? I'm using Tika 1.4, Java 1.6.45, Win 7 x64 and karaf 2.3.3. -- Bratislav Stojanovic, M.Sc.
