Your code does not explicitly give the charsets when reading/writing files. 
Don’t use FileWriter to save the output, use new OutputStreamWriter(new 
FileOutputStream(“…”), “UTF-8”); (unfortunately FileWriter does not have the 
charset parameter, so you need to use OutputStreamWriter).

 

The reason why you see a difference may be because karaf uses another default 
charset.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: [email protected]

 

From: Bratislav Stojanovic [mailto:[email protected]] 
Sent: Friday, October 18, 2013 11:42 AM
To: [email protected]
Subject: Fwd: Tika OSGi bundle does not produce UTF-8 output

 

Hi,

 

I've tried exactly the same code in two scenarios :

 

Tika tika = new Tika();

Metadata metadata = new Metadata();

 

Reader reader = tika.parse(new File("..."));

FileWriter fw = new FileWriter(new File("..."));

 

int data = reader.read();

StringBuilder sb = new StringBuilder();

while (data != -1){

char dataChar = (char) data;

sb.append(dataChar);

fw.write(dataChar);

data = reader.read();

}

 

When I put this code in a simple Java project with tika-app-1.4.jar as a 
dependency, it

generates UTF-8 output (correct).

When I put this code inside a bundle with tika-bundle and tika-core as 
dependencies and deploy it

inside karaf, it generates ANSI output (blah).

Both projects are managed with maven and Eclipse 4.2.

 

Do I have to additionaly set something or should I embed tika-app inside my 
bundle (using

maven-bundle-plugin)?

 

I'm using Tika 1.4, Java 1.6.45, Win 7 x64 and karaf 2.3.3.

 

 

-- 

Bratislav Stojanovic, M.Sc.

Reply via email to