I was trying to run the following command:

java -jar "C:\Code\ATEK\CT\apache-tika-0.9\tika-app-0.9.jar" --xml 
"C:\Data\ATEK\CT\Tests\Example Documents\Urdu امریکہ کےانتباہ کےبعدبھارت میں 
سکیورٹی میں اضافہ.doc" > "C:\Data\ATEK\CT\Tests\Example Documents_OUT_XML\Urdu 
امریکہ کےانتباہ کےبعدبھارت میں سکیورٹی میں اضافہ.doc.OUT.xml"

this command was specified in a batch file that was saved as an UTF-8 file 
without the BOM.

This produced an empty output file with the following name:

Urdu امریکہ کےانتباہ کےبعدبھارت میں سکیورٹی 
میں اضافہ.doc.OUT.xml

It also generated the following exception:

Exception in thread "main" java.net.MalformedURLException: unknown protocol: c
        at java.net.URL.<init>(URL.java:590)
        at java.net.URL.<init>(URL.java:480)
        at java.net.URL.<init>(URL.java:429)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:298)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)

Is this caused by the Unicode characters in either the input or the output 
filepath? Is there a way of processing such files without renaming them?

Thanks,

Mr. Ilya Zavorin, Ph.D.
Principal Research Analyst
Knowledge and Information Management Division
CACI International
4831 Walden Lane
Lanham, MD 20706
ph: 1-301-306-2859
fx: 1-301-306-8201
[email protected]
www.caci.com

Reply via email to