I was trying to run the following command:
java -jar "C:\Code\ATEK\CT\apache-tika-0.9\tika-app-0.9.jar" --xml
"C:\Data\ATEK\CT\Tests\Example Documents\Urdu امریکہ کےانتباہ کےبعدبھارت میں
سکیورٹی میں اضافہ.doc" > "C:\Data\ATEK\CT\Tests\Example Documents_OUT_XML\Urdu
امریکہ کےانتباہ کےبعدبھارت میں سکیورٹی میں اضافہ.doc.OUT.xml"
this command was specified in a batch file that was saved as an UTF-8 file
without the BOM.
This produced an empty output file with the following name:
Urdu امریکہ کےانتباہ کےبعدبھارت میں سکیورٹی
میں اضافہ.doc.OUT.xml
It also generated the following exception:
Exception in thread "main" java.net.MalformedURLException: unknown protocol: c
at java.net.URL.<init>(URL.java:590)
at java.net.URL.<init>(URL.java:480)
at java.net.URL.<init>(URL.java:429)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:298)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
Is this caused by the Unicode characters in either the input or the output
filepath? Is there a way of processing such files without renaming them?
Thanks,
Mr. Ilya Zavorin, Ph.D.
Principal Research Analyst
Knowledge and Information Management Division
CACI International
4831 Walden Lane
Lanham, MD 20706
ph: 1-301-306-2859
fx: 1-301-306-8201
[email protected]
www.caci.com