Ilya,

Since you are at the CMD box prompt, you are at the mercy of the the
fact that the CMD box only works in the 'ascii' mode (current Ansi
Code Page), not in Unicode. I recommend that you try this out with a
littte java program that calls the Tika API, and I expect, though
guarantee, that it will work.



On Tue, Mar 8, 2011 at 4:38 PM, Ilya Zavorin <[email protected]> wrote:
> I was trying to run the following command:
>
>
>
> java -jar "C:\Code\ATEK\CT\apache-tika-0.9\tika-app-0.9.jar" --xml
> "C:\Data\ATEK\CT\Tests\Example Documents\Urdu امریکہ کےانتباہ کےبعدبھارت میں
> سکیورٹی میں اضافہ.doc" > "C:\Data\ATEK\CT\Tests\Example
> Documents_OUT_XML\Urdu امریکہ کےانتباہ کےبعدبھارت میں سکیورٹی میں
> اضافہ.doc.OUT.xml"
>
>
>
> this command was specified in a batch file that was saved as an UTF-8 file
> without the BOM.
>
>
>
> This produced an empty output file with the following name:
>
>
>
> Urdu امریکہ کےانتباہ کےبعدبھارت میں
> سکیورٹی میں اضافہ.doc.OUT.xml
>
> It also generated the following exception:
>
>
>
> Exception in thread "main" java.net.MalformedURLException: unknown protocol:
> c
>
>         at java.net.URL.<init>(URL.java:590)
>
>         at java.net.URL.<init>(URL.java:480)
>
>         at java.net.URL.<init>(URL.java:429)
>
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:298)
>
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
>
>
>
> Is this caused by the Unicode characters in either the input or the output
> filepath? Is there a way of processing such files without renaming them?
>
>
>
> Thanks,
>
>
>
> Mr. Ilya Zavorin, Ph.D.
>
> Principal Research Analyst
>
> Knowledge and Information Management Division
>
> CACI International
>
> 4831 Walden Lane
>
> Lanham, MD 20706
>
> ph: 1-301-306-2859
>
> fx: 1-301-306-8201
>
> [email protected]
>
> www.caci.com
>
>

Reply via email to