Tika 1.1 .chm parsing

Kevin Miller Fri, 13 Apr 2012 08:33:53 -0700

I was updating my .Net wrapper around Tika and noticed that you now support
.chm parsing as of version .10. I added a test to see if extraction of .chm
files was working. I tried .chm files generated by two different sources
with no luck. Tika returns metadata about the file: content-length,
mimetype, resourceName but no text output.


Here is how my code calls Tika:

https://github.com/KevM/tikaondotnet/blob/tika1-1/TikaOnDotnet/TextExtractor.cs

here is my test

        [Test]
>         public void should_extract_from_chm()
>         {
>             var textExtractionResult = new
> TextExtractor().Extract("docs.chm");
>
> textExtractionResult.Metadata["resourceName"].ShouldContain("docs.chm");



            //FAILS HERE

            textExtractionResult.Text.Trim().ShouldNotBeEmpty();
>         }



Shouldn't the .CHM parser be returning text extracted from the help files?

Tika 1.1 .chm parsing

Reply via email to