Just my 5 cents ;-) Basic structure of MS CAB archive is described in [MS-CAB] document that you can find on Microsoft Open Specification site. There is also old version of documentation available as part of MS Cab SDK (also on MS site)
MS Cab data could be compressed with Quantum, Deflate & LZX algorithms. LZX is with some modifications, and existing documentation doesn't provide good description of these modifications, although some parts of them are described in [MS-PATCH] document, although it also differs from LZX used in MS Cab... Use of Deflate is described in [MS-MCI] document... On Tue, Feb 7, 2012 at 1:04 PM, Nick Burch <[email protected]> wrote: > On Tue, 7 Feb 2012, Jan Høydahl wrote: >> >> Would it be possible to add support to extract the proprietary MS .CAB >> archive format? I cannot find any Java-based extractors out there but there >> exists one in C. > > > You'd need to read either the file format docs, or the C source code to > understand the format (whichever is easier), then use that to write Java > code for it. I think you should be able to find existing Java code to handle > DEFLATE (in Java itself or Commons Compress) and LZX (in POI), not sure > about Quantum. > > Alternately, if you have command line tools to read the format, you may be > able to use that from Tika. However, that'd need a bit of work, as the Tika > external parsers support doesn't currently handle embedded resources > > Nick -- With best wishes, Alex Ott http://alexott.net/ Tiwtter: alexott_en (English), alexott (Russian) Skype: alex.ott
