RE: How to extract embedded files from Office 07

Dmitry Goldenberg Thu, 28 Aug 2008 17:31:19 -0700

Rainer,
You were right, the .bin file in /embeddings is Ole and can be read with POIFS.

The gotcha is, there's currently no API to extract the file out of the Ole 
structures within POIFS.

HSLF has an API to enumerate Ole objects within slides. But what I need is a 
generic API that would let me do the following:

List<Embedding> embeddings = poifs.getEmbeddings();
for (Embedding embedding : embeddings) {
    System.out.println(">> Embedding: " + embedding.getName());
    embedding.extractTo(new FileOutputStream(outputDir, 
Utils.getCleanFileName(embedding.getName())));
}

getEmbeddings() could be getOleObjects() or whatever, but that's the gist of 
it..

- Dmitry

-----Original Message-----
From: Rainer Schwarze [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 28, 2008 6:49 PM
To: POI Users List
Subject: Re: How to extract embedded files from Office 07

Dmitry Goldenberg wrote:
> Yegor,
>
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 
> 11 e0 a1 b1 1a e1.
>
> Seems like they compress data in a proprietary way. I've read one post where 
> someone recommended the .NET Packaging API to crack these ...  Not a good 
> option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html

Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS.
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence
[size]"PK", where [size] depends on the search position. Assume you
start immediately after the first 4 bytes for total length, then the
size value is length-4. Step further by one byte and check for the
sequence with size set to length-5 a.s.o. When the 6 bytes match the
expected [size]PK sequence, you can be somewhat sure, that "PK"
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to extract embedded files from Office 07

Reply via email to