In specified file watermark is not a text, but a OfficeDrawing shape,
anchored to header document part. Check the following example from poi
trunk:
public void testWatermark() throws UnsupportedEncodingException
{
HWPFDocument hwpfDocument = HWPFTestDataSamples
.openSampleFile( "watermark.doc" );
OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
.getOfficeDrawings().iterator().next();
EscherContainerRecord escherContainerRecord = drawing
.getOfficeArtSpContainer();
EscherOptRecord officeArtFOPT = escherContainerRecord
.getChildById( (short) 0xF00B );
EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
officeArtFOPT
.lookup( 0x00c0 );
String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE" );
assertEquals( "DRAFT CONTRACT\0", text );
}
Adding the following text to document metadata has too many assumptions:
- we assume there is only one header (i.e. single page structure, no
even/odd pages, no first/last pages, etc.)
- we assume the first office art is actually watermark
For your information, below quotes from doc-dump by HWPFLister:
HWPFLister watermark.doc --escher --officeDrawings
== ESCHER PROPERTIES (rebuilded) ==
org.apache.poi.ddf.EscherContainerRecord (DggContainer):
isContainer: true
options: 0x000F
recordId: 0xF000
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherDggRecord:
RecordId: 0xF006
Options: 0x0000
ShapeIdMax: 2050
NumIdClusters: 3
NumShapesSaved: 3
DrawingsSaved: 2
DrawingGroupId1: 1
NumShapeIdsUsed1: 2
DrawingGroupId2: 2
NumShapeIdsUsed2: 2
Child 1:
org.apache.poi.ddf.EscherSplitMenuColorsRecord:
RecordId: 0xF11E
Options: 0x0040
Color1: 0x08000004
Color2: 0x08000001
Color3: 0x08000002
Color4: 0x100000F7
org.apache.poi.ddf.EscherContainerRecord (DgContainer):
isContainer: true
options: 0x000F
recordId: 0xF002
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherDgRecord:
RecordId: 0xF008
Options: 0x0010
NumShapes: 2
LastMSOSPID: 1025
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
isContainer: true
options: 0x000F
recordId: 0xF003
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherSpgrRecord:
RecordId: 0xF009
Options: 0x0001
RectX: 0
RectY: 0
RectWidth: -32767
RectHeight: -32767
Child 1:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0002
ShapeId: 1024
Flags: GROUP|PATRIARCH (0x00000005)
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0882
ShapeId: 1025
Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)
Child 1:
org.apache.poi.ddf.EscherOptRecord:
isContainer: false
options: 0x0143
recordId: 0xF00B
numchildren: 0
properties:
propNum: 4, RAW: 0x0004, propName: transform.rotation,
complex: false, blipId: false, value: 20643840 (0x013B0000)
propNum: 133, RAW: 0x0085, propName: text.wraptext,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 135, RAW: 0x0087, propName: text.anchortext,
complex: false, blipId: false, value: 1 (0x00000001)
propNum: 192, propName: geotext.unicode, complex:
true, blipId: true, data:
00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
propNum: 197, propName: geotext.fontfamilyname,
complex: true, blipId: true, data:
00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00, 00,
propNum: 255, RAW: 0x00FF, propName:
geotext.strikethroughfont, complex: false, blipId: false, value:
-47872 (0xFFFF4500)
propNum: 327, RAW: 0x0147, propName:
geometry.adjustvalue, complex: false, blipId: false, value: 10800
(0x00002A30)
propNum: 383, RAW: 0x017F, propName: geometry.fillok,
complex: false, blipId: false, value: 262205 (0x0004003D)
propNum: 384, RAW: 0x0180, propName: fill.filltype,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
complex: false, blipId: false, value: 32768 (0x00008000)
propNum: 387, RAW: 0x0183, propName:
fill.fillbackcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
propNum: 447, RAW: 0x01BF, propName:
fill.nofillhittest, complex: false, blipId: false, value: 1048592
(0x00100010)
propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 450, RAW: 0x01C2, propName:
linestyle.backcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
propNum: 470, RAW: 0x01D6, propName:
linestyle.linejoinstyle, complex: false, blipId: false, value: 2
(0x00000002)
propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
(0x00090000)
propNum: 575, RAW: 0x023F, propName:
shadowstyle.shadowobsured, complex: false, blipId: false, value:
131072 (0x00020000)
propNum: 896, propName: groupshape.shapename, complex:
true, blipId: true, data:
00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
61, 00,
32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
30, 00,
64: 34, 00, 00, 00,
propNum: 959, RAW: 0x03BF, propName: groupshape.print,
complex: false, blipId: false, value: 2097184 (0x00200020)
Child 2:
org.apache.poi.ddf.EscherTertiaryOptRecord:
isContainer: false
options: 0x0043
recordId: 0xF122
numchildren: 0
properties:
propNum: 911, RAW: 0x038F, propName: groupshape.posh,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 912, RAW: 0x0390, propName:
groupshape.posrelh, complex: false, blipId: false, value: 0
(0x00000000)
propNum: 913, RAW: 0x0391, propName: groupshape.posv,
complex: false, blipId: false, value: 2 (0x00000002)
propNum: 914, RAW: 0x0392, propName:
groupshape.posrelv, complex: false, blipId: false, value: 0
(0x00000000)
org.apache.poi.ddf.EscherContainerRecord (DgContainer):
isContainer: true
options: 0x000F
recordId: 0xF002
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherDgRecord:
RecordId: 0xF008
Options: 0x0020
NumShapes: 1
LastMSOSPID: 2049
Child 1:
org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
isContainer: true
options: 0x000F
recordId: 0xF003
numchildren: 1
children:
Child 0:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 2
children:
Child 0:
org.apache.poi.ddf.EscherSpgrRecord:
RecordId: 0xF009
Options: 0x0001
RectX: 0
RectY: 0
RectWidth: -32767
RectHeight: -32767
Child 1:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0002
ShapeId: 2048
Flags: GROUP|PATRIARCH (0x00000005)
Child 2:
org.apache.poi.ddf.EscherContainerRecord (SpContainer):
isContainer: true
options: 0x000F
recordId: 0xF004
numchildren: 3
children:
Child 0:
org.apache.poi.ddf.EscherSpRecord:
RecordId: 0xF00A
Options: 0x0012
ShapeId: 2049
Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)
Child 1:
org.apache.poi.ddf.EscherOptRecord:
isContainer: false
options: 0x0043
recordId: 0xF00B
numchildren: 0
properties:
propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 134217729 (0x08000001)
propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
complex: false, blipId: false, value: 0 (0x00000000)
propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
(0x00080008)
propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
complex: false, blipId: false, value: 134217730 (0x08000002)
Child 2:
org.apache.poi.ddf.EscherClientDataRecord:
RecordId: 0xF011
Options: 0x0000
Extra Data:
00000000 01 00 00 00 ....
== OFFICE DRAWINGS (rebuilded) ==
=== Document part: HEADER ===
OfficeDrawingImpl: [FSPA]
.spid = (1025 )
.xaLeft = (14 )
.yaTop = (2309 )
.xaRight = (9346 )
.yaBottom = (11640 )
.flags = (16500 )
.fHdr = false
.bx = 2
.by = 2
.wr = 3
.wrk = 0
.fRcaSimple = false
.fBelowText = true
.fAnchorLock = false
.cTxbx = (0 )
[/FSPA]
=== Document part: MAIN ===
--
Best regards,
Sergey
On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
<[email protected]> wrote:
> Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.
>
> Can't see the watermark when saving and reopening the doc at the .docx
> format, have attached the .doc example
>
> Thanks
>
> Julien
>
> On 23 August 2011 14:06, Nick Burch <[email protected]> wrote:
>
>> On Tue, 23 Aug 2011, Julien Nioche wrote:
>>
>>> We definitely don't get them in Tika. See docs attached (saved with
>>> OpenOffice )
>>>
>>
>> It's probably worth putting these sample files on a tika issue so they
>> don't get lost, and can be used in a future unit test
>>
>> The next thing to check is probably to unit the .docx file, and see where
>> the watermark text lives. If it's in the main document part then it should
>> be farily easy to get for Tika. If it's in a different part, then a little
>> bit of support will likely be needed on the POI side to allow easier access
>> to it
>>
>>
>> Nick
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail:
>> [email protected].**org<[email protected]>
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
--
Sergey Vladimirov
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]