In specified file watermark is not a text, but a OfficeDrawing shape,
anchored to header document part. Check the following example from poi
trunk:

    public void testWatermark() throws UnsupportedEncodingException
    {
        HWPFDocument hwpfDocument = HWPFTestDataSamples
                .openSampleFile( "watermark.doc" );
        OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
                .getOfficeDrawings().iterator().next();
        EscherContainerRecord escherContainerRecord = drawing
                .getOfficeArtSpContainer();

        EscherOptRecord officeArtFOPT = escherContainerRecord
                .getChildById( (short) 0xF00B );
        EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
officeArtFOPT
                .lookup( 0x00c0 );

        String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE" );
        assertEquals( "DRAFT CONTRACT\0", text );
    }

Adding the following text to document metadata has too many assumptions:
 - we assume there is only one header (i.e. single page structure, no
even/odd pages, no first/last pages, etc.)
 - we assume the first office art is actually watermark

For your information, below quotes from doc-dump by HWPFLister:
    HWPFLister watermark.doc --escher --officeDrawings

== ESCHER PROPERTIES (rebuilded) ==
org.apache.poi.ddf.EscherContainerRecord (DggContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF000
  numchildren: 2
  children:
   Child 0:
    org.apache.poi.ddf.EscherDggRecord:
      RecordId: 0xF006
      Options: 0x0000
      ShapeIdMax: 2050
      NumIdClusters: 3
      NumShapesSaved: 3
      DrawingsSaved: 2
      DrawingGroupId1: 1
      NumShapeIdsUsed1: 2
      DrawingGroupId2: 2
      NumShapeIdsUsed2: 2

   Child 1:
    org.apache.poi.ddf.EscherSplitMenuColorsRecord:
      RecordId: 0xF11E
      Options: 0x0040
      Color1: 0x08000004
      Color2: 0x08000001
      Color3: 0x08000002
      Color4: 0x100000F7

org.apache.poi.ddf.EscherContainerRecord (DgContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF002
  numchildren: 2
  children:
   Child 0:
    org.apache.poi.ddf.EscherDgRecord:
      RecordId: 0xF008
      Options: 0x0010
      NumShapes: 2
      LastMSOSPID: 1025

   Child 1:
    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF003
      numchildren: 2
      children:
       Child 0:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 2
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpgrRecord:
              RecordId: 0xF009
              Options: 0x0001
              RectX: 0
              RectY: 0
              RectWidth: -32767
              RectHeight: -32767

           Child 1:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0002
              ShapeId: 1024
              Flags: GROUP|PATRIARCH (0x00000005)


       Child 1:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 3
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0882
              ShapeId: 1025
              Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)

           Child 1:
            org.apache.poi.ddf.EscherOptRecord:
              isContainer: false
              options: 0x0143
              recordId: 0xF00B
              numchildren: 0
              properties:
                propNum: 4, RAW: 0x0004, propName: transform.rotation,
complex: false, blipId: false, value: 20643840 (0x013B0000)
                propNum: 133, RAW: 0x0085, propName: text.wraptext,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 135, RAW: 0x0087, propName: text.anchortext,
complex: false, blipId: false, value: 1 (0x00000001)
                propNum: 192, propName: geotext.unicode, complex:
true, blipId: true, data:
            00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
                propNum: 197, propName: geotext.fontfamilyname,
complex: true, blipId: true, data:
            00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00, 00,
                propNum: 255, RAW: 0x00FF, propName:
geotext.strikethroughfont, complex: false, blipId: false, value:
-47872 (0xFFFF4500)
                propNum: 327, RAW: 0x0147, propName:
geometry.adjustvalue, complex: false, blipId: false, value: 10800
(0x00002A30)
                propNum: 383, RAW: 0x017F, propName: geometry.fillok,
complex: false, blipId: false, value: 262205 (0x0004003D)
                propNum: 384, RAW: 0x0180, propName: fill.filltype,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
complex: false, blipId: false, value: 32768 (0x00008000)
                propNum: 387, RAW: 0x0183, propName:
fill.fillbackcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
                propNum: 447, RAW: 0x01BF, propName:
fill.nofillhittest, complex: false, blipId: false, value: 1048592
(0x00100010)
                propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 0 (0x00000000)
                propNum: 450, RAW: 0x01C2, propName:
linestyle.backcolor, complex: false, blipId: false, value: 16777215
(0x00FFFFFF)
                propNum: 470, RAW: 0x01D6, propName:
linestyle.linejoinstyle, complex: false, blipId: false, value: 2
(0x00000002)
                propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
(0x00090000)
                propNum: 575, RAW: 0x023F, propName:
shadowstyle.shadowobsured, complex: false, blipId: false, value:
131072 (0x00020000)
                propNum: 896, propName: groupshape.shapename, complex:
true, blipId: true, data:
            00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
61, 00,
            32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
30, 00,
            64: 34, 00, 00, 00,
                propNum: 959, RAW: 0x03BF, propName: groupshape.print,
complex: false, blipId: false, value: 2097184 (0x00200020)

           Child 2:
            org.apache.poi.ddf.EscherTertiaryOptRecord:
              isContainer: false
              options: 0x0043
              recordId: 0xF122
              numchildren: 0
              properties:
                propNum: 911, RAW: 0x038F, propName: groupshape.posh,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 912, RAW: 0x0390, propName:
groupshape.posrelh, complex: false, blipId: false, value: 0
(0x00000000)
                propNum: 913, RAW: 0x0391, propName: groupshape.posv,
complex: false, blipId: false, value: 2 (0x00000002)
                propNum: 914, RAW: 0x0392, propName:
groupshape.posrelv, complex: false, blipId: false, value: 0
(0x00000000)



org.apache.poi.ddf.EscherContainerRecord (DgContainer):
  isContainer: true
  options: 0x000F
  recordId: 0xF002
  numchildren: 3
  children:
   Child 0:
    org.apache.poi.ddf.EscherDgRecord:
      RecordId: 0xF008
      Options: 0x0020
      NumShapes: 1
      LastMSOSPID: 2049

   Child 1:
    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF003
      numchildren: 1
      children:
       Child 0:
        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
          isContainer: true
          options: 0x000F
          recordId: 0xF004
          numchildren: 2
          children:
           Child 0:
            org.apache.poi.ddf.EscherSpgrRecord:
              RecordId: 0xF009
              Options: 0x0001
              RectX: 0
              RectY: 0
              RectWidth: -32767
              RectHeight: -32767

           Child 1:
            org.apache.poi.ddf.EscherSpRecord:
              RecordId: 0xF00A
              Options: 0x0002
              ShapeId: 2048
              Flags: GROUP|PATRIARCH (0x00000005)



   Child 2:
    org.apache.poi.ddf.EscherContainerRecord (SpContainer):
      isContainer: true
      options: 0x000F
      recordId: 0xF004
      numchildren: 3
      children:
       Child 0:
        org.apache.poi.ddf.EscherSpRecord:
          RecordId: 0xF00A
          Options: 0x0012
          ShapeId: 2049
          Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)

       Child 1:
        org.apache.poi.ddf.EscherOptRecord:
          isContainer: false
          options: 0x0043
          recordId: 0xF00B
          numchildren: 0
          properties:
            propNum: 448, RAW: 0x01C0, propName: linestyle.color,
complex: false, blipId: false, value: 134217729 (0x08000001)
            propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
complex: false, blipId: false, value: 0 (0x00000000)
            propNum: 511, RAW: 0x01FF, propName:
linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
(0x00080008)
            propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
complex: false, blipId: false, value: 134217730 (0x08000002)

       Child 2:
        org.apache.poi.ddf.EscherClientDataRecord:
          RecordId: 0xF011
          Options: 0x0000
          Extra Data:
        00000000 01 00 00 00                                     ....



== OFFICE DRAWINGS (rebuilded) ==
=== Document part: HEADER ===
OfficeDrawingImpl: [FSPA]
    .spid                 =  (1025 )
    .xaLeft               =  (14 )
    .yaTop                =  (2309 )
    .xaRight              =  (9346 )
    .yaBottom             =  (11640 )
    .flags                =  (16500 )
         .fHdr                     = false
         .bx                       = 2
         .by                       = 2
         .wr                       = 3
         .wrk                      = 0
         .fRcaSimple               = false
         .fBelowText               = true
         .fAnchorLock              = false
    .cTxbx                =  (0 )
[/FSPA]

=== Document part: MAIN ===

-- 
Best regards,
Sergey

On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
<[email protected]> wrote:
> Created https://issues.apache.org/jira/browse/TIKA-696 to track the issue.
>
> Can't see the watermark when saving and reopening the doc at the .docx
> format, have attached the .doc example
>
> Thanks
>
> Julien
>
> On 23 August 2011 14:06, Nick Burch <[email protected]> wrote:
>
>> On Tue, 23 Aug 2011, Julien Nioche wrote:
>>
>>> We definitely don't get them in Tika. See docs attached (saved with
>>> OpenOffice )
>>>
>>
>> It's probably worth putting these sample files on a tika issue so they
>> don't get lost, and can be used in a future unit test
>>
>> The next thing to check is probably to unit the .docx file, and see where
>> the watermark text lives. If it's in the main document part then it should
>> be farily easy to get for Tika. If it's in a different part, then a little
>> bit of support will likely be needed on the POI side to allow easier access
>> to it
>>
>>
>> Nick
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: 
>> [email protected].**org<[email protected]>
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Sergey Vladimirov

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to