Hi Sergey,
Sounds like we can't extract the watermarks in a generic way then.
Thanks for your comments

Julien

On 23 August 2011 16:40, Sergey Vladimirov <[email protected]> wrote:

> In specified file watermark is not a text, but a OfficeDrawing shape,
> anchored to header document part. Check the following example from poi
> trunk:
>
>    public void testWatermark() throws UnsupportedEncodingException
>    {
>        HWPFDocument hwpfDocument = HWPFTestDataSamples
>                .openSampleFile( "watermark.doc" );
>        OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders()
>                .getOfficeDrawings().iterator().next();
>        EscherContainerRecord escherContainerRecord = drawing
>                .getOfficeArtSpContainer();
>
>        EscherOptRecord officeArtFOPT = escherContainerRecord
>                .getChildById( (short) 0xF00B );
>        EscherComplexProperty gtextUNICODE = (EscherComplexProperty)
> officeArtFOPT
>                .lookup( 0x00c0 );
>
>        String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE"
> );
>        assertEquals( "DRAFT CONTRACT\0", text );
>    }
>
> Adding the following text to document metadata has too many assumptions:
>  - we assume there is only one header (i.e. single page structure, no
> even/odd pages, no first/last pages, etc.)
>  - we assume the first office art is actually watermark
>
> For your information, below quotes from doc-dump by HWPFLister:
>    HWPFLister watermark.doc --escher --officeDrawings
>
> == ESCHER PROPERTIES (rebuilded) ==
> org.apache.poi.ddf.EscherContainerRecord (DggContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF000
>  numchildren: 2
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDggRecord:
>      RecordId: 0xF006
>      Options: 0x0000
>      ShapeIdMax: 2050
>      NumIdClusters: 3
>      NumShapesSaved: 3
>      DrawingsSaved: 2
>      DrawingGroupId1: 1
>      NumShapeIdsUsed1: 2
>      DrawingGroupId2: 2
>      NumShapeIdsUsed2: 2
>
>   Child 1:
>    org.apache.poi.ddf.EscherSplitMenuColorsRecord:
>      RecordId: 0xF11E
>      Options: 0x0040
>      Color1: 0x08000004
>      Color2: 0x08000001
>      Color3: 0x08000002
>      Color4: 0x100000F7
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF002
>  numchildren: 2
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDgRecord:
>      RecordId: 0xF008
>      Options: 0x0010
>      NumShapes: 2
>      LastMSOSPID: 1025
>
>   Child 1:
>    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF003
>      numchildren: 2
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 2
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpgrRecord:
>              RecordId: 0xF009
>              Options: 0x0001
>              RectX: 0
>              RectY: 0
>              RectWidth: -32767
>              RectHeight: -32767
>
>           Child 1:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0002
>              ShapeId: 1024
>              Flags: GROUP|PATRIARCH (0x00000005)
>
>
>       Child 1:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 3
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0882
>              ShapeId: 1025
>              Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00)
>
>           Child 1:
>            org.apache.poi.ddf.EscherOptRecord:
>              isContainer: false
>              options: 0x0143
>              recordId: 0xF00B
>              numchildren: 0
>              properties:
>                propNum: 4, RAW: 0x0004, propName: transform.rotation,
> complex: false, blipId: false, value: 20643840 (0x013B0000)
>                propNum: 133, RAW: 0x0085, propName: text.wraptext,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 135, RAW: 0x0087, propName: text.anchortext,
> complex: false, blipId: false, value: 1 (0x00000001)
>                propNum: 192, propName: geotext.unicode, complex:
> true, blipId: true, data:
>            00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43,
> 00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00,
>                propNum: 197, propName: geotext.fontfamilyname,
> complex: true, blipId: true, data:
>            00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00,
> 00,
>                propNum: 255, RAW: 0x00FF, propName:
> geotext.strikethroughfont, complex: false, blipId: false, value:
> -47872 (0xFFFF4500)
>                propNum: 327, RAW: 0x0147, propName:
> geometry.adjustvalue, complex: false, blipId: false, value: 10800
> (0x00002A30)
>                propNum: 383, RAW: 0x017F, propName: geometry.fillok,
> complex: false, blipId: false, value: 262205 (0x0004003D)
>                propNum: 384, RAW: 0x0180, propName: fill.filltype,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 385, RAW: 0x0181, propName: fill.fillcolor,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 386, RAW: 0x0182, propName: fill.fillopacity,
> complex: false, blipId: false, value: 32768 (0x00008000)
>                propNum: 387, RAW: 0x0183, propName:
> fill.fillbackcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
>                propNum: 447, RAW: 0x01BF, propName:
> fill.nofillhittest, complex: false, blipId: false, value: 1048592
> (0x00100010)
>                propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 0 (0x00000000)
>                propNum: 450, RAW: 0x01C2, propName:
> linestyle.backcolor, complex: false, blipId: false, value: 16777215
> (0x00FFFFFF)
>                propNum: 470, RAW: 0x01D6, propName:
> linestyle.linejoinstyle, complex: false, blipId: false, value: 2
> (0x00000002)
>                propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824
> (0x00090000)
>                propNum: 575, RAW: 0x023F, propName:
> shadowstyle.shadowobsured, complex: false, blipId: false, value:
> 131072 (0x00020000)
>                propNum: 896, propName: groupshape.shapename, complex:
> true, blipId: true, data:
>            00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C,
> 00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00,
> 61, 00,
>            32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63,
> 00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00,
> 30, 00,
>            64: 34, 00, 00, 00,
>                propNum: 959, RAW: 0x03BF, propName: groupshape.print,
> complex: false, blipId: false, value: 2097184 (0x00200020)
>
>           Child 2:
>            org.apache.poi.ddf.EscherTertiaryOptRecord:
>              isContainer: false
>              options: 0x0043
>              recordId: 0xF122
>              numchildren: 0
>              properties:
>                propNum: 911, RAW: 0x038F, propName: groupshape.posh,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 912, RAW: 0x0390, propName:
> groupshape.posrelh, complex: false, blipId: false, value: 0
> (0x00000000)
>                propNum: 913, RAW: 0x0391, propName: groupshape.posv,
> complex: false, blipId: false, value: 2 (0x00000002)
>                propNum: 914, RAW: 0x0392, propName:
> groupshape.posrelv, complex: false, blipId: false, value: 0
> (0x00000000)
>
>
>
> org.apache.poi.ddf.EscherContainerRecord (DgContainer):
>  isContainer: true
>  options: 0x000F
>  recordId: 0xF002
>  numchildren: 3
>  children:
>   Child 0:
>    org.apache.poi.ddf.EscherDgRecord:
>      RecordId: 0xF008
>      Options: 0x0020
>      NumShapes: 1
>      LastMSOSPID: 2049
>
>   Child 1:
>    org.apache.poi.ddf.EscherContainerRecord (SpgrContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF003
>      numchildren: 1
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>          isContainer: true
>          options: 0x000F
>          recordId: 0xF004
>          numchildren: 2
>          children:
>           Child 0:
>            org.apache.poi.ddf.EscherSpgrRecord:
>              RecordId: 0xF009
>              Options: 0x0001
>              RectX: 0
>              RectY: 0
>              RectWidth: -32767
>              RectHeight: -32767
>
>           Child 1:
>            org.apache.poi.ddf.EscherSpRecord:
>              RecordId: 0xF00A
>              Options: 0x0002
>              ShapeId: 2048
>              Flags: GROUP|PATRIARCH (0x00000005)
>
>
>
>   Child 2:
>    org.apache.poi.ddf.EscherContainerRecord (SpContainer):
>      isContainer: true
>      options: 0x000F
>      recordId: 0xF004
>      numchildren: 3
>      children:
>       Child 0:
>        org.apache.poi.ddf.EscherSpRecord:
>          RecordId: 0xF00A
>          Options: 0x0012
>          ShapeId: 2049
>          Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00)
>
>       Child 1:
>        org.apache.poi.ddf.EscherOptRecord:
>          isContainer: false
>          options: 0x0043
>          recordId: 0xF00B
>          numchildren: 0
>          properties:
>            propNum: 448, RAW: 0x01C0, propName: linestyle.color,
> complex: false, blipId: false, value: 134217729 (0x08000001)
>            propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth,
> complex: false, blipId: false, value: 0 (0x00000000)
>            propNum: 511, RAW: 0x01FF, propName:
> linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296
> (0x00080008)
>            propNum: 513, RAW: 0x0201, propName: shadowstyle.color,
> complex: false, blipId: false, value: 134217730 (0x08000002)
>
>       Child 2:
>        org.apache.poi.ddf.EscherClientDataRecord:
>          RecordId: 0xF011
>          Options: 0x0000
>          Extra Data:
>        00000000 01 00 00 00                                     ....
>
>
>
> == OFFICE DRAWINGS (rebuilded) ==
> === Document part: HEADER ===
> OfficeDrawingImpl: [FSPA]
>    .spid                 =  (1025 )
>    .xaLeft               =  (14 )
>    .yaTop                =  (2309 )
>    .xaRight              =  (9346 )
>    .yaBottom             =  (11640 )
>    .flags                =  (16500 )
>         .fHdr                     = false
>         .bx                       = 2
>         .by                       = 2
>         .wr                       = 3
>         .wrk                      = 0
>         .fRcaSimple               = false
>         .fBelowText               = true
>         .fAnchorLock              = false
>    .cTxbx                =  (0 )
> [/FSPA]
>
> === Document part: MAIN ===
>
> --
> Best regards,
> Sergey
>
> On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche
> <[email protected]> wrote:
> > Created https://issues.apache.org/jira/browse/TIKA-696 to track the
> issue.
> >
> > Can't see the watermark when saving and reopening the doc at the .docx
> > format, have attached the .doc example
> >
> > Thanks
> >
> > Julien
> >
> > On 23 August 2011 14:06, Nick Burch <[email protected]> wrote:
> >
> >> On Tue, 23 Aug 2011, Julien Nioche wrote:
> >>
> >>> We definitely don't get them in Tika. See docs attached (saved with
> >>> OpenOffice )
> >>>
> >>
> >> It's probably worth putting these sample files on a tika issue so they
> >> don't get lost, and can be used in a future unit test
> >>
> >> The next thing to check is probably to unit the .docx file, and see
> where
> >> the watermark text lives. If it's in the main document part then it
> should
> >> be farily easy to get for Tika. If it's in a different part, then a
> little
> >> bit of support will likely be needed on the POI side to allow easier
> access
> >> to it
> >>
> >>
> >> Nick
> >>
> >>
> ------------------------------**------------------------------**---------
> >> To unsubscribe, e-mail: [email protected].**org<
> [email protected]>
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> Sergey Vladimirov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to