Hi Sergey, Sounds like we can't extract the watermarks in a generic way then. Thanks for your comments
Julien On 23 August 2011 16:40, Sergey Vladimirov <[email protected]> wrote: > In specified file watermark is not a text, but a OfficeDrawing shape, > anchored to header document part. Check the following example from poi > trunk: > > public void testWatermark() throws UnsupportedEncodingException > { > HWPFDocument hwpfDocument = HWPFTestDataSamples > .openSampleFile( "watermark.doc" ); > OfficeDrawing drawing = hwpfDocument.getOfficeDrawingsHeaders() > .getOfficeDrawings().iterator().next(); > EscherContainerRecord escherContainerRecord = drawing > .getOfficeArtSpContainer(); > > EscherOptRecord officeArtFOPT = escherContainerRecord > .getChildById( (short) 0xF00B ); > EscherComplexProperty gtextUNICODE = (EscherComplexProperty) > officeArtFOPT > .lookup( 0x00c0 ); > > String text = new String( gtextUNICODE.getComplexData(), "UTF-16LE" > ); > assertEquals( "DRAFT CONTRACT\0", text ); > } > > Adding the following text to document metadata has too many assumptions: > - we assume there is only one header (i.e. single page structure, no > even/odd pages, no first/last pages, etc.) > - we assume the first office art is actually watermark > > For your information, below quotes from doc-dump by HWPFLister: > HWPFLister watermark.doc --escher --officeDrawings > > == ESCHER PROPERTIES (rebuilded) == > org.apache.poi.ddf.EscherContainerRecord (DggContainer): > isContainer: true > options: 0x000F > recordId: 0xF000 > numchildren: 2 > children: > Child 0: > org.apache.poi.ddf.EscherDggRecord: > RecordId: 0xF006 > Options: 0x0000 > ShapeIdMax: 2050 > NumIdClusters: 3 > NumShapesSaved: 3 > DrawingsSaved: 2 > DrawingGroupId1: 1 > NumShapeIdsUsed1: 2 > DrawingGroupId2: 2 > NumShapeIdsUsed2: 2 > > Child 1: > org.apache.poi.ddf.EscherSplitMenuColorsRecord: > RecordId: 0xF11E > Options: 0x0040 > Color1: 0x08000004 > Color2: 0x08000001 > Color3: 0x08000002 > Color4: 0x100000F7 > > org.apache.poi.ddf.EscherContainerRecord (DgContainer): > isContainer: true > options: 0x000F > recordId: 0xF002 > numchildren: 2 > children: > Child 0: > org.apache.poi.ddf.EscherDgRecord: > RecordId: 0xF008 > Options: 0x0010 > NumShapes: 2 > LastMSOSPID: 1025 > > Child 1: > org.apache.poi.ddf.EscherContainerRecord (SpgrContainer): > isContainer: true > options: 0x000F > recordId: 0xF003 > numchildren: 2 > children: > Child 0: > org.apache.poi.ddf.EscherContainerRecord (SpContainer): > isContainer: true > options: 0x000F > recordId: 0xF004 > numchildren: 2 > children: > Child 0: > org.apache.poi.ddf.EscherSpgrRecord: > RecordId: 0xF009 > Options: 0x0001 > RectX: 0 > RectY: 0 > RectWidth: -32767 > RectHeight: -32767 > > Child 1: > org.apache.poi.ddf.EscherSpRecord: > RecordId: 0xF00A > Options: 0x0002 > ShapeId: 1024 > Flags: GROUP|PATRIARCH (0x00000005) > > > Child 1: > org.apache.poi.ddf.EscherContainerRecord (SpContainer): > isContainer: true > options: 0x000F > recordId: 0xF004 > numchildren: 3 > children: > Child 0: > org.apache.poi.ddf.EscherSpRecord: > RecordId: 0xF00A > Options: 0x0882 > ShapeId: 1025 > Flags: HAVEANCHOR|HASSHAPETYPE (0x00000A00) > > Child 1: > org.apache.poi.ddf.EscherOptRecord: > isContainer: false > options: 0x0143 > recordId: 0xF00B > numchildren: 0 > properties: > propNum: 4, RAW: 0x0004, propName: transform.rotation, > complex: false, blipId: false, value: 20643840 (0x013B0000) > propNum: 133, RAW: 0x0085, propName: text.wraptext, > complex: false, blipId: false, value: 2 (0x00000002) > propNum: 135, RAW: 0x0087, propName: text.anchortext, > complex: false, blipId: false, value: 1 (0x00000001) > propNum: 192, propName: geotext.unicode, complex: > true, blipId: true, data: > 00: 44, 00, 52, 00, 41, 00, 46, 00, 54, 00, 20, 00, 43, > 00, 4F, 00, 4E, 00, 54, 00, 52, 00, 41, 00, 43, 00, 54, 00, 00, 00, > propNum: 197, propName: geotext.fontfamilyname, > complex: true, blipId: true, data: > 00: 43, 00, 61, 00, 6C, 00, 69, 00, 62, 00, 72, 00, 69, 00, 00, > 00, > propNum: 255, RAW: 0x00FF, propName: > geotext.strikethroughfont, complex: false, blipId: false, value: > -47872 (0xFFFF4500) > propNum: 327, RAW: 0x0147, propName: > geometry.adjustvalue, complex: false, blipId: false, value: 10800 > (0x00002A30) > propNum: 383, RAW: 0x017F, propName: geometry.fillok, > complex: false, blipId: false, value: 262205 (0x0004003D) > propNum: 384, RAW: 0x0180, propName: fill.filltype, > complex: false, blipId: false, value: 0 (0x00000000) > propNum: 385, RAW: 0x0181, propName: fill.fillcolor, > complex: false, blipId: false, value: 0 (0x00000000) > propNum: 386, RAW: 0x0182, propName: fill.fillopacity, > complex: false, blipId: false, value: 32768 (0x00008000) > propNum: 387, RAW: 0x0183, propName: > fill.fillbackcolor, complex: false, blipId: false, value: 16777215 > (0x00FFFFFF) > propNum: 447, RAW: 0x01BF, propName: > fill.nofillhittest, complex: false, blipId: false, value: 1048592 > (0x00100010) > propNum: 448, RAW: 0x01C0, propName: linestyle.color, > complex: false, blipId: false, value: 0 (0x00000000) > propNum: 450, RAW: 0x01C2, propName: > linestyle.backcolor, complex: false, blipId: false, value: 16777215 > (0x00FFFFFF) > propNum: 470, RAW: 0x01D6, propName: > linestyle.linejoinstyle, complex: false, blipId: false, value: 2 > (0x00000002) > propNum: 511, RAW: 0x01FF, propName: > linestyle.nolinedrawdash, complex: false, blipId: false, value: 589824 > (0x00090000) > propNum: 575, RAW: 0x023F, propName: > shadowstyle.shadowobsured, complex: false, blipId: false, value: > 131072 (0x00020000) > propNum: 896, propName: groupshape.shapename, complex: > true, blipId: true, data: > 00: 50, 00, 6F, 00, 77, 00, 65, 00, 72, 00, 50, 00, 6C, > 00, 75, 00, 73, 00, 57, 00, 61, 00, 74, 00, 65, 00, 72, 00, 4D, 00, > 61, 00, > 32: 72, 00, 6B, 00, 4F, 00, 62, 00, 6A, 00, 65, 00, 63, > 00, 74, 00, 35, 00, 31, 00, 36, 00, 31, 00, 30, 00, 31, 00, 31, 00, > 30, 00, > 64: 34, 00, 00, 00, > propNum: 959, RAW: 0x03BF, propName: groupshape.print, > complex: false, blipId: false, value: 2097184 (0x00200020) > > Child 2: > org.apache.poi.ddf.EscherTertiaryOptRecord: > isContainer: false > options: 0x0043 > recordId: 0xF122 > numchildren: 0 > properties: > propNum: 911, RAW: 0x038F, propName: groupshape.posh, > complex: false, blipId: false, value: 2 (0x00000002) > propNum: 912, RAW: 0x0390, propName: > groupshape.posrelh, complex: false, blipId: false, value: 0 > (0x00000000) > propNum: 913, RAW: 0x0391, propName: groupshape.posv, > complex: false, blipId: false, value: 2 (0x00000002) > propNum: 914, RAW: 0x0392, propName: > groupshape.posrelv, complex: false, blipId: false, value: 0 > (0x00000000) > > > > org.apache.poi.ddf.EscherContainerRecord (DgContainer): > isContainer: true > options: 0x000F > recordId: 0xF002 > numchildren: 3 > children: > Child 0: > org.apache.poi.ddf.EscherDgRecord: > RecordId: 0xF008 > Options: 0x0020 > NumShapes: 1 > LastMSOSPID: 2049 > > Child 1: > org.apache.poi.ddf.EscherContainerRecord (SpgrContainer): > isContainer: true > options: 0x000F > recordId: 0xF003 > numchildren: 1 > children: > Child 0: > org.apache.poi.ddf.EscherContainerRecord (SpContainer): > isContainer: true > options: 0x000F > recordId: 0xF004 > numchildren: 2 > children: > Child 0: > org.apache.poi.ddf.EscherSpgrRecord: > RecordId: 0xF009 > Options: 0x0001 > RectX: 0 > RectY: 0 > RectWidth: -32767 > RectHeight: -32767 > > Child 1: > org.apache.poi.ddf.EscherSpRecord: > RecordId: 0xF00A > Options: 0x0002 > ShapeId: 2048 > Flags: GROUP|PATRIARCH (0x00000005) > > > > Child 2: > org.apache.poi.ddf.EscherContainerRecord (SpContainer): > isContainer: true > options: 0x000F > recordId: 0xF004 > numchildren: 3 > children: > Child 0: > org.apache.poi.ddf.EscherSpRecord: > RecordId: 0xF00A > Options: 0x0012 > ShapeId: 2049 > Flags: HAVEANCHOR|BACKGROUND|HASSHAPETYPE (0x00000E00) > > Child 1: > org.apache.poi.ddf.EscherOptRecord: > isContainer: false > options: 0x0043 > recordId: 0xF00B > numchildren: 0 > properties: > propNum: 448, RAW: 0x01C0, propName: linestyle.color, > complex: false, blipId: false, value: 134217729 (0x08000001) > propNum: 459, RAW: 0x01CB, propName: linestyle.linewidth, > complex: false, blipId: false, value: 0 (0x00000000) > propNum: 511, RAW: 0x01FF, propName: > linestyle.nolinedrawdash, complex: false, blipId: false, value: 524296 > (0x00080008) > propNum: 513, RAW: 0x0201, propName: shadowstyle.color, > complex: false, blipId: false, value: 134217730 (0x08000002) > > Child 2: > org.apache.poi.ddf.EscherClientDataRecord: > RecordId: 0xF011 > Options: 0x0000 > Extra Data: > 00000000 01 00 00 00 .... > > > > == OFFICE DRAWINGS (rebuilded) == > === Document part: HEADER === > OfficeDrawingImpl: [FSPA] > .spid = (1025 ) > .xaLeft = (14 ) > .yaTop = (2309 ) > .xaRight = (9346 ) > .yaBottom = (11640 ) > .flags = (16500 ) > .fHdr = false > .bx = 2 > .by = 2 > .wr = 3 > .wrk = 0 > .fRcaSimple = false > .fBelowText = true > .fAnchorLock = false > .cTxbx = (0 ) > [/FSPA] > > === Document part: MAIN === > > -- > Best regards, > Sergey > > On Tue, Aug 23, 2011 at 5:45 PM, Julien Nioche > <[email protected]> wrote: > > Created https://issues.apache.org/jira/browse/TIKA-696 to track the > issue. > > > > Can't see the watermark when saving and reopening the doc at the .docx > > format, have attached the .doc example > > > > Thanks > > > > Julien > > > > On 23 August 2011 14:06, Nick Burch <[email protected]> wrote: > > > >> On Tue, 23 Aug 2011, Julien Nioche wrote: > >> > >>> We definitely don't get them in Tika. See docs attached (saved with > >>> OpenOffice ) > >>> > >> > >> It's probably worth putting these sample files on a tika issue so they > >> don't get lost, and can be used in a future unit test > >> > >> The next thing to check is probably to unit the .docx file, and see > where > >> the watermark text lives. If it's in the main document part then it > should > >> be farily easy to get for Tika. If it's in a different part, then a > little > >> bit of support will likely be needed on the POI side to allow easier > access > >> to it > >> > >> > >> Nick > >> > >> > ------------------------------**------------------------------**--------- > >> To unsubscribe, e-mail: [email protected].**org< > [email protected]> > >> For additional commands, e-mail: [email protected] > >> > >> > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > > > > -- > Sergey Vladimirov > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
