On 03/30/2012 05:09 AM, J.Pietschmann wrote:
Am 29.03.2012 01:24, schrieb Craig Ringer:
I'd also like to have getEncodedName() return a byte[] not a
String, since an encoded PDF name isn't actually text data.
Sounds like a reasonable idea.
BTW, is there any reason Fop's PDF library uses java.lang.String when
working with sequences of PDF data bytes?
I'd chalk this up to historical reasons, as usual. Fell free to
provide a patch which cleans this up.
J.Pietschmann
Here's how I'd like to rewrite PDFName; untested code as an example of
what I'm getting at. This is just a standalone file; a patch that
incorporates it into the main sources will be a lot more work that I'm
holding off on until I know folks here agree with the approach.
In any case, after reading more of the PDF library I'm rethinking the
wisdom of trying to make this change. The change its self is correct,
but it'll be really hard to safely integrate into the rest of the PDF
library because of the difficulty of auditing every site to ensure
nothing breaks. Java likes to call `toString' automatically in places,
meaning that anywhere that doesn't use the proper PDFWritable output
methods PDFName inherits will break by producing bad PDF data that might
be quite hard to spot. I'd start by making PDFName.toString() throw (for
testing), but that'd only catch issues in code that test paths actually hit.
Given the number of these kinds of issues in fop's pdf library I'm more
and more inclined to wonder if it should just be replaced with PDFBox.
It's *full* of text encoding issues, it crams 8-bit binary data into the
lower 8 bits of Unicode strings, etc. Most of the classes that extend
basics like PDFDictionary act like the base class isn't public API and
break if anyone else changes the dictionary in ways they don't expect,
too; they should have-a PDFDictionary not be-a PDFDictionary really.
PDFBox is far from perfect, but it has a clean separation between the
model classes (PD) and the basic PDF data types (COSxxx); it has a
clean PDFName, PDFString, etc; it has a good PDF parser already, etc.
Maybe it'd be easier for me to whip up a port of FOP's PDF output code
to PDFBox? I suspect I'm insane to mention the possibility of doing that
without evaluating the amount of work involved first, so I'm not
promising anything, but by the looks it might be easier than doing the
cleanups I'd like to do in fop.
Thoughts?
--
Craig Ringer
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the License); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/* $Id$ */
package org.apache.fop.pdf;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.Serializable;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.util.*;
import org.apache.commons.io.output.CountingOutputStream;
/**
* Class representing a PDF name object.
*
*/
public class PDFName extends PDFObject {
private static final MapByteString, PDFName commonNames;
private final ByteString unescapedName;
private ByteString escapedName;
/**
* Creates a new PDF name object from a Unicode java string,
* encoding the name as UTF-8.
*
* @param name the name value
*/
public PDFName(String name) {
super();
this.unescapedName = new ByteString(name.getBytes(java.nio.charset.StandardCharsets.UTF_8));
}
/**
* Creates a new PDF name object from a sequence of bytes
* in no particular encoding.
*
* By PDF convention you should use utf-8 when encoding names
* (as is done by the String-based PDFName constructor), but this
* is NOT required by the spec.
*/
private PDFName(ByteString name) {
super();
this.unescapedName = name;
}
/**
* Create a PDFName with a pre-escaped name supplied. This is mostly useful
* when defining names from data parsed from PDF data, or when allocating
* pre-cached names.
*
* @param unescapedName Name with PDF name escapes decoded
* @param escapedName Name encoded with PDF escapes
*/
private PDFName(ByteString unescapedName, ByteString escapedName) {
this.unescapedName =