RE: [EXT] Re: How to replace multi character delimiter with ASCII 001

Peter Wicks (pwicks) Wed, 06 Nov 2019 20:33:00 -0800

Shawn,

We had the same issue, and use special and multi character delimiters here. I 
have not been able to find a CSV library that supports multi-character 
delimiters, otherwise I would have updated the CSV Record Reader to support it. 
I created a special Record Reader that supports multi-character delimiters. We 
use this in Convert Record to convert to a different format as soon as possible 
😊.  I don’t know if your up for using custom code… But just in case you are, 
here is my personal implementation that we use in house.


Thanks,
  Peter

--Class 1--

public class CSVReader extends AbstractControllerService implements 
RecordReaderFactory {
    static final PropertyDescriptor COLUMN_DELIMITER = new 
PropertyDescriptor.Builder()
            .name("pt-column-delimiter")
            .displayName("Column Delimiter")
            .description("The character(s) to use to separate columns of data. 
Special characters like metacharacter should use the '\\\\' notation. If not 
specified Ctrl+A delimiter is used.")
            .required(false)
            .defaultValue("\\u0001")
            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
            .build();
    static final PropertyDescriptor RECORD_DELIMITER = new 
PropertyDescriptor.Builder()
            .name("pt-record-delimiter")
            .displayName("Record Delimiter")
            .description("The character(s) to use to separate rows of data. For 
line return press 'Shift+Enter' in this field. Special characters should use 
the '\\u' notation.")
            .required(false)
            .defaultValue("\n")
            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
            .build();
    static final PropertyDescriptor SKIP_HEADER_ROW = new 
PropertyDescriptor.Builder()
            .name("pt-skip-header")
            .displayName("Skip First Row")
            .description("Specifies whether or not the first row of data will 
be skipped.")
            .allowableValues("true", "false")
            .defaultValue("true")
            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
            .build();

    private volatile String colDelimiter;
    private volatile String recDelimiter;
   private volatile boolean skipHeader;

    @OnEnabled
    public void storeCsvFormat(final ConfigurationContext context) {
        this.colDelimiter = 
StringEscapeUtils.unescapeJava(context.getProperty(COLUMN_DELIMITER).getValue());
        this.recDelimiter = 
StringEscapeUtils.unescapeJava(context.getProperty(RECORD_DELIMITER).getValue());
        this.skipHeader = context.getProperty(SKIP_HEADER_ROW).asBoolean();
    }

    @Override
    protected List<PropertyDescriptor> getSupportedPropertyDescriptors() {
        List<PropertyDescriptor> propertyDescriptors = new ArrayList<>();
        propertyDescriptors.add(COLUMN_DELIMITER);
        propertyDescriptors.add(RECORD_DELIMITER);
        propertyDescriptors.add(SKIP_HEADER_ROW);

        return propertyDescriptors;
    }

    @Override
    public RecordReader createRecordReader(Map<String, String> map, InputStream 
inputStream, ComponentLog componentLog) throws MalformedRecordException, 
IOException, SchemaNotFoundException {
        return new CSVRecordReader(inputStream, componentLog, this.skipHeader, 
this.colDelimiter, this.recDelimiter);
    }
}


--- Class 2 ---

public class CSVRecordReader implements RecordReader {
    private final PeekableScanner s;
    private final RecordSchema schema;
    private final String colDelimiter;
    private final String recordDelimiter;

    public CSVRecordReader(final InputStream in, final ComponentLog logger, 
final boolean hasHeader, final String colDelimiter, final String 
recordDelimiter) throws IOException {
        this.recordDelimiter = recordDelimiter;
        this.colDelimiter = colDelimiter;

        s = new PeekableScanner(new Scanner(in, 
"UTF-8").useDelimiter(recordDelimiter));
        //Build a basic schema based on row count
        final String forRowCount = s.peek();
        final List<RecordField> fields = new ArrayList<>();

        if (forRowCount != null) {
            final String[] columns = forRowCount.split(colDelimiter, -1);
            for (int nColumnIndex = 0; nColumnIndex < columns.length; 
nColumnIndex++) {
                fields.add(new RecordField("Column_" + 
String.valueOf(nColumnIndex), RecordFieldType.STRING.getDataType(), true));
            }

            schema = new SimpleRecordSchema(fields);
        } else {
            schema = null;
        }

        //Skip the header line, if there is one
        if (hasHeader && s.hasNext()) s.next();
    }

    @Override
    public Record nextRecord(boolean b, boolean b1) throws IOException, 
MalformedRecordException {
        if(!s.hasNext()) return null;

        final String row = s.next();
        final List<RecordField> recordFields = getSchema().getFields();

        final Map<String, Object> values = new 
LinkedHashMap<>(recordFields.size() * 2);
        final String[] columns = row.split(colDelimiter, -1);

        for (int i = 0; i < columns.length; i++) {
            final RecordField recordField = recordFields.get(i);
            final String rawFieldName = recordField.getFieldName();
            final String rawValue = columns[i];

            values.put(rawFieldName, rawValue);
        }

        return new MapRecord(schema, values);
    }

    @Override
    public RecordSchema getSchema() throws MalformedRecordException {
        return schema;
    }

    @Override
    public void close() throws IOException {
        s.close();
    }

    private class PeekableScanner
    {
        private Scanner scan;
        private String next;

        public PeekableScanner( Scanner scan )
        {
            this.scan = scan;
            next = (scan.hasNext() ? scan.next() : null);
        }

        public boolean hasNext()
        {
            return (next != null);
        }

        public String next()
        {
            String current = next;
            next = (scan.hasNext() ? scan.next() : null);
            return current;
        }

        public String peek()
        {
            return next;
        }

        public void close(){
            scan.close();
        }
    }
}


From: Andy LoPresto <[email protected]>
Sent: Wednesday, November 6, 2019 4:40 PM
To: [email protected]
Subject: [EXT] Re: How to replace multi character delimiter with ASCII 001

I haven’t tried this, but you might be able to use ${"AQ==“:base64Decode()} as 
AQ== is the Base64 encoded \u0001 ?

Andy LoPresto
[email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69


On Nov 6, 2019, at 12:25 PM, Shawn Weeks 
<[email protected]<mailto:[email protected]>> wrote:

I'm trying to process a delimited file with a multi character delimiter which 
is not supported by the CSV Record Reader. To get around that I'm trying to 
replace the delimiter with ASCII 001 the same delimiter used by Hive and one I 
know isn't in the data. Here is my current configuration but NiFi isn't 
interpreting \u0001. I've also tried \001 and ${literal('\u0001')}. None of 
which worked. What is the correct way to do this?

Thanks
Shawn Weeks

<Outlook-asqiibgt.png>

RE: [EXT] Re: How to replace multi character delimiter with ASCII 001

Reply via email to