There is a way apparently you are allowed to encode this, and I have a patch, but JIRA is down. If it doesn't come back up soon I'll email you the patch.
Karl On Sun, May 15, 2016 at 12:11 PM, Karl Wright <[email protected]> wrote: > Hi Silvio, > > This sounds like a problem with the way the Elastic Search connector is > forming JSON. The spec is silent on control characters: > > http://rfc7159.net/rfc7159#rfc.section.8.1 > > ... so we just embed those in strings. But it sounds like ElasticSearch's > JSON parser is not so happy with them. > > If we can find an encoding that satisfies everyone, we can change the code > to do what is needed. Maybe "\0" for null, etc? > > Karl > > > On Sun, May 15, 2016 at 10:21 AM, <[email protected]> wrote: > >> Hi Apache ManifoldCF user list >> >> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the >> network Windows shares of our company. I’m using Elasticsearch 1.7.4, >> Apache ManifoldCF 2.3 with MS Active Directory as authority source. >> I defined a job with the following connection configuration comprising >> the following chain of transformations (order in the list indicates the >> order of the transformations): >> >> 1. Repository connection (MS Network Share) >> 2. Allowed documents >> 3. Tika extractor >> 4. Metadata adjuster >> 5. Elasticsearch >> >> I do this because I don’t want to store the original document inside the >> elasticsearch index but only the extracted text of the document. This works >> so far. However, there are numerous documents which cause an exception of >> the following kind when being analyzed and sent to the indexer by Apache >> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer: >> >> [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral the >> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: >> Failed to execute [index {[sharein >> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf], >> source[{"access_permission:extract_for_access >> ibility" : "true","dcterms:created" : >> "2016-03-02T13:03:47Z","access_permission:can_modify" : >> "true","access_permission:modify_annotations" : "true","Creation-Date" : >> "2016-03-02T1 >> 3:03:47Z","fileLastModified" : >> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" : >> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" : >> "52067","dc:format" : >> "application\/pdf; version=1.4","access_permission:can_print" : >> "true","stream_name" : "M├ñuseTastaturen 2.3.16 - >> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250 PDF","resourc >> eName" : "M├ñuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" : >> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" : >> "true","meta:creation-date" : "2016-03-02T13:03: >> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" : >> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" : >> "AppDevData$","access_permission: >> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed Mar >> 16 22:22:24 CET 2016","pdf:encrypted" : >> "false","access_permission:extract_content" : "true","producer" : >> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" : >> "applica-tion\/pdf","allow_token_document" : >> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S >> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document" >> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : " >> __nosecurity__","deny_token_share" : >> "__nosecurity__","allow_token_parent" : >> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}] >> org.elasticsearch.index.mapper.MapperParsingException: failed to parse >> [_source] >> at >> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411) >> at >> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240) >> at >> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540) >> at >> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493) >> at >> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492) >> at >> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192) >> at >> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574) >> at >> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440) >> at >> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse >> content to map >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130) >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81) >> at >> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274) >> at >> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401) >> ... 11 more >> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: >> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using >> backslash to be included in string va >> lue >> at [Source: [B@5b774e8b; line: 1, column: 1145] >> at >> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487) >> at >> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) >> at >> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) >> at >> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) >> at >> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213) >> at >> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228) >> at >> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125) >> ... 14 more >> >> This happens for documents of different types/extension, such as pdfs as >> well as xlsx, etc. It seems that Tika sometimes does not remove special >> characters as the null character 0x0000. The presence of the special >> characters causes Elasticsearch to omit the indexing of the document. Thus >> the document is not indexed at all, as special characters need to be >> escaped when handed over as a JSON request. Is there a way to work around >> the problem with the existing functionality of Apache ManifoldCF? >> >> Regards >> Silvio >> >> > >
