Here's the patch. Relatively short. Karl
On Sun, May 15, 2016 at 12:27 PM, Karl Wright <[email protected]> wrote: > There is a way apparently you are allowed to encode this, and I have a > patch, but JIRA is down. If it doesn't come back up soon I'll email you > the patch. > > Karl > > > On Sun, May 15, 2016 at 12:11 PM, Karl Wright <[email protected]> wrote: > >> Hi Silvio, >> >> This sounds like a problem with the way the Elastic Search connector is >> forming JSON. The spec is silent on control characters: >> >> http://rfc7159.net/rfc7159#rfc.section.8.1 >> >> ... so we just embed those in strings. But it sounds like >> ElasticSearch's JSON parser is not so happy with them. >> >> If we can find an encoding that satisfies everyone, we can change the >> code to do what is needed. Maybe "\0" for null, etc? >> >> Karl >> >> >> On Sun, May 15, 2016 at 10:21 AM, <[email protected]> >> wrote: >> >>> Hi Apache ManifoldCF user list >>> >>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the >>> network Windows shares of our company. I’m using Elasticsearch 1.7.4, >>> Apache ManifoldCF 2.3 with MS Active Directory as authority source. >>> I defined a job with the following connection configuration comprising >>> the following chain of transformations (order in the list indicates the >>> order of the transformations): >>> >>> 1. Repository connection (MS Network Share) >>> 2. Allowed documents >>> 3. Tika extractor >>> 4. Metadata adjuster >>> 5. Elasticsearch >>> >>> I do this because I don’t want to store the original document inside the >>> elasticsearch index but only the extracted text of the document. This works >>> so far. However, there are numerous documents which cause an exception of >>> the following kind when being analyzed and sent to the indexer by Apache >>> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer: >>> >>> [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral the >>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: >>> Failed to execute [index {[sharein >>> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf], >>> source[{"access_permission:extract_for_access >>> ibility" : "true","dcterms:created" : >>> "2016-03-02T13:03:47Z","access_permission:can_modify" : >>> "true","access_permission:modify_annotations" : "true","Creation-Date" : >>> "2016-03-02T1 >>> 3:03:47Z","fileLastModified" : >>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" : >>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" : >>> "52067","dc:format" : >>> "application\/pdf; version=1.4","access_permission:can_print" : >>> "true","stream_name" : "M├ñuseTastaturen 2.3.16 - >>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250 PDF","resourc >>> eName" : "M├ñuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" : >>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" : >>> "true","meta:creation-date" : "2016-03-02T13:03: >>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" : >>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" : >>> "AppDevData$","access_permission: >>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed >>> Mar 16 22:22:24 CET 2016","pdf:encrypted" : >>> "false","access_permission:extract_content" : "true","producer" : >>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" : >>> "applica-tion\/pdf","allow_token_document" : >>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S >>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document" >>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : " >>> __nosecurity__","deny_token_share" : >>> "__nosecurity__","allow_token_parent" : >>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}] >>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse >>> [_source] >>> at >>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411) >>> at >>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240) >>> at >>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540) >>> at >>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493) >>> at >>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492) >>> at >>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192) >>> at >>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574) >>> at >>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440) >>> at >>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to >>> parse content to map >>> at >>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130) >>> at >>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81) >>> at >>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274) >>> at >>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401) >>> ... 11 more >>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: >>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using >>> backslash to be included in string va >>> lue >>> at [Source: [B@5b774e8b; line: 1, column: 1145] >>> at >>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487) >>> at >>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) >>> at >>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482) >>> at >>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357) >>> at >>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) >>> at >>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) >>> at >>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) >>> at >>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293) >>> at >>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275) >>> at >>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258) >>> at >>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213) >>> at >>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228) >>> at >>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125) >>> ... 14 more >>> >>> This happens for documents of different types/extension, such as pdfs as >>> well as xlsx, etc. It seems that Tika sometimes does not remove special >>> characters as the null character 0x0000. The presence of the special >>> characters causes Elasticsearch to omit the indexing of the document. Thus >>> the document is not indexed at all, as special characters need to be >>> escaped when handed over as a JSON request. Is there a way to work around >>> the problem with the existing functionality of Apache ManifoldCF? >>> >>> Regards >>> Silvio >>> >>> >> >> >
CONNECTORS-elasticsearch.patch
Description: Binary data
