Yes. Karl
On Mon, May 16, 2016 at 1:14 PM, Silvio Meier < [email protected]> wrote: > Hi Karl > > Thanks for the fast response and the patch. I'll patch the version that I > have. Will the patch be included in the next official release of Apache > ManifoldCF? > > Regards > Silvio > > > On 15.05.2016 18:37, Karl Wright wrote: > > Here's the patch. Relatively short. > > Karl > > > On Sun, May 15, 2016 at 12:27 PM, Karl Wright <[email protected]> wrote: > >> There is a way apparently you are allowed to encode this, and I have a >> patch, but JIRA is down. If it doesn't come back up soon I'll email you >> the patch. >> >> Karl >> >> >> On Sun, May 15, 2016 at 12:11 PM, Karl Wright < <[email protected]> >> [email protected]> wrote: >> >>> Hi Silvio, >>> >>> This sounds like a problem with the way the Elastic Search connector is >>> forming JSON. The spec is silent on control characters: >>> >>> http://rfc7159.net/rfc7159#rfc.section.8.1 >>> >>> ... so we just embed those in strings. But it sounds like >>> ElasticSearch's JSON parser is not so happy with them. >>> >>> If we can find an encoding that satisfies everyone, we can change the >>> code to do what is needed. Maybe "\0" for null, etc? >>> >>> Karl >>> >>> >>> On Sun, May 15, 2016 at 10:21 AM, < <[email protected]> >>> [email protected]> wrote: >>> >>>> Hi Apache ManifoldCF user list >>>> >>>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the >>>> network Windows shares of our company. I’m using Elasticsearch 1.7.4, >>>> Apache ManifoldCF 2.3 with MS Active Directory as authority source. >>>> I defined a job with the following connection configuration comprising >>>> the following chain of transformations (order in the list indicates the >>>> order of the transformations): >>>> >>>> 1. Repository connection (MS Network Share) >>>> 2. Allowed documents >>>> 3. Tika extractor >>>> 4. Metadata adjuster >>>> 5. Elasticsearch >>>> >>>> I do this because I don’t want to store the original document inside >>>> the elasticsearch index but only the extracted text of the document. This >>>> works so far. However, there are numerous documents which cause an >>>> exception of the following kind when being analyzed and sent to the >>>> indexer by Apache ManifoldCF. Note that the exceptions happens in the >>>> Elastic search analyzer: >>>> >>>> [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral the >>>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: >>>> Failed to execute [index {[sharein >>>> dex][attachment][ >>>> file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf], >>>> source[{"access_permission:extract_for_access >>>> ibility" : "true","dcterms:created" : >>>> "2016-03-02T13:03:47Z","access_permission:can_modify" : >>>> "true","access_permission:modify_annotations" : "true","Creation-Date" : >>>> "2016-03-02T1 >>>> 3:03:47Z","fileLastModified" : >>>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" : >>>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" : >>>> "52067","dc:format" : >>>> "application\/pdf; version=1.4","access_permission:can_print" : >>>> "true","stream_name" : "M├ñuseTastaturen 2.3.16 - >>>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250 PDF","resourc >>>> eName" : "M├ñuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" : >>>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" : >>>> "true","meta:creation-date" : "2016-03-02T13:03: >>>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" : >>>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" : >>>> "AppDevData$","access_permission: >>>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed >>>> Mar 16 22:22:24 CET 2016","pdf:encrypted" : >>>> "false","access_permission:extract_content" : "true","producer" : >>>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" : >>>> "applica-tion\/pdf","allow_token_document" : >>>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S >>>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document" >>>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : " >>>> __nosecurity__","deny_token_share" : >>>> "__nosecurity__","allow_token_parent" : >>>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}] >>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse >>>> [_source] >>>> at >>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411) >>>> at >>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240) >>>> at >>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540) >>>> at >>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493) >>>> at >>>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492) >>>> at >>>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192) >>>> at >>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574) >>>> at >>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440) >>>> at >>>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to >>>> parse content to map >>>> at >>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130) >>>> at >>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81) >>>> at >>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274) >>>> at >>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401) >>>> ... 11 more >>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: >>>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using >>>> backslash to be included in string va >>>> lue >>>> at [Source: [B@5b774e8b; line: 1, column: 1145] >>>> at >>>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487) >>>> at >>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) >>>> at >>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482) >>>> at >>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357) >>>> at >>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287) >>>> at >>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286) >>>> at >>>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) >>>> at >>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293) >>>> at >>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275) >>>> at >>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258) >>>> at >>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213) >>>> at >>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228) >>>> at >>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125) >>>> ... 14 more >>>> >>>> This happens for documents of different types/extension, such as pdfs >>>> as well as xlsx, etc. It seems that Tika sometimes does not remove special >>>> characters as the null character 0x0000. The presence of the special >>>> characters causes Elasticsearch to omit the indexing of the document. Thus >>>> the document is not indexed at all, as special characters need to be >>>> escaped when handed over as a JSON request. Is there a way to work around >>>> the problem with the existing functionality of Apache ManifoldCF? >>>> >>>> Regards >>>> Silvio >>>> >>>> >>> >>> >> >
