Hi Karl
Thanks for the fast response and the patch. I'll patch the version that I have.
Will the patch be included in the next official release of Apache ManifoldCF?
Regards
Silvio
On 15.05.2016 18:37, Karl Wright wrote:
Here's the patch. Relatively short.
Karl
On Sun, May 15, 2016 at 12:27 PM, Karl Wright <[email protected]
<mailto:[email protected]>> wrote:
There is a way apparently you are allowed to encode this, and I
have a patch, but JIRA is down. If it doesn't come back up soon
I'll email you the patch.
Karl
On Sun, May 15, 2016 at 12:11 PM, Karl Wright <[email protected]
<mailto:[email protected]>> wrote:
Hi Silvio,
This sounds like a problem with the way the Elastic Search
connector is forming JSON. The spec is silent on control
characters:
http://rfc7159.net/rfc7159#rfc.section.8.1
... so we just embed those in strings. But it sounds like
ElasticSearch's JSON parser is not so happy with them.
If we can find an encoding that satisfies everyone, we can
change the code to do what is needed. Maybe "\0" for null, etc?
Karl
On Sun, May 15, 2016 at 10:21 AM,
<[email protected]
<mailto:[email protected]>> wrote:
Hi Apache ManifoldCF user list
I’m experimenting with Apache ManifoldCF 2.3 which I use
to index the network Windows shares of our company. I’m
using Elasticsearch 1.7.4, Apache ManifoldCF 2.3 with MS
Active Directory as authority source.
I defined a job with the following connection
configuration comprising the following chain of
transformations (order in the list indicates the order of
the transformations):
1. Repository connection (MS Network Share)
2. Allowed documents
3. Tika extractor
4. Metadata adjuster
5. Elasticsearch
I do this because I don’t want to store the original
document inside the elasticsearch index but only the
extracted text of the document. This works so far.
However, there are numerous documents which cause an
exception of the following kind when being analyzed and
sent to the indexer by Apache ManifoldCF. Note that the
exceptions happens in the Elastic search analyzer:
[2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral
the Surveyor] [shareindex][2],
node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed to
execute [index {[sharein
dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
source[{"access_permission:extract_for_access
ibility" : "true","dcterms:created" :
"2016-03-02T13:03:47Z","access_permission:can_modify" :
"true","access_permission:modify_annotations" :
"true","Creation-Date" : "2016-03-02T1
3:03:47Z","fileLastModified" :
"2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
"true","created" : "Wed Mar 02 14:03:47 CET
2016","stream_size" : "52067","dc:format" :
"application\/pdf;
version=1.4","access_permission:can_print" :
"true","stream_name" : "MäuseTastaturen 2.3.16 -
Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250
PDF","resourc
eName" : "MäuseTastaturen 2.3.16 -
Kopie.pdf","fileCreatedOn" :
"2016-03-16T21:22:24.085Z","access_permission:assemble_document"
: "true","meta:creation-date" : "2016-03-02T13:03:
47Z","lastModified" : "Wed Mar 02 14:03:37 CET
2016","pdf:PDFVersion" : "1.4","X-Parsed-By" :
"org.apache.tika.parser.DefaultParser","shareName" :
"AppDevData$","access_permission:
can_print_degraded" : "true","xmpTPg:NPages" :
"1","createdOn" : "Wed Mar 16 22:22:24 CET
2016","pdf:encrypted" :
"false","access_permission:extract_content" :
"true","producer" :
"Adobe PSL 1.2e for Canon ","attributes" :
"32","Content-Type" :
"applica-tion\/pdf","allow_token_document" :
["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
-1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
: "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
__nosecurity__","deny_token_share" :
"__nosecurity__","allow_token_parent" :
"__nosecurity__","deny_token_parent" :
"__nosecurity__","content" : ""}]}]
org.elasticsearch.index.mapper.MapperParsingException:
failed to parse [_source]
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
at
org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
at
org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
at
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchParseException:
Failed to parse content to map
at
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
at
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
at
org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
... 11 more
Caused by:
org.elasticsearch.common.jackson.core.JsonParseException:
Illegal unquoted character ((CTRL-CHAR, code 0)): has to
be escaped using backslash to be included in string va
lue
at [Source: [B@5b774e8b; line: 1, column: 1145]
at
org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
at
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
at
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
at
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
at
org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
at
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
... 14 more
This happens for documents of different types/extension,
such as pdfs as well as xlsx, etc. It seems that Tika
sometimes does not remove special characters as the null
character 0x0000. The presence of the special characters
causes Elasticsearch to omit the indexing of the document.
Thus the document is not indexed at all, as special
characters need to be escaped when handed over as a JSON
request. Is there a way to work around the problem with
the existing functionality of Apache ManifoldCF?
Regards
Silvio