Re: Questing regarding Tika text extraction and elasticsearch

Karl Wright Sun, 15 May 2016 09:28:22 -0700

There is a way apparently you are allowed to encode this, and I have a
patch, but JIRA is down.  If it doesn't come back up soon I'll email you
the patch.


Karl


On Sun, May 15, 2016 at 12:11 PM, Karl Wright <[email protected]> wrote:

> Hi Silvio,
>
> This sounds like a problem with the way the Elastic Search connector is
> forming JSON.  The spec is silent on control characters:
>
> http://rfc7159.net/rfc7159#rfc.section.8.1
>
> ... so we just embed those in strings.  But it sounds like ElasticSearch's
> JSON parser is not so happy with them.
>
> If we can find an encoding that satisfies everyone, we can change the code
> to do what is needed.  Maybe "\0" for null, etc?
>
> Karl
>
>
> On Sun, May 15, 2016 at 10:21 AM, <[email protected]> wrote:
>
>> Hi Apache ManifoldCF user list
>>
>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>> I defined a job with the following connection configuration comprising
>> the following chain of transformations (order in the list indicates the
>> order of the transformations):
>>
>> 1.    Repository connection (MS Network Share)
>> 2.    Allowed documents
>> 3.    Tika extractor
>> 4.    Metadata adjuster
>> 5.    Elasticsearch
>>
>> I do this because I don’t want to store the original document inside the
>> elasticsearch index but only the extracted text of the document. This works
>> so far. However, there are numerous documents which cause an exception of
>> the following kind when being  analyzed and sent to the indexer by Apache
>> ManifoldCF. Note that the exceptions happens in the Elastic search analyzer:
>>
>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>> Failed to execute [index {[sharein
>> dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>> source[{"access_permission:extract_for_access
>> ibility" : "true","dcterms:created" :
>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>> "2016-03-02T1
>> 3:03:47Z","fileLastModified" :
>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>> "52067","dc:format" :
>>  "application\/pdf; version=1.4","access_permission:can_print" :
>> "true","stream_name" : "M├ñuseTastaturen 2.3.16 -
>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>> eName" : "M├ñuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>> "true","meta:creation-date" : "2016-03-02T13:03:
>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>> "AppDevData$","access_permission:
>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed Mar
>> 16 22:22:24 CET 2016","pdf:encrypted" :
>> "false","access_permission:extract_content" : "true","producer" :
>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>> "applica-tion\/pdf","allow_token_document" :
>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>> __nosecurity__","deny_token_share" :
>> "__nosecurity__","allow_token_parent" :
>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>> [_source]
>>         at
>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>         at
>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>         at
>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>         at
>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>         at
>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>         at
>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>         at
>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>         at
>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>         at
>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse
>> content to map
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>         at
>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>         at
>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>         ... 11 more
>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>> backslash to be included in string va
>> lue
>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>         at
>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>         at
>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>         at
>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>         at
>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>         at
>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>         at
>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>         at
>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>         ... 14 more
>>
>> This happens for documents of different types/extension, such as pdfs as
>> well as xlsx, etc. It seems that Tika sometimes does not remove special
>> characters as the null character 0x0000. The presence of the special
>> characters causes Elasticsearch to omit the indexing of the document. Thus
>> the document is not indexed at all, as  special characters need to be
>> escaped when handed over as a JSON request. Is there a way to work around
>> the problem with the existing functionality of Apache ManifoldCF?
>>
>> Regards
>> Silvio
>>
>>
>
>

Re: Questing regarding Tika text extraction and elasticsearch

Reply via email to