Re: Questing regarding Tika text extraction and elasticsearch

Silvio Meier Mon, 16 May 2016 10:14:54 -0700

Hi Karl

Thanks for the fast response and the patch. I'll patch the version that I have. 
Will the patch be included in the next official release of Apache ManifoldCF?


Regards
Silvio


On 15.05.2016 18:37, Karl Wright wrote:

Here's the patch.  Relatively short.

Karl

On Sun, May 15, 2016 at 12:27 PM, Karl Wright <[email protected]<mailto:[email protected]>> wrote:


    There is a way apparently you are allowed to encode this, and I
    have a patch, but JIRA is down.  If it doesn't come back up soon
    I'll email you the patch.

    Karl


    On Sun, May 15, 2016 at 12:11 PM, Karl Wright <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Silvio,

        This sounds like a problem with the way the Elastic Search
        connector is forming JSON.  The spec is silent on control
        characters:

        http://rfc7159.net/rfc7159#rfc.section.8.1

        ... so we just embed those in strings.  But it sounds like
        ElasticSearch's JSON parser is not so happy with them.

        If we can find an encoding that satisfies everyone, we can
        change the code to do what is needed.  Maybe "\0" for null, etc?

        Karl


        On Sun, May 15, 2016 at 10:21 AM,
        <[email protected]
        <mailto:[email protected]>> wrote:

            Hi Apache ManifoldCF user list
            I’m experimenting with Apache ManifoldCF 2.3 which I use
            to index the network Windows shares of our company. I’m
            using Elasticsearch 1.7.4, Apache ManifoldCF 2.3 with MS
            Active Directory as authority source.
            I defined a job with the following connection
            configuration comprising the following chain of
            transformations (order in the list indicates the order of
            the transformations):

            1.    Repository connection (MS Network Share)
            2.    Allowed documents
            3.    Tika extractor
            4.    Metadata adjuster
            5.    Elasticsearch
            I do this because I don’t want to store the original
            document inside the elasticsearch index but only the
            extracted text of the document. This works so far.
            However, there are numerous documents which cause an
            exception of the following kind when being analyzed and
            sent to the indexer by Apache ManifoldCF. Note that the
            exceptions happens in the Elastic search analyzer:
            [2016-03-16 22:22:43,884][DEBUG][action.index ] [Tefral
            the Surveyor] [shareindex][2],
            node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]: Failed to
            execute [index {[sharein
            
dex][attachment][file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
            source[{"access_permission:extract_for_access
            ibility" : "true","dcterms:created" :
            "2016-03-02T13:03:47Z","access_permission:can_modify" :
            "true","access_permission:modify_annotations" :
            "true","Creation-Date" : "2016-03-02T1
            3:03:47Z","fileLastModified" :
            "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
            "true","created" : "Wed Mar 02 14:03:47 CET
            2016","stream_size" : "52067","dc:format" :
             "application\/pdf;
            version=1.4","access_permission:can_print" :
            "true","stream_name" : "M├ñuseTastaturen 2.3.16 -
            Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250
            PDF","resourc
            eName" : "M├ñuseTastaturen 2.3.16 -
            Kopie.pdf","fileCreatedOn" :
            "2016-03-16T21:22:24.085Z","access_permission:assemble_document"
            : "true","meta:creation-date" : "2016-03-02T13:03:
            47Z","lastModified" : "Wed Mar 02 14:03:37 CET
            2016","pdf:PDFVersion" : "1.4","X-Parsed-By" :
            "org.apache.tika.parser.DefaultParser","shareName" :
            "AppDevData$","access_permission:
            can_print_degraded" : "true","xmpTPg:NPages" :
            "1","createdOn" : "Wed Mar 16 22:22:24 CET
            2016","pdf:encrypted" :
            "false","access_permission:extract_content" :
            "true","producer" :
            "Adobe PSL 1.2e for Canon ","attributes" :
            "32","Content-Type" :
            "applica-tion\/pdf","allow_token_document" :
            
["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
            
-1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
            : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
            __nosecurity__","deny_token_share" :
            "__nosecurity__","allow_token_parent" :
            "__nosecurity__","deny_token_parent" :
            "__nosecurity__","content" : ""}]}]
            org.elasticsearch.index.mapper.MapperParsingException:
            failed to parse [_source]
                    at
            
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
                    at
            
org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
                    at
            
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
                    at
            
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
                    at
            
org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
                    at
            
org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
                    at
            
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
                    at
            
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
                    at
            
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
                    at
            
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                    at
            
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                    at java.lang.Thread.run(Thread.java:745)
            Caused by: org.elasticsearch.ElasticsearchParseException:
            Failed to parse content to map
                    at
            
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
                    at
            
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
                    at
            
org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
                    at
            
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
                    ... 11 more
            Caused by:
            org.elasticsearch.common.jackson.core.JsonParseException:
            Illegal unquoted character ((CTRL-CHAR, code 0)): has to
            be escaped using backslash to be included in string va
            lue
             at [Source: [B@5b774e8b; line: 1, column: 1145]
                    at
            
org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
                    at
            
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
                    at
            
org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
                    at
            
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
                    at
            
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
                    at
            
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
                    at
            
org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
                    at
            
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
                    at
            
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
                    at
            
org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
                    at
            
org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
                    at
            
org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
                    at
            
org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
                    ... 14 more
            This happens for documents of different types/extension,
            such as pdfs as well as xlsx, etc. It seems that Tika
            sometimes does not remove special characters as the null
            character 0x0000. The presence of the special characters
            causes Elasticsearch to omit the indexing of the document.
            Thus the document is not indexed at all, as  special
            characters need to be escaped when handed over as a JSON
            request. Is there a way to work around the problem with
            the existing functionality of Apache ManifoldCF?
            Regards
            Silvio

Re: Questing regarding Tika text extraction and elasticsearch

Reply via email to