Re: Questing regarding Tika text extraction and elasticsearch

Karl Wright Mon, 16 May 2016 10:16:31 -0700

Yes.
Karl


On Mon, May 16, 2016 at 1:14 PM, Silvio Meier <
[email protected]> wrote:

> Hi Karl
>
> Thanks for the fast response and the patch. I'll patch the version that I 
> have. Will the patch be included in the next official release of Apache 
> ManifoldCF?
>
> Regards
> Silvio
>
>
> On 15.05.2016 18:37, Karl Wright wrote:
>
> Here's the patch.  Relatively short.
>
> Karl
>
>
> On Sun, May 15, 2016 at 12:27 PM, Karl Wright <[email protected]> wrote:
>
>> There is a way apparently you are allowed to encode this, and I have a
>> patch, but JIRA is down.  If it doesn't come back up soon I'll email you
>> the patch.
>>
>> Karl
>>
>>
>> On Sun, May 15, 2016 at 12:11 PM, Karl Wright < <[email protected]>
>> [email protected]> wrote:
>>
>>> Hi Silvio,
>>>
>>> This sounds like a problem with the way the Elastic Search connector is
>>> forming JSON.  The spec is silent on control characters:
>>>
>>> http://rfc7159.net/rfc7159#rfc.section.8.1
>>>
>>> ... so we just embed those in strings.  But it sounds like
>>> ElasticSearch's JSON parser is not so happy with them.
>>>
>>> If we can find an encoding that satisfies everyone, we can change the
>>> code to do what is needed.  Maybe "\0" for null, etc?
>>>
>>> Karl
>>>
>>>
>>> On Sun, May 15, 2016 at 10:21 AM, < <[email protected]>
>>> [email protected]> wrote:
>>>
>>>> Hi Apache ManifoldCF user list
>>>>
>>>> I’m experimenting with Apache ManifoldCF 2.3 which I use to index the
>>>> network Windows shares of our company. I’m using Elasticsearch 1.7.4,
>>>> Apache ManifoldCF 2.3 with MS Active Directory as authority source.
>>>> I defined a job with the following connection configuration comprising
>>>> the following chain of transformations (order in the list indicates the
>>>> order of the transformations):
>>>>
>>>> 1.    Repository connection (MS Network Share)
>>>> 2.    Allowed documents
>>>> 3.    Tika extractor
>>>> 4.    Metadata adjuster
>>>> 5.    Elasticsearch
>>>>
>>>> I do this because I don’t want to store the original document inside
>>>> the elasticsearch index but only the extracted text of the document. This
>>>> works so far. However, there are numerous documents which cause an
>>>> exception of the following kind when being  analyzed and sent to the
>>>> indexer by Apache ManifoldCF. Note that the exceptions happens in the
>>>> Elastic search analyzer:
>>>>
>>>> [2016-03-16 22:22:43,884][DEBUG][action.index             ] [Tefral the
>>>> Surveyor] [shareindex][2], node[O2bWpnsKS8iAE7hwGEOpuA], [P], s[STARTED]:
>>>> Failed to execute [index {[sharein
>>>> dex][attachment][
>>>> file://///du-evs-01/AppDevData%24/0Repository/temp/indexingtestcorpus/M%C3%A4useTastaturen%202.3.16%20-%20Kopie.pdf],
>>>> source[{"access_permission:extract_for_access
>>>> ibility" : "true","dcterms:created" :
>>>> "2016-03-02T13:03:47Z","access_permission:can_modify" :
>>>> "true","access_permission:modify_annotations" : "true","Creation-Date" :
>>>> "2016-03-02T1
>>>> 3:03:47Z","fileLastModified" :
>>>> "2016-03-02T13:03:37.433Z","access_permission:fill_in_form" :
>>>> "true","created" : "Wed Mar 02 14:03:47 CET 2016","stream_size" :
>>>> "52067","dc:format" :
>>>>  "application\/pdf; version=1.4","access_permission:can_print" :
>>>> "true","stream_name" : "M├ñuseTastaturen 2.3.16 -
>>>> Kopie.pdf","xmp:CreatorTool" : "Canon iR-ADV C5250  PDF","resourc
>>>> eName" : "M├ñuseTastaturen 2.3.16 - Kopie.pdf","fileCreatedOn" :
>>>> "2016-03-16T21:22:24.085Z","access_permission:assemble_document" :
>>>> "true","meta:creation-date" : "2016-03-02T13:03:
>>>> 47Z","lastModified" : "Wed Mar 02 14:03:37 CET 2016","pdf:PDFVersion" :
>>>> "1.4","X-Parsed-By" : "org.apache.tika.parser.DefaultParser","shareName" :
>>>> "AppDevData$","access_permission:
>>>> can_print_degraded" : "true","xmpTPg:NPages" : "1","createdOn" : "Wed
>>>> Mar 16 22:22:24 CET 2016","pdf:encrypted" :
>>>> "false","access_permission:extract_content" : "true","producer" :
>>>> "Adobe PSL 1.2e for Canon ","attributes" : "32","Content-Type" :
>>>> "applica-tion\/pdf","allow_token_document" :
>>>> ["LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-16152","LDAPConn:S
>>>> -1-5-21-1751174259-1996115066-1435642685-16153","LDAPConn:S-1-5-21-1751174259-1996115066-1435642685-7894"],"deny_token_document"
>>>> : "LDAPConn:DEAD_AUTHORITY","allow_token_share" : "
>>>> __nosecurity__","deny_token_share" :
>>>> "__nosecurity__","allow_token_parent" :
>>>> "__nosecurity__","deny_token_parent" : "__nosecurity__","content" : ""}]}]
>>>> org.elasticsearch.index.mapper.MapperParsingException: failed to parse
>>>> [_source]
>>>>         at
>>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
>>>>         at
>>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.preParse(SourceFieldMapper.java:240)
>>>>         at
>>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:540)
>>>>         at
>>>> org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
>>>>         at
>>>> org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:492)
>>>>         at
>>>> org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:192)
>>>>         at
>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
>>>>         at
>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
>>>>         at
>>>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>> Caused by: org.elasticsearch.ElasticsearchParseException: Failed to
>>>> parse content to map
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:130)
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:81)
>>>>         at
>>>> org.elasticsearch.index.mapper.internal.SourceFieldMapper.parseCreateField(SourceFieldMapper.java:274)
>>>>         at
>>>> org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
>>>>         ... 11 more
>>>> Caused by: org.elasticsearch.common.jackson.core.JsonParseException:
>>>> Illegal unquoted character ((CTRL-CHAR, code 0)): has to be escaped using
>>>> backslash to be included in string va
>>>> lue
>>>>  at [Source: [B@5b774e8b; line: 1, column: 1145]
>>>>         at
>>>> org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1487)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.base.ParserMinimalBase._throwUnquotedSpace(ParserMinimalBase.java:482)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2357)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2287)
>>>>         at
>>>> org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:286)
>>>>         at
>>>> org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:293)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
>>>>         at
>>>> org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
>>>>         at
>>>> org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:125)
>>>>         ... 14 more
>>>>
>>>> This happens for documents of different types/extension, such as pdfs
>>>> as well as xlsx, etc. It seems that Tika sometimes does not remove special
>>>> characters as the null character 0x0000. The presence of the special
>>>> characters causes Elasticsearch to omit the indexing of the document. Thus
>>>> the document is not indexed at all, as  special characters need to be
>>>> escaped when handed over as a JSON request. Is there a way to work around
>>>> the problem with the existing functionality of Apache ManifoldCF?
>>>>
>>>> Regards
>>>> Silvio
>>>>
>>>>
>>>
>>>
>>
>

Re: Questing regarding Tika text extraction and elasticsearch

Reply via email to