Thank you for the examples Karl.
However, when I include this definition in the job definition and then run the
job, it seems like ManifoldCF enters in some kind of loop in the running state.
Looking at the manifoldcf.log file I see many times this kind of entries:
>>>>>>
FATAL 2015-08-19 07:51:48,231 (Worker thread '70') -
org.apache.manifoldcf.crawlerthreads - Error tossed:
nulljava.lang.NullPointerException at
org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.append(ForcedMetadataConnector.java:646)
at
org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.processExpression(ForcedMetadataConnector.java:678)
at
org.apache.manifoldcf.agents.transformation.forcedmetadata.ForcedMetadataConnector.addOrReplaceDocumentWithException(ForcedMetadataConnector.java:134)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3221)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3072)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2706)
at
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)
at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1503)
at
org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1468)
at
org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1813)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:379)
<<<<<<
Which may or may not be related to this earlier messages in the same log file:
>>>>>> INFO 2015-08-19 07:47:47,307 (main) - org.apache.manifoldcf.root -
>>>>>> Synchronization storage cleaned up INFO 2015-08-19 07:48:07,830 (main) -
>>>>>> org.apache.manifoldcf.root - Running... INFO 2015-08-19 07:48:07,846
>>>>>> (main) - org.apache.manifoldcf.root - Running... INFO 2015-08-19
>>>>>> 07:48:07,994 (Agents thread) - org.apache.manifoldcf.jobs - Cleaning up
>>>>>> all process data INFO 2015-08-19 07:48:08,036 (Agents thread) -
>>>>>> org.apache.manifoldcf.jobs - Cleanup complete INFO 2015-08-19
>>>>>> 07:48:08,064 (Agents thread) - org.apache.manifoldcf.jobs - Starting
>>>>>> cluster INFO 2015-08-19 07:48:08,072 (Agents thread) -
>>>>>> org.apache.manifoldcf.jobs - Cluster start complete INFO 2015-08-19
>>>>>> 07:48:08,075 (Agents thread) - org.apache.manifoldcf.root - Starting up
>>>>>> pull-agent... INFO 2015-08-19 07:48:08,088 (Agents thread) -
>>>>>> org.apache.manifoldcf.root - Starting up pull-agent... INFO 2015-08-19
>>>>>> 07:48:08,133 (Agents thread) - org.apache.manifoldcf.root - Pull-agent
>>>>>> started INFO 2015-08-19 07:48:08,182 (Agents thread) -
>>>>>> org.apache.manifoldcf.root - Pull-agent startedERROR 2015-08-19
>>>>>> 07:48:44,184 (qtp858007949-11) - org.apache.manifoldcf.misc - Missing
>>>>>> resource 'ForcedMetadata.ForcedMetadataNameMustNotBeNull' in bundle
>>>>>> 'org.apache.manifoldcf.agents.transformation.forcedmetadata.common' for
>>>>>> locale 'en_US'java.util.MissingResourceException: Can't find resource
>>>>>> for bundle java.util.PropertyResourceBundle, key
>>>>>> ForcedMetadata.ForcedMetadataNameMustNotBeNull at
>>>>>> java.util.ResourceBundle.getObject(ResourceBundle.java:395) at
>>>>>> java.util.ResourceBundle.getString(ResourceBundle.java:355) at
>>>>>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:193)
>>>>>> at
>>>>>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:240)
>>>>>> at
>>>>>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:208)
>>>>>> at
>>>>>> org.apache.manifoldcf.ui.i18n.ResourceBundleWrapper.getString(ResourceBundleWrapper.java:44)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>> at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at .......<<<<<<
if I edit the job definition and remove the regular expression and save the
job, then almost immediately I can see this entries in the log:
>>>>>> INFO 2015-08-19 07:52:28,300 (Finisher thread) -
>>>>>> org.apache.manifoldcf.jobs - Marked job 1439951495926 for shutdown INFO
>>>>>> 2015-08-19 07:52:28,434 (Job reset thread) - org.apache.manifoldcf.jobs
>>>>>> - Job 1439951495926 now completed INFO 2015-08-19 07:52:38,332 (Job
>>>>>> notification thread) - org.apache.manifoldcf.jobs - Found job
>>>>>> 1439951495926 in need of notification<<<<<<
Thank you,
Mike
Date: Wed, 19 Aug 2015 03:45:30 -0400
Subject: Re: Metadata expressions
From: [email protected]
To: [email protected]
Hi Mike,
The documentation (which seems not to have updated on the site yet) says the
following:
>>>>>> <p>You can also use regular expressions in the
>>>>>> substitution string, for example: "${there|[0-9]*}", which will extract
>>>>>> the first sequence of sequential numbers it finds in the
>>>>>> value of the field "there", or "${there|string(.*)|1}", which will
>>>>>> include everything following "string" in the field value. (The third
>>>>>> argument specifies the regular expression group
>>>>>> number, with an optional suffix of "l" or "u" meaning upper-case or
>>>>>> lower-case.)</p> <p>Enter a parameter name, and either
>>>>>> select to remove the value or provide an expression. If you chose to
>>>>>> supply an expression, enter the expression in the box.<<<<<<
To evaluate your regular expression with the specific input you gave, I
typically use a regex applet, if you can find a browser that still allows it:
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
Dropping your stuff in and clicking the "find()" button yields this:"Pattern
did not match"
So your regex is not correct. But, "Protocol (\d+)" does match, with the
following group outputs:
start() = 0, end() = 16group(0) = "Protocol 1234500"group(1) = "1234500"
So you want group 1. Therefore, the MCF expression would be:
expression = Protocol-${protocol_name|Protocol (\d+)|1}
Thanks,
Karl
On Tue, Aug 18, 2015 at 11:19 PM, Mike Caceres <[email protected]> wrote:
If I have a document with the following metadata values:"protocol_name" :
"Protocol 1234500 (USPA00012345) second version"
and I want to produce a new metadata field that looks like this:
"protocol_id" : "Protocol-1234500"
should the metadata expression look like this?
parameter name = protocol_id remove this parameter = false expression =
Protocol-${protocol_name|string(\d+)|0}
Thank you!