[ 
https://issues.apache.org/jira/browse/SOLR-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804093#action_12804093
 ] 

Jan Høydahl commented on SOLR-1725:
-----------------------------------

This is very much wanted!

Would it make more sense to execute the scripts in the order they are named in 
the scripts param? If I have two pipelines/chains, that need to use the same 
scripts but in different orders, I'm in trouble.

Isn't solr/lib/scripts a more natural location of code than solr/conf?

How to pass parameters to each script, to facilitate reusable scripts instead 
of hardcoded ones?

To overcome some of these limitations, why not reuse the existing pipeline 
mechanism to define even the chain, i.e. allow only one script at a time? Then 
the order of scripts are then dictated by the order of <processor > tags in the 
ProcessorChain and we reuse the parameter passing logic. A positive side effect 
is that you can compose a ProcessorChain with a mix and match of Java and 
Script based Processors. Class/script instantiation needs to be optimized of 
course.

Example use case: Say you have an XML input with structured data where one of 
the fields is the file name of a PDF. You want to convert the PDF usigng a Tika 
processor (hopefully to come) and then sanitize Author metadata from the parsed 
PDF. This could then look like this in solrconfig.xml:

<updateRequestProcessorChain name="xmlwithpdf">
    <processor class="solr.FileReaderProcessorFactory">
      <str name="filenameField">filename</str>
      <str name="outputField">binarydata</str>
    </processor>
    <processor class="solr.TikaProcessorFactory">
      <str name="inputField">binarydata</str>
      <str name="fmap.author">tikaauthor</str>
      <str name="fmap.content">text</str>
    </processor>
    <processor class="solr.ScriptUpdateProcessorFactory">
      <str name="script">author_samitizer.js</str>
      <str name="inputField">tikaauthor</str>
      <str name="outputField">author</str>
      <str name="discardRegex">Microsoft Word.*|Adobe 
Distiller.*|PDF995.*|Unknown</str>
      <bool name="overwriteExisting">false</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

> Script based UpdateRequestProcessorFactory
> ------------------------------------------
>
>                 Key: SOLR-1725
>                 URL: https://issues.apache.org/jira/browse/SOLR-1725
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Uri Boness
>         Attachments: SOLR-1725.patch, SOLR-1725.patch, SOLR-1725.patch, 
> SOLR-1725.patch
>
>
> A script based UpdateRequestProcessorFactory (Uses JDK6 script engine 
> support). The main goal of this plugin is to be able to configure/write 
> update processors without the need to write and package Java code.
> The update request processor factory enables writing update processors in 
> scripts located in {{solr.solr.home}} directory. The functory accepts one 
> (mandatory) configuration parameter named {{scripts}} which accepts a 
> comma-separated list of file names. It will look for these files under the 
> {{conf}} directory in solr home. When multiple scripts are defined, their 
> execution order is defined by the lexicographical order of the script file 
> name (so {{scriptA.js}} will be executed before {{scriptB.js}}).
> The script language is resolved based on the script file extension (that is, 
> a *.js files will be treated as a JavaScript script), therefore an extension 
> is mandatory.
> Each script file is expected to have one or more methods with the same 
> signature as the methods in the {{UpdateRequestProcessor}} interface. It is 
> *not* required to define all methods, only those hat are required by the 
> processing logic.
> The following variables are define as global variables for each script:
>  * {{req}} - The SolrQueryRequest
>  * {{rsp}}- The SolrQueryResponse
>  * {{logger}} - A logger that can be used for logging purposes in the script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to