DIH transformers

Fergus McMenemie Mon, 16 Feb 2009 01:53:10 -0800

Hello.

I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.


  1) I have bodged TemplateTransformer to allow it to return 
     when one of the variables is undefined. This ensures my
     uniqueKey is always defined. But thinking more on
     Nobel's comments there is use in having it work both ways.
     ie leaving the column undefined or replacing the variable
     with "". I still like my idea about using the default
     value of a solr field from schema.xml, but I cant figure
     out how/where to best implement it. 

  2) Having used TemplateTransformer to assign a value to an 
     entity column that column cannot be used in other 
     TemplateTransformer operations. In my project I am 
     attempting to reuse "x.fileWebPath". To fix this, the 
     last line of transformRow() in TemplateTransformer.java
     needs replaced with the following which as well as 
     'putting' the templated-ed string in 'row' also saves it
     into the 'resolver'.

     **originally**
      row.put(column, resolver.replaceTokens(expr));
      }

     **new**
      String columnName = map.get(DataImporter.COLUMN);
      expr=resolver.replaceTokens(expr);
      row.put(columnName, expr);
      resolverMapCopy.put(columnName, expr);
      }

     As an aside I think I ran into the issues covered by 
     SOLR-993. It took a while to figure out I could not a
     a single columnname/value to the resolver. I had instead
     to add to the map that was already stored within the
     resolver.

  3) No entity column names can be used within RegexTransformer.
     I guess all the stuff that was added to TemplateTransformer
     to allow column names to be used in templates needs re-added
     into RegexTransformer. I am doing that now... but am confused
     by the fragment of code which copies from resolverMap into
     resolverMapCopy. As best I can see resolverMap is always 
     empty; but I am barely able to follow the code! Can somebody
     explain when/why resolverMap would be populated.

     Also, I begin to understand comments made by Noble in
     SOL-1001 about resolving "entity attributes in 
     ContextImpl.getEntityAttribute" and I guess Shalin was
     right as well. However it also seems wrong that at the
     top of every transformer we are going to repeat the
     same code to load the resolver with information about the 
     entity.

  4) In that I am reusing template output within other templates
     the order of execution becomes important. Can I assume that
     the explicitly listed columns in an entity are processed by
     the various transformers in the order they appear within
     data-config.xml. I *think* that the list of columns within
     an entity as returned by getAllEntityFields() is actually
     an ArrayList which I think or order dependent. IS this
     correct?

  5) Should I raise this as a single JIRA issue?

  6) Having played with this stuff, I was going to add a bit
     more to the wiki highlighting some of the possibilities
     and issues with transformers. But want to check with the 
     list first!


   <dataConfig>
   <dataSource name="myfilereader" type="FileDataSource"/>    
    <document>
    <entity name="jc"
               processor="FileListEntityProcessor"
               fileName="^.*\.xml$"
               newerThan="'NOW-1000DAYS'"
               recursive="true"
               rootEntity="false"
               dataSource="null"
               baseDir="/Volumes/spare/ts/solr/content"
               >
    <entity name="x"
                  dataSource="myfilereader"
                  processor="XPathEntityProcessor"
                  url="${jc.fileAbsolutePath}"
                  rootEntity="true"
                  stream="false"
                  forEach="/record | /record/mediaBlock"
                  
transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">

<field column="fileAbsolutePath"       template="${jc.fileAbsolutePath}" />
<field column="fileWebPath"            regex="${x.test}(.*)" 
replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
<field column="title"                  xpath="/record/title" />
<field column="para1" name="para"      xpath="/record/sect1/para" />
<field column="para2" name="para"      xpath="/record/list/listitem/para" />
<field column="pubdate"                
xpath="/record/metadata/da...@qualifier='pubDate']" dateTimeFormat="yyyyMMdd"   
/>

<field column="vurl"                   
xpath="/record/mediaBlock/mediaObject/@vurl" />
<field column="imgSrcArticle"          
template="${dataimporter.request.fordinstalldir}" />
<field column="imgCpation"             xpath="/record/mediaBlock/caption"  />

<field column="test"                   
template="${dataimporter.request.contentinstalldir}" />
<!-- **problem is that vurl is just a fragment of the info needed to access the 
picture. -->
<field column="imgWebPathICON"         regex="(.*)/.*" 
replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
<field column="imgWebPathFULL"         regex="(.*)/.*" 
replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
<field column="vdkvgwkey"              
template="${jc.fileAbsolutePath}#${x.vurl}" />
       </entity>
       </entity>
       </document>
    </dataConfig>

Regards Fergus.

-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

DIH transformers

Reply via email to