[ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688280#action_12688280
 ] 

Fergus McMenemie commented on SOLR-1060:
----------------------------------------

Down loaded your version of my patch. Thanks for taking a look at it and making 
the improvements.

However I still can get things to work. My solr-data.xml is now as follows:-
{code}
     <entity name="single-delete"
                 dataSource="myURIreader"
                 processor="XPathEntityProcessor"
                 url="${dataimporter.request.single}"
                 rootEntity="true"
                 flatten="true"
                 stream="false"
                 forEach="/record | /record/mediaBlock"
                 transformer="TemplateTransformer">

      <field column="fileAbsolutePath"    
template="${dataimporter.request.single}" /> 
      <field column="$deleteDocByQuery"   
template="fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}"
 />               
      <field column="vdkvgwkey"           
template="${dataimporter.request.single}" /> 
      </entity>

{code}

But an attempt to delete a document produces the following..
{code}
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-1.4-dev path=/dataimport 
params={command=full-import&clean=false&entity=single-delete&commit=true&single=file:///Volumes/spare/ts/janes/schema/janesxml/data/news/jdw/jdw2008/jni71796.xml}
 status=0 QTime=1 
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer 
transformRow
WARNING: Unable to resolve variable: 
dataimporter.functions.escapeQueryChars(dataimporter.request.single) while 
parsing expression: 
fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
        
commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_3,version=1237809265075,generation=3,filenames=[_5.nrm,
 _5.tii, _5.tis, _5.fdx, _5.prx, _5.fdt, _5.fnm, segments_3, _5.frq]
Mar 23, 2009 12:45:42 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1237809265075
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer 
transformRow
WARNING: Unable to resolve variable: 
dataimporter.functions.escapeQueryChars(dataimporter.request.single) while 
parsing expression: 
fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.TemplateTransformer 
transformRow
WARNING: Unable to resolve variable: 
dataimporter.functions.escapeQueryChars(dataimporter.request.single) while 
parsing expression: 
fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
Mar 23, 2009 12:45:42 PM org.apache.solr.handler.dataimport.DocBuilder commit
INFO: Full Import completed successfully
Mar 23, 2009 12:45:42 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true)
{code}

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: regex-fix.patch, SOLR-1060.patch, SOLR-1060.patch, 
> SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to