[ https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683851#action_12683851 ]
Fergus McMenemie commented on SOLR-1060: ---------------------------------------- My original patch did all this in the ChangeListEntityProcessor, as an option! However as a seperate issue I do think we have a ambigutiy in the face value behaviour of the following code when a mismatch occurs. {code} <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine"/> <field column="$deleteDocByQuery" regex="^DELETE.*" sourceColName="rawLine" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" /> {code} While I do understand that under the hood one is a match and the other a replace. I think that we could to enhance the existing transformer somehow to streamline its interface. After all a new custom/new Transformer would just be a regex by another name. Not sure what to do for the best. 1) I could put my optional code back into ChangeListEntityProcessor? 2) I can also get around the problem with temporary fields, but it is rather ugly:- {code} <entity name="jc" processor="ChangeListEntityProcessor" acceptLineRegex="^.*\.xml$" omitLineRegex="usc2009" fileName="file:///Volumes/spare/ts/man-findlsurl.txt" rootEntity="false" dataSource="null" baseLocation="file:///Volumes/spare/ts/ford/schema/" transformer="RegexTransformer" > <field column="fileAbsolutePath" regex="^.*\s+([^ ]*)$" replaceWith="${jc.baseLocation}/$1" sourceColName="rawLine"/> <field column="dummy" regex="^DELETE.*" replaceWith="fileAbsolutePath:${jc.fileAbsolutePath}" sourceColName="rawLine"/> <field column="$deleteDocByQuery" regex="^fileAbsolutePath:" sourceColName="dummy"/> <entity name="x" dataSource="myURIreader" processor="XPathEntityProcessor" {code} > a new DIH EnityProcessor allowing text file lists of files to be indexed > ------------------------------------------------------------------------ > > Key: SOLR-1060 > URL: https://issues.apache.org/jira/browse/SOLR-1060 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Affects Versions: 1.4 > Reporter: Fergus McMenemie > Assignee: Shalin Shekhar Mangar > Fix For: 1.4 > > Attachments: SOLR-1060.patch, SOLR-1060.patch > > Original Estimate: 120h > Remaining Estimate: 120h > > I have finished a new DIH EntityProcessor. It is designed around the idea > that whatever demon is used to maintain your content store it is likely to > drop a report or log file explaining what has changed within your content > store. I wish to use this report file to control the indexing of the new or > changed content and the removal of old content. The report files, perhaps > from un-tar or un-zip, are likely to reference jpegs and directory stubs > which need to be ignored. I assumed a file based content repository but this > should be expanded to handle URI's as well > I feel that the current FileListEntityProcessor is poorly named. It should be > called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And > this new EntityProcessor should have the name FileListEntityProcessor. > However what is done is done. I then came up with manifestEnityProcessor > which I thought suited, manifest files are all over the content sets I deal > with and the dictionary definition seemed close enough ("ships manifest"). > However how about ChangeListEntityProcessor > {code} > <entity name="jc" > processor="ManifestEntityProcessor" > baseDir="/Volumes/Techmore/ts/aaa/schema/data" > rootEntity="false" > dataSource="null" > allowRegex="^.*\.xml$" > blockRegex="usc2009" > manifestFileName="/Volumes/ts/man-find.txt" > docAddRegex=".*" > > > {code} > The new entity fields are as follows. > > *manifestFileName* is the required location of the manifest file. If this > value is relative, it assumed to be relative to baseDir. > *allowRegex* is an optional attribute that if present discards any line > which does not match the regExp > > *blockRegex* is an optional attribute that is applied after any allowRegex > and discards any line which matches the regExp > *docAddRegex* is a required regex to identify lines which when matched > should cause docs to be added to the index. As well as matching the line it > should also return the portion of the line which contains the filepath as > group(1) > *docDeleteRegex* is an optional value of a regex to identify documents > which when matched should be deleted from the index. As well as matching the > line it should also return the portion of the line which contains the > filepath as group(1) **PLANNED** -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.