[jira] Updated: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Fergus McMenemie (JIRA) Tue, 10 Mar 2009 11:47:15 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fergus McMenemie updated SOLR-1060:
-----------------------------------

    Description: 
I have finished a new DIH EntityProcessor. It is designed around the idea that 
whatever demon is used to maintain your content store it is likely to drop a 
report or log file explaining what has changed within your content store. I 
wish to use this report file to control the indexing of the new or changed 
content and the removal of old content. The report files, perhaps from un-tar 
or un-zip, are likely to reference jpegs and directory stubs which need to be 
ignored. I assumed a file based content repository but this should be expanded 
to handle URI's as well

I feel that the current FileListEntityProcessor is poorly named. It should be 
called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this 
new EntityProcessor should have the name FileListEntityProcessor. However what 
is done is done. I then came up with manifestEnityProcessor which I thought 
suited, manifest files are all over the content sets I deal with and the 
dictionary definition seemed close enough ("ships manifest"). However how about 
ChangeListEntityProcessor
{code}
       <entity name="jc"
               processor="ManifestEntityProcessor"
               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
               rootEntity="false"
               dataSource="null"

               allowRegex="^.*\.xml$"
               blockRegex="usc2009"
               manifestFileName="/Volumes/ts/man-find.txt"
               manifestAddRegex=".*"
               >
{code}

The new entity fields are as follows.
 
   *manifestFileName* is the required location of the manifest file. If this 
value is relative, it assumed to be relative to baseDir.

   *allowRegex* is an optional attribute that if present discards any line 
which does not match the regExp
 
   *blockRegex* is an option attribute that is applied after any allowRegex and 
discards any line which matches the regExp

   *docAddRegex* is a required regex to identify lines which when matched 
should cause docs to be added to the index. As well as matching the line it 
should also return the portion of the line which contains the filepath as 
group(1)

   *docDeleteRegex* is an optional value of a regex to identify documents which 
when matched should be deleted from the index. As well as matching the line it 
should also return the portion of the line which contains the filepath as 
group(1) **PLANNED**



  was:
I have finished a new DIH EntityProcessor. It is designed around the idea that 
whatever demon is used to maintain your content store it is likely to drop a 
report or log file explaining what has changed within your content store. I 
wish to use this report file to control the indexing of the new or changed 
content and the removal of old content. The report files, perhaps from un-tar 
or un-zip, are likely to reference jpegs and directory stubs which need to be 
ignored. I assumed a file based content repository but this should be expanded 
to handle URI's as well

I feel that the current FileListEntityProcessor is poorly named. It should be 
called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And this 
new EntityProcessor should have the name FileListEntityProcessor. However what 
is done is done. I then came up with manifestEnityProcessor which I thought 
suited, manifest files are all over the content sets I deal with and the 
dictionary definition seemed close enough ("ships manifest"). However how about 
ChangeListEntityProcessor
{code}
       <entity name="jc"
               processor="ManifestEntityProcessor"
               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
               rootEntity="false"
               dataSource="null"

               allowRegex="^.*\.xml$"
               blockRegex="usc2009"
               manifestFileName="/Volumes/ts/man-find.txt"
               manifestAddRegex=".*"
               >
{code}

The new entity fields are as follows.
 
   manifestFileName is the required location of the manifest file. If this 
value is relative, it assumed to be relative to baseDir.

   allowRegex is an optional attribute that if present discards any line which 
does not match the regExp
 
   blockRegex is an option attribute that is applied after any allowRegex and 
discards any line which matches the regExp

   docAddRegex is a required regex to identify lines which when matched should 
cause docs to be added to the index. As well as matching the line it should 
also return the portion of the line which contains the filepath as group(1)

   docDeleteRegex is an optional value of a regex to identify documents which 
when matched should be deleted from the index. As well as matching the line it 
should also return the portion of the line which contains the filepath as 
group(1) **PLANNED**




> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>             Fix For: 1.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                manifestAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an option attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to