[jira] Updated: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Fergus McMenemie (JIRA) Mon, 16 Mar 2009 07:06:13 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fergus McMenemie updated SOLR-1060:
-----------------------------------

    Attachment: SOLR-1060.patch

I have rewritten and tested the ChangeListEntityProcessor such that it supports 
URL's. This allows the list of changes to fetched from a local file, a simple 
URL or any restful type web service. The list of changes must appear as one 
change per line read, the line can contain an absolute file:/// or http:// 
pathnames or it can be a relative pathname. The entity attribute baseLocation 
specifies a prefix to be used with relative pathnames. baseLocation must be a 
valid URL; file:/// or http:// for the moment. The entity attributes are as 
follows

 * *FileName* is the required URL location of the change list. If this value is 
relative, it assumed to be relative to baseLocation.</li>
 * *acceptLineRegex* is an optional attribute that if present discards any line 
read from the change list which does not match the regExp.</li>
 * *omitLineRegex* is an optional attribute that is applied after any 
acceptLineRegex and discards any line read from the change list which matches 
the regExp.</li>
 * *docAddRegex is an optional regex to identify lines which when matched 
should cause docs to be added to the index. As well as matching the line it 
should also return the portion of the line which is to be treated as the 
pathname, as group(1). If not specified the whole line is assumed to be valid 
pathname.</li>
 * <li>docDeleteRegex is an optional value of a regex to identify documents 
which when matched should be deleted from the index. As well as matching the 
line it should also return the portion of the line which contains the filepath 
as group(1) *PLANNED WORK see SOLR-1059*</li>
 * <li>baseLocation is an required prefix added to fileName or lines read from 
the change list which do not appear to be absolute http:// or file:/// 
URL's</li>

Here is a sample of the way I used it:-
{code}
       <entity name="jc"
               processor="ChangeListEntityProcessor"
               acceptLineRegex="^.*\.xml$"
               omitLineRegex="usc2009"
               fileName="file:///Volumes/ts/man-findlsurl.txt"
               rootEntity="false"
               dataSource="null"
               baseLocation="http://localhost/ford/";
               docAddRegex="\s+([^ ]*)$"
               >
{code}

This entity returns a row containing a single "fileAbsolutePath" field for each 
pathname accepted from the changelist. If the docDeleteRegex was matched then 
another fields will also be returned $deleteDocId=?? and  $deleteDocQuery=??. 
What do I need to set these values to?

I have also create a URLDataSource, it seems to work. However "an expert" had 
better review what I have done; I am still very inexperienced re Java best 
practice. On that topic; why did we not rename the existing httpDataSource to 
URLDataSource and then make httpDataSource a wrapper for URLDataSource?

Testing with my sample of 40000 documents reveals no noticible slowdown 
compared with FileListEntiryProcessor.

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>             Fix For: 1.4
>
>         Attachments: SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to