[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Fergus McMenemie (JIRA) Fri, 13 Mar 2009 02:19:21 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681649#action_12681649
 ]


Fergus McMenemie commented on SOLR-1060:
----------------------------------------

Oh dear, this is getting complicated!

"My concern is that we have two data sources whose names identify their 
respective functionality. With this change FileDataSource becomes redundant and 
HttpDataSource does not give the impression that it can read files too. I 
assume that everyone will be generating the changeset using their own sweet 
tools/programs. Therefore it is a simple task for the changeset generator to 
generate http/file separately or mark them differently. Then one can use 
different root entities."

Hmmmm, no I am not sure about this. 
# Firstly I agree "FileDataSource becomes redundant and HttpDataSource .. can 
read files"; bit of a mess really. Ideally I think we need a new dataSource 
that can read from either a FileSystem or a URI. 
# I the poor old content indexer am often presented with the manifest fait 
accompli. It comes as part of the update kit, I have little or no control of 
its format. I would have to organise some middle-ware to sort its format if we 
restrict DIH. Which would be a pity since the proposed changes should allow 
solr to directly handle every case I have seen, and I suspect it is well over 
80% of the usecase.
# Even if the lines read from the changelist are simple filepaths, how we 
access those files will depend on other factors. They could be on a local or 
remote machine. The lines read from the file will not indicate this. As Nobel 
implies we may not know this ahead of time, we need to be able to pass 
parameters into the system which supplies that information.

<thinking out loud>
# We need to be able to *read lines* describing changes we *may* wish to make 
to our index from a file:// or a restful web service or URL.
# The lines read will need analysed for two purposes. a) to identify the 
portion of the line we are interested in b) to reformat that portion such that 
it can be passed to the child entity which will in turn pass it to a dataSource.
# We do not know which dataSource the child entity may be using which make the 
reformating stage 2b) a bit more tricky. Hence the required cooperation.

1) and 2a) could be done by changeListEntityProcessor (As Noble says we need an 
EntityProcessor because it is generating data... without even a datasource!)
2b) could be done by a transformer, information will need to be available to 
the transformer to allow it deal with local or remote access. 
3)?????
</thinking out loud>

For the moment I was intending to build 1)2a)2b) into the 
ChangeListEntityProcessor, it does not appear to be bad. Once done perhaps we 
can look again at a need to lift 2b) into a separate EntityProcessor or 
Transformer.

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>             Fix For: 1.4
>
>         Attachments: SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to