[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Fergus McMenemie (JIRA) Tue, 17 Mar 2009 14:02:15 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682802#action_12682802
 ]


Fergus McMenemie commented on SOLR-1060:
----------------------------------------

Hi,

I have the following snippet from my data-config.xml. This is after removing 
all code 
from ChangeListEntityProcessor which deal with finding the juciy part of the 
line. 
However I get tracebacks when ever I start tomcat saying that my schema.xml has
no mention of $deleteQuery. Do I have to declare a field $deleteQuery in my
schema.xml; if so it is rather ugly!

{Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImporter 
loadDataConfig
INFO: Data Configuration loaded successfully
Mar 17, 2009 8:54:31 PM org.apache.solr.handler.dataimport.DataImportHandler 
inform
SEVERE: Exception while loading DataImporter
org.apache.solr.handler.dataimport.DataImportHandlerException: There are errors 
in the Schema
The field :$deleteQuery present in DataConfig does not have a counterpart in 
Solr Schema

        at 
org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.java:109)
        at 
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:96)
        at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:388)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:571)
        at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:122)
        at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
        at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223)
}


{code}
<dataConfig>
  <dataSource name="myFILEreader" type="FileDataSource"/>    
  <dataSource name="myURIreader"  type="URLDataSource" />    
    <document>
      <entity name="jc"
               processor="ChangeListEntityProcessor"
               acceptLineRegex="^.*\.xml$"
               omitLineRegex="usc2009"
               fileName="file:///Volumes/spare/ts/man-findlsurl.txt"
               rootEntity="false"
               dataSource="null"
               baseLocation="file:///Volumes/spare/ts/ford"
               transformer="RegexTransformer"
               >
        <!-- the following columns are only defined if the regex matches -->
        <field column="fileAbsolutePath"    regex="\s+([^ ]*)$" 
replaceWith="${jc.baseLocation}/$1"  sourceColName="rawLine"/>
        <field column="$deleteQuery"        regex="^DELETE\s+"  
replaceWith="${jc.fileAbsolutePath}" sourceColName="rawLine"/>         

        <entity name="x"
                dataSource="myurireader"
                processor="XPathEntityProcessor"
                url="${jc.fileAbsolutePath}"
                rootEntity="true"
                flatten="true"
                stream="false"
                forEach="/record | /record/mediaBlock"
                
transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">

<field column="fileAbsolutePath"                 
template="${jc.fileAbsolutePath}" />
<field column="fileWebPath"                      
template="${jc.fileAbsolutePath}" 
regex="${dataimporter.request.fordinstalldir}(.*)" replaceWith="/ford$1"/>
<field column="fileWebDir"                       regex="(.*)/.*" 
replaceWith="$1" sourceColName="fileWebPath"/>
{code>

> a new DIH EnityProcessor allowing text file lists of files to be indexed
> ------------------------------------------------------------------------
>
>                 Key: SOLR-1060
>                 URL: https://issues.apache.org/jira/browse/SOLR-1060
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1060.patch, SOLR-1060.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have finished a new DIH EntityProcessor. It is designed around the idea 
> that whatever demon is used to maintain your content store it is likely to 
> drop a report or log file explaining what has changed within your content 
> store. I wish to use this report file to control the indexing of the new or 
> changed content and the removal of old content. The report files, perhaps 
> from un-tar or un-zip, are likely to reference jpegs and directory stubs 
> which need to be ignored. I assumed a file based content repository but this 
> should be expanded to handle URI's as well
> I feel that the current FileListEntityProcessor is poorly named. It should be 
> called the dirWalkEntityProcessor or dirCrawlEntityProcessor or such. And 
> this new EntityProcessor should have the name FileListEntityProcessor. 
> However what is done is done. I then came up with manifestEnityProcessor 
> which I thought suited, manifest files are all over the content sets I deal 
> with and the dictionary definition seemed close enough ("ships manifest"). 
> However how about ChangeListEntityProcessor
> {code}
>        <entity name="jc"
>                processor="ManifestEntityProcessor"
>                baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>                rootEntity="false"
>                dataSource="null"
>                allowRegex="^.*\.xml$"
>                blockRegex="usc2009"
>                manifestFileName="/Volumes/ts/man-find.txt"
>                docAddRegex=".*"
>                >
> {code}
> The new entity fields are as follows.
>  
>    *manifestFileName* is the required location of the manifest file. If this 
> value is relative, it assumed to be relative to baseDir.
>    *allowRegex* is an optional attribute that if present discards any line 
> which does not match the regExp
>  
>    *blockRegex* is an optional attribute that is applied after any allowRegex 
> and discards any line which matches the regExp
>    *docAddRegex* is a required regex to identify lines which when matched 
> should cause docs to be added to the index. As well as matching the line it 
> should also return the portion of the line which contains the filepath as 
> group(1)
>    *docDeleteRegex* is an optional value of a regex to identify documents 
> which when matched should be deleted from the index. As well as matching the 
> line it should also return the portion of the line which contains the 
> filepath as group(1) **PLANNED**

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1060) a new DIH EnityProcessor allowing text file lists of files to be indexed

Reply via email to