Hi All,

I did some research on this and found some alternatives useful to my
usecase. Please give your ideas.

Can I update all documents indexed after a /dataimport query using the
last_indexed_time in dataimport.properties?
If so can anyone please give me some pointers?
What I currently have in mind is something like below;

1. Store the indexing timestamp of the document as a field
eg: <field name="timestamp" type="date" indexed="true" stored="true"
default="NOW"
multiValued="false"/>

2. Read the last_index_time from the dataimport.properties

3. Query all document id's indexed after the last_index_time and send them
through the Stanbol update processor.

But I have a question here;
Does the last_index_time refer to when the dataimport is
started(onImportStart) or when the dataimport is finished (onImportEnd)?
If it's onImportEnd timestamp, them this solution won't work because the
timestamp indexed in the document field will be : onImportStart<
doc-index-timestamp < onImportEnd.


Another alternative I can think of is trigger an update chain via a
EventListener configured to run after a dataimport is processed
(onImportEnd).
In this case can the context in DIH give the list of document ids processed
in the /dataimport request? If so I can send those doc ids with an /update
query to run the Stanbol update process.

Please give me your ideas and suggestions.

Thanks,
Dileepa




On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody <dileepajayak...@gmail.com
> wrote:

> Hi All,
>
> I have a Solr requirement to send all the documents imported from a
> /dataimport query to go through another update chain as a separate
> background process.
>
> Currently I have configured my custom update chain in the /dataimport
> handler itself. But since my custom update process need to connect to an
> external enhancement engine (Apache Stanbol) to enhance the documents with
> some NLP fields, it has a negative impact on /dataimport process.
> The solution will be to have a separate update process running to enhance
> the content of the documents imported from /dataimport.
>
> Currently I have configured my custom Stanbol Processor as below in my
> /dataimport handler.
>
> <requestHandler name="/dataimport" class="solr.DataImportHandler">
> <lst name="defaults">
>  <str name="config">data-config.xml</str>
> <str name="update.chain">stanbolInterceptor</str>
>  </lst>
>    </requestHandler>
>
> <updateRequestProcessorChain name="stanbolInterceptor">
>  <processor
> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
> <processor class="solr.RunUpdateProcessorFactory" />
>   </updateRequestProcessorChain>
>
>
> What I need now is to separate the 2 processes of dataimport and
> stanbol-enhancement.
> So this is like runing a separate re-indexing process periodically over
> the documents imported from /dataimport for Stanbol fields.
>
> The question is how to trigger my Stanbol update process to the documents
> imported from /dataimport?
> In Solr to trigger /update query we need to know the id and the fields of
> the document to be updated. In my case I need to run all the documents
> imported from the previous /dataimport process through a stanbol
> update.chain.
>
> Is there a way to keep track of the documents ids imported from
> /dataimport?
> Any advice or pointers will be really helpful.
>
> Thanks,
> Dileepa
>

Reply via email to