Re: Namespaces accumulate on refresh

Rupert Westenthaler Tue, 27 Mar 2012 00:38:39 -0700

Hi Michael

Can you please try the following


On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <[email protected]> wrote:

> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
> sleep 5
> cd TZW
> rm -rf indexing/destination
> rm -rf indexing/dist

rm -rf indexing/resource/tdb

> java -jar 
> org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar
>  index
> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>

The "indexing/resource/tdb" folder contains the Jena TDB triplestore
with the imported RDF data. This data are kept in-between indexing
processes mainly because the time needed to import the RDF data is
typically approximately the same as needed for the indexing process.
Because of that it makes a lot of sense to reuse already imported RDF
data if you index RDF dumps (e.g. DBpedia).

In the case where the RDF data change this default is not optimal,
because the changed dataset is appended to data already present in the
Jena TDB store. This means that if you change or remove things in your
thesaurus they will still be present within the triple store and
therefore also appear in the created index.

I must say that it is very confusing if users need to delete something
within the "/indexing/resources" folder if they change the RDF data.
So I will create an issue to change this behavior. I think I will try
to create named graphs for each imported RDF file. This would allow to
automatically delete already existing data within the Jena TDB store
if a file with the same name is imported again.

Can you please check and report back if this is the cause of your problem.

Thanks in advance

best
Rupert

>
> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>
>> Hi Michel
>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>
>>> Hello,
>>>
>>> As I am experimenting with various versions of my importfile I have changed 
>>> my namespace urls. But when I refresh the index, the old namespaces keep 
>>> accumulating in my results, resulting in duplicates. Is this intended 
>>> behavior? How can I get rid of these (cached?) results and return to a 
>>> pristine state?
>>>
>>
>> I think I have an explanation for what you are seeing. Can you please check 
>> that.
>>
>> The indexing tool does NOT delete the "{indexing-root}/indexing/destination" 
>> folder. So if you index your data twice without deleting this folder the new 
>> data will be appended. This would explain why you still see the data with 
>> the old namespaces. So please try to delete the indexing/destination folder 
>> and index again.
>>
>> This behavior is not a bug, but a feature because is allows to index 
>> multiple datasets. I am currently writing some documentation on that so I 
>> will copy the section related to the end of this mail.
>>
>> best
>> Rupert
>>
>> - - -
>> ### Indexing Datasets separately
>>
>> This demo indexes all four datasets in a single step. However this is not 
>> required. With a simple trick it is possible to index different datasets 
>> with different indexing configurations to the same target. This section 
>> describes how this could be achieved and why users might want to do this.
>>
>> This demo uses Solr as target for the indexing process. Theoretically there 
>> might be several possibility, but currently this is the only available 
>> IndexingDestination implementation. The SolrIdnex used to store the data is 
>> located at "{indexing-root}/indexing/destination/indexes/default/{name}. If 
>> this directory does not alread exist it is initialized by the indexing tool 
>> based on the SolrCore configuration in 
>> "{indexing-root}/indexing/config/{name}" or the default SolrCore 
>> configuration of not present. However if it already exists than this core is 
>> used and the data of the current indexing process are added to the existing 
>> SolrCore.
>>
>> Because of that is is possible to subsequently add information of different 
>> datasets to the same SolrIndex. However users need to know that if the 
>> different dataset contain the same entity (resource with the same URI) the 
>> information of the second dataset will replace those of the first. 
>> Nonetheless this would allow in the given demo to create separate 
>> configurations (e.g. mappings) for all four datasets while still ensuring 
>> the indexed data are contained in the same SolrIndex.
>>
>> This might be useful in situations where the same property (e.g. rdfs:label) 
>> is used by the different datasets in different ways. Because than one could 
>> create a mapping for dataset1 that maps rdfs:label > skos:prefLabel and for 
>> dataset2 an mapping that ensures that rdfs:label > skos:altLabel.
>>
>> Workflows like that can be easily implemented by shell scrips or by setting 
>> soft links in the file system.
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Namespaces accumulate on refresh

Reply via email to