Re: Namespaces accumulate on refresh

Rupert Westenthaler Tue, 27 Mar 2012 01:12:06 -0700

Hi Michael

On Tue, Mar 27, 2012 at 9:55 AM, Michel Benevento <[email protected]> wrote:
> Succes!
>
> resources/tdb was the culprit, thank you Rupert.
>


good to hear

> PS Maybe it should be a setting in indexing.properties(?) if you want to 
> override or append to an index?
>

This would be an other possibility to solve this issue it that with
the named graphs does not work out. I prefer the named graph thing
because it would work "magically" - without the need that the users
provides any kind of configuration.

However a property like that would be a good Idea for
enabling/disabling the automatic deletion of the destination folder.

best
Rupert

>
>
> On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote:
>
>> Hi Michael
>>
>> Can you please try the following
>>
>> On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <[email protected]> 
>> wrote:
>>
>>> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
>>> sleep 5
>>> cd TZW
>>> rm -rf indexing/destination
>>> rm -rf indexing/dist
>>
>> rm -rf indexing/resource/tdb
>>
>>> java -jar 
>>> org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar
>>>  index
>>> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>>>
>>
>> The "indexing/resource/tdb" folder contains the Jena TDB triplestore
>> with the imported RDF data. This data are kept in-between indexing
>> processes mainly because the time needed to import the RDF data is
>> typically approximately the same as needed for the indexing process.
>> Because of that it makes a lot of sense to reuse already imported RDF
>> data if you index RDF dumps (e.g. DBpedia).
>>
>> In the case where the RDF data change this default is not optimal,
>> because the changed dataset is appended to data already present in the
>> Jena TDB store. This means that if you change or remove things in your
>> thesaurus they will still be present within the triple store and
>> therefore also appear in the created index.
>>
>> I must say that it is very confusing if users need to delete something
>> within the "/indexing/resources" folder if they change the RDF data.
>> So I will create an issue to change this behavior. I think I will try
>> to create named graphs for each imported RDF file. This would allow to
>> automatically delete already existing data within the Jena TDB store
>> if a file with the same name is imported again.
>>
>> Can you please check and report back if this is the cause of your problem.
>>
>> Thanks in advance
>>
>> best
>> Rupert
>>
>>>
>>> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>>>
>>>> Hi Michel
>>>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> As I am experimenting with various versions of my importfile I have 
>>>>> changed my namespace urls. But when I refresh the index, the old 
>>>>> namespaces keep accumulating in my results, resulting in duplicates. Is 
>>>>> this intended behavior? How can I get rid of these (cached?) results and 
>>>>> return to a pristine state?
>>>>>
>>>>
>>>> I think I have an explanation for what you are seeing. Can you please 
>>>> check that.
>>>>
>>>> The indexing tool does NOT delete the 
>>>> "{indexing-root}/indexing/destination" folder. So if you index your data 
>>>> twice without deleting this folder the new data will be appended. This 
>>>> would explain why you still see the data with the old namespaces. So 
>>>> please try to delete the indexing/destination folder and index again.
>>>>
>>>> This behavior is not a bug, but a feature because is allows to index 
>>>> multiple datasets. I am currently writing some documentation on that so I 
>>>> will copy the section related to the end of this mail.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> - - -
>>>> ### Indexing Datasets separately
>>>>
>>>> This demo indexes all four datasets in a single step. However this is not 
>>>> required. With a simple trick it is possible to index different datasets 
>>>> with different indexing configurations to the same target. This section 
>>>> describes how this could be achieved and why users might want to do this.
>>>>
>>>> This demo uses Solr as target for the indexing process. Theoretically 
>>>> there might be several possibility, but currently this is the only 
>>>> available IndexingDestination implementation. The SolrIdnex used to store 
>>>> the data is located at 
>>>> "{indexing-root}/indexing/destination/indexes/default/{name}. If this 
>>>> directory does not alread exist it is initialized by the indexing tool 
>>>> based on the SolrCore configuration in 
>>>> "{indexing-root}/indexing/config/{name}" or the default SolrCore 
>>>> configuration of not present. However if it already exists than this core 
>>>> is used and the data of the current indexing process are added to the 
>>>> existing SolrCore.
>>>>
>>>> Because of that is is possible to subsequently add information of 
>>>> different datasets to the same SolrIndex. However users need to know that 
>>>> if the different dataset contain the same entity (resource with the same 
>>>> URI) the information of the second dataset will replace those of the 
>>>> first. Nonetheless this would allow in the given demo to create separate 
>>>> configurations (e.g. mappings) for all four datasets while still ensuring 
>>>> the indexed data are contained in the same SolrIndex.
>>>>
>>>> This might be useful in situations where the same property (e.g. 
>>>> rdfs:label) is used by the different datasets in different ways. Because 
>>>> than one could create a mapping for dataset1 that maps rdfs:label > 
>>>> skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > 
>>>> skos:altLabel.
>>>>
>>>> Workflows like that can be easily implemented by shell scrips or by 
>>>> setting soft links in the file system.
>>>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Namespaces accumulate on refresh

Reply via email to