Re: Namespaces accumulate on refresh

Michel Benevento Tue, 27 Mar 2012 00:56:25 -0700

Succes!

resources/tdb was the culprit, thank you Rupert.


Michel


PS Maybe it should be a setting in indexing.properties(?) if you want to 
override or append to an index?



On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote:

> Hi Michael
> 
> Can you please try the following
> 
> On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <[email protected]> 
> wrote:
> 
>> rm ../stanbol/sling/datafiles/TZW.solrindex.zip
>> sleep 5
>> cd TZW
>> rm -rf indexing/destination
>> rm -rf indexing/dist
> 
> rm -rf indexing/resource/tdb
> 
>> java -jar 
>> org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar
>>  index
>> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles
>> 
> 
> The "indexing/resource/tdb" folder contains the Jena TDB triplestore
> with the imported RDF data. This data are kept in-between indexing
> processes mainly because the time needed to import the RDF data is
> typically approximately the same as needed for the indexing process.
> Because of that it makes a lot of sense to reuse already imported RDF
> data if you index RDF dumps (e.g. DBpedia).
> 
> In the case where the RDF data change this default is not optimal,
> because the changed dataset is appended to data already present in the
> Jena TDB store. This means that if you change or remove things in your
> thesaurus they will still be present within the triple store and
> therefore also appear in the created index.
> 
> I must say that it is very confusing if users need to delete something
> within the "/indexing/resources" folder if they change the RDF data.
> So I will create an issue to change this behavior. I think I will try
> to create named graphs for each imported RDF file. This would allow to
> automatically delete already existing data within the Jena TDB store
> if a file with the same name is imported again.
> 
> Can you please check and report back if this is the cause of your problem.
> 
> Thanks in advance
> 
> best
> Rupert
> 
>> 
>> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote:
>> 
>>> Hi Michel
>>> On 26.03.2012, at 16:40, Michel Benevento wrote:
>>> 
>>>> Hello,
>>>> 
>>>> As I am experimenting with various versions of my importfile I have 
>>>> changed my namespace urls. But when I refresh the index, the old 
>>>> namespaces keep accumulating in my results, resulting in duplicates. Is 
>>>> this intended behavior? How can I get rid of these (cached?) results and 
>>>> return to a pristine state?
>>>> 
>>> 
>>> I think I have an explanation for what you are seeing. Can you please check 
>>> that.
>>> 
>>> The indexing tool does NOT delete the 
>>> "{indexing-root}/indexing/destination" folder. So if you index your data 
>>> twice without deleting this folder the new data will be appended. This 
>>> would explain why you still see the data with the old namespaces. So please 
>>> try to delete the indexing/destination folder and index again.
>>> 
>>> This behavior is not a bug, but a feature because is allows to index 
>>> multiple datasets. I am currently writing some documentation on that so I 
>>> will copy the section related to the end of this mail.
>>> 
>>> best
>>> Rupert
>>> 
>>> - - -
>>> ### Indexing Datasets separately
>>> 
>>> This demo indexes all four datasets in a single step. However this is not 
>>> required. With a simple trick it is possible to index different datasets 
>>> with different indexing configurations to the same target. This section 
>>> describes how this could be achieved and why users might want to do this.
>>> 
>>> This demo uses Solr as target for the indexing process. Theoretically there 
>>> might be several possibility, but currently this is the only available 
>>> IndexingDestination implementation. The SolrIdnex used to store the data is 
>>> located at "{indexing-root}/indexing/destination/indexes/default/{name}. If 
>>> this directory does not alread exist it is initialized by the indexing tool 
>>> based on the SolrCore configuration in 
>>> "{indexing-root}/indexing/config/{name}" or the default SolrCore 
>>> configuration of not present. However if it already exists than this core 
>>> is used and the data of the current indexing process are added to the 
>>> existing SolrCore.
>>> 
>>> Because of that is is possible to subsequently add information of different 
>>> datasets to the same SolrIndex. However users need to know that if the 
>>> different dataset contain the same entity (resource with the same URI) the 
>>> information of the second dataset will replace those of the first. 
>>> Nonetheless this would allow in the given demo to create separate 
>>> configurations (e.g. mappings) for all four datasets while still ensuring 
>>> the indexed data are contained in the same SolrIndex.
>>> 
>>> This might be useful in situations where the same property (e.g. 
>>> rdfs:label) is used by the different datasets in different ways. Because 
>>> than one could create a mapping for dataset1 that maps rdfs:label > 
>>> skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > 
>>> skos:altLabel.
>>> 
>>> Workflows like that can be easily implemented by shell scrips or by setting 
>>> soft links in the file system.
>> 
> 
> 
> 
> -- 
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen

Re: Namespaces accumulate on refresh

Reply via email to