Succes! resources/tdb was the culprit, thank you Rupert.
Michel PS Maybe it should be a setting in indexing.properties(?) if you want to override or append to an index? On 27 mrt. 2012, at 09:38, Rupert Westenthaler wrote: > Hi Michael > > Can you please try the following > > On Mon, Mar 26, 2012 at 5:51 PM, Michel Benevento <[email protected]> > wrote: > >> rm ../stanbol/sling/datafiles/TZW.solrindex.zip >> sleep 5 >> cd TZW >> rm -rf indexing/destination >> rm -rf indexing/dist > > rm -rf indexing/resource/tdb > >> java -jar >> org.apache.stanbol.entityhub.indexing.genericrdf-0.9.0-incubating-SNAPSHOT-jar-with-dependencies.jar >> index >> mv indexing/dist/TZW.solrindex.zip ../../stanbol/sling/datafiles >> > > The "indexing/resource/tdb" folder contains the Jena TDB triplestore > with the imported RDF data. This data are kept in-between indexing > processes mainly because the time needed to import the RDF data is > typically approximately the same as needed for the indexing process. > Because of that it makes a lot of sense to reuse already imported RDF > data if you index RDF dumps (e.g. DBpedia). > > In the case where the RDF data change this default is not optimal, > because the changed dataset is appended to data already present in the > Jena TDB store. This means that if you change or remove things in your > thesaurus they will still be present within the triple store and > therefore also appear in the created index. > > I must say that it is very confusing if users need to delete something > within the "/indexing/resources" folder if they change the RDF data. > So I will create an issue to change this behavior. I think I will try > to create named graphs for each imported RDF file. This would allow to > automatically delete already existing data within the Jena TDB store > if a file with the same name is imported again. > > Can you please check and report back if this is the cause of your problem. > > Thanks in advance > > best > Rupert > >> >> On 26 mrt. 2012, at 17:11, Rupert Westenthaler wrote: >> >>> Hi Michel >>> On 26.03.2012, at 16:40, Michel Benevento wrote: >>> >>>> Hello, >>>> >>>> As I am experimenting with various versions of my importfile I have >>>> changed my namespace urls. But when I refresh the index, the old >>>> namespaces keep accumulating in my results, resulting in duplicates. Is >>>> this intended behavior? How can I get rid of these (cached?) results and >>>> return to a pristine state? >>>> >>> >>> I think I have an explanation for what you are seeing. Can you please check >>> that. >>> >>> The indexing tool does NOT delete the >>> "{indexing-root}/indexing/destination" folder. So if you index your data >>> twice without deleting this folder the new data will be appended. This >>> would explain why you still see the data with the old namespaces. So please >>> try to delete the indexing/destination folder and index again. >>> >>> This behavior is not a bug, but a feature because is allows to index >>> multiple datasets. I am currently writing some documentation on that so I >>> will copy the section related to the end of this mail. >>> >>> best >>> Rupert >>> >>> - - - >>> ### Indexing Datasets separately >>> >>> This demo indexes all four datasets in a single step. However this is not >>> required. With a simple trick it is possible to index different datasets >>> with different indexing configurations to the same target. This section >>> describes how this could be achieved and why users might want to do this. >>> >>> This demo uses Solr as target for the indexing process. Theoretically there >>> might be several possibility, but currently this is the only available >>> IndexingDestination implementation. The SolrIdnex used to store the data is >>> located at "{indexing-root}/indexing/destination/indexes/default/{name}. If >>> this directory does not alread exist it is initialized by the indexing tool >>> based on the SolrCore configuration in >>> "{indexing-root}/indexing/config/{name}" or the default SolrCore >>> configuration of not present. However if it already exists than this core >>> is used and the data of the current indexing process are added to the >>> existing SolrCore. >>> >>> Because of that is is possible to subsequently add information of different >>> datasets to the same SolrIndex. However users need to know that if the >>> different dataset contain the same entity (resource with the same URI) the >>> information of the second dataset will replace those of the first. >>> Nonetheless this would allow in the given demo to create separate >>> configurations (e.g. mappings) for all four datasets while still ensuring >>> the indexed data are contained in the same SolrIndex. >>> >>> This might be useful in situations where the same property (e.g. >>> rdfs:label) is used by the different datasets in different ways. Because >>> than one could create a mapping for dataset1 that maps rdfs:label > >>> skos:prefLabel and for dataset2 an mapping that ensures that rdfs:label > >>> skos:altLabel. >>> >>> Workflows like that can be easily implemented by shell scrips or by setting >>> soft links in the file system. >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen
