Hi all, I've been working with MCF the past few days and am very happy with what it lets me do, and I have a pipeline going from my repository to Solr which works fine. But there is one point I clearly don't understand, which is:
How do you know exactly what fields are going to be output in a given configuration? I found that i had to resort to trial and error to tweak my Solr schema to avoid "undefined field xxxxx" errors from Manifold when trying to write to Solr. Now to be fair, clearly I could just ignore any fields I don't specifically know I want, but I'd like to understand how this works. Is it the case that the initial set of fields depends on the repository connector? I found that I seemed to get some Alfresco specific stuff when reading from Alfresco, as opposed to what I got from a simple dummy file-system repo I was initially experimenting with. It also seems that Tika adds some fields, (actually a lot of fields) even when you don't have a Tika transform wired in explicitly? Is it the case that you need to put in an explicit Tika transform if you want to control which fields are contributed by Tika? And on that point, is there a master list of possible fields that TIka will emit, or is Tika just transforming the names of metadata fields in the documents it encounters, and programmatically generating a field name? Any and all help on understanding how this works is greatly appreciated... Phil ~~~~ This message optimized for indexing by NSA PRISM
