: I'm streaming over the document content (presumably via tika) and its : gathering the document's metadata which includes the keywords metadata field. : Since I'm also passing that field from the DB to the REST call as a list (as : you suggested) there is a collision because the keywords field is single : valued. : : I can change this behavior using a copy field. What I wanted to know is if : there was a specific reason the default schema defined a field like keywords : single valued so I could make sure I wasn't missing something before I changed : things.
That file is just an example, you're absolutely free to change it to meet your use case. I'm not very familiar with Tika, but based on the comment in the example config... <!-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. --> ...i suspect it was intentional that that field is *not* multiValued (i guess Tika always returns a single delimited value?) but if you have multiple descrete values you want to send for your DB backed data there is no downside to changing that. : While I'm at it, I'd REALLY like to know how to use DIH to index the metadata : from the database while simultaneously streaming over the document content and : indexing it. I've never quite figured it out yet but I have to believe it is : a possibility. There's a TikaEntityProcessor that can be used to have Tika crunch the data that comes from an "entity" and extract out specific fields, and it can be used in combination with a JdbcDataSource and a BinFileDataSource so that a field in your db data specifies the name of a file on disk to use as the TikaEntity -- but i've personally never tried it Here's a simple example someone posted last year that they got working... http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html -Hoss