: Solr aims at being an answer to "enterprise needs", by indexing
: structured data for different applications. However I think that many
: enterprises would like to be able to structure information themselves.
thta's exactly what Solr is about: letting a schema creator define
what the structure is, and letting putting data in whatever fields they
want.
Could a future "parser plugin" architecture make sure that the outcome
is in a well-defined format? In this case there could be a step for pure
document processing.
Everything fed into the document processor stage should in other words
be in a universal format - complete with source and which parser was
used, of course. From this document, fields could be extracted and
computed via simple programming to meet the requirements of the schema.
the problem with providing support for unstructured data out of hte box is
that it's got no strucutre :) ... how would Solr know what to do with the
binary data it finds? how would it know what charset to use when reading
thta data? ... assuming it gets character data, how does it know which
strings should go in which fields? how does it know which analyzers to
use?
With regards to the above, this could be handled by the parser, which
creates the "standard document". This document would also contain meta
data relevant to solving these tasks. The document processing stage
would then know which conversion to use.
some code somewhere has to make these decissions ... at the moment that
code needs to be provided by the user and run outside of Solr ... i
suspect it won't be long before much of that code can run inside of Solr
as a plugin, but it will still need to be provided by the user to parse
truely unstructured data.
Yep. But my idea of a "standard document" - wouldn't that help a bit?
Don't look at me, I'm just a newbie :)
Eivind