I've been struggling with how to get various bits of structured data into solr documents. In various projects I have tried various ideas, but none feel great.

Take a simple example where I want a document field to be the list of linked data with name, ID, and path. I have tried things like:

<doc>
  <field name="id">ID</field>
  <field name="link">IDA nameA pathA</field>
  <field name="link">IDB nameB pathB</field>
  <field name="link">IDC nameC pathC</field>
</doc>

this is ok -- when spaces are a problem, i've tokenized on \n -- but this feels very brittle.

I'm considering a general JSON tokenizer and want to know what you all think. Consider:
<doc>
  <field name="id">ID</field>
  <field name="link">{ "id":10 "name":"nameA" "path":"/..." }</field>
  <field name="link">{ "id":11 "name":"nameB" "path":"/..." }</field>
  <field name="link">{ "id":12 "name":"nameB" "path":"/..." }</field>
</doc>

The tokenizer can make a token for each key:value pair, that is:
 id:10, name:nameA,path:....,id:11...

Perhaps this could be part of the general 'tag' design:
http://wiki.apache.org/solr/UserTagDesign

rather then having fixed prefixes "~erik#lucene", we could use json syntax:
 {user:erik, text:lucene, date:20071112 }

Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSON parsing is super fast. The prefix queries are probably slower with a longer string, but I guess you could just use:
 {u:erik, t:lucene, d:20071112 }

Thoughts?

ryan

Reply via email to