JSON tokenizer? tagging ideas

Ryan McKinley Fri, 25 Jan 2008 15:25:08 -0800

I've been struggling with how to get various bits of structured datainto solr documents. In various projects I have tried various ideas,but none feel great.

Take a simple example where I want a document field to be the list oflinked data with name, ID, and path. I have tried things like:


<doc>
  <field name="id">ID</field>
  <field name="link">IDA nameA pathA</field>
  <field name="link">IDB nameB pathB</field>
  <field name="link">IDC nameC pathC</field>
</doc>

this is ok -- when spaces are a problem, i've tokenized on \n -- butthis feels very brittle.

I'm considering a general JSON tokenizer and want to know what you allthink. Consider:

<doc>
  <field name="id">ID</field>
  <field name="link">{ "id":10 "name":"nameA" "path":"/..." }</field>
  <field name="link">{ "id":11 "name":"nameB" "path":"/..." }</field>
  <field name="link">{ "id":12 "name":"nameB" "path":"/..." }</field>
</doc>

The tokenizer can make a token for each key:value pair, that is:
 id:10, name:nameA,path:....,id:11...

Perhaps this could be part of the general 'tag' design:
http://wiki.apache.org/solr/UserTagDesign

rather then having fixed prefixes "~erik#lucene", we could use json syntax:
 {user:erik, text:lucene, date:20071112 }

Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSONparsing is super fast. The prefix queries are probably slower with alonger string, but I guess you could just use:

 {u:erik, t:lucene, d:20071112 }

Thoughts?

ryan

JSON tokenizer? tagging ideas

Reply via email to