Hi, > On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <mathias.wal...@gmx.net> > wrote: > > I indexed about 90 million sentences and the PAS (predicate argument > structures) they consist of (which are about 500 million). Then > > I try to do NER (named entity recognition) by searching about 5 million > entities. For each entity I need the all search results, not > > just the top X. Since about 10 percent of the entities are high frequent (i. > e. there are more than 5 million hits for "human"), it > > takes very long to obtain the data from the index. "Very long" means about a > day with 15 distributed Katta nodes. Katta is just a > > distribution and shard balancing solution on top of Lucene. > > if you aren't getting top-N results/doing search, are you sure a > search engine library/server is the right tool for this job?
No, I'm not sure, but I didn't find another solution. Any other solution also has to create some kind of index and has to provide some search API. Because I need SpanNearQuery and PhraseQuery to find some multi-term entities, I think Solr/Lucene is a good starting point. Also, I need the classic top-N results for the web application. So a single solution is preferred. > > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte > array. The size was increased to 7 characters (= 14 bytes) > > which is still a gain of more than 50 percent compared to the UTF8 encoding. > BTW: I found no sample how to use the > > IndexableBinaryStringTools class except in the unit tests. > > it is deprecated in trunk, because you can index binary terms (your > own byte[]) directly if you want. To do this, you need to use a custom > AttributeFactory. How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory? -- Kind regards, Mathias