Hi,

> On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter <mathias.wal...@gmx.net>
> wrote:
> > I indexed about 90 million sentences and the PAS (predicate argument
> structures) they consist of (which are about 500 million). Then
> > I try to do NER (named entity recognition) by searching about 5 million
> entities. For each entity I need the all search results, not
> > just the top X. Since about 10 percent of the entities are high frequent (i.
> e. there are more than 5 million hits for "human"), it
> > takes very long to obtain the data from the index. "Very long" means about a
> day with 15 distributed Katta nodes. Katta is just a
> > distribution and shard balancing solution on top of Lucene.
> 
> if you aren't getting top-N results/doing search, are you sure a
> search engine library/server is the right tool for this job?

No, I'm not sure, but I didn't find another solution. Any other solution also 
has to create some kind of index and has to provide some search API. Because I 
need SpanNearQuery and PhraseQuery to find some multi-term entities, I think 
Solr/Lucene is a good starting point. Also, I need the classic top-N results 
for the web application. So a single solution is preferred.

> > Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte
> array. The size was increased to 7 characters (= 14 bytes)
> > which is still a gain of more than 50 percent compared to the UTF8 encoding.
> BTW: I found no sample how to use the
> > IndexableBinaryStringTools class except in the unit tests.
> 
> it is deprecated in trunk, because you can index binary terms (your
> own byte[]) directly if you want. To do this, you need to use a custom
> AttributeFactory.

How do I use it with Solr, i. e. how to set up a schema.xml using a custom 
AttributeFactory?

--
Kind regards,
Mathias

Reply via email to