RE: Optimizing number of segments in lucene index (no writes/deletes, only reads)

2017-06-14 Thread Uwe Schindler
Hi, This article is still very correct! Use the defaults of TieredMergePolicy, nothering more to say. The problems only start once you optimize/forceMerge for the first time and still update it afterwards. Because then your index is no longer structured in an optimal way and the huge segment w

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-14 Thread David Smiley
Nice! On Tue, Jun 13, 2017 at 11:12 PM Tom Hirschfeld wrote: > Hey All, > > I was able to solve my problem a few weeks ago and wanted to update you > all. The root issue was with the caching mechanism in > "makedistancevaluesource" method in the lucene spatial module, it appears > that documents

Re: Optimizing number of segments in lucene index (no writes/deletes, only reads)

2017-06-14 Thread Riccardo Tasso
2017-06-14 19:20 GMT+02:00 Uwe Schindler : > You also lose the ability to parallelize searches with an Executor on > IndexSearcher! How can you say that? Isn't true that multiple reader can access concurrently on the same segment?

Using POS payloads for chunking

2017-06-14 Thread José Tomás Atria
Hello! I'm not particularly familiar with lucene's search api (as I've been using the library mostly as a dumb index rather than a search engine), but I am almost certain that, using its payload capabilities, it would be trivial to implement a regular chunker to look for patterns in sequences of p

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello, We use POS-tagging too, and encode them as payload bitsets for scoring, which is, as far as is know, the only possibility with payloads. So, instead of encoding them as payloads, why not index your treebanks POS-tags as tokens on the same position, like synonyms. If you do that, you can

Re: Using POS payloads for chunking

2017-06-14 Thread Erik Hatcher
Markus - how are you encoding payloads as bitsets and use them for scoring? Curious to see how folks are leveraging them. Erik > On Jun 14, 2017, at 4:45 PM, Markus Jelsma wrote: > > Hello, > > We use POS-tagging too, and encode them as payload bitsets for scoring, which > is, as f

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Erik, Using Solr, or actually more parts are Lucene, we have a CharFilter adding treebank tags to whitespace delimited word using a delimiter, further on we get these tokens with the delimiter and the POS-tag. It won't work with some Tokenizers and put it before WDF, it'll split as you kn

Re: Using POS payloads for chunking

2017-06-14 Thread Erick Erickson
Markus: I don't believe that payloads are limited in size at all. LUCENE-7705 was done in part because there _was_ a hard-coded 256 limit for some of the tokenizers. The Payload (at least recent versions) just have some bytes after them, and (with LUCENE-7705) can be arbitrarily long. Of course i

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Erick, no worries, i recognize you two. I will take a look at your references tomorrow. Although i am still fine with eight bits, i cannot spare any more but one. If Lucene allows us to pass longer bitsets to the BytesRef, it would be awesome and easy to encode. Thanks! Markus -Orig

Re: Using POS payloads for chunking

2017-06-14 Thread Tommaso Teofili
I think it'd be interesting to also investigate using TypeAttribute [1] together with TypeTokenFilter [2]. Regards, Tommaso [1] : https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html [2] : https://lucene.apache.org/core/6_5_0/analyzers-common/org

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Tommaso, These don't propagate to search right, but can be used in the analyzer chain! This would be a better solution than using delimiters on words. The only problem is that TypeFilter only works on Tokens, after the tokenizer. The bonus of a CharFilter is that is sees the whole text, s

RE: Optimizing number of segments in lucene index (no writes/deletes, only reads)

2017-06-14 Thread Luís Filipe Nassif
In the past I have tried IndexSearcher with an ExecutorService to parallelize searches on multiple segments on a SSD disk. That was with Lucene 4.9. Unfortunatelly the searches became slower with various number of threads in the pool, and much slower with 1 thread. There was some overhead with tha

RE: Optimizing number of segments in lucene index (no writes/deletes, only reads)

2017-06-14 Thread Uwe Schindler
Hi, This was meant that a *single* search can be parallelized. Of course you can do multiple searches in parallel. But that is completely unrelated to the question. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message-