Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-11 Thread Michael McCandless
On Sat, Jan 10, 2015 at 7:58 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Mike, We run our Solr 3.x indexing with 10GB/shard. I've been testing Solr 4 with 4,6, and 8GB for heap. As of Friday night when the indexes were about half done (about 400GB on disk) only the 4GB had issues.

Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

2015-01-11 Thread Martin Wunderlich
Hi all, I am currently in the process of upgrading a search engine application from Lucene 3.5.0 to version 4.10.3. There have been some substantial API changes in version 4 that break backward compatibility. I have managed to fix most of them, but a few issues remain that I could use some

RE: Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

2015-01-11 Thread Uwe Schindler
Hi, First, there is also a migrate guide next to the changes log: http://lucene.apache.org/core/4_10_3/MIGRATE.html 1. If you implement analyzer, you have to override createComponents() which return TokenStreamComponents objects. See other Analyzer’s source code to understand how to

Highlighter - SimpleSpanFragmenter bug

2015-01-11 Thread zsolt.szloboda
the highlighter's SimpleSpanFragmenter has a bug documented in https://issues.apache.org/jira/browse/LUCENE-2229 that practically makes it unusable with PhraseQuery I can confirm that the bug still exists in version 4.10 (the JIRA issue was created back in year 2010) the symptom is that if there

SegmentCommitInfos and live/deleted files

2015-01-11 Thread Varun Thacker
I wanted to know whats the difference betwen the two ways that I am getting a list of all segment files belonging to a segment? method1 never returns .liv files. https://gist.github.com/vthacker/98065232c3d2da579700 -- Regards, Varun Thacker http://www.vthacker.in/

Re: SegmentCommitInfos and live/deleted files

2015-01-11 Thread Robert Muir
files are either per-segment or per-commit. the first only returns per-segment files. this means it won't include any per-commit files: * segments_N itself * generational .liv for deletes * generational .fnm/.dvd/etc for docvalues updates. the second includes per-commit files, too. it doesnt

Re: Upgrading Lucene from 3.5 to 4.10 - how to handle Java API changes

2015-01-11 Thread Martin Wunderlich
Hi Uwe, Thanks a lot for the detailed reply. I'll see how far I get with it, but being quite new to Lucene, it seems I am lacking a bit of background information to fully understand the response below. In particular, I need to do some background reading on how token streams and readers work,

Re: SegmentCommitInfos and live/deleted files

2015-01-11 Thread Varun Thacker
Thanks Robert for pointing out the difference. On Sun, Jan 11, 2015 at 10:29 PM, Robert Muir rcm...@gmail.com wrote: files are either per-segment or per-commit. the first only returns per-segment files. this means it won't include any per-commit files: * segments_N itself * generational

Custom tokenizer

2015-01-11 Thread Vihari Piratla
Hi, I am trying to implement a custom tokenizer for my application and I have few queries regarding the same. 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the custom tokenizer and make it use this tokenizer instead of say StandardTokenizer? 2. Why are analyzers such as