[Lucene.Net] [jira] [Updated] (LUCENENET-427) Provide limit on phrase analysis in FastVectorHighlighter (LUCENE-3234)
[ https://issues.apache.org/jira/browse/LUCENENET-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Digy updated LUCENENET-427: --- Attachment: FastVectorHighlighter.patch Provide limit on phrase analysis in FastVectorHighlighter (LUCENE-3234) --- Key: LUCENENET-427 URL: https://issues.apache.org/jira/browse/LUCENENET-427 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 2.9.4g Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4g Attachments: FastVectorHighlighter.patch https://issues.apache.org/jira/browse/LUCENE-3234 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Resolved] (LUCENENET-427) Provide limit on phrase analysis in FastVectorHighlighter (LUCENE-3234)
[ https://issues.apache.org/jira/browse/LUCENENET-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Digy resolved LUCENENET-427. Resolution: Fixed Committed Provide limit on phrase analysis in FastVectorHighlighter (LUCENE-3234) --- Key: LUCENENET-427 URL: https://issues.apache.org/jira/browse/LUCENENET-427 Project: Lucene.Net Issue Type: Improvement Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 2.9.4g Reporter: Digy Priority: Minor Fix For: Lucene.Net 2.9.4g Attachments: FastVectorHighlighter.patch https://issues.apache.org/jira/browse/LUCENE-3234 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (LUCENE-3243) FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs
[ https://issues.apache.org/jira/browse/LUCENE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055347#comment-13055347 ] Koji Sekiguchi commented on LUCENE-3243: Thank you for the proposal and patch! I don't understand: * What is the position offset? Isn't it just a position? * Why is the position offset String? * Why do you need setPositionOffset()? I don't understand the implementation of the method... it appends the argument position to the current position. FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs --- Key: LUCENE-3243 URL: https://issues.apache.org/jira/browse/LUCENE-3243 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.2 Environment: Lucene 3.2 Reporter: Jahangir Anwari Priority: Minor Labels: feature, lucene Attachments: LUCENE-3243.patch.diff Needed to return position offsets along with highlighted snippets when using FVH for highlighting. Using the ([LUCENE-3141|https://issues.apache.org/jira/browse/LUCENE-3141]) patch I was able to get the fragInfo for a particular Phrase search. Currently the Toffs(Term offsets) class only stores the start and end offset. To get the position offset, I added the position offset information in Toffs and FieldPhraseList class. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055364#comment-13055364 ] Koji Sekiguchi commented on LUCENE-1889: Patch looks really good! bq. To handle RangeQuery, you'd need to add another such data structure: it would probably be best to introduce some new abstraction to represent all of these query-proxies. Would you like to try this one? :) bq. It seemed a less useful case to me anyway since we don't usually use range queries in the context of full text; more often they come up in structured metadata? Curious if you have requests for that? I don't have the requirement for highlighting range queries, even wildcard, prefix and regexp either. Because I'm using FVH to highlight terms in N-gram fields, and these MultiTermQueries are not ideal for N-gram. But if FVH could cover range queries, it should be nicer for users. FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-826) Language detector
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055374#comment-13055374 ] Jan Høydahl commented on LUCENE-826: Reviving this issue - would be interesting to arrive at a proposal whether this code could replace Tika's existing languageIdentifier. We still need to solve the case with small texts. I'm thinking of a hybrid solution where we fallback to a dictionary based detector for small texts, i.e. based on Ooo dictionaries. Language detector - Key: LUCENE-826 URL: https://issues.apache.org/jira/browse/LUCENE-826 Project: Lucene - Java Issue Type: New Feature Reporter: Karl Wettin Assignee: Karl Wettin Attachments: ld.tar.gz, ld.tar.gz A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications. Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. Initialized like this: {code} LanguageRoot root = new LanguageRoot(new File(documentClassifier/language root)); root.addBranch(uralic); root.addBranch(fino-ugric, uralic); root.addBranch(ugric, uralic); root.addLanguage(fino-ugric, fin, finnish, fi, Suomi); root.addBranch(proto-indo european); root.addBranch(germanic, proto-indo european); root.addBranch(northern germanic, germanic); root.addLanguage(northern germanic, dan, danish, da, Danmark); root.addLanguage(northern germanic, nor, norwegian, no, Norge); root.addLanguage(northern germanic, swe, swedish, sv, Sverige); root.addBranch(west germanic, germanic); root.addLanguage(west germanic, eng, english, en, UK); root.mkdirs(); LanguageClassifier classifier = new LanguageClassifier(root); if (!new File(root.getDataPath(), trainingData.arff).exists()) { classifier.compileTrainingData(); // from wikipedia } classifier.buildClassifier(); {code} Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test: (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.) {code} assertEquals(swe, classifier.classify(sweden_in_swedish).getISO()); testEquals(swe, classifier.classify(norway_in_swedish).getISO()); testEquals(swe, classifier.classify(denmark_in_swedish).getISO()); testEquals(swe, classifier.classify(finland_in_swedish).getISO()); testEquals(swe, classifier.classify(uk_in_swedish).getISO()); testEquals(nor, classifier.classify(sweden_in_norwegian).getISO()); assertEquals(nor, classifier.classify(norway_in_norwegian).getISO()); testEquals(nor, classifier.classify(denmark_in_norwegian).getISO()); testEquals(nor, classifier.classify(finland_in_norwegian).getISO()); testEquals(nor, classifier.classify(uk_in_norwegian).getISO()); testEquals(fin, classifier.classify(sweden_in_finnish).getISO()); testEquals(fin, classifier.classify(norway_in_finnish).getISO()); testEquals(fin, classifier.classify(denmark_in_finnish).getISO()); assertEquals(fin, classifier.classify(finland_in_finnish).getISO()); testEquals(fin, classifier.classify(uk_in_finnish).getISO()); testEquals(dan, classifier.classify(sweden_in_danish).getISO()); // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small. testEquals(dan, classifier.classify(norway_in_danish).getISO()); assertEquals(dan, classifier.classify(denmark_in_danish).getISO()); testEquals(dan, classifier.classify(finland_in_danish).getISO()); testEquals(dan, classifier.classify(uk_in_danish).getISO()); testEquals(eng, classifier.classify(sweden_in_english).getISO()); testEquals(eng, classifier.classify(norway_in_english).getISO()); testEquals(eng, classifier.classify(denmark_in_english).getISO()); testEquals(eng, classifier.classify(finland_in_english).getISO()); assertEquals(eng, classifier.classify(uk_in_english).getISO()); {code} I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying. It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
[jira] [Commented] (SOLR-2614) stats with pivot
[ https://issues.apache.org/jira/browse/SOLR-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055396#comment-13055396 ] pengyao commented on SOLR-2614: --- somebody help me or give me some suggestion ? or is it easy to patch it? thanks very much. stats with pivot Key: SOLR-2614 URL: https://issues.apache.org/jira/browse/SOLR-2614 Project: Solr Issue Type: Improvement Components: SearchComponents - other Affects Versions: 4.0 Reporter: pengyao Priority: Critical Fix For: 4.0 Is it possible to get stats (like Stats Component: min ,max, sum, count, missing, sumOfSquares, mean and stddev) from numeric fields inside hierarchical facets (with more than one level, like Pivot)? I would like to query: ...?q=*:*version=2.2start=0rows=0stats=truestats.field=numeric_field1stats.field=numeric_field2stats.pivot=field_x,field_y,field_z and get min, max, sum, count, etc. from numeric_field1 and numeric_field2 from all combinations of field_x, field_y and field_z (hierarchical values). Using stats.facet I get just one field at one level and using facet.pivot I get just counts, but no stats. Looping in client application to do all combinations of facets values will be to slow because there is a lot of combinations. Thanks a lot! this is very import,because only counts value,it's no use for sometimes. please add stats with pivot in solr 4.0 thanks a lot -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055402#comment-13055402 ] Toke Eskildsen commented on LUCENE-3079: This is quite another design than the quarter-baked one I've proposed with SOLR-2412 (which is really just a thin wrapper around LUCENE-2369). While maintaining a sidecar index makes the workflow more complicated, I would expect that it is beneficial for re-open speed and scalability. Technical note: For hierarchical faceting, I find that it is possible to avoid storing all levels in the hierarchy. By maintaining two numbers for each tag, denoting the tag-level and the level for the previous tag that matches, only the relevant tags needs to be indexed (full explanation at https://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/). Kudos for contributing solid code. I am looking forward to seeing the patch. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3231) Add fixed size DocValues int variants expose Arrays where possible
[ https://issues.apache.org/jira/browse/LUCENE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3231. - Resolution: Fixed Assignee: Simon Willnauer Lucene Fields: [New, Patch Available] (was: [New]) Add fixed size DocValues int variants expose Arrays where possible Key: LUCENE-3231 URL: https://issues.apache.org/jira/browse/LUCENE-3231 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3231.patch, LUCENE-3231.patch currently we only have variable bit packed ints implementation. for flexible scoring or loading field caches it is desirable to have fixed int implementations for 8, 16, 32 and 64 bit. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)
[ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055437#comment-13055437 ] Koji Sekiguchi commented on SOLR-2583: -- I'd like the feature as I'm using ExternalFileField a lot! bq. what do you say regarding the suggestion to use HashMap up to ~5.5% and above that using the float[]? Looking at your test, I think it is reasonable. But I'd like to use CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in my test. How about introducing compact=yes (default is no and float[] is used) with sparse=yes/no/auto? Make external scoring more efficient (ExternalFileField, FileFloatSource) - Key: SOLR-2583 URL: https://issues.apache.org/jira/browse/SOLR-2583 Project: Solr Issue Type: Improvement Components: search Reporter: Martin Grotzke Priority: Minor Attachments: FileFloatSource.java.patch, patch.txt External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory. This could be optimized by using a map of doc - score, so that the map contains as many entries as there are scoring entries in the external file, but not more. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055443#comment-13055443 ] Shai Erera commented on LUCENE-3079: Thanks Toke for the pointer. I think it's very interesting. We've actually explored in the past storing just the category/leaf, instead of the entire hierarchy, in the document. The search response time was much slower than what I reported above (nearly 2x slowdown). While storing the entire hierarchy indeed consumes more space, it is more performing at search time, and we figure that space today is cheap, and usually search apps are more interested in faster search response times and are willing to spend some more time at indexing and analysis stages. Nevertheless, the link you provided proposes an interesting way to manage the hierarchy, and I think it's worth exploring at some point. Could be that it will perform better than how we managed it when we indexed just the leaf category for each document. We'd also need to see how to update the taxonomy on the go. For example, it describes that for A/B/C you know that its level is 3 (that's easy) and that the previous category/tag that matches (P) is A. But what if at some point A/B is added to a document? What happens to the data indexed for the doc w/ A/B/C, which now its previous matching category is A/B? It's not clear to me, but could be that I've missed the description in the proposal. I am very close to uploading the patch. Hopefully I'll upload it by the end of my day. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055452#comment-13055452 ] Robert Muir commented on LUCENE-1889: - {quote} A possible issue is that regex support will differ from RegexpQuery, but I think? that Java's is a superset, so should be ok, but I'm not sure about this one. {quote} Actually, these are totally different syntaxes! An alternative way to flatten these multitermqueries could be to implement o.a.l.index.Terms with what is in the term vector... then you could rewrite them with their own code. trying to generate an equivalent string pattern could be a little problematic, for example wildcard supports escaped terms (and could contain other characters that are java.util.regex syntax characters but not wildcard syntax characters), the regex syntax is different, etc. if you still decide you want to do it this way though, i would use o.a.l.util.automaton instead of java.util.regex? Besides being faster, this is internally what these queries are using anyway, so you can convert them with for example WildcardQuery.toAutomaton(). Then, union these and match against the union'ed machine instead of a List. But personally i would look at going the Terms/rewriteMethod route if possible, this way all multitermqueries will just work. FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [VOTE] release 3.3 (take two)
Hi, This time, all is fine: - Smoke tests with tests.iter=100,tests.multiplicator=100,tests.nightly on Sun Java 1.5.0_22 Solaris-64 passed for Lucene-Core (using Lucene-Src package, so even compiles fine). Lucene-Contrib failed somehow, but with iter=1 passed. One core dump on testing contrib/analysis-common/ (Portuguese), seems Java 5 bug happens sometimes (unfortunately log hs_err is gone), not reproducible. So all fine, don't want to accuse Java 5 - but policeman is angry and wants an expensive ticket + driver license removed from Java 5 :-) - All signatures are fine, signatures are all from Robert Muir who I know personally: find . -name '*.asc' | xargs -L1 gpg --verify - Artifact META-INFs are correct, versions and revno are correct - Lucene-core-3.3.0.jar from Maven plugged into PANGAEA was successfully without recompiling. No Hotspot issues - MMap is fine - Extracted Solr src package, ant test with iter=1 and multiplicator=100 and nightly passes from root folder (includes lucene tests) - Extracted Lucene and Solr binary packages and checked contents for completeness (Licenses, Javadocs,...) - fine! Solr.WAR file contains correct artifact versions, unfortunately the lucene jar files are not available in the dist folder inside solr binary - is that wanted? Small issue: - systemrequirements.html: The JUNIT version as requirement is ahm, ah, huho very old. We should remove that in future, as JUnit is bundled with src package. So here is my PMC +1 ! Uwe - Generics|Java5|Signature|Manifest Policeman with PMC vote - - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, June 26, 2011 5:12 PM To: dev@lucene.apache.org Subject: [VOTE] release 3.3 (take two) Artifacts here: http://s.apache.org/lusolr330rc1 working release notes here: http://wiki.apache.org/lucene-java/ReleaseNote33 http://wiki.apache.org/solr/ReleaseNote33 To see the changes between the previous release candidate (rc0): svn diff -r 1139028:1139775 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_3 Here is my +1 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055480#comment-13055480 ] Toke Eskildsen commented on LUCENE-3079: SOLR-2412/LUCENE-2369 were created with the trade-offs (relatively) long startup, low memory, high performance: When the index is (re)opened, the hierarchy is analyzed by iterating the terms (it could be offloaded to index-time, but it is still iterate-the-entire-term-list after each change). This does not play well with real-time, but should be a nice fit for large indexes with low update rate. As for speed, my theory is that the sparser hierarchy (only the concrete paths) wins due to less counting, but without another solution to compare against it has so far remained a theory. There are some measurements at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ but I find that for hierarchical faceting, small changes to test-setups can easily have vast implications on performance, so they are not comparable to your million-document test. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3217) Improve DocValues merging
[ https://issues.apache.org/jira/browse/LUCENE-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3217: Attachment: LUCENE-3217.patch here is a patch for int variant. All fixed int variants are merged without loading them into memory and bulk merged if no deleted docs are present. Improve DocValues merging - Key: LUCENE-3217 URL: https://issues.apache.org/jira/browse/LUCENE-3217 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3217.patch Some DocValues impl. still load all values from merged segments into memory during merge. For efficiency we should merge them on the fly without buffering in memory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3179) OpenBitSet.prevSetBit()
[ https://issues.apache.org/jira/browse/LUCENE-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055506#comment-13055506 ] Michael McCandless commented on LUCENE-3179: Patch looks good Uwe -- thanks! OpenBitSet.prevSetBit() --- Key: LUCENE-3179 URL: https://issues.apache.org/jira/browse/LUCENE-3179 Project: Lucene - Java Issue Type: Improvement Reporter: Paul Elschot Assignee: Paul Elschot Priority: Minor Fix For: 3.3, 4.0 Attachments: LUCENE-3179-fix.patch, LUCENE-3179-fix.patch, LUCENE-3179-long-ntz.patch, LUCENE-3179-long-ntz.patch, LUCENE-3179.patch, LUCENE-3179.patch, LUCENE-3179.patch, TestBitUtil.java, TestOpenBitSet.patch Find a previous set bit in an OpenBitSet. Useful for parent testing in nested document query execution LUCENE-2454 . -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3240) Move FunctionQuery, ValueSources and DocValues to Queries module
[ https://issues.apache.org/jira/browse/LUCENE-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055510#comment-13055510 ] Michael McCandless commented on LUCENE-3240: Looks great Chris! Move FunctionQuery, ValueSources and DocValues to Queries module Key: LUCENE-3240 URL: https://issues.apache.org/jira/browse/LUCENE-3240 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3240.patch, LUCENE-3240.patch, LUCENE-3240.patch Having resolved the FunctionQuery sorting issue and moved the MutableValue classes, we can now move FunctionQuery, ValueSources and DocValues to a Queries module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055526#comment-13055526 ] Chris Male commented on LUCENE-3079: Great contribution Shai. What about putting it into a branch? I think it really does need a thorough review before we put it into trunk. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055532#comment-13055532 ] Shai Erera commented on LUCENE-3079: We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process? Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055532#comment-13055532 ] Shai Erera edited comment on LUCENE-3079 at 6/27/11 1:24 PM: - We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process? I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it. was (Author: shaie): We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process? Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055544#comment-13055544 ] Michael McCandless commented on LUCENE-1536: bq. My question: Do we really need to make the delDocs inverse in this issue? I agree, let's break this (inverting delDocs/skipDocs) into a new issue and do it first, then come back to this issue. There's still more work to do here, eg the bits should be stored inverted too (and the sparse encoding flipped). bq. The method name getNotDeletedDocs() should also be getVisibleDocs() or similar [I don't like double negation]. +1 for getVisibleDocs -- I also don't like double negation! bq. In general, reversing the delDocs might be a good idea, but we should do it separate and hard (not allow both variants implemented by IndexReader Co.). I agree it must be hard cutover -- no more getDelDocs, and getVisibleDocs is abstract in IR. bq. About the impls: FieldCacheRangeFilter can also implement getBits() directly as FieldCache is random access. It should just return an own Bits impl for the DocIdSet that checks the filtering in get(index). Ahh, right: FCRF has no trouble being random access, and it can re-use the already created matchDoc in the subclasses. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055546#comment-13055546 ] Mike Sokolov commented on LUCENE-1889: -- Robert: Thanks that sounds like good advice. I wasn't completely happy with that Pattern list anyway; really still just feeling my way around Lucene and trying random things at this point a bit. I wonder if you could comment on this possible other idea, following up on Mike M's quote above: I tried hacking up SpanScorer to see if I could get positions out of it using a custom Collector, but found that by the time a doc was reported, SpanScorer had already iterated over and dropped the positions. I was thinking of adding a Collector.collectSpans(int start, int end), and having SpanScorer call it (it would be an empty function in Collector proper) or something like that. At this point I'm wondering if it might be possible to rewrite many queries as some kind of SpanQuery (using a visitor), without the need to actually alter all the Query implementations. Is there a better way? I was also thinking it might be possible to capture and re-use positions gathered during the initial scoring episode rather than having to re-score during highlighting, but I guess that's a separate issue. Koji: Thanks for the review, but it sounds like some more iteration is needed here; for sure on RegExpQuery. I probably should have tested that a bit more carefully, although the one thing I tried (character classes) seems to work the same. FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3079: Attachment: LUCENE-3079.patch just some trivial test modifications so the tests work with an unmodified LuceneTestCase: * in some cases, if an assertion failed it would print the seed... but LTC does this. * in other tests, the test wanted to repeat a random sequence, but instead of exposing LTC internals, the test just grabs random.nextLong, makes a new Random from this, and then resets it with .setSeed. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055570#comment-13055570 ] Robert Muir commented on LUCENE-3079: - {quote} I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it. {quote} +1 My suggestion: # commit to branch 3.x with @experimental. # next, do a fast port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. # finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055580#comment-13055580 ] Robert Muir commented on LUCENE-1889: - Hi Mike, Simon has an issue open to make a lot of what you are talking about wrt positions easier: LUCENE-2878 In my opinion once LUCENE-2878 is resolved, we may want to then consider adding the capability for a codec to encode the offset deltas in parallel with the positions (so its just a stream of delta-encoded integers you read in parallel with the positions for things like highlighting). Then, highlighting would not require term vectors anymore right? I think this would be much faster and more efficient without the space waste of term vectors, and we could prototype such a thing by encoding these ourselves into the payloads... which is close to the same, but I think ultimately optionally supporting offsets this way will be better especially with block-oriented compression algorithms. FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #161: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/161/ No tests ran. Build Log (for compile errors): [...truncated 14056 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055581#comment-13055581 ] Robert Muir commented on LUCENE-1536: - {quote} +1 for getVisibleDocs – I also don't like double negation! {quote} I agree... getVisibleDocs() or another alternative would be getLiveDocs() if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3217) Improve DocValues merging
[ https://issues.apache.org/jira/browse/LUCENE-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055582#comment-13055582 ] Simon Willnauer commented on LUCENE-3217: - I am going to commit this part of the patch soon if nobody objects. Improve DocValues merging - Key: LUCENE-3217 URL: https://issues.apache.org/jira/browse/LUCENE-3217 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3217.patch Some DocValues impl. still load all values from merged segments into memory during merge. For efficiency we should merge them on the fly without buffering in memory -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3079: --- Attachment: LUCENE-3079-dev-tools.patch Thanks Robert for the fix. This indeed looks better than patching LTC ! Patch for dev-tools only, this time w/ Maven support too. I hope it works well :). Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055584#comment-13055584 ] Shai Erera commented on LUCENE-3079: {quote} My suggestion: 1. commit to branch 3.x with @experimental. 2. next, do a fast port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs. 3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue. {quote} I agree. I'll give it a day or two before I commit, unless everyone agree it can be committed today, in which case I'll happily press the button :). Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3220) Implement various ranking models as Similarities
[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3220: Attachment: LUCENE-3220.patch Information-based model framework due to Clinchant and Gaussier added. Implement various ranking models as Similarities Key: LUCENE-3220 URL: https://issues.apache.org/jira/browse/LUCENE-3220 Project: Lucene - Java Issue Type: Sub-task Components: core/search Affects Versions: flexscoring branch Reporter: David Mark Nemeskey Assignee: David Mark Nemeskey Labels: gsoc Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch Original Estimate: 336h Remaining Estimate: 336h With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu. TODO: * {{EasyStats}}: contains all statistics that might be relevant for a ranking algorithm * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible * _BM25_: the current mock implementation might be OK * _LM_ * _DFR_ Done: -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055599#comment-13055599 ] Mike Sokolov commented on LUCENE-1889: -- Ah, I see - that's awesome, thanks, had no idea. Yeah - I had been thinking about matching positions-offsets using the existing term vectors, which was going to be kind of unpleasant; you have to iterate by term, which you don't care about, and scan for a matching position. FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1889) FastVectorHighlighter: support for additional queries
[ https://issues.apache.org/jira/browse/LUCENE-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055603#comment-13055603 ] Robert Muir commented on LUCENE-1889: - well I think Simon might be looking for feedback on LUCENE-2878, which would allow you to get at the positions and corresponding payloads. So as an experiment close to what you describe, you could play with his patch, make a TokenFilter that copies whatever offset info highlighting needs into the payload (OffsetAsPayloadFilter or something), and try to make a quick-n-dirty highlighter that uses it? It would be interesting to see what the performance is like from this versus the term vectors, besides working with all queries :) FastVectorHighlighter: support for additional queries - Key: LUCENE-1889 URL: https://issues.apache.org/jira/browse/LUCENE-1889 Project: Lucene - Java Issue Type: Wish Components: modules/highlighter Reporter: Robert Muir Priority: Minor Attachments: LUCENE-1889.patch I am using fastvectorhighlighter for some strange languages and it is working well! One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) Here is one thing Michael M posted in the original ticket: {quote} I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). {quote} Due to strange requirements I am using something similar to this (but specialized to our case). I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted, and flattening multiphrasequeries into boolean or'ed phrasequeries. I do not think these things would be 'fast', but i had a few ideas that might help: * looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right? * maybe as a last resort, try Query.extractTerms() ? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3243) FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs
[ https://issues.apache.org/jira/browse/LUCENE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jahangir Anwari updated LUCENE-3243: Attachment: CustomSolrHighlighter.java FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs --- Key: LUCENE-3243 URL: https://issues.apache.org/jira/browse/LUCENE-3243 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.2 Environment: Lucene 3.2 Reporter: Jahangir Anwari Priority: Minor Labels: feature, lucene Attachments: CustomSolrHighlighter.java, LUCENE-3243.patch.diff Needed to return position offsets along with highlighted snippets when using FVH for highlighting. Using the ([LUCENE-3141|https://issues.apache.org/jira/browse/LUCENE-3141]) patch I was able to get the fragInfo for a particular Phrase search. Currently the Toffs(Term offsets) class only stores the start and end offset. To get the position offset, I added the position offset information in Toffs and FieldPhraseList class. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3246) Invert IR.getDelDocs - IR.getLiveDocs
Invert IR.getDelDocs - IR.getLiveDocs -- Key: LUCENE-3246 URL: https://issues.apache.org/jira/browse/LUCENE-3246 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Spinoff from LUCENE-1536, where we need to fix the low level filtering we do for deleted docs to match Filters (ie, a set bit means the doc is accepted) so that filters can be pushed all the way down to the enums when possible/appropriate. This change also inverts the meaning first arg to TermsEnum.docs/AndPositions (renames from skipDocs to liveDocs). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3243) FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs
[ https://issues.apache.org/jira/browse/LUCENE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055610#comment-13055610 ] Jahangir Anwari commented on LUCENE-3243: - Hi Koji, Sorry for not elaborating more on our requirements and our implementation. Basically for every search result we needed the position(word offset) information of the search hits in the document. On the search result page, this position offsets information was embedded in the search result links. When the user clicked on a search link, at the target page using javascript and the position offset information we would highlight the search terms. To return the position offset information along with the highlighted snippet we created a CustomSolrHighlihter(attached). Depending on the type of query the custom highlighter returns the position offsets information. # Non-phrase query: Using FieldTermStack we return the term position offset for the terms in the query. # Phrase query: Using the WeightedFragInfo.fragInfos we return the term position offset for the terms in the query. But currently the Toffs(Term offsets) class only stores the start and end offset and so we updated it so that it would store the position information as well. Answers to your questions: * *What is the position offset? Isn't it just a position?* Yes, it is just the position. * *Why is the position offset String?* Since for phrase queries(e.g. divine knowledge) the position-gap between terms == 1, WeightedPhraseInfo would only store the startOffset(i.e 12) of the first term of the phrase terms and the endOffset(i.e. 29) of the phrase terms. {code} [startOffset, endOffset] divine knowledge: [(12,29)] {code} But as we needed position information(i.e. 5,6) of all the terms it required storing the position of the terms of a phrase query as a String. {code} [startOffset, endOffset, positions] divine knowledge: [(12,29, [5,6])] {code} * *Why do you need setPositionOffset()?* setPositionOffset() is used to store the positions of consecutive terms of a phrase query. For every terms of the phrase query it just appends the argument position to the current position(i.e. [5,6]). P.S. In order to able to override doHighlightingByFastVectorHighlighter() method in CustomSolrHighlighter we had to change the access modifier for alternateField() and getSolrFragmentsBuilder() to protected. FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs --- Key: LUCENE-3243 URL: https://issues.apache.org/jira/browse/LUCENE-3243 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 3.2 Environment: Lucene 3.2 Reporter: Jahangir Anwari Priority: Minor Labels: feature, lucene Attachments: CustomSolrHighlighter.java, LUCENE-3243.patch.diff Needed to return position offsets along with highlighted snippets when using FVH for highlighting. Using the ([LUCENE-3141|https://issues.apache.org/jira/browse/LUCENE-3141]) patch I was able to get the fragInfo for a particular Phrase search. Currently the Toffs(Term offsets) class only stores the start and end offset. To get the position offset, I added the position offset information in Toffs and FieldPhraseList class. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3246) Invert IR.getDelDocs - IR.getLiveDocs
[ https://issues.apache.org/jira/browse/LUCENE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-3246: --- Attachment: LUCENE-3246.patch Initial patch, pulled out of LUCENE-1536, plus 1) renamed IR.getNotDeletedDocs to IR.getLiveDocs, and 2) fixed IR to force subclasses to override this (removing getDeletedDocs). I think this is close, but the one thing remaining is to fix the IR impls to properly invert their del docs (now they create a NotDocs wrapper around their current bitsets). Invert IR.getDelDocs - IR.getLiveDocs -- Key: LUCENE-3246 URL: https://issues.apache.org/jira/browse/LUCENE-3246 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3246.patch Spinoff from LUCENE-1536, where we need to fix the low level filtering we do for deleted docs to match Filters (ie, a set bit means the doc is accepted) so that filters can be pushed all the way down to the enums when possible/appropriate. This change also inverts the meaning first arg to TermsEnum.docs/AndPositions (renames from skipDocs to liveDocs). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3243) FastVectorHighlighter - add position offset to FieldPhraseList.WeightedPhraseInfo.Toffs
[ https://issues.apache.org/jira/browse/LUCENE-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055610#comment-13055610 ] Jahangir Anwari edited comment on LUCENE-3243 at 6/27/11 4:01 PM: -- Hi Koji, Sorry for not elaborating more on our requirements and our implementation. Basically for every search result we needed the position(word offset) information of the search hits in the document. On the search result page, this position offsets information was embedded in the search result links. When the user clicked on a search link, at the target page using javascript and the position offset information we would highlight the search terms. To return the position offset information along with the highlighted snippet we created a CustomSolrHighlihter(attached). Depending on the type of query the custom highlighter returns the position offsets information. # Non-phrase query: Using FieldTermStack we return the term position offset for the terms in the query. # Phrase query: Using the WeightedFragInfo.fragInfos we return the term position offset for the terms in the query. But currently the Toffs(Term offsets) class only stores the start and end offset and so we updated it so that it would store the position information as well. Answers to your questions: * *What is the position offset? Isn't it just a position?* Yes, it is just the position. * *Why is the position offset String?* Since for phrase queries(e.g. divine knowledge) the position-gap between terms == 1, WeightedPhraseInfo would only store the startOffset(i.e 12) of the first term of the phrase terms and the endOffset(i.e. 29) of the phrase terms. {code} [startOffset, endOffset] divine knowledge: [(12,29)] {code} But as we needed position information(i.e. 5,6) of all the terms it required storing the position of the terms of a phrase query as a String. {code} [startOffset, endOffset, positions] divine knowledge: [(12,29, [5,6])] {code} * *Why do you need setPositionOffset()?* setPositionOffset() is used to store the positions of consecutive terms of a phrase query. For every terms of the phrase query it just appends the argument position to the current position(i.e. [5,6]). Example output: {code} lst name=/book/title/pg15 arr name=para strun of strong class=highlightdivine knowledge/strong and understanding, and become the recipients of a grace that is infinite and /str /arr str name=positionOffsets80,81,118,119/str /lst {code} P.S. In order to able to override doHighlightingByFastVectorHighlighter() method in CustomSolrHighlighter we had to change the access modifier for alternateField() and getSolrFragmentsBuilder() to protected. was (Author: janwari): Hi Koji, Sorry for not elaborating more on our requirements and our implementation. Basically for every search result we needed the position(word offset) information of the search hits in the document. On the search result page, this position offsets information was embedded in the search result links. When the user clicked on a search link, at the target page using javascript and the position offset information we would highlight the search terms. To return the position offset information along with the highlighted snippet we created a CustomSolrHighlihter(attached). Depending on the type of query the custom highlighter returns the position offsets information. # Non-phrase query: Using FieldTermStack we return the term position offset for the terms in the query. # Phrase query: Using the WeightedFragInfo.fragInfos we return the term position offset for the terms in the query. But currently the Toffs(Term offsets) class only stores the start and end offset and so we updated it so that it would store the position information as well. Answers to your questions: * *What is the position offset? Isn't it just a position?* Yes, it is just the position. * *Why is the position offset String?* Since for phrase queries(e.g. divine knowledge) the position-gap between terms == 1, WeightedPhraseInfo would only store the startOffset(i.e 12) of the first term of the phrase terms and the endOffset(i.e. 29) of the phrase terms. {code} [startOffset, endOffset] divine knowledge: [(12,29)] {code} But as we needed position information(i.e. 5,6) of all the terms it required storing the position of the terms of a phrase query as a String. {code} [startOffset, endOffset, positions] divine knowledge: [(12,29, [5,6])] {code} * *Why do you need setPositionOffset()?* setPositionOffset() is used to store the positions of consecutive terms of a phrase query. For every terms of the phrase query it just appends the argument position to the current position(i.e. [5,6]). P.S. In order to able to override doHighlightingByFastVectorHighlighter()
RE: [JENKINS-MAVEN] Lucene-Solr-Maven-trunk #161: POMs out of sync
This was the same misspelled common module problem. I should have run both 'ant generate-maven-artifacts' *and* 'mvn install' when I committed the (partial) fix last time... Anyway, again I've committed a fix. Going over to Jenkins now to run the trunk maven build again. 13th try's the charm? -Original Message- From: Apache Jenkins Server [mailto:jenk...@builds.apache.org] Sent: Monday, June 27, 2011 10:37 AM To: dev@lucene.apache.org Subject: [JENKINS-MAVEN] Lucene-Solr-Maven-trunk #161: POMs out of sync Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/161/ No tests ran. Build Log (for compile errors): [...truncated 14056 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3247) Update CompoundFile format on the website
[ https://issues.apache.org/jira/browse/LUCENE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-3247: --- Assignee: Simon Willnauer Update CompoundFile format on the website - Key: LUCENE-3247 URL: https://issues.apache.org/jira/browse/LUCENE-3247 Project: Lucene - Java Issue Type: Task Components: general/website Affects Versions: 3.4, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.4, 4.0 since we changed the compound file format lately we should update the website accordingly -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3247) Update CompoundFile format on the website
Update CompoundFile format on the website - Key: LUCENE-3247 URL: https://issues.apache.org/jira/browse/LUCENE-3247 Project: Lucene - Java Issue Type: Task Components: general/website Affects Versions: 3.4, 4.0 Reporter: Simon Willnauer Priority: Minor Fix For: 3.4, 4.0 since we changed the compound file format lately we should update the website accordingly -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3247) Update CompoundFile format on the website
[ https://issues.apache.org/jira/browse/LUCENE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3247: Attachment: LUCENE-3247.patch here is a patch. Update CompoundFile format on the website - Key: LUCENE-3247 URL: https://issues.apache.org/jira/browse/LUCENE-3247 Project: Lucene - Java Issue Type: Task Components: general/website Affects Versions: 3.4, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.4, 4.0 Attachments: LUCENE-3247.patch since we changed the compound file format lately we should update the website accordingly -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Updateing the website
hey folks, I tried to update the website yesterday and run into some problems with permissions etc. I talked to the infra guys which helped me to fix it. Yet, the fact that we are relying on grants cron job bugs me a little. It seems that we are doing things not the apache way where you just go into people.apache.org:/www/lucene.apache.org and run svn update. We still export stuff from certain svn paths into that directory via /home/gsingers/bin/exportLuceneDocs.sh I wonder if we can achive the same thing by using something like svn external (maybe I am wrong here) or we should change the layout of the website so we can simply run svn update there? SImon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3216) Store DocValues per segment instead of per field
[ https://issues.apache.org/jira/browse/LUCENE-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3216: Attachment: LUCENE-3216.patch next iteration, this time fixing most of the Byte variants to only write / open one file at a time. Straight variants are still missing. Store DocValues per segment instead of per field Key: LUCENE-3216 URL: https://issues.apache.org/jira/browse/LUCENE-3216 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3216.patch, LUCENE-3216_floats.patch currently we are storing docvalues per field which results in at least one file per field that uses docvalues (or at most two per field per segment depending on the impl.). Yet, we should try to by default pack docvalues into a single file if possible. To enable this we need to hold all docvalues in memory during indexing and write them to disk once we flush a segment. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055649#comment-13055649 ] Simon Willnauer commented on LUCENE-2793: - bq. Should IOContext and MergeInfo be in oal.store not .index? +1 bq. I think SegmentMerger should receive an IOCtx from its caller, and yeah I think we should pass the IOContext in via the ctor. Yet, for IW#addIndexes you can simply build a best effort IOContext like: {code} for (IndexReader indexReader : readers) { numDocs += indexReader.numDocs(); } final IOContext context = new IOContext(new MergeInfo(numDocs, -1, true, false)); } bq. I think on flush IOContext should include num docs and estimated +1 I think that is good no? bq. Somehow, lucene/contrib/demo/data is deleted on the branch. We should check if anything else is missing! oh man... I will check you use new IOContext(Context.FLUSH) and new IOContext(Context.READ) in your patch but we have some static like IOContext.READ maybe we need FLUSH too? for the tests I think we should start randomizing the IOContext. I think you should add a newIOContext(Random random) to LuceneTestCase and get the context from there in a unit test. At the end of the day we should see same behavior whatever context you pass in right? simon Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Assignee: Varun Thacker Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2793-nrt.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3248) In BufferedIndexInput to cleanup the bufferSize variable passed down to it as the default bufferSize(BUFFER_SIZE) is always used
In BufferedIndexInput to cleanup the bufferSize variable passed down to it as the default bufferSize(BUFFER_SIZE) is always used Key: LUCENE-3248 URL: https://issues.apache.org/jira/browse/LUCENE-3248 Project: Lucene - Java Issue Type: Improvement Components: core/store Affects Versions: 4.0, IOContext branch Reporter: Varun Thacker After adding IOContext in (LUCENE-2793) we can optimize the size of all the buffers accordingly. This patch would cleanup all the unused bufferSize variables. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3247) Update CompoundFile format on the website
[ https://issues.apache.org/jira/browse/LUCENE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3247. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) committed to trunk and backported to 3.x Update CompoundFile format on the website - Key: LUCENE-3247 URL: https://issues.apache.org/jira/browse/LUCENE-3247 Project: Lucene - Java Issue Type: Task Components: general/website Affects Versions: 3.4, 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 3.4, 4.0 Attachments: LUCENE-3247.patch since we changed the compound file format lately we should update the website accordingly -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[VOTE] Drop Java 5 support for trunk (Lucene 4.0)
This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3246) Invert IR.getDelDocs - IR.getLiveDocs
[ https://issues.apache.org/jira/browse/LUCENE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3246: -- Attachment: LUCENE-3246-IndexSplitters.patch Hi Mike, some work for you: I removed the nocommits in both contrib IndexSplitters. Now only NotBits usage in core is left over, right? Invert IR.getDelDocs - IR.getLiveDocs -- Key: LUCENE-3246 URL: https://issues.apache.org/jira/browse/LUCENE-3246 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3246-IndexSplitters.patch, LUCENE-3246.patch Spinoff from LUCENE-1536, where we need to fix the low level filtering we do for deleted docs to match Filters (ie, a set bit means the doc is accepted) so that filters can be pushed all the way down to the enums when possible/appropriate. This change also inverts the meaning first arg to TermsEnum.docs/AndPositions (renames from skipDocs to liveDocs). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
My big +1. D. On Mon, Jun 27, 2011 at 7:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
My +1 for trunk :-) I will change hudson scripts once this vote passes! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, June 27, 2011 7:38 PM To: dev@lucene.apache.org Subject: [VOTE] Drop Java 5 support for trunk (Lucene 4.0) This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, June 27, 2011 1:38 PM To: dev@lucene.apache.org Subject: [VOTE] Drop Java 5 support for trunk (Lucene 4.0) This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3246) Invert IR.getDelDocs - IR.getLiveDocs
[ https://issues.apache.org/jira/browse/LUCENE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055683#comment-13055683 ] Michael McCandless commented on LUCENE-3246: Awesome, thanks Uwe! I'll work on SR cutting over to live docs on disk... Invert IR.getDelDocs - IR.getLiveDocs -- Key: LUCENE-3246 URL: https://issues.apache.org/jira/browse/LUCENE-3246 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3246-IndexSplitters.patch, LUCENE-3246.patch Spinoff from LUCENE-1536, where we need to fix the low level filtering we do for deleted docs to match Filters (ie, a set bit means the doc is accepted) so that filters can be pushed all the way down to the enums when possible/appropriate. This change also inverts the meaning first arg to TermsEnum.docs/AndPositions (renames from skipDocs to liveDocs). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1, never thought I'd see the day ;-) -Yonik http://www.lucidimagination.com On Mon, Jun 27, 2011 at 1:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1, never thought I'd see the day ;-) We should run Mike's super-duper graphical visualizations of average/stddev queries speed for: - latest SUN 1.5, - trunk with latest SUN 1.6, - trunk with latest SUN 1.6 after upgrades to use 1.6-specific infrastructure (Arrays.copyOf, bit fiddling intrinsics). This would be interesting and maybe inspiring for folks still willing to keep 1.5 support in place ;) Dawid P.S. s/SUN/Oracle/g... - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 On Jun 27, 2011, at 1:38 PM, Simon Willnauer wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Mark Miller lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 On Mon, Jun 27, 2011 at 1:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3245) Realtime terms dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3245: - Attachment: LUCENE-3245.patch Here's a cut with a first implementation of the CSLM and AIA terms dictionaries. I think we're ready to benchmark writes. Realtime terms dictionary - Key: LUCENE-3245 URL: https://issues.apache.org/jira/browse/LUCENE-3245 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3245.patch, LUCENE-3245.patch, LUCENE-3245.patch For LUCENE-2312 we need a realtime terms dictionary. While ConcurrentSkipListMap may be used, it has drawbacks in terms of high object overhead which can impact GC collection times and heap memory usage. If we implement a skip list that uses primitive backing arrays, we can hopefully have a data structure that is [as] fast and memory efficient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 9121 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/9121/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback Error Message: file 1.fnx was already written to Stack Trace: java.io.IOException: file 1.fnx was already written to at org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:347) at org.apache.lucene.index.SegmentInfos.writeGlobalFieldMap(SegmentInfos.java:817) at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:305) at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:813) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3789) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2649) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2720) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1074) at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2041) at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:1964) at org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:929) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1430) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1348) Build Log (for compile errors): [...truncated 3426 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2619) two sfields in geospatial search
[ https://issues.apache.org/jira/browse/SOLR-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055720#comment-13055720 ] jose rodriguez commented on SOLR-2619: -- Hi David thanks for your reply, When i said it works for me is because i tried houdreds other possibilities without success. i was triying to run all from q= _query_:{} _query_:{} and very very large etc.. If i understand i could have both {!geofilt} into q? is there a better way to do my query q={!geofilt sfield=location_1}fq={!geofilt sfield=location_2} min this case ??? Thanks. two sfields in geospatial search Key: SOLR-2619 URL: https://issues.apache.org/jira/browse/SOLR-2619 Project: Solr Issue Type: Wish Components: clients - php Affects Versions: 3.2 Environment: Using with drupal Reporter: jose rodriguez Fix For: 3.2 Is it possible to create a query with two sfield (geospatial search)? .Want to mean two diferents pt and d for each field. If i need from - to then i need fields around the from coordinate and around the to coordinates. Thanks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2341) explore morfologik integration
[ https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Dybizbański updated LUCENE-2341: --- Attachment: morfologik-polish-1.5.2.jar morfologik-stemming-1.5.2.jar morfologik-fsa-1.5.2.jar LUCENE-2341.diff David, as you suggested, I've changed the interface to MorfologikAnalyzer and MorfologikFilter to account for the changes in Morfologik 1.5.2, namely the multiple dictionaries. Both those classes' constructors now accept a PolishStemmer.DICTIONARY (instead of languageCode String as in previous patch). A PolishStemmer object is instantiated by MorfologikFilter, so each invocation of MorfologikAnalyzer.createComponents (which instantiates MorfologikFilter) is coupled with an individual instance of PolishStemmer. This way, sharing a MorfologikAnalyzer by separate threads is safe (even though MorfologikFilter itself isn't thread-safe) provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..) explore morfologik integration -- Key: LUCENE-2341 URL: https://issues.apache.org/jira/browse/LUCENE-2341 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Robert Muir Assignee: Dawid Weiss Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, morfologik-stemming-1.5.0.jar, morfologik-stemming-1.5.2.jar Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available: http://sourceforge.net/projects/morfologik/ This works differently than LUCENE-2298, and ideally would be another option for users. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-2341) explore morfologik integration
[ https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055729#comment-13055729 ] Michał Dybizbański edited comment on LUCENE-2341 at 6/27/11 8:19 PM: - Dawid, as you suggested, I've changed the interface to MorfologikAnalyzer and MorfologikFilter to account for the changes in Morfologik 1.5.2, namely the multiple dictionaries. Both those classes' constructors now accept a PolishStemmer.DICTIONARY (instead of languageCode String as in previous patch). A PolishStemmer object is instantiated by MorfologikFilter, so each invocation of MorfologikAnalyzer.createComponents (which instantiates MorfologikFilter) is coupled with an individual instance of PolishStemmer. This way, sharing a MorfologikAnalyzer by separate threads is safe (even though MorfologikFilter itself isn't thread-safe) provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..) was (Author: michcio): David, as you suggested, I've changed the interface to MorfologikAnalyzer and MorfologikFilter to account for the changes in Morfologik 1.5.2, namely the multiple dictionaries. Both those classes' constructors now accept a PolishStemmer.DICTIONARY (instead of languageCode String as in previous patch). A PolishStemmer object is instantiated by MorfologikFilter, so each invocation of MorfologikAnalyzer.createComponents (which instantiates MorfologikFilter) is coupled with an individual instance of PolishStemmer. This way, sharing a MorfologikAnalyzer by separate threads is safe (even though MorfologikFilter itself isn't thread-safe) provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..) explore morfologik integration -- Key: LUCENE-2341 URL: https://issues.apache.org/jira/browse/LUCENE-2341 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Robert Muir Assignee: Dawid Weiss Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, morfologik-stemming-1.5.0.jar, morfologik-stemming-1.5.2.jar Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available: http://sourceforge.net/projects/morfologik/ This works differently than LUCENE-2298, and ideally would be another option for users. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3228) build should allow you (especially hudson) to refer to a local javadocs installation instead of downloading
[ https://issues.apache.org/jira/browse/LUCENE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-3228: - Attachment: LUCENE-3228.patch rmuir: here's a rough patch showing how the link offline stuff works. (as far as i understand it anyway) some quick testing didn't turn up any problems, but i didn't test the where modules/contribs usage of invoke-javadoc. there may be cleanup we want to do to - for now i avoided adding more sys properties for the package-list dirs, but maybe we want them? i dunno. there 's also some existing instances of the link tag that look totally bogus and broken (see the WTF comments i added) but i didn't test what changes if i remove them this patch should also fix SOLR-2439 (use relative links for lucene jdocs from solr jdocs. build should allow you (especially hudson) to refer to a local javadocs installation instead of downloading --- Key: LUCENE-3228 URL: https://issues.apache.org/jira/browse/LUCENE-3228 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-3228.patch Currently, we fail on all javadocs warnings. However, you get a warning if it cannot download the package-list from sun.com So I think we should allow you optionally set a sysprop using linkoffline. Then we would get much less hudson fake failures I feel like Mike opened an issue for this already but I cannot find it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)
[ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055737#comment-13055737 ] Martin Grotzke commented on SOLR-2583: -- bq. Looking at your test, I think it is reasonable. But I'd like to use CompactByteArray. I saw it wins over HashMap and float[] when 5% and above in my test. Can you share your test code or s.th. similar? Perhaps you can just fork https://github.com/magro/lucene-solr/ and add an appropriate test that reflects your data? Make external scoring more efficient (ExternalFileField, FileFloatSource) - Key: SOLR-2583 URL: https://issues.apache.org/jira/browse/SOLR-2583 Project: Solr Issue Type: Improvement Components: search Reporter: Martin Grotzke Priority: Minor Attachments: FileFloatSource.java.patch, patch.txt External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory. This could be optimized by using a map of doc - score, so that the map contains as many entries as there are scoring entries in the external file, but not more. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 27. juni 2011, at 19.38, Simon Willnauer wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 Mike McCandless http://blog.mikemccandless.com On Mon, Jun 27, 2011 at 1:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-3.x - Build # 9126 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/9126/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback Error Message: MockDirectoryWrapper: cannot close: there are still open files: {_co.cfs=1} Stack Trace: java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_co.cfs=1} at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:483) at org.apache.lucene.index.TestAddIndexes$RunAddIndexesThreads.closeDir(TestAddIndexes.java:693) at org.apache.lucene.index.TestAddIndexes.testAddIndexesWithRollback(TestAddIndexes.java:924) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1277) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1195) Caused by: java.lang.RuntimeException: unclosed IndexOutput: _co.cfs at org.apache.lucene.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:410) at org.apache.lucene.store.MockCompoundFileDirectoryWrapper.init(MockCompoundFileDirectoryWrapper.java:39) at org.apache.lucene.store.MockDirectoryWrapper.createCompoundOutput(MockDirectoryWrapper.java:439) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:128) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:3101) at org.apache.lucene.index.TestAddIndexes$CommitAndAddIndexes3.doBody(TestAddIndexes.java:839) at org.apache.lucene.index.TestAddIndexes$RunAddIndexesThreads$1.run(TestAddIndexes.java:667) Build Log (for compile errors): [...truncated 6497 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2619) two sfields in geospatial search
[ https://issues.apache.org/jira/browse/SOLR-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055788#comment-13055788 ] jose rodriguez commented on SOLR-2619: -- Excuse my english David what i wanted to mean is that i didnt find the way to put all into q= without using fq. Because i was reading about possibilities to write it using nested queries : http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ But all what i tried was without success. And if is possible to use nested queries, in this case ... is better than my option using fq? In solr im a newbie. two sfields in geospatial search Key: SOLR-2619 URL: https://issues.apache.org/jira/browse/SOLR-2619 Project: Solr Issue Type: Wish Components: clients - php Affects Versions: 3.2 Environment: Using with drupal Reporter: jose rodriguez Fix For: 3.2 Is it possible to create a query with two sfield (geospatial search)? .Want to mean two diferents pt and d for each field. If i need from - to then i need fields around the from coordinate and around the to coordinates. Thanks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2366) Facet Range Gaps
[ https://issues.apache.org/jira/browse/SOLR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055793#comment-13055793 ] Hoss Man commented on SOLR-2366: bq. Guess my main point with the examples was to suggest that a facet.range.spec should not require facet.range.start and facet.range.end, but that the first and last values in the spec list should be taken as start and end, instead of requiring start and end in addition. ... bq. Simply document that facet.range.spec is mutually exclusive to the parameters gap,start,end and other. I respect your argument, but i think if this new spec param is going to be mutually exclusive of facet.range.other as well as all of the existing mandatory facet.range params (facet.range.gap, facet.range.start, and facet.range.end) then it seems like what you're describing really shouldn't be an extension of facet.range at all ... it sounds should be some completley distinct type of faceting (sequence faceting ?) with it's own params and section in the response. ie... {noformat} facet.seq=fieldName f.fieldName.facet.seq.spec=0,5,25,50,100,200,400,* f.fieldName.facet.seq.include=edge {noformat} (where facet.seq.include has same semantics as facet.range.include ... except i don't think edge makes sense at all w/o the other param concept ... need to think it through more) Otherwise it could get really confusing for users trying to udnerstandwhat facet.range.* params do/don't make sense if they start using facet.range.gap and then switch to facet.range.spec (or vice-versa) ... ie: how come i'm not getting the before/after ranges when i use 'facet.range.spec=0,5,25,50facet.range.other=after' ?) Facet Range Gaps Key: SOLR-2366 URL: https://issues.apache.org/jira/browse/SOLR-2366 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor Fix For: 3.3 Attachments: SOLR-2366.patch, SOLR-2366.patch There really is no reason why the range gap for date and numeric faceting needs to be evenly spaced. For instance, if and when SOLR-1581 is completed and one were doing spatial distance calculations, one could facet by function into 3 different sized buckets: walking distance (0-5KM), driving distance (5KM-150KM) and everything else (150KM+), for instance. We should be able to quantize the results into arbitrarily sized buckets. I'd propose the syntax to be a comma separated list of sizes for each bucket. If only one value is specified, then it behaves as it currently does. Otherwise, it creates the different size buckets. If the number of buckets doesn't evenly divide up the space, then the size of the last bucket specified is used to fill out the remaining space (not sure on this) For instance, facet.range.start=0 facet.range.end=400 facet.range.gap=5,25,50,100 would yield buckets of: 0-5,5-30,30-80,80-180,180-280,280-380,380-400 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 Sanne 2011/6/27 Michael McCandless luc...@mikemccandless.com: +1 Mike McCandless http://blog.mikemccandless.com On Mon, Jun 27, 2011 at 1:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 On Tue, Jun 28, 2011 at 10:04 AM, Sanne Grinovero sanne.grinov...@gmail.com wrote: +1 Sanne 2011/6/27 Michael McCandless luc...@mikemccandless.com: +1 Mike McCandless http://blog.mikemccandless.com On Mon, Jun 27, 2011 at 1:38 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl
[jira] [Resolved] (SOLR-2619) two sfields in geospatial search
[ https://issues.apache.org/jira/browse/SOLR-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man resolved SOLR-2619. Resolution: Invalid Fix Version/s: (was: 3.2) there doesn't seem to actually be a concrete improvemment/bug identified here. jose: if you are having difficulties understanding/using solr features, please start by posting a detailed question explaining your usecase/problem to the solr-user mailing list http://wiki.apache.org/solr/UsingMailingLists two sfields in geospatial search Key: SOLR-2619 URL: https://issues.apache.org/jira/browse/SOLR-2619 Project: Solr Issue Type: Wish Components: clients - php Affects Versions: 3.2 Environment: Using with drupal Reporter: jose rodriguez Is it possible to create a query with two sfield (geospatial search)? .Want to mean two diferents pt and d for each field. If i need from - to then i need fields around the from coordinate and around the to coordinates. Thanks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 : Date: Mon, 27 Jun 2011 19:38:08 +0200 : From: Simon Willnauer simon.willna...@googlemail.com : Reply-To: dev@lucene.apache.org, simon.willna...@gmail.com : To: dev@lucene.apache.org : Subject: [VOTE] Drop Java 5 support for trunk (Lucene 4.0) : : This issue has been discussed on various occasions and lately on : LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) : : The main reasons for this have been discussed on the issue but let me : put them out here too: : : - Lack of testing on Jenkins with Java 5 : - Java 5 end of lifetime is reached a long time ago so Java 5 is : totally unmaintained which means for us that bugs have to either be : hacked around, tests disabled, warnings placed, but some things simply : cannot be fixed... we cannot actually support something that is no : longer maintained: we do find JRE bugs : (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important : that bugs actually get fixed: cannot do everything with hacks.\ : - due to Java 5 we legitimate performance hits like 20% slower grouping speed. : : For reference please read through the issue mentioned above. : : A lot of the committers seem to be on the same page here to drop Java : 5 support so I am calling out an official vote. : : all Lucene 3.x releases will remain with Java 5 support this vote is : for trunk only. : : : Here is my +1 : : Simon : : - : To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org : For additional commands, e-mail: dev-h...@lucene.apache.org : : -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3228) build should allow you (especially hudson) to refer to a local javadocs installation instead of downloading
[ https://issues.apache.org/jira/browse/LUCENE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056084#comment-13056084 ] Robert Muir commented on LUCENE-3228: - I am glad you had the same WTF, although ant docs say its ok to use both, the current tasks in e.g. lucene have both the link attribute and nested link-without-href-wtf, and as i tried mixing linkoffline in different ways, it would appear to work, until i changed the link to javaBROKENURL.sun.com/, etc. I think we should go with this patch so we aren't downloading this junk anymore, it causes false build failures, the only trick I can think of is how to ensure lucene source releases build by themself without reaching back to dev-tools (i think this is broken on trunk at the moment, but it does work on 3.x right now) build should allow you (especially hudson) to refer to a local javadocs installation instead of downloading --- Key: LUCENE-3228 URL: https://issues.apache.org/jira/browse/LUCENE-3228 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-3228.patch Currently, we fail on all javadocs warnings. However, you get a warning if it cannot download the package-list from sun.com So I think we should allow you optionally set a sysprop using linkoffline. Then we would get much less hudson fake failures I feel like Mike opened an issue for this already but I cannot find it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated LUCENE-2793: -- Attachment: LUCENE-2793.patch I have made the necessary changes. Still I might have missed out changing couple of Test Cases to random IOContext. I wanted to put it our so that you'll can have a look as soon as possible. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Assignee: Varun Thacker Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2793-nrt.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch, LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3240) Move FunctionQuery, ValueSources and DocValues to Queries module
[ https://issues.apache.org/jira/browse/LUCENE-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male resolved LUCENE-3240. Resolution: Fixed Assignee: Chris Male Committed revision 1140379. I'll open a separate task to move the impls. Move FunctionQuery, ValueSources and DocValues to Queries module Key: LUCENE-3240 URL: https://issues.apache.org/jira/browse/LUCENE-3240 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Assignee: Chris Male Fix For: 4.0 Attachments: LUCENE-3240.patch, LUCENE-3240.patch, LUCENE-3240.patch Having resolved the FunctionQuery sorting issue and moved the MutableValue classes, we can now move FunctionQuery, ValueSources and DocValues to a Queries module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3249) Move Solr's FunctionQuery impls to Queries Module
Move Solr's FunctionQuery impls to Queries Module - Key: LUCENE-3249 URL: https://issues.apache.org/jira/browse/LUCENE-3249 Project: Lucene - Java Issue Type: Sub-task Reporter: Chris Male Now that we have the main interfaces in the Queries module, we can move the actual impls over. Impls that won't be moved are: function/distance/* (to be moved to a spatial module) function/FileFloatSource.java (depends on Solr's Schema, data directories and exposes a RequestHandler) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)
[ https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056224#comment-13056224 ] Koji Sekiguchi commented on SOLR-2583: -- I didn't save the test snippet because I wrote it out of my office (I used stranger's PC). What I did was just using CompactByteArray instead of CompactFloatArray in your FileFloatSourceMemoryTest.java. Make external scoring more efficient (ExternalFileField, FileFloatSource) - Key: SOLR-2583 URL: https://issues.apache.org/jira/browse/SOLR-2583 Project: Solr Issue Type: Improvement Components: search Reporter: Martin Grotzke Priority: Minor Attachments: FileFloatSource.java.patch, patch.txt External scoring eats much memory, depending on the number of documents in the index. The ExternalFileField (used for external scoring) uses FileFloatSource, where one FileFloatSource is created per external scoring file. FileFloatSource creates a float array with the size of the number of docs (this is also done if the file to load is not found). If there are much less entries in the scoring file than there are number of docs in total the big float array wastes much memory. This could be optimized by using a map of doc - score, so that the map contains as many entries as there are scoring entries in the external file, but not more. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3250: Attachment: LUCENE-3250.patch remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3250.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056236#comment-13056236 ] Chris Male commented on LUCENE-3250: +1 remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3250.patch, LUCENE-3250_suggest.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056244#comment-13056244 ] Robert Muir commented on LUCENE-3250: - ok, i'll commit this soon, if anyone wants to take care of the intellij/maven deps, please go for it (eclipse is one huge megaproject with all the jars in classpath so it does not know about these things) remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Attachments: LUCENE-3250.patch, LUCENE-3250_suggest.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-3250. - Resolution: Fixed Fix Version/s: 4.0 Assignee: Robert Muir remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-3250.patch, LUCENE-3250_suggest.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2950) Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
[ https://issues.apache.org/jira/browse/LUCENE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056245#comment-13056245 ] Robert Muir commented on LUCENE-2950: - just following up: the only thing in lucene reaching back into modules right now is contrib/demo... Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs' -- Key: LUCENE-2950 URL: https://issues.apache.org/jira/browse/LUCENE-2950 Project: Lucene - Java Issue Type: Bug Components: general/build Affects Versions: 4.0 Reporter: Steven Rowe Priority: Blocker Fix For: 4.0 Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.) This issue is visible via the nightly Jenkins (formerly Hudson) job named Lucene-trunk, which publishes binary and source artifacts, using {{package-tgz}} and {{package-tgz-src}}, as well as javadocs using the {{javadocs}} target, all run from the top-level {{lucene/}} directory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056246#comment-13056246 ] Chris Male commented on LUCENE-3250: I'll sort out the IntelliJ and Maven deps in a moment. remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-3250.patch, LUCENE-3250_suggest.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3249) Move Solr's FunctionQuery impls to Queries Module
[ https://issues.apache.org/jira/browse/LUCENE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-3249: --- Attachment: LUCENE-3249.patch Patch which moves the impls. Compiles and tests pass. I'd like to commit this in the next day or so. Move Solr's FunctionQuery impls to Queries Module - Key: LUCENE-3249 URL: https://issues.apache.org/jira/browse/LUCENE-3249 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3249.patch Now that we have the main interfaces in the Queries module, we can move the actual impls over. Impls that won't be moved are: function/distance/* (to be moved to a spatial module) function/FileFloatSource.java (depends on Solr's Schema, data directories and exposes a RequestHandler) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3249) Move Solr's FunctionQuery impls to Queries Module
[ https://issues.apache.org/jira/browse/LUCENE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056251#comment-13056251 ] Chris Male commented on LUCENE-3249: Command for patch: {code} svn --parents mkdir modules/queries/src/java/org/apache/lucene/queries/function/valuesource svn --parents mkdir modules/queries/src/java/org/apache/lucene/queries/function/docvalues svn move solr/src/java/org/apache/solr/search/function/*Function.java modules/queries/src/java/org/apache/lucene/queries/function/valuesource/ svn move solr/src/java/org/apache/solr/search/function/*FieldSource.java modules/queries/src/java/org/apache/lucene/queries/function/valuesource svn move solr/src/java/org/apache/solr/search/function/*ValueSource.java modules/queries/src/java/org/apache/lucene/queries/function/valuesource svn move solr/src/java/org/apache/solr/search/function/*CacheSource.java modules/queries/src/java/org/apache/lucene/queries/function/valuesource svn move solr/src/java/org/apache/solr/search/function/ConstNumberSource.java modules/queries/src/java/org/apache/lucene/queries/function/valuesource svn move solr/src/java/org/apache/solr/search/function/*DocValues.java modules/queries/src/java/org/apache/lucene/queries/function/docvalues {code} Move Solr's FunctionQuery impls to Queries Module - Key: LUCENE-3249 URL: https://issues.apache.org/jira/browse/LUCENE-3249 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3249.patch Now that we have the main interfaces in the Queries module, we can move the actual impls over. Impls that won't be moved are: function/distance/* (to be moved to a spatial module) function/FileFloatSource.java (depends on Solr's Schema, data directories and exposes a RequestHandler) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-3191) Add TopDocs.merge to merge multiple TopDocs
[ https://issues.apache.org/jira/browse/LUCENE-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reopened LUCENE-3191: - Reopening: this code in SlowCollatedStringComparator is totally broken: {noformat} @Override public int compareValues(BytesRef first, BytesRef second) { if (first == null) { if (second == null) { return 0; } return -1; } else if (second == null) { return 1; } else { return collator.compare(first, second); } } {noformat} I haven't tracked this issue to understand whats going on here, but you cannot pass BytesRefs to collator.compare. If this code is ever reached (and looking at the test i wrote for this damn thing, its unclear if this code is even necessary?!), it *will* throw ClassCastException: http://download.oracle.com/javase/1,5.0/docs/api/java/text/Collator.html#compare(java.lang.Object, java.lang.Object) Add TopDocs.merge to merge multiple TopDocs --- Key: LUCENE-3191 URL: https://issues.apache.org/jira/browse/LUCENE-3191 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.3, 4.0 Attachments: LUCENE-3191-3x.patch, LUCENE-3191.patch, LUCENE-3191.patch, LUCENE-3191.patch, LUCENE-3191.patch, LUCENE-3191.patch It's not easy today to merge TopDocs, eg produced by multiple shards, supporting arbitrary Sort. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3250) remove contrib/misc and contrib/wordnet's dependencies on analyzers module
[ https://issues.apache.org/jira/browse/LUCENE-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-3250: --- Attachment: LUCENE-3250.patch Patch which fixes the deps for Maven and IntelliJ. Also fixes incorrect IntelliJ dependencies on the common module, when it should be analysis-common. I'll commit. remove contrib/misc and contrib/wordnet's dependencies on analyzers module -- Key: LUCENE-3250 URL: https://issues.apache.org/jira/browse/LUCENE-3250 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.0 Attachments: LUCENE-3250.patch, LUCENE-3250.patch, LUCENE-3250_suggest.patch These contribs don't actually analyze any text. After this patch, only the contrib/demo relies upon the analyzers module... we can separately try to figure that one out (I don't think any of these lucene contribs needs to reach back into modules/) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2950) Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs'
[ https://issues.apache.org/jira/browse/LUCENE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056259#comment-13056259 ] Chris Male commented on LUCENE-2950: The xml-query-parser demo also reaches back to StandardAnalyzer. Does this get included in the packaging? Modules under top-level modules/ directory should be included in lucene's build targets, e.g. 'package-tgz', 'package-tgz-src', and 'javadocs' -- Key: LUCENE-2950 URL: https://issues.apache.org/jira/browse/LUCENE-2950 Project: Lucene - Java Issue Type: Bug Components: general/build Affects Versions: 4.0 Reporter: Steven Rowe Priority: Blocker Fix For: 4.0 Lucene's top level {{modules/}} directory is not included in the binary or source release distribution Ant targets {{package-tgz}} and {{package-tgz-src}}, or in {{javadocs}}, in {{lucene/build.xml}}. (However, these targets do include Lucene contribs.) This issue is visible via the nightly Jenkins (formerly Hudson) job named Lucene-trunk, which publishes binary and source artifacts, using {{package-tgz}} and {{package-tgz-src}}, as well as javadocs using the {{javadocs}} target, all run from the top-level {{lucene/}} directory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2341) explore morfologik integration
[ https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056261#comment-13056261 ] Robert Muir commented on LUCENE-2341: - {quote} provided each thread obtains its own TokenStreamComponents through ReusableAnalyzerBase.createComponents (is this always the case ? looking at other filters, thay don't look thread-safe neither ..) {quote} yes, its the case that Analyzer/ReusableAnalyzerBase take care of this with a threadlocal, as long as each thread only needs to use one tokenstream at a time (which is true for all lucene consumers), see: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/Analyzer.java explore morfologik integration -- Key: LUCENE-2341 URL: https://issues.apache.org/jira/browse/LUCENE-2341 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Robert Muir Assignee: Dawid Weiss Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, morfologik-stemming-1.5.0.jar, morfologik-stemming-1.5.2.jar Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer available: http://sourceforge.net/projects/morfologik/ This works differently than LUCENE-2298, and ideally would be another option for users. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 On Tue, Jun 28, 2011 at 1:50 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: +1 : Date: Mon, 27 Jun 2011 19:38:08 +0200 : From: Simon Willnauer simon.willna...@googlemail.com : Reply-To: dev@lucene.apache.org, simon.willna...@gmail.com : To: dev@lucene.apache.org : Subject: [VOTE] Drop Java 5 support for trunk (Lucene 4.0) : : This issue has been discussed on various occasions and lately on : LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) : : The main reasons for this have been discussed on the issue but let me : put them out here too: : : - Lack of testing on Jenkins with Java 5 : - Java 5 end of lifetime is reached a long time ago so Java 5 is : totally unmaintained which means for us that bugs have to either be : hacked around, tests disabled, warnings placed, but some things simply : cannot be fixed... we cannot actually support something that is no : longer maintained: we do find JRE bugs : (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important : that bugs actually get fixed: cannot do everything with hacks.\ : - due to Java 5 we legitimate performance hits like 20% slower grouping speed. : : For reference please read through the issue mentioned above. : : A lot of the committers seem to be on the same page here to drop Java : 5 support so I am calling out an official vote. : : all Lucene 3.x releases will remain with Java 5 support this vote is : for trunk only. : : : Here is my +1 : : Simon : : - : To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org : For additional commands, e-mail: dev-h...@lucene.apache.org : : -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2623) Solr JMX MBeans do not survive core reloads
Solr JMX MBeans do not survive core reloads --- Key: SOLR-2623 URL: https://issues.apache.org/jira/browse/SOLR-2623 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 3.2, 3.1, 1.4.1, 1.4 Reporter: Alexey Serba Priority: Minor Solr JMX MBeans do not survive core reloads {noformat:title=Steps to reproduce} sh cd example sh vi multicore/core0/conf/solrconfig.xml # enable jmx sh java -Dcom.sun.management.jmxremote -Dsolr.solr.home=multicore -jar start.jar sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar solr/core0:id=core0,type=core solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=org.apache.solr.handler.StandardRequestHandler solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=standard solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=/update solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=org.apache.solr.handler.XmlUpdateRequestHandler ... solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=searcher solr/core0:id=org.apache.solr.update.DirectUpdateHandler2,type=updateHandler sh curl 'http://localhost:8983/solr/admin/cores?action=RELOADcore=core0' sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar # there's only one bean left after Solr core reload solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@2e831a91 main {noformat} The root cause of this is Solr core reload behavior: # create new core (which overwrites existing registered MBeans) # register new core and close old one (we remove/un-register MBeans on oldCore.close) The correct sequence is: # unregister MBeans from old core # create and register new core # close old core without touching MBeans -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Issues with Grouping
The trunk has issues with grouping (NPE). I get this with or without f.hgid_i1.facet.numFacetTerms, 1... I think it has to do with a NPE in group in 4.0 it fails on other code. Thoughts? [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=NumFacetTermsFacetsTest -Dtestmethod=testNumFacetTermsFacetCounts -Dtests.seed=3921835369594659663:-3219730304883530389 [junit] *** BEGIN org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCount s: Insane FieldCache usage(s) *** [junit] SUBREADER: Found caches for descendants of DirectoryReader(segments_3 _0(4.0):C6)+hgid_i1 [junit] 'DirectoryReader(segments_3 _0(4.0):C6)'='hgid_i1',class org.apache.lucene.search.FieldCache$DocTermsIndex,org.apache.lucene.search. cache.DocTermsIndexCreator@603bb3eb=org.apache.lucene.search.cache.DocTerm sIndexCreator$DocTermsIndexImpl#1026179434 (size =~ 372 bytes) [junit] 'org.apache.lucene.index.SegmentCoreReaders@7e8905bd'='hgid_i1',int,org.a pache.lucene.search.cache.IntValuesCreator@30781822=org.apache.lucene.sear ch.cache.CachedArray$IntValues#291172425 (size =~ 92 bytes) [junit] [junit] *** END org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCount s: Insane FieldCache usage(s) *** [junit] - --- [junit] Testcase: testNumFacetTermsFacetCounts(org.apache.solr.request.NumFacetTermsFacetsTes t): FAILED [junit] org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCount s: Insane FieldCache usage(s) found expected:0 but was:1 [junit] junit.framework.AssertionFailedError: org.apache.solr.request.NumFacetTermsFacetsTest.testNumFacetTermsFacetCount s: Insane FieldCache usage(s) found expected:0 but was:1 [junit] at org.apache.lucene.util.LuceneTestCase.assertSaneFieldCaches(LuceneTestCase. java:725) [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:620) [junit] at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:96) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneT estCase.java:1430) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneT estCase.java:1348) [junit] assertQ(check group and facet counts with numFacetTerms=1, req(q, id:[1 TO 6] ,indent, on ,facet, true ,group, true ,group.field, hgid_i1 ,f.hgid_i1.facet.limit, -1 ,f.hgid_i1.facet.mincount, 1 ,f.hgid_i1.facet.numFacetTerms, 1 ,facet.field, hgid_i1 ) ,*[count(//arr[@name='groups'])=1] ,*[count(//lst[@name='facet_fields']/lst[@name='hgid_i1']/int)=1] // there are 1 unique items ,//lst[@name='hgid_i1']/int[@name='numFacetTerms'][.='4'] ); - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2623) Solr JMX MBeans do not survive core reloads
[ https://issues.apache.org/jira/browse/SOLR-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056281#comment-13056281 ] Alexey Serba commented on SOLR-2623: Related bug report in solr mailing list - http://www.lucidimagination.com/search/document/f109d695b7e5d2ae/weird_issue_with_solr_and_jconsole_jmx Solr JMX MBeans do not survive core reloads --- Key: SOLR-2623 URL: https://issues.apache.org/jira/browse/SOLR-2623 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 1.4, 1.4.1, 3.1, 3.2 Reporter: Alexey Serba Priority: Minor Solr JMX MBeans do not survive core reloads {noformat:title=Steps to reproduce} sh cd example sh vi multicore/core0/conf/solrconfig.xml # enable jmx sh java -Dcom.sun.management.jmxremote -Dsolr.solr.home=multicore -jar start.jar sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar solr/core0:id=core0,type=core solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=org.apache.solr.handler.StandardRequestHandler solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=standard solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=/update solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=org.apache.solr.handler.XmlUpdateRequestHandler ... solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=searcher solr/core0:id=org.apache.solr.update.DirectUpdateHandler2,type=updateHandler sh curl 'http://localhost:8983/solr/admin/cores?action=RELOADcore=core0' sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar # there's only one bean left after Solr core reload solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@2e831a91 main {noformat} The root cause of this is Solr core reload behavior: # create new core (which overwrites existing registered MBeans) # register new core and close old one (we remove/un-register MBeans on oldCore.close) The correct sequence is: # unregister MBeans from old core # create and register new core # close old core without touching MBeans -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2623) Solr JMX MBeans do not survive core reloads
[ https://issues.apache.org/jira/browse/SOLR-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serba updated SOLR-2623: --- Attachment: SOLR-2623.patch Added test Solr JMX MBeans do not survive core reloads --- Key: SOLR-2623 URL: https://issues.apache.org/jira/browse/SOLR-2623 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 1.4, 1.4.1, 3.1, 3.2 Reporter: Alexey Serba Priority: Minor Attachments: SOLR-2623.patch Solr JMX MBeans do not survive core reloads {noformat:title=Steps to reproduce} sh cd example sh vi multicore/core0/conf/solrconfig.xml # enable jmx sh java -Dcom.sun.management.jmxremote -Dsolr.solr.home=multicore -jar start.jar sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar solr/core0:id=core0,type=core solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=org.apache.solr.handler.StandardRequestHandler solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=standard solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=/update solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=org.apache.solr.handler.XmlUpdateRequestHandler ... solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=searcher solr/core0:id=org.apache.solr.update.DirectUpdateHandler2,type=updateHandler sh curl 'http://localhost:8983/solr/admin/cores?action=RELOADcore=core0' sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar # there's only one bean left after Solr core reload solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@2e831a91 main {noformat} The root cause of this is Solr core reload behavior: # create new core (which overwrites existing registered MBeans) # register new core and close old one (we remove/un-register MBeans on oldCore.close) The correct sequence is: # unregister MBeans from old core # create and register new core # close old core without touching MBeans -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Drop Java 5 support for trunk (Lucene 4.0)
+1 On Mon, Jun 27, 2011 at 11:08 PM, Simon Willnauer simon.willna...@googlemail.com wrote: This issue has been discussed on various occasions and lately on LUCENE-3239 (https://issues.apache.org/jira/browse/LUCENE-3239) The main reasons for this have been discussed on the issue but let me put them out here too: - Lack of testing on Jenkins with Java 5 - Java 5 end of lifetime is reached a long time ago so Java 5 is totally unmaintained which means for us that bugs have to either be hacked around, tests disabled, warnings placed, but some things simply cannot be fixed... we cannot actually support something that is no longer maintained: we do find JRE bugs (http://wiki.apache.org/lucene-java/SunJavaBugs) and its important that bugs actually get fixed: cannot do everything with hacks.\ - due to Java 5 we legitimate performance hits like 20% slower grouping speed. For reference please read through the issue mentioned above. A lot of the committers seem to be on the same page here to drop Java 5 support so I am calling out an official vote. all Lucene 3.x releases will remain with Java 5 support this vote is for trunk only. Here is my +1 Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Regards, Shalin Shekhar Mangar. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2623) Solr JMX MBeans do not survive core reloads
[ https://issues.apache.org/jira/browse/SOLR-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-2623: --- Assignee: Shalin Shekhar Mangar Solr JMX MBeans do not survive core reloads --- Key: SOLR-2623 URL: https://issues.apache.org/jira/browse/SOLR-2623 Project: Solr Issue Type: Bug Components: multicore Affects Versions: 1.4, 1.4.1, 3.1, 3.2 Reporter: Alexey Serba Assignee: Shalin Shekhar Mangar Priority: Minor Attachments: SOLR-2623.patch, SOLR-2623.patch, SOLR-2623.patch Solr JMX MBeans do not survive core reloads {noformat:title=Steps to reproduce} sh cd example sh vi multicore/core0/conf/solrconfig.xml # enable jmx sh java -Dcom.sun.management.jmxremote -Dsolr.solr.home=multicore -jar start.jar sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar solr/core0:id=core0,type=core solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=org.apache.solr.handler.StandardRequestHandler solr/core0:id=org.apache.solr.handler.StandardRequestHandler,type=standard solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=/update solr/core0:id=org.apache.solr.handler.XmlUpdateRequestHandler,type=org.apache.solr.handler.XmlUpdateRequestHandler ... solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=searcher solr/core0:id=org.apache.solr.update.DirectUpdateHandler2,type=updateHandler sh curl 'http://localhost:8983/solr/admin/cores?action=RELOADcore=core0' sh echo 'open 8842 # 8842 is java pid domain solr/core0 beans ' | java -jar jmxterm-1.0-alpha-4-uber.jar # there's only one bean left after Solr core reload solr/core0:id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@2e831a91 main {noformat} The root cause of this is Solr core reload behavior: # create new core (which overwrites existing registered MBeans) # register new core and close old one (we remove/un-register MBeans on oldCore.close) The correct sequence is: # unregister MBeans from old core # create and register new core # close old core without touching MBeans -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org