[Lucene.Net] Problem while creating index for the xml file
Dear Lucene team, I would like to create index files for the below xml file using Lucene.Net dll v2.9. I used the below code, but its not working. Please guide me to create index files for the below xml file. Thanks in advance NewsHistory News Story eid=34151 Stream8742656/Stream IdentifierKDILI00D9L36/Identifier GroupIdentifier/GroupIdentifier VersionTypeORIGINAL/VersionType ActionADD_1STPASS/Action WireNumber25/WireNumber WireCodeBN/WireCode LanguageENGLISH/Language Time20090115 13:30:00.000/Time HotLevel2/HotLevel Headline*U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000 LAST WEEK/Headline TypePLAIN/Type Text China’s statistics bureau said it condemns leaks of economic data and those responsible/Text /Story Story eid=34151 Stream8742656/Stream IdentifierKDILI03T6SQU/Identifier GroupIdentifier/GroupIdentifier VersionTypeORIGINAL/VersionType ActionADD_1STPASS/Action WireNumber25/WireNumber WireCodeBN/WireCode LanguageENGLISH/Language Time20090115 13:30:00.000/Time HotLevel0/HotLevel Headline*U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000 LAST WEEK/Headline TypePLAIN/Type Text China’s foreign-exchange reserves exceeded $3 trillion for the first time/Text /Story /News NewsHistory Code === indexFileLocation = @C:\Index; Lucene.Net.Store.Directory dir = FSDirectory.GetDirectory(indexFileLocation, true); //create an analyzer to process the text Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(indexFileLocation, new StandardAnalyzer(), true); TextReader txtReader = new StreamReader(@C:\NewsMetaData.xml); //create a document, add in a single field Document doc = new Document(); Field fldContent = new Field(contents, txtReader, Field.TermVector.YES); doc.Add(fldContent); //write the document to the index indexWriter.AddDocument(doc); //optimize and close the writer indexWriter.Optimize(); indexWriter.Close();
RE: [Lucene.Net] Problem while creating index for the xml file
What's the issue your having? Seems like you're indexing the entire XML document as one field, which likely isn't the best way to go ~P Date: Tue, 17 May 2011 11:04:30 +0530 From: vlalithasivajyo...@gmail.com To: lucene-net-dev@lucene.apache.org Subject: [Lucene.Net] Problem while creating index for the xml file Dear Lucene team, I would like to create index files for the below xml file using Lucene.Net dll v2.9. I used the below code, but its not working. Please guide me to create index files for the below xml file. Thanks in advance 8742656 KDILI00D9L36 ORIGINAL ADD_1STPASS 25 BN ENGLISH 20090115 13:30:00.000 2 *U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000 LAST WEEK PLAIN China’s statistics bureau said it condemns leaks of economic data and those responsible 8742656 KDILI03T6SQU ORIGINAL ADD_1STPASS 25 BN ENGLISH 20090115 13:30:00.000 0 *U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000 LAST WEEK PLAIN China’s foreign-exchange reserves exceeded $3 trillion for the first time Code === indexFileLocation = @C:\Index; Lucene.Net.Store.Directory dir = FSDirectory.GetDirectory(indexFileLocation, true); //create an analyzer to process the text Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(indexFileLocation, new StandardAnalyzer(), true); TextReader txtReader = new StreamReader(@C:\NewsMetaData.xml); //create a document, add in a single field Document doc = new Document(); Field fldContent = new Field(contents, txtReader, Field.TermVector.YES); doc.Add(fldContent); //write the document to the index indexWriter.AddDocument(doc); //optimize and close the writer indexWriter.Optimize(); indexWriter.Close();
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 8078 - Still Failing
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/8078/ No tests ran. Build Log (for compile errors): [...truncated 47 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3093) Build failed in the flexscoring branch because of Javadoc warnings
[ https://issues.apache.org/jira/browse/LUCENE-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mark Nemeskey updated LUCENE-3093: Affects Version/s: flexscoring branch Thanks Robert! I have added the flexscoring branch to the Affected Version/s field as well to indicate that this whole issue belongs there. Build failed in the flexscoring branch because of Javadoc warnings -- Key: LUCENE-3093 URL: https://issues.apache.org/jira/browse/LUCENE-3093 Project: Lucene - Java Issue Type: Bug Components: Javadocs Affects Versions: flexscoring branch Environment: N/A Reporter: David Mark Nemeskey Assignee: Robert Muir Priority: Minor Fix For: flexscoring branch Attachments: LUCENE-3093.patch Original Estimate: 1h Remaining Estimate: 1h Ant build log: [javadoc] Standard Doclet version 1.6.0_24 [javadoc] Building tree for all the packages and classes... [javadoc] /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/Similarity.java:93: warning - Tag @link: can't find tf(float) in org.apache.lucene.search.Similarity [javadoc] /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:588: warning - @param argument term is not a parameter name. [javadoc] /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:588: warning - @param argument docFreq is not a parameter name. [javadoc] /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:618: warning - @param argument terms is not a parameter name. [javadoc] Generating /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated//package-summary.html... [javadoc] Copying file /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/classdiagram.png to directory /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files... [javadoc] Copying file /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/HitCollectionBench.jpg to directory /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files... [javadoc] Copying file /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/classdiagram.uxf to directory /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files... [javadoc] Generating /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/serialized-form.html... [javadoc] Copying file /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/prettify/stylesheet+prettify.css to file /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/stylesheet+prettify.css... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/help-doc.html... [javadoc] 4 warnings -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033919#comment-13033919 ] Stanislaw Osinski commented on SOLR-2448: - Hi, if there are no objections, I'd like to commit this patch later today. Thanks! S. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3101: Attachment: LUCENE-3101_test.patch an explicit test case TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Attachments: LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-3070: --- Assignee: Simon Willnauer Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033937#comment-13033937 ] Michael McCandless commented on LUCENE-3098: Patch looks great Martijn; thanks! Maybe, until we work out how multiple collectors can update a single TopGroups result, we should make TopGroups' totalGroupCount changeable after the fact? Ie, add a setter? This way apps can at least do it themselves before passing the TopGroups onto consumers within the apps? Also, could you update the code sample in package.html, showing how to also use the TotalGroupCountCollector, incl. setting this totalGroupCount in the TopGroups? Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033939#comment-13033939 ] Michael McCandless commented on LUCENE-3097: Thanks for the example Bill -- that makes sense! I think, in general, the post-group faceting should act as if you had indexed a single document per group, with multi-valued fields containing the union of all field values within that group, and then done normal faceting. I believe this defines the semantics we are after for post-grouping faceting. Post grouping faceting -- Key: LUCENE-3097 URL: https://issues.apache.org/jira/browse/LUCENE-3097 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Priority: Minor Fix For: 3.2, 4.0 This issues focuses on implementing post grouping faceting. * How to handle multivalued fields. What field value to show with the facet. * Where the facet counts should be based on ** Facet counts can be based on the normal documents. Ungrouped counts. ** Facet counts can be based on the groups. Grouped counts. ** Facet counts can be based on the combination of group value and facet value. Matrix counts. And properly more implementation options. The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics. This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers: |||hotel||departure_airport||duration|| |Hotel a|AMS|5 |Hotel a|DUS|10 |Hotel b|AMS|5 |Hotel b|AMS|10 If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet: AMS: 2 DUS: 1 The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033940#comment-13033940 ] Martijn van Groningen commented on LUCENE-3097: --- bq. If I say, facet.field=gender I would expect: I think this can be achieved by basing the facet counts on the normal documents. Ungrouped counts. {quote} If we had Spatial, and I had lat long for each address, I would expect if I say sort=geodist() asc that it would group and then find the closest point for each grouping to return in the proper order. For example, if I was at 103 E 5th St, I would expect the output for doctorid=1 to be: {quote} This just depends on the sort / group sort you provide. I think this should already work in the Solr trunk. bq. If I only need the 1st point in the grouping I would expect the other points to be omitted. This depends on the group limit you provide in the request. Post grouping faceting -- Key: LUCENE-3097 URL: https://issues.apache.org/jira/browse/LUCENE-3097 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Priority: Minor Fix For: 3.2, 4.0 This issues focuses on implementing post grouping faceting. * How to handle multivalued fields. What field value to show with the facet. * Where the facet counts should be based on ** Facet counts can be based on the normal documents. Ungrouped counts. ** Facet counts can be based on the groups. Grouped counts. ** Facet counts can be based on the combination of group value and facet value. Matrix counts. And properly more implementation options. The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics. This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers: |||hotel||departure_airport||duration|| |Hotel a|AMS|5 |Hotel a|DUS|10 |Hotel b|AMS|5 |Hotel b|AMS|10 If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet: AMS: 2 DUS: 1 The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033943#comment-13033943 ] Martijn van Groningen commented on LUCENE-3098: --- I will update both patches today. A setter in TopGroups for now seems fine to me. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033946#comment-13033946 ] Michael McCandless commented on LUCENE-3098: One more idea: should we add a getter to TotalGroupCountCollector so you can actually get the groups (CollectionBytesRef) themselves...? (Ie, not just the total unique count). Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033947#comment-13033947 ] Michael McCandless commented on LUCENE-3097: Right, gender in this example was single-valued per group. Another way to visualize / define how post-group faceting should behave is: imagine for ever facet value (ie field + value) you could define an aggregator. Today, that aggregator is just the count of how many docs had that value from the full result set. But you could, instead define it to be count(distinct(doctor_id)), and then you'll get the group counts you want. (Other aggregators are conceivable -- max(relevance), min+max(prices), etc.). Conceptually I think this also defines the post-group faceting functionality, even if we would never implement it this way (ie count(distinct(doctor_id)) would be way too costly to do naively). Post grouping faceting -- Key: LUCENE-3097 URL: https://issues.apache.org/jira/browse/LUCENE-3097 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Priority: Minor Fix For: 3.2, 4.0 This issues focuses on implementing post grouping faceting. * How to handle multivalued fields. What field value to show with the facet. * Where the facet counts should be based on ** Facet counts can be based on the normal documents. Ungrouped counts. ** Facet counts can be based on the groups. Grouped counts. ** Facet counts can be based on the combination of group value and facet value. Matrix counts. And properly more implementation options. The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics. This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers: |||hotel||departure_airport||duration|| |Hotel a|AMS|5 |Hotel a|DUS|10 |Hotel b|AMS|5 |Hotel b|AMS|10 If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet: AMS: 2 DUS: 1 The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033950#comment-13033950 ] Robert Muir commented on LUCENE-3101: - the problem appears to be splitblock[] and partition[]. these are using n^2 space... the rest of the datastructures seem ok (either just #states or sigma * #states) these two were cut over from arraylist to bitset in revision 1026190, but it looks like they are sparse and we should use a better datastructure (just for these two, i think the other bitsets are all fine). TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Attachments: LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3097) Post grouping faceting
[ https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033953#comment-13033953 ] Michael McCandless commented on LUCENE-3097: In fact, I think a very efficient way to implement post-group faceting is something like LUCENE-2454. Ie, we just have to insure, at indexing time, that docs within the same group are adjacent, if you want to be able to count by unique group values. Hmm... but I think this (what your identifier field is, for facet counting purposes) should be decoupled from how you group. I may group by State, for presentation purposes, but count facets by doctor_id. Post grouping faceting -- Key: LUCENE-3097 URL: https://issues.apache.org/jira/browse/LUCENE-3097 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Priority: Minor Fix For: 3.2, 4.0 This issues focuses on implementing post grouping faceting. * How to handle multivalued fields. What field value to show with the facet. * Where the facet counts should be based on ** Facet counts can be based on the normal documents. Ungrouped counts. ** Facet counts can be based on the groups. Grouped counts. ** Facet counts can be based on the combination of group value and facet value. Matrix counts. And properly more implementation options. The first two methods are implemented in the SOLR-236 patch. For the first option it calculates a DocSet based on the individual documents from the query result. For the second option it calculates a DocSet for all the most relevant documents of a group. Once the DocSet is computed the FacetComponent and StatsComponent use one the DocSet to create facets and statistics. This last one is a bit more complex. I think it is best explained with an example. Lets say we search on travel offers: |||hotel||departure_airport||duration|| |Hotel a|AMS|5 |Hotel a|DUS|10 |Hotel b|AMS|5 |Hotel b|AMS|10 If we group by hotel and have a facet for airport. Most end users expect (according to my experience off course) the following airport facet: AMS: 2 DUS: 1 The above result can't be achieved by the first two methods. You either get counts AMS:3 and DUS:1 or 1 for both airports. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3070: Attachment: LUCENE-3070.patch This patch adds UOE to PreFlex codec and makes FieldInfo#docValues transactional to prevent wrong flags if non-aborting exceptions occur. I also added some random docValues fields to RandomIndexWriter as well as some basic checks to CheckIndex. It's not perfect though but it's a start. Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3014) comparator API for segment versions
[ https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3014: Attachment: LUCENE-3014.patch initial patch comparator API for segment versions --- Key: LUCENE-3014 URL: https://issues.apache.org/jira/browse/LUCENE-3014 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Assignee: Uwe Schindler Priority: Critical Fix For: 3.2 Attachments: LUCENE-3014.patch See LUCENE-3012 for an example. Things get ugly if you want to use SegmentInfo.getVersion() For example, what if we committed my patch, release 3.2, but later released 3.1.1 (will 3.1.1 this be whats written and returned by this function?) Then suddenly we broke the index format because we are using Strings here without a reasonable comparator API. In this case one should be able to compute if the version is 3.2 safely. If we don't do this, and we rely upon this version information internally in lucene, I think we are going to break something. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033970#comment-13033970 ] Robert Muir commented on LUCENE-3070: - Seems like it might be a good idea in RandomIndexWriter to sometimes not add docvalues? Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
+1 for pushing 3.2!! There have been discussions about porting DWPT to 3.x but I think its a little premature now and I am still not sure if we should do it at all. The refactoring is pretty intense throughout all IndexWriter and it integrates with Flex / Codecs. I am not saying its impossible, certainly doable but I am not sure if its worth the hassle, lets rather concentrate on 4.0. the question is if we should backport stuff like LUCENE-2881 to 3.2 or if we should hold off until 3.3, should we do it at all? simon On Sat, May 14, 2011 at 12:30 PM, Michael McCandless luc...@mikemccandless.com wrote: +1 for 3.2. Mike http://blog.mikemccandless.com On Sat, May 14, 2011 at 12:32 AM, Shai Erera ser...@gmail.com wrote: +1 for 3.2! And also, we should adopt that approach going forward (no more bug fix releases for the stable branch, except for the last release before 4.0 is out). That means updating the release TODO with e.g., not creating a branch for 3.2.x, only tag it. When 4.0 is out, we branch 3.x.y out of the last 3.x tag. Shai On Saturday, May 14, 2011, Ryan McKinley ryan...@gmail.com wrote: On Fri, May 13, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org wrote: It's been just over 1 month since the last release. We've all said we want to get to about a 3 month release cycle (if not more often). I think this means we should start shooting for a next release sometime in June. Which, in my mind, means we should start working on wrapping up issues now, IMO. Here's what's open for 3.2 against: Lucene: https://issues.apache.org/jira/browse/LUCENE/fixforversion/12316070 Solr: https://issues.apache.org/jira/browse/SOLR/fixforversion/12316172 Thoughts? +1 for 3.2 with a new feature freeze pretty soon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033971#comment-13033971 ] Simon Willnauer commented on LUCENE-3070: - bq. Seems like it might be a good idea in RandomIndexWriter to sometimes not add docvalues? yeah I think we should make this per RIW session not per document though since we already have random DocValues Types so some docs might get docvalues_int_xyz and some might get docvalues_float_xyz fields. Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3070: Attachment: LUCENE-3070.patch new patch, I added random DocValues to updateDocument and randomly enable / disable docValues entirely on optimize / commit / getReader so we get segments that don't have docValues at all etc. I think I will commit soon if nobody objects. Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 3.2.0 (or 3.1.1)
On Mon, May 16, 2011 at 1:30 PM, Robert Muir rcm...@gmail.com wrote: On Mon, May 16, 2011 at 7:10 AM, Simon Willnauer simon.willna...@googlemail.com wrote: the question is if we should backport stuff like LUCENE-2881 to 3.2 or if we should hold off until 3.3, should we do it at all? I think it depends solely if someone is willing to do the work? The only idea i would suggest is if we did such a thing, it would really be preferred if it was able to have around 2 weeks of hudson to knock out problems? Absolutely, but I think we can safely move that to 3.3 though.. I am busy with other things right now simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3101: -- Attachment: LUCENE-3101.patch This patch reverts splitblock[], partition[] and reverse[][] to state before r1026190, the BitSets on top-level (not in inner loops are unchanged) TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Attachments: LUCENE-3101.patch, LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033977#comment-13033977 ] Robert Muir commented on LUCENE-3070: - looks good, i think this will help the test coverage a lot. can you rename swtichDoDocValues to switchDoDocValues? :) Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3070: Attachment: LUCENE-3070.patch fixed typo - I will commit in a second. Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Moving towards Lucene 4.0
Hey folks, we just started the discussion about Lucene 3.2 and releasing more often. Yet, I think we should also start planning for Lucene 4.0 soon. We have tons of stuff in trunk that people want to have and we can't just keep on talking about it - we need to push this out to our users. From my perspective we should decide on at least the big outstanding issues like: - BulkPostings (my +1 since I want to enable positional scoring on all queries) - DocValues (pretty close) - FlexibleScoring (+- 0 I think we should wait how gsoc turns out and decide then?) - Codec Support for Stored Fields, Norms TV (not sure about that but seems doable at least an API and current impl as default) - Realtime Search aka. Searchable Ram Buffer (this seems quite far though while I would love to have it it seems we need to push this to 4.0) For DocValues the decision seems easy since we are very close with that and I expect it to land until end of June. I want to kick off the discussion here so nothing will be set to stone really but I think we should plan to release somewhere near the end of the year?! simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3101: -- Attachment: LUCENE-3101.patch After some perf analysis, it showed, that replacing the LinkedList in partition[] by HashSet makes it faster. Order is unimportant and the b1.remove()/b2.add() combi in inner loop no longer uses linear scan. TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Lucene Fields: [New, Patch Available] (was: [New]) Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102.patch Patch includes the bug fixes + test. Still none of the items I listed after 'Also ...'. I plan to tackle that next, in subsequent patches. Question -- perhaps we can commit these changes incrementally? I.e., after we iterate on the changes in this patch, if they are ok, commit them, then do the rest of the stuff? Or a single commit w/ everything is preferable? Mike, there is another reason to separate Collector.needsScores() from cacheScores -- it is possible someone will pass a Collector which needs scores, however won't want to have CachingCollector 'cache' them. In which case, the wrapped Collector should be delegated setScorer instead of cachedScorer. I will leave Collector.needsScores() for a different issue though? Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
I think we should also start planning for Lucene 4.0 soon. +1 ! I think we should focus on everything that's *infrastructure* in 4.0, so that we can develop additional features in subsequent 4.x releases. If we end up releasing 4.0 just to discover many things will need to wait to 5.0, it'll be a big loss. So Codecs seem like *infra* to me, and can we make sure the necessary API is in place for RT Search and stuff? I think a lot of the new API in 4.0 is @lucene.experimental anyway? In short, if we have enough API support in 4.0 already, we can release it and develop features in 4.x releases. The only thing we should 'push' is stuff that requires API serious changes (I doubt there are many like that, maybe just Codecs support for the stuff you mentioned). Shai On Mon, May 16, 2011 at 2:52 PM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey folks, we just started the discussion about Lucene 3.2 and releasing more often. Yet, I think we should also start planning for Lucene 4.0 soon. We have tons of stuff in trunk that people want to have and we can't just keep on talking about it - we need to push this out to our users. From my perspective we should decide on at least the big outstanding issues like: - BulkPostings (my +1 since I want to enable positional scoring on all queries) - DocValues (pretty close) - FlexibleScoring (+- 0 I think we should wait how gsoc turns out and decide then?) - Codec Support for Stored Fields, Norms TV (not sure about that but seems doable at least an API and current impl as default) - Realtime Search aka. Searchable Ram Buffer (this seems quite far though while I would love to have it it seems we need to push this to 4.0) For DocValues the decision seems easy since we are very close with that and I expect it to land until end of June. I want to kick off the discussion here so nothing will be set to stone really but I think we should plan to release somewhere near the end of the year?! simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-3101. --- Resolution: Fixed Fix Version/s: 4.0 Lucene Fields: [New, Patch Available] (was: [New]) Committed revision: 1103711 Thanks Robert for help with this horrible monster! TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
stanislav you are a full committer afaik?! simon On Mon, May 16, 2011 at 2:11 PM, stanis...@apache.org wrote: Author: stanislaw Date: Mon May 16 12:11:57 2011 New Revision: 1103709 URL: http://svn.apache.org/viewvc?rev=1103709view=rev Log: Adding myself (Stanislaw Osinski) to the contrib committer list. Modified: lucene/java/site/docs/whoweare.html lucene/java/site/docs/whoweare.pdf lucene/java/site/src/documentation/content/xdocs/whoweare.xml Modified: lucene/java/site/docs/whoweare.html URL: http://svn.apache.org/viewvc/lucene/java/site/docs/whoweare.html?rev=1103709r1=1103708r2=1103709view=diff == --- lucene/java/site/docs/whoweare.html (original) +++ lucene/java/site/docs/whoweare.html Mon May 16 12:11:57 2011 @@ -3,7 +3,7 @@ head META http-equiv=Content-Type content=text/html; charset=UTF-8 meta content=Apache Forrest name=Generator -meta name=Forrest-version content=0.9 +meta name=Forrest-version content=0.8 meta name=Forrest-skin-name content=lucene title Apache Lucene/Solr - Who We Are/title link type=text/css href=skin/basic.css rel=stylesheet @@ -343,6 +343,9 @@ document.write(Last Published: + docu bPatrick O'Leary/b (pjaol@...)/li li +bStanislaw Osinski/b (stanislaw@...)/li + +li bChris Male/b (chrism@...)/li li @@ -355,7 +358,7 @@ document.write(Last Published: + docu /div -a name=N100B0/aa name=emeritus/a +a name=N100B5/aa name=emeritus/a h2 class=boxedEmeritus Committers/h2 div class=section ul Modified: lucene/java/site/docs/whoweare.pdf URL: http://svn.apache.org/viewvc/lucene/java/site/docs/whoweare.pdf?rev=1103709r1=1103708r2=1103709view=diff == Binary files - no diff available. Modified: lucene/java/site/src/documentation/content/xdocs/whoweare.xml URL: http://svn.apache.org/viewvc/lucene/java/site/src/documentation/content/xdocs/whoweare.xml?rev=1103709r1=1103708r2=1103709view=diff == --- lucene/java/site/src/documentation/content/xdocs/whoweare.xml (original) +++ lucene/java/site/src/documentation/content/xdocs/whoweare.xml Mon May 16 12:11:57 2011 @@ -38,6 +38,7 @@ ul libWolfgang Hoschek/b (whoschek@...)/li libPatrick O'Leary/b (pjaol@...)/li +libStanislaw Osinski/b (stanislaw@...)/li libChris Male/b (chrism@...)/li libAndi Vajda/b (vajda@...)/li libKarl Wettin/b (kalle@...)/li - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103711 - in /lucene/dev/trunk/lucene/src: java/org/apache/lucene/util/automaton/MinimizationOperations.java test/org/apache/lucene/util/automaton/TestMinimize.java
On Mon, May 16, 2011 at 2:15 PM, uschind...@apache.org wrote: Author: uschindler Date: Mon May 16 12:15:45 2011 New Revision: 1103711 URL: http://svn.apache.org/viewvc?rev=1103711view=rev Log: LUCENE-3101: Fix n^2 memory usage in minimizeSchindler() ähm minimizeHopcroft() LOL ^ ^ Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestMinimize.java Modified: lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java?rev=1103711r1=1103710r2=1103711view=diff == --- lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java (original) +++ lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java Mon May 16 12:15:45 2011 @@ -30,6 +30,8 @@ package org.apache.lucene.util.automaton; import java.util.BitSet; +import java.util.ArrayList; +import java.util.HashSet; import java.util.LinkedList; /** @@ -72,8 +74,12 @@ final public class MinimizationOperation final int[] sigma = a.getStartPoints(); final State[] states = a.getNumberedStates(); final int sigmaLen = sigma.length, statesLen = states.length; - final BitSet[][] reverse = new BitSet[statesLen][sigmaLen]; - final BitSet[] splitblock = new BitSet[statesLen], partition = new BitSet[statesLen]; + @SuppressWarnings(unchecked) final ArrayListState[][] reverse = + (ArrayListState[][]) new ArrayList[statesLen][sigmaLen]; + @SuppressWarnings(unchecked) final HashSetState[] partition = + (HashSetState[]) new HashSet[statesLen]; + @SuppressWarnings(unchecked) final ArrayListState[] splitblock = + (ArrayListState[]) new ArrayList[statesLen]; final int[] block = new int[statesLen]; final StateList[][] active = new StateList[statesLen][sigmaLen]; final StateListNode[][] active2 = new StateListNode[statesLen][sigmaLen]; @@ -82,8 +88,8 @@ final public class MinimizationOperation final BitSet split = new BitSet(statesLen), refine = new BitSet(statesLen), refine2 = new BitSet(statesLen); for (int q = 0; q statesLen; q++) { - splitblock[q] = new BitSet(statesLen); - partition[q] = new BitSet(statesLen); + splitblock[q] = new ArrayListState(); + partition[q] = new HashSetState(); for (int x = 0; x sigmaLen; x++) { active[q][x] = new StateList(); } @@ -92,23 +98,22 @@ final public class MinimizationOperation for (int q = 0; q statesLen; q++) { final State qq = states[q]; final int j = qq.accept ? 0 : 1; - partition[j].set(q); + partition[j].add(qq); block[q] = j; for (int x = 0; x sigmaLen; x++) { - final BitSet[] r = + final ArrayListState[] r = reverse[qq.step(sigma[x]).number]; if (r[x] == null) - r[x] = new BitSet(); - r[x].set(q); + r[x] = new ArrayListState(); + r[x].add(qq); } } // initialize active sets for (int j = 0; j = 1; j++) { - final BitSet part = partition[j]; for (int x = 0; x sigmaLen; x++) { - for (int i = part.nextSetBit(0); i = 0; i = part.nextSetBit(i+1)) { - if (reverse[i][x] != null) - active2[i][x] = active[j][x].add(states[i]); + for (final State qq : partition[j]) { + if (reverse[qq.number][x] != null) + active2[qq.number][x] = active[j][x].add(qq); } } } @@ -121,18 +126,19 @@ final public class MinimizationOperation // process pending until fixed point int k = 2; while (!pending.isEmpty()) { - IntPair ip = pending.removeFirst(); + final IntPair ip = pending.removeFirst(); final int p = ip.n1; final int x = ip.n2; pending2.clear(x*statesLen + p); // find states that need to be split off their blocks for (StateListNode m = active[p][x].first; m != null; m = m.next) { - final BitSet r = reverse[m.q.number][x]; - if (r != null) for (int i = r.nextSetBit(0); i = 0; i = r.nextSetBit(i+1)) { + final ArrayListState r = reverse[m.q.number][x]; + if (r != null) for (final State s : r) { + final int i = s.number; if (!split.get(i)) { split.set(i); final int j = block[i]; - splitblock[j].set(i); + splitblock[j].add(s); if (!refine2.get(j)) { refine2.set(j); refine.set(j); @@ -142,18 +148,19 @@ final public class MinimizationOperation } // refine blocks for (int j = refine.nextSetBit(0); j =
[jira] [Resolved] (LUCENE-3070) Enable DocValues by default for every Codec
[ https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-3070. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [New]) Enable DocValues by default for every Codec --- Key: LUCENE-3070 URL: https://issues.apache.org/jira/browse/LUCENE-3070 Project: Lucene - Java Issue Type: Task Components: Index Affects Versions: CSF branch Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: CSF branch Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch Currently DocValues are enable with a wrapper Codec so each codec which needs DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader should be moved to Codec to be enabled by default. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
+1 Mike http://blog.mikemccandless.com On Mon, May 16, 2011 at 7:52 AM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey folks, we just started the discussion about Lucene 3.2 and releasing more often. Yet, I think we should also start planning for Lucene 4.0 soon. We have tons of stuff in trunk that people want to have and we can't just keep on talking about it - we need to push this out to our users. From my perspective we should decide on at least the big outstanding issues like: - BulkPostings (my +1 since I want to enable positional scoring on all queries) - DocValues (pretty close) - FlexibleScoring (+- 0 I think we should wait how gsoc turns out and decide then?) - Codec Support for Stored Fields, Norms TV (not sure about that but seems doable at least an API and current impl as default) - Realtime Search aka. Searchable Ram Buffer (this seems quite far though while I would love to have it it seems we need to push this to 4.0) For DocValues the decision seems easy since we are very close with that and I expect it to land until end of June. I want to kick off the discussion here so nothing will be set to stone really but I think we should plan to release somewhere near the end of the year?! simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-1149) add XA transaction support
[ https://issues.apache.org/jira/browse/LUCENE-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1149: Sorry, you're right this issue isn't really a dup (I've reopened it). I was just saying that Lucene's IW APIs are already transactional so one should be able to build a transactions layer on top. Ie, you should not have to make a new index for each transaction. We would still need a layer that mates this up to the XA transactions API (I think?). Does anyone have a patch for this? add XA transaction support -- Key: LUCENE-1149 URL: https://issues.apache.org/jira/browse/LUCENE-1149 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: robert engels Need to add XA transaction support to Lucene. Without XA support, it is difficult to keep disparate resources (e.g. database) in sync with the Lucene index. A review of the XA support added to Hibernate might be a good start (although Hibernate almost always uses a XA capable backing store database). It would be ideal to have a combined IndexReaderWriter instance, then create a XAIndexReaderWriter which wraps it. The implementation might be as simple as a XA log file which lists the XA transaction id, and the segments XXX number(s), since Lucene already allows you to rollback to a previous version (??? for sure, or does it only allow you to abort the current commit). If operating under a XA transaction, the no explicit commits or rollbacks should be allowed on the instance. The index would be committed during XA prepare(), and then if needed rolledback when requested. The XA commit() would be a no-op. There is a lot more to this but this should get the ball rolling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Mon, May 16, 2011 at 7:52 AM, Simon Willnauer simon.willna...@googlemail.com wrote: Hey folks, we just started the discussion about Lucene 3.2 and releasing more often. Yet, I think we should also start planning for Lucene 4.0 soon. We have tons of stuff in trunk that people want to have and we can't just keep on talking about it - we need to push this out to our users. From my perspective we should decide on at least the big outstanding issues like: - BulkPostings (my +1 since I want to enable positional scoring on all queries) in my own opinion, this is probably the most important to decide how to handle. I think it might not be good if we introduce a new major version branch (4.x) with flexible indexing if the postings APIs limit us from actually taking advantage of it. I think that we should look at (shai brought up a previous thread about this) when 4.x is released, 3.x goes into bugfix mode and we open up 5.x. So, we want to make sure we actually have things stable enough (from an API and flexibility perspective) that we will be able to get some life out of the 4.x series and add new features to it. I think there is a lot left to do with bulkpostings and its going to require a lot of work, but at the same time I really don't like that we have serious improvements/features in trunk (some have been there now for years) still unreleased and not yet available to users. Some other crazy ideas (just for discussion): * we could try to be more aggressive about backporting and getting more life out of 3.x, and getting some of these features to users. For example, perhaps things like DWPT, DocValues, more efficient terms index, automaton, etc could be backported safely. the advantage here is that we get the features to the users, but the disadvantage is it would be a lot of effort backporting. * we could decide that we do actually have enough flexibility now in 4.x to get several releases out of it (e.g. containing features like docvalues, realtime search, etc), even though we know its limited to some extent, and defer api-breakers like bulkpostings/flexscoring to 5.x. the advantage here is that we could start looking at 4.x releasing very soon, but there are some disadvantages, like forcing people have to change a lot of their code to upgrade but for less gain, and potentially limiting ourselves in the 4.x branch by its APIs. * we could do nothing at all, and keep going like we are going now, deciding that we are actually getting enough useful features into 3.x releases that its ok for us to block 4.0 on some of these tougher issues like bulkpostings. The disadvantage is of course even longer wait time for the features that have been sitting in trunk a while, but it keeps 3.x stable and is less work for us. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Moving towards Lucene 4.0
Sorry to be negative, - BulkPostings (my +1 since I want to enable positional scoring on all queries) My problem is the really crappy and unusable API of BulkPostings (wait for my talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself that’s unusable. I tried to understand even the simple MultiTermQueryWrapperFilter - easy on trunk, horrible on branch - sorry that’s a no-go. Its code duplication everywhere and unreadable. Uwe - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Mon, May 16, 2011 at 8:48 AM, Uwe Schindler u...@thetaphi.de wrote: Sorry to be negative, - BulkPostings (my +1 since I want to enable positional scoring on all queries) My problem is the really crappy and unusable API of BulkPostings (wait for my talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself that’s unusable. I tried to understand even the simple MultiTermQueryWrapperFilter - easy on trunk, horrible on branch - sorry that’s a no-go. Its code duplication everywhere and unreadable. I don't think you should apologize for being negative, its true there is a ton of work to do here before that branch is ready. Thats why in my email I tried to brainstorm some alternative ways we could get some of these features into the hands of users without being held up by this work. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-2383) Velocity: Generalize range and date facet display
[ https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl reopened SOLR-2383: --- Reopening to add patch for branch 3.2 Velocity: Generalize range and date facet display - Key: SOLR-2383 URL: https://issues.apache.org/jira/browse/SOLR-2383 Project: Solr Issue Type: Bug Components: Response Writers Reporter: Jan Høydahl Assignee: Grant Ingersoll Labels: facet, range, velocity Fix For: 4.0 Attachments: SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch Velocity (/browse) GUI has hardcoded price range facet and a hardcoded manufacturedate_dt date facet. Need general solution which work for any facet.range and facet.date. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2383) Velocity: Generalize range and date facet display
[ https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-2383: -- Fix Version/s: 3.2 Velocity: Generalize range and date facet display - Key: SOLR-2383 URL: https://issues.apache.org/jira/browse/SOLR-2383 Project: Solr Issue Type: Bug Components: Response Writers Reporter: Jan Høydahl Assignee: Grant Ingersoll Labels: facet, range, velocity Fix For: 3.2, 4.0 Attachments: SOLR-2383-branch_32.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch Velocity (/browse) GUI has hardcoded price range facet and a hardcoded manufacturedate_dt date facet. Need general solution which work for any facet.range and facet.date. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2383) Velocity: Generalize range and date facet display
[ https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated SOLR-2383: -- Attachment: SOLR-2383-branch_32.patch This Velocity enhancement should make it to 3.2. In this patch I back-port what was committed for 4.0, with these exceptions: * No pivot facets * End range uses ] instead of } Velocity: Generalize range and date facet display - Key: SOLR-2383 URL: https://issues.apache.org/jira/browse/SOLR-2383 Project: Solr Issue Type: Bug Components: Response Writers Reporter: Jan Høydahl Assignee: Grant Ingersoll Labels: facet, range, velocity Fix For: 3.2, 4.0 Attachments: SOLR-2383-branch_32.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch Velocity (/browse) GUI has hardcoded price range facet and a hardcoded manufacturedate_dt date facet. Need general solution which work for any facet.range and facet.date. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034004#comment-13034004 ] Robert Muir commented on LUCENE-3101: - Thanks for reporting this selckin, this is a great find, definitely amazed we randomly generated this one :) TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM
[ https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034006#comment-13034006 ] Dawid Weiss commented on LUCENE-3101: - There is a lot of power in randomness, huh? :) I really like these randomized tests... this should be a built-in functionality in JUnit (call it 'repeatable randomness')... TestMinimize.testAgainstBrzozowski reproducible seed OOM Key: LUCENE-3101 URL: https://issues.apache.org/jira/browse/LUCENE-3101 Project: Lucene - Java Issue Type: Bug Reporter: selckin Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101_test.patch {code} [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize -Dtestmethod=testAgainstBrzozowski -Dtests.seed=-7429820995201119781:1013305000165135537 [junit] NOTE: test params are: codec=PreFlex, locale=ru, timezone=America/Pangnirtung [junit] NOTE: all tests run in this JVM: [junit] [TestMinimize] [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496 [junit] - --- [junit] Testcase: testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): Caused an ERROR [junit] Java heap space [junit] java.lang.OutOfMemoryError: Java heap space [junit] at java.util.BitSet.initWords(BitSet.java:144) [junit] at java.util.BitSet.init(BitSet.java:139) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85) [junit] at org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478) [junit] at org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428) [junit] at org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256) [junit] at org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Mon, May 16, 2011 at 2:57 PM, Robert Muir rcm...@gmail.com wrote: On Mon, May 16, 2011 at 8:48 AM, Uwe Schindler u...@thetaphi.de wrote: Sorry to be negative, - BulkPostings (my +1 since I want to enable positional scoring on all queries) My problem is the really crappy and unusable API of BulkPostings (wait for my talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself that’s unusable. I tried to understand even the simple MultiTermQueryWrapperFilter - easy on trunk, horrible on branch - sorry that’s a no-go. Its code duplication everywhere and unreadable. I don't think you should apologize for being negative, its true there is a ton of work to do here before that branch is ready. Thats why in my email I tried to brainstorm some alternative ways we could get some of these features into the hands of users without being held up by this work. I have to admit that branch is very rough and the API is super hard to use. For now! Lets not be dragged away into discussion how this API should look like there will be time for that. I agree with robert that I see a large amount of work left on that branch though so maybe we should move the positional scoring (LUCENE-2878) over to trunk as another option. I think we should not wait much longer with Lucene 4.0 so I lean towards Roberts option 2 even if we need to pay the price for a major change in 5.0. I am not sure if we really need to change much API for Realtime Search since this should be hidden in IW, IndexingChain and IW#getReader() - I kind of like the idea to be close to 4.0 :) simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer simon.willna...@googlemail.com wrote: I have to admit that branch is very rough and the API is super hard to use. For now! Lets not be dragged away into discussion how this API should look like there will be time for that. +1, this is what i really meant by decide how to handle. I don't think we will be able to quickly decide how to fix the branch itself, i think its really complicated. But we can admit its really complicated and won't be solved very soon, and try to figure out a release strategy with this in mind. (p.s. sorry simon, you got two copies of this message i accidentally hit reply instead of reply-all) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
Hi Stanisław, You don’t need to be logged into people.apache.org to update the website. Have you seen these instructions? The “unversioned website” section is what you want, I think: http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite Steve From: stac...@gmail.com [mailto:stac...@gmail.com] On Behalf Of Stanislaw Osinski Sent: Monday, May 16, 2011 8:56 AM To: dev@lucene.apache.org; simon.willna...@gmail.com Cc: java-...@lucene.apache.org; java-comm...@lucene.apache.org Subject: Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml stanislav you are a full committer afaik?! I've been working mostly on the clustering plugin for now, so I'm not sure if it's right to move me to the core section right away :-) Incidentally, I tried to svn up on /www/lucene.apache.org/java/docshttp://lucene.apache.org/java/docs at people.apache.orghttp://people.apache.org to push the modifications live, but there is an SVN lock on that directory. Am I missing anything? I'm assuming that's the right directory for the commiters list? S.
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
Hi Steve, That explains everything, thanks! I somehow failed to locate that wiki page and was looking at http://wiki.apache.org/solr/Website_Update_HOWTO instead. S. On Mon, May 16, 2011 at 15:25, Steven A Rowe sar...@syr.edu wrote: Hi Stanisław, You don’t need to be logged into people.apache.org to update the website. Have you seen these instructions? The “unversioned website” section is what you want, I think: http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite Steve *From:* stac...@gmail.com [mailto:stac...@gmail.com] *On Behalf Of *Stanislaw Osinski *Sent:* Monday, May 16, 2011 8:56 AM *To:* dev@lucene.apache.org; simon.willna...@gmail.com *Cc:* java-...@lucene.apache.org; java-comm...@lucene.apache.org *Subject:* Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml stanislav you are a full committer afaik?! I've been working mostly on the clustering plugin for now, so I'm not sure if it's right to move me to the core section right away :-) Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at people.apache.org to push the modifications live, but there is an SVN lock on that directory. Am I missing anything? I'm assuming that's the right directory for the commiters list? S.
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
On May 16, 2011, at 8:55 AM, Stanislaw Osinski wrote: stanislav you are a full committer afaik?! I've been working mostly on the clustering plugin for now, so I'm not sure if it's right to move me to the core section right away :-) Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at people.apache.org to push the modifications live, but there is an SVN lock on that directory. Am I missing anything? I'm assuming that's the right directory for the commiters list? S. Stanislav - we certainly nominated you in the spirit of maintaining the carrot2 contrib, but you are still a full committer. We have decided to stop adding new Contrib committers. A full committer may be someone that only works on part of the project. IMO, a full committer might be someone that only has commit bits so that he can update the website! We trust full committers to only mess with what they are comfortable with. So we trust that you will stick to Carrot2 or other areas you are strong in, and that if you want to move into other code, you will do so intelligently. Essentially, by making you a Committer, we are mostly just saying - we trust you. But you are a full committer and not a contrib committer. We no longer mint new contrib committers. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034018#comment-13034018 ] Robert Muir commented on SOLR-1942: --- any update on this? Would be nice to be able to hook in codecproviders and codecs this way. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034025#comment-13034025 ] Martijn van Groningen commented on LUCENE-3098: --- Hmmm... So you get a list of all grouped values. That can be useful. Only remember that doesn't tell anything about the group head (most relevant document of a group), since we don't sort inside the groups. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml
Hi Mark, Thanks for clarifying the difference between contrib and full committers, I was probably too shy to subscribe myself to the latter group right away :-) For the time being, I'll most likely stick with maintaining the clustering bit and will consult you guys if I have something to contribute in the other areas of the code. S. On Mon, May 16, 2011 at 15:41, Mark Miller markrmil...@gmail.com wrote: Stanislav - we certainly nominated you in the spirit of maintaining the carrot2 contrib, but you are still a full committer. We have decided to stop adding new Contrib committers. A full committer may be someone that only works on part of the project. IMO, a full committer might be someone that only has commit bits so that he can update the website! We trust full committers to only mess with what they are comfortable with. So we trust that you will stick to Carrot2 or other areas you are strong in, and that if you want to move into other code, you will do so intelligently. Essentially, by making you a Committer, we are mostly just saying - we trust you. But you are a full committer and not a contrib committer. We no longer mint new contrib committers. - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034040#comment-13034040 ] Michael McCandless commented on LUCENE-3098: Right, we'd make it clear the collection is unordered. It just seems like, since we are building up this collection anyway, we may as well give access to the consumer? Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034050#comment-13034050 ] Martijn van Groningen commented on LUCENE-3098: --- That is true. It is just a simple un-orded collection of all values of the group field that have matches the query. I'll include this as well. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034051#comment-13034051 ] Grant Ingersoll commented on SOLR-1942: --- I thought I would have time last week, but that turned out to not be the case. If you have time, Robert, feel free, otherwise I might be able to get to it later in the week (pending conf. prep). From the sounds of it, it likely just needs to be updated to trunk and then it should be ready to go (we should also doc it on the wiki) Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034053#comment-13034053 ] Simon Willnauer commented on LUCENE-3090: - I did 150 runs for all Lucene Tests incl. contrib - no failure so far. Seems to be good to go. DWFlushControl does not take active DWPT out of the loop on fullFlush - Key: LUCENE-3090 URL: https://issues.apache.org/jira/browse/LUCENE-3090 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Critical Fix For: 4.0 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch We have seen several OOM on TestNRTThreads and all of them are caused by DWFlushControl missing DWPT that are set as flushPending but can't full due to a full flush going on. Yet that means that those DWPT are filling up in the background while they should actually be checked out and blocked until the full flush finishes. Even further we currently stall on the maxNumThreadStates while we should stall on the num of active thread states. I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1942) Ability to select codec per field
[ https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034059#comment-13034059 ] Robert Muir commented on SOLR-1942: --- ok thanks Grant. I'll take a look thru the patch some today and post back what I think. Ability to select codec per field - Key: SOLR-1942 URL: https://issues.apache.org/jira/browse/SOLR-1942 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Assignee: Grant Ingersoll Fix For: 4.0 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch We should use PerFieldCodecWrapper to allow users to select the codec per-field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Moving towards Lucene 4.0
We anyway seem to mark every new API as @lucene.experimental these days, so we shouldn't have too much problem when 4.0 is out :). Experimental API is subject to change at any time. We can consider that as an option as well (maybe it adds another option to Robert's?). Though personally, I'm not a big fan of this notion - I think we deceive ourselves and users when we have @experimental on a stable branch. Any @experimental API on trunk today falls into this bucket after 4.0 is out. And I'm sure there are a couple in 3.x already. Don't get me wrong - I don't suggest we should stop using it. But I think we should consider to review the @experimental API before every stable release, and reduce it over time, not increase it. Shai On Mon, May 16, 2011 at 4:20 PM, Robert Muir rcm...@gmail.com wrote: On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer simon.willna...@googlemail.com wrote: I have to admit that branch is very rough and the API is super hard to use. For now! Lets not be dragged away into discussion how this API should look like there will be time for that. +1, this is what i really meant by decide how to handle. I don't think we will be able to quickly decide how to fix the branch itself, i think its really complicated. But we can admit its really complicated and won't be solved very soon, and try to figure out a release strategy with this in mind. (p.s. sorry simon, you got two copies of this message i accidentally hit reply instead of reply-all) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Field should accept BytesRef?
But when you create an untokenized field (or even a binary field, which is stored-only at the moment), you could theoretically index the bytes directly Right, if I already have a BytesRef of what needs to be indexed, then passing the BR into Field/able should reduce garbage collection of strings? On Sun, May 15, 2011 at 9:59 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, I think Jason meant the field value, not the field name. Field names should stay Strings, as they are only identifiers making them BytesRefs is not really useful. But when you create an untokenized field (or even a binary field, which is stored-only at the moment), you could theoretically index the bytes directly. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, May 15, 2011 6:22 PM To: dev@lucene.apache.org Subject: Re: Field should accept BytesRef? On Sun, May 15, 2011 at 12:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: In the Field object a text value must be of type string, however I think we can allow a BytesRef to be passed in? it would be nice if we sorted them in byte order too? I think right now fields are sorted in utf-16 order, but terms are sorted in utf-8 order? (if so, this is confusing) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Attachment: LUCENE-3102.patch bq. Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an access check on each assignment. Maybe make them package protected? Or use a setter? Good catch Mike. I read about it some and found this nice webpage which explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the member is private (whether it's in the inner or outer class), there is an access check. So the right think to do is to declare is protected / package-private, which I did. Thanks for the opportunity to get some education ! Patch fixes this. I intend to commit this shortly + move the class to core + apply to trunk. Then, I'll continue w/ the rest of the improvements. Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Field should accept BytesRef?
On Mon, May 16, 2011 at 11:29 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: But when you create an untokenized field (or even a binary field, which is stored-only at the moment), you could theoretically index the bytes directly Right, if I already have a BytesRef of what needs to be indexed, then passing the BR into Field/able should reduce garbage collection of strings? you can do this with a tokenstream, see http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/Test2BTerms.java for an example (sorry i somehow was confused about your message earlier). - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
[ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2450. - Resolution: Fixed Committed to trunk and branch_3x. Carrot2 clustering should use both its own and Solr's stop words Key: SOLR-2450 URL: https://issues.apache.org/jira/browse/SOLR-2450 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2450.patch While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words. I'm not sure what the best way to implement this would be though. My first thought was to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
[ https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2449. - Resolution: Fixed Committed to trunk and branch_3x. Loading of Carrot2 resources from Solr config directory --- Key: SOLR-2449 URL: https://issues.apache.org/jira/browse/SOLR-2449 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Fix For: 3.2, 4.0 Attachments: SOLR-2449.patch Currently, Carrot2 clustering algorithms read linguistic resources (stop words, stop labels) from the classpath (Carrot2 JAR), which makes them difficult to edit/override. The directory from which Carrot2 should read its resources (absolute, or relative to Solr config dir) could be specified in the {{engine}} element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2448. - Resolution: Fixed Committed to trunk and branch_3x. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2505) Output cluster scores
[ https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2505. - Resolution: Fixed Committed to trunk and branch_3x. Output cluster scores - Key: SOLR-2505 URL: https://issues.apache.org/jira/browse/SOLR-2505 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Carrot2 algorithms compute cluster scores; we could expose them on the output from Solr clustering component. Along with scores, we can output a boolean flag that marks the Other Topics groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch Here updated patch that removes some ListSI usage from DirectoryReader and IndexWriter for rollback when commit fails. I am still not happy with interacting of IndexWriter code directly with the list, but this should maybe fixed later. This patch could also be backported to cleanup 3.x, but for backwards compatibility, the SegmentInfos class should still extend VectorSI, but we can make the fields segment simply point to this. I am not sure how to deprecated extension of a class? A possibility would be to add each Vector method as a overridden one-liner and deprecated, but thats a non-brainer and stupid to do :( MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034091#comment-13034091 ] Michael McCandless commented on LUCENE-3102: Patch looks great Shai -- +1 to commit!! Yes that is very sneaky about the private fields in inner/outer classes -- it's good you added a comment explaining it! Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034093#comment-13034093 ] Michael McCandless commented on LUCENE-3084: Uwe, this looks like a great step forward? Even if there are other things to fix later, we should commit this first (progress not perfection)? Thanks! On backporting, this is an experimental API, and it's rather expert for code to be interacting with SegmentInfos, so I think we can just break it (and advertise we did so)? MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034095#comment-13034095 ] Michael McCandless commented on LUCENE-3090: Patch looks good but hairy Simon! I ran 144 iters of all (Solr+lucene+lucene-contrib) tests. I hit three fails (one in Solr's TestJoin.testRandomJoin, and two in Solr's HighlighterTest) but I don't think these are related to this patch. DWFlushControl does not take active DWPT out of the loop on fullFlush - Key: LUCENE-3090 URL: https://issues.apache.org/jira/browse/LUCENE-3090 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Critical Fix For: 4.0 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch We have seen several OOM on TestNRTThreads and all of them are caused by DWFlushControl missing DWPT that are set as flushPending but can't full due to a full flush going on. Yet that means that those DWPT are filling up in the background while they should actually be checked out and blocked until the full flush finishes. Even further we currently stall on the maxNumThreadStates while we should stall on the num of active thread states. I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2521) TestJoin.testRandom fails
TestJoin.testRandom fails - Key: SOLR-2521 URL: https://issues.apache.org/jira/browse/SOLR-2521 Project: Solr Issue Type: Bug Reporter: Michael McCandless Fix For: 4.0 Hit this random failure; it reproduces on trunk: {noformat} [junit] Testsuite: org.apache.solr.TestJoin [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec [junit] [junit] - Standard Error - [junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin [junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound [junit] request=LocalSolrQueryRequest{echoParams=allindent=trueq={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*wt=json} [junit] result={ [junit] responseHeader:{ [junit] status:0, [junit] QTime:0, [junit] params:{ [junit] echoParams:all, [junit] indent:true, [junit] q:{!join from=small_i to=small3_is}*:*, [junit] wt:json}}, [junit] response:{numFound:1,start:0,docs:[ [junit] { [junit] id:NXEA, [junit] score_f:87.90162, [junit] small3_ss:[N, [junit] v, [junit] n], [junit] small_i:4, [junit] small2_i:1, [junit] small2_is:[2], [junit] small3_is:[69, [junit] 88, [junit] 54, [junit] 80, [junit] 75, [junit] 83, [junit] 57, [junit] 73, [junit] 85, [junit] 52, [junit] 50, [junit] 88, [junit] 51, [junit] 89, [junit] 12, [junit] 8, [junit] 19, [junit] 23, [junit] 53, [junit] 75, [junit] 26, [junit] 99, [junit] 0, [junit] 44]}] [junit] }} [junit] expected={numFound:0,start:0,docs:[]} [junit] model={NXEA:Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 44]],JSLZ:Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], small_i=6, small2_is=[2, 3], small3_is=[22, 1]],FAWX:Doc(2):[id=FAWX, score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, E, P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], small3_is=[95, 42]],GDDZ:Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 6], small3_is=[36, 48, 9, 8, 40, 40, 68]],RBIQ:Doc(4):[id=RBIQ, score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, small2_is=6, small3_is=[13, 77, 96, 45]],LRDM:Doc(5):[id=LRDM, score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 100, 81, 34, 45, 87, 72, 14, 5]]} [junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin -Dtestmethod=testRandomJoin -Dtests.seed=-4998031941344546449:8541928265064992444 [junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), small2_s=MockFixedIntBlock(blockSize=1738), small3_is=MockVariableIntBlock(baseBlockSize=77), small_i=MockFixedIntBlock(blockSize=1738), small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, timezone=America/Barbados [junit] NOTE: all tests run in this JVM: [junit] [TestJoin] [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736 [junit] - --- [junit] Testcase: testRandomJoin(org.apache.solr.TestJoin): FAILED [junit] mismatch: '0'!='1' @ response/numFound [junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ response/numFound [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] at org.apache.solr.TestJoin.testRandomJoin(TestJoin.java:172) [junit] [junit] [junit] Test org.apache.solr.TestJoin FAILED {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail:
[jira] [Assigned] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-3100: --- Assignee: Simon Willnauer IW.commit() writes but fails to fsync the N.fnx file Key: LUCENE-3100 URL: https://issues.apache.org/jira/browse/LUCENE-3100 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising bug! Because the new N.fnx file is written at the last minute along with the segments file, it's not included in the sis.files() that IW uses to figure out which files to sync. This bug means one could call IW.commit(), successfully, return, and then the machine could crash and when it comes back up your index could be corrupted. We should hopefully first fix TestCrash so that it hits this bug (maybe it needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034101#comment-13034101 ] Michael McCandless commented on SOLR-2519: -- I think the attached patch is a good starting point. It fixes the generic text fieldType to have good all around defaults for all languages, so that non-whitespace languages work fine. Then, I think we should iteratively add in custom languages over time (as separate issues). We can eg add text_en_autophrase, text_en, text_zh, etc. We should at least do first sweep of nice analyzers module and add fieldTypes for them. This way we will eventually get to the ideal future when we have text_XX coverage for many languages. Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-2027) Deprecate Directory.touchFile
[ https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-2027: -- Assignee: Michael McCandless Deprecate Directory.touchFile - Key: LUCENE-2027 URL: https://issues.apache.org/jira/browse/LUCENE-2027 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Michael McCandless Priority: Trivial Fix For: 4.0 Attachments: LUCENE-2027.patch Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl shouldn't swallow the returned result from File.setLastModified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2027) Deprecate Directory.touchFile
[ https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2027: --- Attachment: LUCENE-2027.patch Patch, removing Dir.touchFile from trunk. For 3.x I'll deprecate. Deprecate Directory.touchFile - Key: LUCENE-2027 URL: https://issues.apache.org/jira/browse/LUCENE-2027 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Michael McCandless Assignee: Michael McCandless Priority: Trivial Fix For: 4.0 Attachments: LUCENE-2027.patch Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl shouldn't swallow the returned result from File.setLastModified. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush
[ https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034103#comment-13034103 ] Simon Willnauer commented on LUCENE-3090: - Thanks mike for review and testing!! It makes me feel better with those asserts in there now... I will commit tomorrow. DWFlushControl does not take active DWPT out of the loop on fullFlush - Key: LUCENE-3090 URL: https://issues.apache.org/jira/browse/LUCENE-3090 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Critical Fix For: 4.0 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch We have seen several OOM on TestNRTThreads and all of them are caused by DWFlushControl missing DWPT that are set as flushPending but can't full due to a full flush going on. Yet that means that those DWPT are filling up in the background while they should actually be checked out and blocked until the full flush finishes. Even further we currently stall on the maxNumThreadStates while we should stall on the num of active thread states. I will attach a patch tomorrow. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
[ https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3084: -- Attachment: LUCENE-3084-trunk-only.patch New patch that also has BalancedMergePolicy from contrib refactored to new API (sorry that was missing). MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos -- Key: LUCENE-3084 URL: https://issues.apache.org/jira/browse/LUCENE-3084 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch SegmentInfos carries a bunch of fields beyond the list of SI, but for merging purposes these fields are unused. We should cutover to ListSI instead. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034120#comment-13034120 ] Yonik Seeley commented on SOLR-2519: I think maybe there's a misconception that the fieldType named text was meant to be generic for all languages. As I said in the thread, if I had to do it over again, I would have named it text_en because that's what it's purpose was. But at this point, it seems like the best way forward is to leave text as an english fieldType and simply add other fieldTypes that can support other languages. Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): - The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) - You left the english stemmer on the text fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Random question: what are the implications of changing from WhitespaceTokenizer to StandardTokenizer, esp w.r.t. WDF? Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034151#comment-13034151 ] Hoss Man commented on SOLR-2520: I'm confused here: As far as i can tell, the JSONResponseWriter does in fact output valid JSON (the link mentioned points out that there are control characters valid in JSON which are not valid in javascript, but that's what the response writer produces -- JSON) ... so what is the bug? And what do you mean by the query option to ask for jsonp ? ... i don't see that option in the JSONResponseWriter (is this bug about some third party response writer?) Solr creates invalid jsonp strings -- Key: SOLR-2520 URL: https://issues.apache.org/jira/browse/SOLR-2520 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Benson Margulies Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034154#comment-13034154 ] Michael McCandless commented on SOLR-2519: -- bq. I think maybe there's a misconception that the fieldType named text was meant to be generic for all languages. Regardless of what the original intention was, text today has become the generic text fieldType new users use on starting with Solr. I mean, it has the perfect name for that :) bq. As I said in the thread, if I had to do it over again, I would have named it text_en because that's what it's purpose was. Hindsight is 20/20... but, we can still fix this today. We shouldn't lock ourselves into poor defaults. Especially, as things improve and we get better analyzers, etc., we should be free to improve the defaults in schema.xml to take advantage of these improvements. bq. But at this point, it seems like the best way forward is to leave text as an english fieldType and simply add other fieldTypes that can support other languages. I think this is a dangerous approach -- the name (ie, missing _en if in fact it has such English-specific configuration) is misleading and traps new users. Ideally, in the future, we wouldn't even have a text fieldType, only text_XX per-language examples and then maybe something like text_general, which you use if you cannot find your language. {quote} Some downsides I see to this patch (i.e. trying to make the 'text' fieldType generic): The current WordDelimiterFilter options the fieldType feel like a trap for non-whitespace-delimited languages. WDF is configured to index catenations as well as splits... so all of the tokens (words?) that are split out are also catenated together and indexed (which seems like it could lead to some truly huge tokens erroneously being indexed.) {quote} Ahh good point. I think we should remove WDF altogether from the generic text fieldType. {quote} You left the english stemmer on the text fieldType... but if it's supposed to be generic, couldn't this be bad for some other western languages where it could cause stemming collisions of words not related to each other? {quote} +1, we should remove the stemming too from text. bq. Taking into account all the existing users (and all the existing documentation, examples, tutorial, etc), I favor a more conservative approach of adding new fieldTypes rather than radically changing the behavior of existing ones. Can you point to specific examples (docs, examples, tutorial)? I'd like to understand how much work it is to fix these... My feeling is we should simply do the work here (I'll sign up to it) and fix any places that actually rely on the specifics of text fieldType, eg autophrase. We shouldn't avoid fixing things well because it's gonna be more work today, especially if someone (me) is signing up to do it. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034158#comment-13034158 ] Michael McCandless commented on SOLR-2519: -- It's also spooky that text fieldType has different index time vs query time analyzers? Ie, WDF is configured differently. Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034159#comment-13034159 ] Benson Margulies commented on SOLR-2520: Fun happens when you specify something in json.wrf. This demands 'jsonp' instead of json, which results in the result being treated as javascript, not json. wt=jsonjson.wrf=SOME_PREFIX will cause Solr to respond with SOME_PREFIX({whatever it was otherwise going to return}) instead of just {whatever it was otherwise going to return} If there is then an interesting Unicode character in there, Chrome implodes and firefox quietly rejects. Solr creates invalid jsonp strings -- Key: SOLR-2520 URL: https://issues.apache.org/jira/browse/SOLR-2520 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Benson Margulies Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098.patch Attached patch with the discussed changes. 3x patch follows soon. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034172#comment-13034172 ] Hoss Man commented on SOLR-2519: I feel like we are convoluting two issues here: the default behavior of TextField, and the example configs. i don't have any strong opinions about changing the default behavior of TextField when {{autoGeneratePhraseQueries}} is not specified in the {{fieldType/}} but if we do make such a change, it should be contingent on the schema version property (which we should bump) so that people who upgrade will get consistent behavior with their existing configs (TextField.init already has an example of this for when we changed the default of {{omitNorms}}) as far as the example configs: i agree with yonik, that changing text at this point might be confusing ... i think the best way to iterate moving forward would probably be: * rename {{fieldType name=text/}} and {{field name=text/}} to something that makes their purpose more clear (text_en, or text_western, or text_european, or some other more general descriptive word for the types of languages were it makes sense) and switch all existing {{field/}} declarations that currently use use field type text to use this new name. * add a new {{fieldType name=text_general/}} which is designed (and documented to be a general purpose field type when the language is unknown (it may make sense to fix/repurpose the existing {{fieldType name=textgen/}} for this, since it already suggests that's what it's for) * Audit all {{field/}} declarations that use text_en (or whatever name was chosen above) and the existing sample data for those fields to see if it makes more sense to change them to text_general. also change any where based on usage it shouldn't matter. The end result being that we have no {{fieldType/}} named text in the example configs, so people won't get it confused with previous versions, and we'll have a new {{fieldType/}} that works as well as possible with all langauges which we use as much as possible with the example data. Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034176#comment-13034176 ] Hoss Man commented on SOLR-2519: bq. Also: existing users would be unaffected by this? They've already copied over / edited their own schema.xml? This is mainly about new users? The trap we've seen with this type of thing in the past (ie: the numeric fields) is that people who tend to use the example configs w/o changing them much refer to the example field types by name when talking about them on the mailing list, not considering that those names can have differnet meanings depending on version. if we make radical changes to a {{fieldType/}} but leave the name alone, it could confuse a lot of people, ie: i tried using the 'text' field but it didn't work; which version of solr are you using?; Solr 4.1; that should work, what exactly does your schema look like; ...; that's the schema from 3.6; yeah, i started with 3.6 nad then upgraded to 4.1 later, etc... Bottom line: it's less confusing to *remove* {{fieldType/}} and add new ones with new names then to make radical changes to existing ones. Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
[ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: --- Component/s: (was: contrib/*) modules/grouping Few issues with CachingCollector Key: LUCENE-3102 URL: https://issues.apache.org/jira/browse/LUCENE-3102 Project: Lucene - Java Issue Type: Bug Components: modules/grouping Reporter: Shai Erera Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3102.patch, LUCENE-3102.patch CachingCollector (introduced in LUCENE-1421) has few issues: # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. # It does not clear cachedScores + cachedSegs upon exceeding RAM limits # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity # This check if (base + nextLength maxDocsToCache) (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? Also: * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean cacheScores? I think it's better defined explicitly than implicitly? * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? * How about using OpenBitSet instead of int[] for doc IDs? ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. * I think a set of dedicated unit tests for this class alone would be good. That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-2520: --- Summary: JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data (was: Solr creates invalid jsonp strings) Benson: thanks for the clarification, i've updated the summary to attempt to clarify the root of the issue. Would make more sense to have a JavascriptResponseWriter or to have the JSONResponseWriter do unicode escaping/stripping if/when json.wrf is specified? JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data Key: SOLR-2520 URL: https://issues.apache.org/jira/browse/SOLR-2520 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Benson Margulies Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034185#comment-13034185 ] Michael McCandless commented on SOLR-2519: -- bq. Bottom line: it's less confusing to remove fieldType/ and add new ones with new names then to make radical changes to existing ones. Ahh, this makes great sense! I really like your proposal Hoss, and that's a great point about emails to the mailing lists. So we'd have no more text fieldType. Just text_en (what text now is) and text_general (basically just StandardAnalyzer, but maybe move/absorb textgen over). Over time we can add in more language specific text_XX fieldTypes... Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034187#comment-13034187 ] Benson Margulies commented on SOLR-2520: I'd vote for the later. I assume that there is some large inventory of people who are currently using json.wrf=foo and who would benefit from the change. However, I have limited context here, so if anyone else knows more about how users are using this stuff I hope they will speak up. Sorry not to have been fully clear on the first attempt. JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data Key: SOLR-2520 URL: https://issues.apache.org/jira/browse/SOLR-2520 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Benson Margulies Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3103: Attachment: LUCENE-3103.patch attached is a first patch... maybe Uwe won't be able to resist rewriting it to make it simpler :) create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3103.patch Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data
[ https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034197#comment-13034197 ] Yonik Seeley commented on SOLR-2520: It looks like we already escape \u2028 (see SOLR-1936), so we should just do the same for \u2029? JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data Key: SOLR-2520 URL: https://issues.apache.org/jira/browse/SOLR-2520 Project: Solr Issue Type: Bug Affects Versions: 4.0 Reporter: Benson Margulies Please see http://timelessrepo.com/json-isnt-a-javascript-subset. If a stored field contains invalid Javascript characters, and you use the query option to ask for jsonp, solr does *not* escape some invalid Unicode characters, resulting in strings that explode on contact with browsers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098.patch Attached a new patch. * Renamed TotalGroupCountCollector to AllGroupsCollector. This rename reflects more what the collector is actual doing. * Group values are now collected in an ArrayList instead of a LinkedList. The initialSize is now also used for the ArrayList. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml
[ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034203#comment-13034203 ] Robert Muir commented on SOLR-2519: --- As someone frustrated by this (but who would ultimately like to move past it and try to help with solr's intl), I just wanted to say +1 to Hoss Man's proposal. My only suggestion on what he said is that I would greatly prefer text_en over text_western or whatever for these reasons: 1. the stemming and stopwords and crap here are english. 2. for other western languages, even if you swap these out to be say, french or italian (which is the seemingly obvious way to cut over), the whole WDF+autophrase is still a huge trap (see http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance for an example). in this case use of ElisionFilter can be taken to avoid it. Improve the defaults for the text field type in default schema.xml Key: SOLR-2519 URL: https://issues.apache.org/jira/browse/SOLR-2519 Project: Solr Issue Type: Bug Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: SOLR-2519.patch Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5 The text fieldType in schema.xml is unusable for non-whitespace languages, because it has the dangerous auto-phrase feature (of Lucene's QP -- see LUCENE-2458) enabled. Lucene leaves this off by default, as does ElasticSearch (http://http://www.elasticsearch.org/). Furthermore, the text fieldType uses WhitespaceTokenizer when StandardTokenizer is a better cross-language default. Until we have language specific field types, I think we should fix the text fieldType to work well for all languages, by: * Switching from WhitespaceTokenizer to StandardTokenizer * Turning off auto-phrase -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034214#comment-13034214 ] Michael McCandless commented on LUCENE-3098: Looks great Martijn! I'll commit in a day or two if nobody objects... Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-3098: -- Assignee: Michael McCandless Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034217#comment-13034217 ] Robert Muir commented on LUCENE-3103: - one thing i did previously (seemed overkill but maybe good to do) was to clearAttributes(), setBytesRef() on each incrementToken, more like a normal tokenizer. we could still change it to work like this. in this case clear() set the br to null. another thing to inspect is the reflection api so toString prints the bytes... didnt check this. create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3103.patch Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034220#comment-13034220 ] Uwe Schindler commented on LUCENE-3103: --- Reflection should work correct. No need to change anything. create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3103.patch Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms
[ https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034224#comment-13034224 ] Michael McCandless commented on LUCENE-3103: +1 -- this is a great test to add, now that we support arbitrary binary terms. create a simple test that indexes and searches byte[] terms --- Key: LUCENE-3103 URL: https://issues.apache.org/jira/browse/LUCENE-3103 Project: Lucene - Java Issue Type: Test Components: general/test Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-3103.patch Currently, the only good test that does this is Test2BTerms (disabled by default) I think we should test this capability, and also have a simpler example for how to do this. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3098) Grouped total count
[ https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-3098: -- Attachment: LUCENE-3098-3x.patch Great! Attached the 3x backport. Grouped total count --- Key: LUCENE-3098 URL: https://issues.apache.org/jira/browse/LUCENE-3098 Project: Lucene - Java Issue Type: New Feature Reporter: Martijn van Groningen Assignee: Michael McCandless Fix For: 3.2, 4.0 Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch When grouping currently you can get two counts: * Total hit count. Which counts all documents that matched the query. * Total grouped hit count. Which counts all documents that have been grouped in the top N groups. Since the end user gets groups in his search result instead of plain documents with grouping. The total number of groups as total count makes more sense in many situations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file
[ https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-3100: Attachment: LUCENE-3100.patch here is a patch sync'ing the file on successful write during prepareCommit IW.commit() writes but fails to fsync the N.fnx file Key: LUCENE-3100 URL: https://issues.apache.org/jira/browse/LUCENE-3100 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3100.patch In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising bug! Because the new N.fnx file is written at the last minute along with the segments file, it's not included in the sis.files() that IW uses to figure out which files to sync. This bug means one could call IW.commit(), successfully, return, and then the machine could crash and when it comes back up your index could be corrupted. We should hopefully first fix TestCrash so that it hits this bug (maybe it needs more/better randomization?), then fix the bug -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034242#comment-13034242 ] Simon Willnauer commented on LUCENE-3092: - mike I attached a patch to LUCENE-3100 and tested with the latest patch on this issue. The test randomly fails (after I close the IW in the test!) here is a trace: {noformat} junit-sequential: [junit] Testsuite: org.apache.lucene.store.TestNRTCachingDirectory [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.16 sec [junit] [junit] - Standard Error - [junit] NOTE: reproduce with: ant test -Dtestcase=TestNRTCachingDirectory -Dtestmethod=testNRTAndCommit -Dtests.seed=-753565914717395747:-1817581638532977526 [junit] NOTE: test params are: codec=RandomCodecProvider: {docid=SimpleText, body=MockFixedIntBlock(blockSize=1993), title=Pulsing(freqCutoff=3), titleTokenized=MockSep, date=SimpleText}, locale=ar_AE, timezone=America/Santa_Isabel [junit] NOTE: all tests run in this JVM: [junit] [TestNRTCachingDirectory] [junit] NOTE: Mac OS X 10.6.7 x86_64/Apple Inc. 1.6.0_24 (64-bit)/cpus=2,threads=1,free=46213552,total=85000192 [junit] - --- [junit] Testcase: testNRTAndCommit(org.apache.lucene.store.TestNRTCachingDirectory):FAILED [junit] limit=12 actual=16 [junit] junit.framework.AssertionFailedError: limit=12 actual=16 [junit] at org.apache.lucene.index.RandomIndexWriter.doRandomOptimize(RandomIndexWriter.java:165) [junit] at org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:199) [junit] at org.apache.lucene.store.TestNRTCachingDirectory.testNRTAndCommit(TestNRTCachingDirectory.java:179) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211) [junit] [junit] [junit] Test org.apache.lucene.store.TestNRTCachingDirectory FAILED {noformat} NRTCachingDirectory, to buffer small segments in a RAMDir - Key: LUCENE-3092 URL: https://issues.apache.org/jira/browse/LUCENE-3092 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, LUCENE-3092.patch, LUCENE-3092.patch I created this simply Directory impl, whose goal is reduce IO contention in a frequent reopen NRT use case. The idea is, when reopening quickly, but not indexing that much content, you wind up with many small files created with time, that can possibly stress the IO system eg if merges, searching are also fighting for IO. So, NRTCachingDirectory puts these newly created files into a RAMDir, and only when they are merged into a too-large segment, does it then write-through to the real (delegate) directory. This lets you spend some RAM to reduce I0. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org