[Lucene.Net] Problem while creating index for the xml file

2011-05-16 Thread Lalitha siva jyothi V
Dear Lucene team,

I would like to create index files for the below xml file using
Lucene.Net dll v2.9. I used the below code, but its not working.
Please guide me to create index files for the below xml file. Thanks
in advance


NewsHistory
News
Story eid=34151
Stream8742656/Stream
IdentifierKDILI00D9L36/Identifier
GroupIdentifier/GroupIdentifier 
VersionTypeORIGINAL/VersionType 
ActionADD_1STPASS/Action
WireNumber25/WireNumber
WireCodeBN/WireCode 
LanguageENGLISH/Language
Time20090115 13:30:00.000/Time
HotLevel2/HotLevel  
Headline*U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000
LAST WEEK/Headline
TypePLAIN/Type  
Text China’s statistics bureau said it condemns leaks of
economic data and those responsible/Text
/Story
Story eid=34151
Stream8742656/Stream
IdentifierKDILI03T6SQU/Identifier
GroupIdentifier/GroupIdentifier 
VersionTypeORIGINAL/VersionType 
ActionADD_1STPASS/Action
WireNumber25/WireNumber
WireCodeBN/WireCode 
LanguageENGLISH/Language
Time20090115 13:30:00.000/Time
HotLevel0/HotLevel  
Headline*U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000
LAST WEEK/Headline
TypePLAIN/Type  
Text China’s foreign-exchange reserves exceeded $3 trillion for
the first time/Text
/Story
/News
NewsHistory

Code
===
indexFileLocation = @C:\Index;
Lucene.Net.Store.Directory dir =
FSDirectory.GetDirectory(indexFileLocation, true);

//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();

IndexWriter indexWriter = new IndexWriter(indexFileLocation, new
StandardAnalyzer(), true);

TextReader txtReader = new StreamReader(@C:\NewsMetaData.xml);

//create a document, add in a single field
Document doc = new Document();
Field fldContent = new Field(contents, txtReader, Field.TermVector.YES);

doc.Add(fldContent);

//write the document to the index
indexWriter.AddDocument(doc);

//optimize and close the writer
indexWriter.Optimize();
indexWriter.Close();


RE: [Lucene.Net] Problem while creating index for the xml file

2011-05-16 Thread Prescott Nasser

What's the issue your having? Seems like you're indexing the entire XML 
document as one field, which likely isn't the best way to go
 
~P






 Date: Tue, 17 May 2011 11:04:30 +0530
 From: vlalithasivajyo...@gmail.com
 To: lucene-net-dev@lucene.apache.org
 Subject: [Lucene.Net] Problem while creating index for the xml file

 Dear Lucene team,

 I would like to create index files for the below xml file using
 Lucene.Net dll v2.9. I used the below code, but its not working.
 Please guide me to create index files for the below xml file. Thanks
 in advance


 
 
 
 8742656
 KDILI00D9L36
 
 ORIGINAL
 ADD_1STPASS
 25
 BN
 ENGLISH
 20090115 13:30:00.000
 2
 *U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000
 LAST WEEK
 PLAIN
 China’s statistics bureau said it condemns leaks of
 economic data and those responsible
 
 
 8742656
 KDILI03T6SQU
 
 ORIGINAL
 ADD_1STPASS
 25
 BN
 ENGLISH
 20090115 13:30:00.000
 0
 *U.S. INITIAL JOBLESS CLAIMS ROSE 54,000 TO 524,000
 LAST WEEK
 PLAIN
 China’s foreign-exchange reserves exceeded $3 trillion for
 the first time
 
 
 

 Code
 ===
 indexFileLocation = @C:\Index;
 Lucene.Net.Store.Directory dir =
 FSDirectory.GetDirectory(indexFileLocation, true);

 //create an analyzer to process the text
 Lucene.Net.Analysis.Analyzer analyzer = new
 Lucene.Net.Analysis.Standard.StandardAnalyzer();

 IndexWriter indexWriter = new IndexWriter(indexFileLocation, new
 StandardAnalyzer(), true);

 TextReader txtReader = new StreamReader(@C:\NewsMetaData.xml);

 //create a document, add in a single field
 Document doc = new Document();
 Field fldContent = new Field(contents, txtReader, Field.TermVector.YES);

 doc.Add(fldContent);

 //write the document to the index
 indexWriter.AddDocument(doc);

 //optimize and close the writer
 indexWriter.Optimize();
 indexWriter.Close();

[JENKINS] Lucene-Solr-tests-only-trunk - Build # 8078 - Still Failing

2011-05-16 Thread Apache Jenkins Server
Build: https://builds.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/8078/

No tests ran.

Build Log (for compile errors):
[...truncated 47 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3093) Build failed in the flexscoring branch because of Javadoc warnings

2011-05-16 Thread David Mark Nemeskey (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mark Nemeskey updated LUCENE-3093:


Affects Version/s: flexscoring branch

Thanks Robert! I have added the flexscoring branch to the Affected Version/s 
field as well to indicate that this whole issue belongs there.

 Build failed in the flexscoring branch because of Javadoc warnings
 --

 Key: LUCENE-3093
 URL: https://issues.apache.org/jira/browse/LUCENE-3093
 Project: Lucene - Java
  Issue Type: Bug
  Components: Javadocs
Affects Versions: flexscoring branch
 Environment: N/A
Reporter: David Mark Nemeskey
Assignee: Robert Muir
Priority: Minor
 Fix For: flexscoring branch

 Attachments: LUCENE-3093.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Ant build log:
   [javadoc] Standard Doclet version 1.6.0_24
   [javadoc] Building tree for all the packages and classes...
   [javadoc] 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/Similarity.java:93:
  warning - Tag @link: can't find tf(float) in 
 org.apache.lucene.search.Similarity
   [javadoc] 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:588:
  warning - @param argument term is not a parameter name.
   [javadoc] 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:588:
  warning - @param argument docFreq is not a parameter name.
   [javadoc] 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java:618:
  warning - @param argument terms is not a parameter name.
   [javadoc] Generating 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated//package-summary.html...
   [javadoc] Copying file 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/classdiagram.png
  to directory 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files...
   [javadoc] Copying file 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/HitCollectionBench.jpg
  to directory 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files...
   [javadoc] Copying file 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/doc-files/classdiagram.uxf
  to directory 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/org/apache/lucene/store/instantiated/doc-files...
   [javadoc] Generating 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/serialized-form.html...
   [javadoc] Copying file 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/prettify/stylesheet+prettify.css
  to file 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/stylesheet+prettify.css...
   [javadoc] Building index for all the packages and classes...
   [javadoc] Building index for all classes...
   [javadoc] Generating 
 /home/savior/Development/workspaces/java/Lucene-GSoC/lucene/build/docs/api/all/help-doc.html...
   [javadoc] 4 warnings

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033919#comment-13033919
 ] 

Stanislaw Osinski commented on SOLR-2448:
-

Hi, if there are no objections, I'd like to commit this patch later today. 
Thanks! S.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3101:


Attachment: LUCENE-3101_test.patch

an explicit test case

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Attachments: LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-3070:
---

Assignee: Simon Willnauer

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033937#comment-13033937
 ] 

Michael McCandless commented on LUCENE-3098:


Patch looks great Martijn; thanks!

Maybe, until we work out how multiple collectors can update a single 
TopGroups result, we should make TopGroups' totalGroupCount changeable after 
the fact?  Ie, add a setter?  This way apps can at least do it themselves 
before passing the TopGroups onto consumers within the apps?

Also, could you update the code sample in package.html, showing how to also use 
the TotalGroupCountCollector, incl. setting this totalGroupCount in the 
TopGroups?

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033939#comment-13033939
 ] 

Michael McCandless commented on LUCENE-3097:


Thanks for the example Bill -- that makes sense!

I think, in general, the post-group faceting should act as if you had indexed 
a single document per group, with multi-valued fields containing the union of 
all field values within that group, and then done normal faceting.  I believe 
this defines the semantics we are after for post-grouping faceting.

 Post grouping faceting
 --

 Key: LUCENE-3097
 URL: https://issues.apache.org/jira/browse/LUCENE-3097
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: 3.2, 4.0


 This issues focuses on implementing post grouping faceting.
 * How to handle multivalued fields. What field value to show with the facet.
 * Where the facet counts should be based on
 ** Facet counts can be based on the normal documents. Ungrouped counts. 
 ** Facet counts can be based on the groups. Grouped counts.
 ** Facet counts can be based on the combination of group value and facet 
 value. Matrix counts.   
 And properly more implementation options.
 The first two methods are implemented in the SOLR-236 patch. For the first 
 option it calculates a DocSet based on the individual documents from the 
 query result. For the second option it calculates a DocSet for all the most 
 relevant documents of a group. Once the DocSet is computed the FacetComponent 
 and StatsComponent use one the DocSet to create facets and statistics.  
 This last one is a bit more complex. I think it is best explained with an 
 example. Lets say we search on travel offers:
 |||hotel||departure_airport||duration||
 |Hotel a|AMS|5
 |Hotel a|DUS|10
 |Hotel b|AMS|5
 |Hotel b|AMS|10
 If we group by hotel and have a facet for airport. Most end users expect 
 (according to my experience off course) the following airport facet:
 AMS: 2
 DUS: 1
 The above result can't be achieved by the first two methods. You either get 
 counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033940#comment-13033940
 ] 

Martijn van Groningen commented on LUCENE-3097:
---

bq. If I say, facet.field=gender I would expect:
I think this can be achieved by basing the facet counts on the normal 
documents. Ungrouped counts.

{quote}
If we had Spatial, and I had lat long for each address, I would expect if I say 
sort=geodist() asc that it would group and then find the closest 
point for each grouping to return in the proper order. For example, if I was at 
103 E 5th St, I would expect the output for doctorid=1 to be:
{quote}
This just depends on the sort / group sort you provide. I think this should 
already work in the Solr trunk.

bq. If I only need the 1st point in the grouping I would expect the other 
points to be omitted.
This depends on the group limit you provide in the request.

 Post grouping faceting
 --

 Key: LUCENE-3097
 URL: https://issues.apache.org/jira/browse/LUCENE-3097
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: 3.2, 4.0


 This issues focuses on implementing post grouping faceting.
 * How to handle multivalued fields. What field value to show with the facet.
 * Where the facet counts should be based on
 ** Facet counts can be based on the normal documents. Ungrouped counts. 
 ** Facet counts can be based on the groups. Grouped counts.
 ** Facet counts can be based on the combination of group value and facet 
 value. Matrix counts.   
 And properly more implementation options.
 The first two methods are implemented in the SOLR-236 patch. For the first 
 option it calculates a DocSet based on the individual documents from the 
 query result. For the second option it calculates a DocSet for all the most 
 relevant documents of a group. Once the DocSet is computed the FacetComponent 
 and StatsComponent use one the DocSet to create facets and statistics.  
 This last one is a bit more complex. I think it is best explained with an 
 example. Lets say we search on travel offers:
 |||hotel||departure_airport||duration||
 |Hotel a|AMS|5
 |Hotel a|DUS|10
 |Hotel b|AMS|5
 |Hotel b|AMS|10
 If we group by hotel and have a facet for airport. Most end users expect 
 (according to my experience off course) the following airport facet:
 AMS: 2
 DUS: 1
 The above result can't be achieved by the first two methods. You either get 
 counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033943#comment-13033943
 ] 

Martijn van Groningen commented on LUCENE-3098:
---

I will update both patches today. A setter in TopGroups for now seems fine to 
me.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033946#comment-13033946
 ] 

Michael McCandless commented on LUCENE-3098:


One more idea: should we add a getter to TotalGroupCountCollector so you can 
actually get the groups (CollectionBytesRef) themselves...?  (Ie, not just 
the total unique count).

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033947#comment-13033947
 ] 

Michael McCandless commented on LUCENE-3097:


Right, gender in this example was single-valued per group.

Another way to visualize / define how post-group faceting should behave is: 
imagine for ever facet value (ie field + value) you could define an aggregator. 
 Today, that aggregator is just the count of how many docs had that value from 
the full result set.  But you could, instead define it to be 
count(distinct(doctor_id)), and then you'll get the group counts you want.  
(Other aggregators are conceivable -- max(relevance), min+max(prices), etc.).

Conceptually I think this also defines the post-group faceting functionality, 
even if we would never implement it this way (ie count(distinct(doctor_id)) 
would be way too costly to do naively).

 Post grouping faceting
 --

 Key: LUCENE-3097
 URL: https://issues.apache.org/jira/browse/LUCENE-3097
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: 3.2, 4.0


 This issues focuses on implementing post grouping faceting.
 * How to handle multivalued fields. What field value to show with the facet.
 * Where the facet counts should be based on
 ** Facet counts can be based on the normal documents. Ungrouped counts. 
 ** Facet counts can be based on the groups. Grouped counts.
 ** Facet counts can be based on the combination of group value and facet 
 value. Matrix counts.   
 And properly more implementation options.
 The first two methods are implemented in the SOLR-236 patch. For the first 
 option it calculates a DocSet based on the individual documents from the 
 query result. For the second option it calculates a DocSet for all the most 
 relevant documents of a group. Once the DocSet is computed the FacetComponent 
 and StatsComponent use one the DocSet to create facets and statistics.  
 This last one is a bit more complex. I think it is best explained with an 
 example. Lets say we search on travel offers:
 |||hotel||departure_airport||duration||
 |Hotel a|AMS|5
 |Hotel a|DUS|10
 |Hotel b|AMS|5
 |Hotel b|AMS|10
 If we group by hotel and have a facet for airport. Most end users expect 
 (according to my experience off course) the following airport facet:
 AMS: 2
 DUS: 1
 The above result can't be achieved by the first two methods. You either get 
 counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033950#comment-13033950
 ] 

Robert Muir commented on LUCENE-3101:
-

the problem appears to be splitblock[] and partition[]. these are using n^2 
space... 
the rest of the datastructures seem ok (either just #states or sigma * #states)

these two were cut over from arraylist to bitset in revision 1026190, but it 
looks like they are 
sparse and we should use a better datastructure (just for these two, i think 
the other bitsets are all fine).


 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Attachments: LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3097) Post grouping faceting

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033953#comment-13033953
 ] 

Michael McCandless commented on LUCENE-3097:


In fact, I think a very efficient way to implement post-group faceting is 
something like LUCENE-2454.

Ie, we just have to insure, at indexing time, that docs within the same group 
are adjacent, if you want to be able to count by unique group values.

Hmm... but I think this (what your identifier field is, for facet counting 
purposes) should be decoupled from how you group.  I may group by State, for 
presentation purposes, but count facets by doctor_id.

 Post grouping faceting
 --

 Key: LUCENE-3097
 URL: https://issues.apache.org/jira/browse/LUCENE-3097
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: 3.2, 4.0


 This issues focuses on implementing post grouping faceting.
 * How to handle multivalued fields. What field value to show with the facet.
 * Where the facet counts should be based on
 ** Facet counts can be based on the normal documents. Ungrouped counts. 
 ** Facet counts can be based on the groups. Grouped counts.
 ** Facet counts can be based on the combination of group value and facet 
 value. Matrix counts.   
 And properly more implementation options.
 The first two methods are implemented in the SOLR-236 patch. For the first 
 option it calculates a DocSet based on the individual documents from the 
 query result. For the second option it calculates a DocSet for all the most 
 relevant documents of a group. Once the DocSet is computed the FacetComponent 
 and StatsComponent use one the DocSet to create facets and statistics.  
 This last one is a bit more complex. I think it is best explained with an 
 example. Lets say we search on travel offers:
 |||hotel||departure_airport||duration||
 |Hotel a|AMS|5
 |Hotel a|DUS|10
 |Hotel b|AMS|5
 |Hotel b|AMS|10
 If we group by hotel and have a facet for airport. Most end users expect 
 (according to my experience off course) the following airport facet:
 AMS: 2
 DUS: 1
 The above result can't be achieved by the first two methods. You either get 
 counts AMS:3 and DUS:1 or 1 for both airports.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3070:


Attachment: LUCENE-3070.patch

This patch adds UOE to PreFlex codec and makes FieldInfo#docValues 
transactional to prevent wrong flags if non-aborting exceptions occur.

I also added some random docValues fields to RandomIndexWriter as well as some 
basic checks to CheckIndex. It's not perfect though but it's a start.

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3014) comparator API for segment versions

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3014:


Attachment: LUCENE-3014.patch

initial patch

 comparator API for segment versions
 ---

 Key: LUCENE-3014
 URL: https://issues.apache.org/jira/browse/LUCENE-3014
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Critical
 Fix For: 3.2

 Attachments: LUCENE-3014.patch


 See LUCENE-3012 for an example.
 Things get ugly if you want to use SegmentInfo.getVersion()
 For example, what if we committed my patch, release 3.2, but later released 
 3.1.1 (will 3.1.1 this be whats written and returned by this function?)
 Then suddenly we broke the index format because we are using Strings here 
 without a reasonable comparator API.
 In this case one should be able to compute if the version is  3.2 safely.
 If we don't do this, and we rely upon this version information internally in 
 lucene, I think we are going to break something.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033970#comment-13033970
 ] 

Robert Muir commented on LUCENE-3070:
-

Seems like it might be a good idea in RandomIndexWriter to sometimes not add 
docvalues?


 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Simon Willnauer
+1 for pushing 3.2!!

There have been discussions about porting DWPT to 3.x but I think its
a little premature now and I am still not sure if we should do it at
all. The refactoring is pretty intense throughout all IndexWriter and
it integrates with Flex / Codecs. I am not saying its impossible,
certainly doable but I am not sure if its worth the hassle, lets
rather concentrate on 4.0.

the question is if we should backport stuff like LUCENE-2881 to 3.2 or
if we should hold off until 3.3, should we do it at all?

simon

On Sat, May 14, 2011 at 12:30 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 +1 for 3.2.

 Mike

 http://blog.mikemccandless.com

 On Sat, May 14, 2011 at 12:32 AM, Shai Erera ser...@gmail.com wrote:
 +1 for 3.2!

 And also, we should adopt that approach going forward (no more bug fix
 releases for the stable branch, except for the last release before 4.0
 is out). That means updating the release TODO with e.g., not creating
 a branch for 3.2.x, only tag it. When 4.0 is out, we branch 3.x.y out
 of the last 3.x tag.

 Shai

 On Saturday, May 14, 2011, Ryan McKinley ryan...@gmail.com wrote:
 On Fri, May 13, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org 
 wrote:
 It's been just over 1 month since the last release.  We've all said we 
 want to get to about a 3 month release cycle (if not more often).  I think 
 this means we should start shooting for a next release sometime in June.  
 Which, in my mind, means we should start working on wrapping up issues 
 now, IMO.

 Here's what's open for 3.2 against:
 Lucene: https://issues.apache.org/jira/browse/LUCENE/fixforversion/12316070
 Solr: https://issues.apache.org/jira/browse/SOLR/fixforversion/12316172

 Thoughts?


 +1 for 3.2 with a new feature freeze pretty soon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033971#comment-13033971
 ] 

Simon Willnauer commented on LUCENE-3070:
-

bq. Seems like it might be a good idea in RandomIndexWriter to sometimes not 
add docvalues?

yeah I think we should make this per RIW session not per document though since 
we already have random DocValues Types so some docs might get docvalues_int_xyz 
and some might get docvalues_float_xyz fields. 

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3070:


Attachment: LUCENE-3070.patch

new patch, I added random DocValues to updateDocument and randomly enable / 
disable docValues entirely on optimize / commit / getReader so we get segments 
that don't have docValues at all etc. I think I will commit soon if nobody 
objects.

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 3.2.0 (or 3.1.1)

2011-05-16 Thread Simon Willnauer
On Mon, May 16, 2011 at 1:30 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, May 16, 2011 at 7:10 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 the question is if we should backport stuff like LUCENE-2881 to 3.2 or
 if we should hold off until 3.3, should we do it at all?


 I think it depends solely if someone is willing to do the work? The
 only idea i would suggest is if we did such a thing, it would really
 be preferred if it was able to have around 2 weeks of hudson to knock
 out problems?


Absolutely, but I think we can safely move that to 3.3 though.. I am
busy with other things right now

simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3101:
--

Attachment: LUCENE-3101.patch

This patch reverts splitblock[], partition[] and reverse[][] to state before 
r1026190, the BitSets on top-level (not in inner loops are unchanged)

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Attachments: LUCENE-3101.patch, LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033977#comment-13033977
 ] 

Robert Muir commented on LUCENE-3070:
-

looks good, i think this will help the test coverage a lot.

can you rename swtichDoDocValues to switchDoDocValues? :)

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3070:


Attachment: LUCENE-3070.patch

fixed typo - I will commit in a second.

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch, 
 LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Moving towards Lucene 4.0

2011-05-16 Thread Simon Willnauer
Hey folks,

we just started the discussion about Lucene 3.2 and releasing more
often. Yet, I think we should also start planning for Lucene 4.0 soon.
We have tons of stuff in trunk that people want to have and we can't
just keep on talking about it - we need to push this out to our users.
From my perspective we should decide on at least the big outstanding
issues like:

- BulkPostings (my +1 since I want to enable positional scoring on all queries)
- DocValues (pretty close)
- FlexibleScoring (+- 0 I think we should wait how gsoc turns out and
decide then?)
- Codec Support for Stored Fields, Norms  TV (not sure about that but
seems doable at least an API and current impl as default)
- Realtime Search aka. Searchable Ram Buffer (this seems quite far
though while I would love to have it it seems we need to push this to
 4.0)

For DocValues the decision seems easy since we are very close with
that and I expect it to land until end of June. I want to kick off the
discussion here so nothing will be set to stone really but I think we
should plan to release somewhere near the end of the year?!


simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3101:
--

Attachment: LUCENE-3101.patch

After some perf analysis, it showed, that replacing the LinkedList in 
partition[] by HashSet makes it faster. Order is unimportant and the 
b1.remove()/b2.add() combi in inner loop no longer uses linear scan.

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, 
 LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

Patch includes the bug fixes + test. Still none of the items I listed after 
'Also ...'. I plan to tackle that next, in subsequent patches.

Question -- perhaps we can commit these changes incrementally? I.e., after we 
iterate on the changes in this patch, if they are ok, commit them, then do the 
rest of the stuff? Or a single commit w/ everything is preferable?

Mike, there is another reason to separate Collector.needsScores() from 
cacheScores -- it is possible someone will pass a Collector which needs scores, 
however won't want to have CachingCollector 'cache' them. In which case, the 
wrapped Collector should be delegated setScorer instead of cachedScorer.

I will leave Collector.needsScores() for a different issue though?

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Shai Erera

 I think we should also start planning for Lucene 4.0 soon.


+1 !

I think we should focus on everything that's *infrastructure* in 4.0, so
that we can develop additional features in subsequent 4.x releases. If we
end up releasing 4.0 just to discover many things will need to wait to 5.0,
it'll be a big loss.

So Codecs seem like *infra* to me, and can we make sure the necessary API is
in place for RT Search and stuff? I think a lot of the new API in 4.0 is
@lucene.experimental anyway?

In short, if we have enough API support in 4.0 already, we can release it
and develop features in 4.x releases. The only thing we should 'push' is
stuff that requires API serious changes (I doubt there are many like that,
maybe just Codecs support for the stuff you mentioned).

Shai

On Mon, May 16, 2011 at 2:52 PM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

 Hey folks,

 we just started the discussion about Lucene 3.2 and releasing more
 often. Yet, I think we should also start planning for Lucene 4.0 soon.
 We have tons of stuff in trunk that people want to have and we can't
 just keep on talking about it - we need to push this out to our users.
 From my perspective we should decide on at least the big outstanding
 issues like:

 - BulkPostings (my +1 since I want to enable positional scoring on all
 queries)
 - DocValues (pretty close)
 - FlexibleScoring (+- 0 I think we should wait how gsoc turns out and
 decide then?)
 - Codec Support for Stored Fields, Norms  TV (not sure about that but
 seems doable at least an API and current impl as default)
 - Realtime Search aka. Searchable Ram Buffer (this seems quite far
 though while I would love to have it it seems we need to push this to
  4.0)

 For DocValues the decision seems easy since we are very close with
 that and I expect it to land until end of June. I want to kick off the
 discussion here so nothing will be set to stone really but I think we
 should plan to release somewhere near the end of the year?!


 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Resolved] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-3101.
---

   Resolution: Fixed
Fix Version/s: 4.0
Lucene Fields: [New, Patch Available]  (was: [New])

Committed revision: 1103711

Thanks Robert for help with this horrible monster!

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, 
 LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Simon Willnauer
stanislav you are a full committer afaik?!

simon

On Mon, May 16, 2011 at 2:11 PM,  stanis...@apache.org wrote:
 Author: stanislaw
 Date: Mon May 16 12:11:57 2011
 New Revision: 1103709

 URL: http://svn.apache.org/viewvc?rev=1103709view=rev
 Log:
 Adding myself (Stanislaw Osinski) to the contrib committer list.

 Modified:
    lucene/java/site/docs/whoweare.html
    lucene/java/site/docs/whoweare.pdf
    lucene/java/site/src/documentation/content/xdocs/whoweare.xml

 Modified: lucene/java/site/docs/whoweare.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/site/docs/whoweare.html?rev=1103709r1=1103708r2=1103709view=diff
 ==
 --- lucene/java/site/docs/whoweare.html (original)
 +++ lucene/java/site/docs/whoweare.html Mon May 16 12:11:57 2011
 @@ -3,7 +3,7 @@
  head
  META http-equiv=Content-Type content=text/html; charset=UTF-8
  meta content=Apache Forrest name=Generator
 -meta name=Forrest-version content=0.9
 +meta name=Forrest-version content=0.8
  meta name=Forrest-skin-name content=lucene
  title        Apache Lucene/Solr - Who We Are/title
  link type=text/css href=skin/basic.css rel=stylesheet
 @@ -343,6 +343,9 @@ document.write(Last Published:  + docu
  bPatrick O'Leary/b (pjaol@...)/li

  li
 +bStanislaw Osinski/b (stanislaw@...)/li
 +
 +li
  bChris Male/b (chrism@...)/li

  li
 @@ -355,7 +358,7 @@ document.write(Last Published:  + docu
  /div


 -a name=N100B0/aa name=emeritus/a
 +a name=N100B5/aa name=emeritus/a
  h2 class=boxedEmeritus Committers/h2
  div class=section
  ul

 Modified: lucene/java/site/docs/whoweare.pdf
 URL: 
 http://svn.apache.org/viewvc/lucene/java/site/docs/whoweare.pdf?rev=1103709r1=1103708r2=1103709view=diff
 ==
 Binary files - no diff available.

 Modified: lucene/java/site/src/documentation/content/xdocs/whoweare.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/java/site/src/documentation/content/xdocs/whoweare.xml?rev=1103709r1=1103708r2=1103709view=diff
 ==
 --- lucene/java/site/src/documentation/content/xdocs/whoweare.xml (original)
 +++ lucene/java/site/src/documentation/content/xdocs/whoweare.xml Mon May 16 
 12:11:57 2011
 @@ -38,6 +38,7 @@
  ul
  libWolfgang Hoschek/b (whoschek@...)/li
  libPatrick O'Leary/b (pjaol@...)/li
 +libStanislaw Osinski/b (stanislaw@...)/li
  libChris Male/b (chrism@...)/li
  libAndi Vajda/b (vajda@...)/li
  libKarl Wettin/b (kalle@...)/li




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103711 - in /lucene/dev/trunk/lucene/src: java/org/apache/lucene/util/automaton/MinimizationOperations.java test/org/apache/lucene/util/automaton/TestMinimize.java

2011-05-16 Thread Simon Willnauer
On Mon, May 16, 2011 at 2:15 PM,  uschind...@apache.org wrote:
 Author: uschindler
 Date: Mon May 16 12:15:45 2011
 New Revision: 1103711

 URL: http://svn.apache.org/viewvc?rev=1103711view=rev
 Log:
 LUCENE-3101: Fix n^2 memory usage in minimizeSchindler() ähm 
 minimizeHopcroft()

LOL ^ ^

 Modified:
    
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
    
 lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestMinimize.java

 Modified: 
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java?rev=1103711r1=1103710r2=1103711view=diff
 ==
 --- 
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
  (original)
 +++ 
 lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
  Mon May 16 12:15:45 2011
 @@ -30,6 +30,8 @@
  package org.apache.lucene.util.automaton;

  import java.util.BitSet;
 +import java.util.ArrayList;
 +import java.util.HashSet;
  import java.util.LinkedList;

  /**
 @@ -72,8 +74,12 @@ final public class MinimizationOperation
     final int[] sigma = a.getStartPoints();
     final State[] states = a.getNumberedStates();
     final int sigmaLen = sigma.length, statesLen = states.length;
 -    final BitSet[][] reverse = new BitSet[statesLen][sigmaLen];
 -    final BitSet[] splitblock = new BitSet[statesLen], partition = new 
 BitSet[statesLen];
 +    @SuppressWarnings(unchecked) final ArrayListState[][] reverse =
 +      (ArrayListState[][]) new ArrayList[statesLen][sigmaLen];
 +    @SuppressWarnings(unchecked) final HashSetState[] partition =
 +      (HashSetState[]) new HashSet[statesLen];
 +    @SuppressWarnings(unchecked) final ArrayListState[] splitblock =
 +      (ArrayListState[]) new ArrayList[statesLen];
     final int[] block = new int[statesLen];
     final StateList[][] active = new StateList[statesLen][sigmaLen];
     final StateListNode[][] active2 = new StateListNode[statesLen][sigmaLen];
 @@ -82,8 +88,8 @@ final public class MinimizationOperation
     final BitSet split = new BitSet(statesLen),
       refine = new BitSet(statesLen), refine2 = new BitSet(statesLen);
     for (int q = 0; q  statesLen; q++) {
 -      splitblock[q] = new BitSet(statesLen);
 -      partition[q] = new BitSet(statesLen);
 +      splitblock[q] = new ArrayListState();
 +      partition[q] = new HashSetState();
       for (int x = 0; x  sigmaLen; x++) {
         active[q][x] = new StateList();
       }
 @@ -92,23 +98,22 @@ final public class MinimizationOperation
     for (int q = 0; q  statesLen; q++) {
       final State qq = states[q];
       final int j = qq.accept ? 0 : 1;
 -      partition[j].set(q);
 +      partition[j].add(qq);
       block[q] = j;
       for (int x = 0; x  sigmaLen; x++) {
 -        final BitSet[] r =
 +        final ArrayListState[] r =
           reverse[qq.step(sigma[x]).number];
         if (r[x] == null)
 -          r[x] = new BitSet();
 -        r[x].set(q);
 +          r[x] = new ArrayListState();
 +        r[x].add(qq);
       }
     }
     // initialize active sets
     for (int j = 0; j = 1; j++) {
 -      final BitSet part = partition[j];
       for (int x = 0; x  sigmaLen; x++) {
 -        for (int i = part.nextSetBit(0); i = 0; i = part.nextSetBit(i+1)) {
 -          if (reverse[i][x] != null)
 -            active2[i][x] = active[j][x].add(states[i]);
 +        for (final State qq : partition[j]) {
 +          if (reverse[qq.number][x] != null)
 +            active2[qq.number][x] = active[j][x].add(qq);
         }
       }
     }
 @@ -121,18 +126,19 @@ final public class MinimizationOperation
     // process pending until fixed point
     int k = 2;
     while (!pending.isEmpty()) {
 -      IntPair ip = pending.removeFirst();
 +      final IntPair ip = pending.removeFirst();
       final int p = ip.n1;
       final int x = ip.n2;
       pending2.clear(x*statesLen + p);
       // find states that need to be split off their blocks
       for (StateListNode m = active[p][x].first; m != null; m = m.next) {
 -        final BitSet r = reverse[m.q.number][x];
 -        if (r != null) for (int i = r.nextSetBit(0); i = 0; i = 
 r.nextSetBit(i+1)) {
 +        final ArrayListState r = reverse[m.q.number][x];
 +        if (r != null) for (final State s : r) {
 +          final int i = s.number;
           if (!split.get(i)) {
             split.set(i);
             final int j = block[i];
 -            splitblock[j].set(i);
 +            splitblock[j].add(s);
             if (!refine2.get(j)) {
               refine2.set(j);
               refine.set(j);
 @@ -142,18 +148,19 @@ final public class MinimizationOperation
       }
       // refine blocks
       for (int j = refine.nextSetBit(0); j = 

[jira] [Resolved] (LUCENE-3070) Enable DocValues by default for every Codec

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-3070.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [New])

 Enable DocValues by default for every Codec
 ---

 Key: LUCENE-3070
 URL: https://issues.apache.org/jira/browse/LUCENE-3070
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Affects Versions: CSF branch
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: CSF branch

 Attachments: LUCENE-3070.patch, LUCENE-3070.patch, LUCENE-3070.patch, 
 LUCENE-3070.patch


 Currently DocValues are enable with a wrapper Codec so each codec which needs 
 DocValues must be wrapped by DocValuesCodec. The DocValues writer and reader 
 should be moved to Codec to be enabled by default.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Michael McCandless
+1

Mike

http://blog.mikemccandless.com

On Mon, May 16, 2011 at 7:52 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 Hey folks,

 we just started the discussion about Lucene 3.2 and releasing more
 often. Yet, I think we should also start planning for Lucene 4.0 soon.
 We have tons of stuff in trunk that people want to have and we can't
 just keep on talking about it - we need to push this out to our users.
 From my perspective we should decide on at least the big outstanding
 issues like:

 - BulkPostings (my +1 since I want to enable positional scoring on all 
 queries)
 - DocValues (pretty close)
 - FlexibleScoring (+- 0 I think we should wait how gsoc turns out and
 decide then?)
 - Codec Support for Stored Fields, Norms  TV (not sure about that but
 seems doable at least an API and current impl as default)
 - Realtime Search aka. Searchable Ram Buffer (this seems quite far
 though while I would love to have it it seems we need to push this to
 4.0)

 For DocValues the decision seems easy since we are very close with
 that and I expect it to land until end of June. I want to kick off the
 discussion here so nothing will be set to stone really but I think we
 should plan to release somewhere near the end of the year?!


 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-1149) add XA transaction support

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1149:




Sorry, you're right this issue isn't really a dup (I've reopened it).

I was just saying that Lucene's IW APIs are already transactional so
one should be able to build a transactions layer on top.  Ie, you
should not have to make a new index for each transaction.

We would still need a layer that mates this up to the XA transactions
API (I think?).  Does anyone have a patch for this?


 add XA transaction support
 --

 Key: LUCENE-1149
 URL: https://issues.apache.org/jira/browse/LUCENE-1149
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: robert engels

 Need to add XA transaction support to Lucene.
 Without XA support, it is difficult to keep disparate resources (e.g. 
 database) in sync with the Lucene index.
 A review of the XA support added to Hibernate might be a good start (although 
 Hibernate almost always uses a XA capable backing store database).
 It would be ideal to have a combined IndexReaderWriter instance, then create 
 a XAIndexReaderWriter which wraps it.
 The implementation might be as simple as a XA log file which lists the XA 
 transaction id, and the segments XXX number(s), since Lucene already allows 
 you to rollback to a previous version (??? for sure, or does it only allow 
 you to abort the current commit).
 If operating under a XA transaction, the no explicit commits or rollbacks 
 should be allowed on the instance.
 The index would be committed during XA prepare(), and then if needed 
 rolledback when requested. The XA commit() would be a no-op.
 There is a lot more to this but this should get the ball rolling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 7:52 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 Hey folks,

 we just started the discussion about Lucene 3.2 and releasing more
 often. Yet, I think we should also start planning for Lucene 4.0 soon.
 We have tons of stuff in trunk that people want to have and we can't
 just keep on talking about it - we need to push this out to our users.
 From my perspective we should decide on at least the big outstanding
 issues like:

 - BulkPostings (my +1 since I want to enable positional scoring on all 
 queries)

in my own opinion, this is probably the most important to decide how
to handle. I think it might not be good if we introduce a new major
version branch (4.x) with flexible indexing if the postings APIs limit
us from actually taking advantage of it.
I think that we should look at (shai brought up a previous thread
about this) when 4.x is released, 3.x goes into bugfix mode and we
open up 5.x. So, we want to make sure we actually have things stable
enough (from an API and flexibility perspective) that we will be able
to get some life out of the 4.x series and add new features to it.

I think there is a lot left to do with bulkpostings and its going to
require a lot of work, but at the same time I really don't like that
we have serious improvements/features in trunk (some have been there
now for years) still unreleased and not yet available to users.

Some other crazy ideas (just for discussion):
* we could try to be more aggressive about backporting and getting
more life out of 3.x, and getting some of these features to users.
For example, perhaps things like DWPT, DocValues, more efficient terms
index, automaton, etc could be backported safely. the advantage here
is that we get the features to the users, but the disadvantage is it
would be a lot of effort backporting.
* we could decide that we do actually have enough flexibility now in
4.x to get several releases out of it (e.g. containing features like
docvalues, realtime search, etc), even though we know its limited to
some extent, and defer api-breakers like bulkpostings/flexscoring to
5.x. the advantage here is that we could start looking at 4.x
releasing very soon, but there are some disadvantages, like forcing
people have to change a lot of their code to upgrade but for less
gain, and potentially limiting ourselves in the 4.x branch by its
APIs.
* we could do nothing at all, and keep going like we are going now,
deciding that we are actually getting enough useful features into 3.x
releases that its ok for us to block 4.0 on some of these tougher
issues like bulkpostings. The disadvantage is of course even longer
wait time for the features that have been sitting in trunk a while,
but it keeps 3.x stable and is less work for us.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Moving towards Lucene 4.0

2011-05-16 Thread Uwe Schindler
Sorry to be negative,

 - BulkPostings (my +1 since I want to enable positional scoring on all 
 queries)

My problem is the really crappy and unusable API of BulkPostings (wait for my 
talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself that’s 
unusable. I tried to understand even the simple MultiTermQueryWrapperFilter - 
easy on trunk, horrible on branch - sorry that’s a no-go.

Its code duplication everywhere and unreadable.

Uwe


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 8:48 AM, Uwe Schindler u...@thetaphi.de wrote:
 Sorry to be negative,

 - BulkPostings (my +1 since I want to enable positional scoring on all 
 queries)

 My problem is the really crappy and unusable API of BulkPostings (wait for my 
 talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself that’s 
 unusable. I tried to understand even the simple MultiTermQueryWrapperFilter - 
 easy on trunk, horrible on branch - sorry that’s a no-go.

 Its code duplication everywhere and unreadable.


I don't think you should apologize for being negative, its true there
is a ton of work to do here before that branch is ready. Thats why
in my email I tried to brainstorm some alternative ways we could get
some of these features into the hands of users without being held up
by this work.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-2383) Velocity: Generalize range and date facet display

2011-05-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl reopened SOLR-2383:
---


Reopening to add patch for branch 3.2

 Velocity: Generalize range and date facet display
 -

 Key: SOLR-2383
 URL: https://issues.apache.org/jira/browse/SOLR-2383
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
  Labels: facet, range, velocity
 Fix For: 4.0

 Attachments: SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, 
 SOLR-2383.patch, SOLR-2383.patch


 Velocity (/browse) GUI has hardcoded price range facet and a hardcoded 
 manufacturedate_dt date facet. Need general solution which work for any 
 facet.range and facet.date.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2383) Velocity: Generalize range and date facet display

2011-05-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2383:
--

Fix Version/s: 3.2

 Velocity: Generalize range and date facet display
 -

 Key: SOLR-2383
 URL: https://issues.apache.org/jira/browse/SOLR-2383
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
  Labels: facet, range, velocity
 Fix For: 3.2, 4.0

 Attachments: SOLR-2383-branch_32.patch, SOLR-2383.patch, 
 SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch


 Velocity (/browse) GUI has hardcoded price range facet and a hardcoded 
 manufacturedate_dt date facet. Need general solution which work for any 
 facet.range and facet.date.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2383) Velocity: Generalize range and date facet display

2011-05-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl updated SOLR-2383:
--

Attachment: SOLR-2383-branch_32.patch

This Velocity enhancement should make it to 3.2.

In this patch I back-port what was committed for 4.0, with these exceptions:
* No pivot facets
* End range uses ] instead of }

 Velocity: Generalize range and date facet display
 -

 Key: SOLR-2383
 URL: https://issues.apache.org/jira/browse/SOLR-2383
 Project: Solr
  Issue Type: Bug
  Components: Response Writers
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
  Labels: facet, range, velocity
 Fix For: 3.2, 4.0

 Attachments: SOLR-2383-branch_32.patch, SOLR-2383.patch, 
 SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch, SOLR-2383.patch


 Velocity (/browse) GUI has hardcoded price range facet and a hardcoded 
 manufacturedate_dt date facet. Need general solution which work for any 
 facet.range and facet.date.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034004#comment-13034004
 ] 

Robert Muir commented on LUCENE-3101:
-

Thanks for reporting this selckin, this is a great find, definitely amazed we 
randomly generated this one :)

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, 
 LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3101) TestMinimize.testAgainstBrzozowski reproducible seed OOM

2011-05-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034006#comment-13034006
 ] 

Dawid Weiss commented on LUCENE-3101:
-

There is a lot of power in randomness, huh? :) I really like these randomized 
tests... this should be a built-in functionality in JUnit (call it 'repeatable 
randomness')...

 TestMinimize.testAgainstBrzozowski reproducible seed OOM
 

 Key: LUCENE-3101
 URL: https://issues.apache.org/jira/browse/LUCENE-3101
 Project: Lucene - Java
  Issue Type: Bug
Reporter: selckin
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-3101.patch, LUCENE-3101.patch, LUCENE-3101.patch, 
 LUCENE-3101_test.patch


 {code}
 [junit] Testsuite: org.apache.lucene.util.automaton.TestMinimize
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 3.792 sec
 [junit] 
 [junit] - Standard Error -
 [junit] NOTE: reproduce with: ant test -Dtestcase=TestMinimize 
 -Dtestmethod=testAgainstBrzozowski 
 -Dtests.seed=-7429820995201119781:1013305000165135537
 [junit] NOTE: test params are: codec=PreFlex, locale=ru, 
 timezone=America/Pangnirtung
 [junit] NOTE: all tests run in this JVM:
 [junit] [TestMinimize]
 [junit] NOTE: Linux 2.6.37-gentoo amd64/Sun Microsystems Inc. 1.6.0_25 
 (64-bit)/cpus=8,threads=1,free=294745976,total=310378496
 [junit] -  ---
 [junit] Testcase: 
 testAgainstBrzozowski(org.apache.lucene.util.automaton.TestMinimize): 
 Caused an ERROR
 [junit] Java heap space
 [junit] java.lang.OutOfMemoryError: Java heap space
 [junit] at java.util.BitSet.initWords(BitSet.java:144)
 [junit] at java.util.BitSet.init(BitSet.java:139)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:85)
 [junit] at 
 org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:52)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:502)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomatonAllowMutate(RegExp.java:478)
 [junit] at 
 org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:428)
 [junit] at 
 org.apache.lucene.util.automaton.AutomatonTestUtil.randomAutomaton(AutomatonTestUtil.java:256)
 [junit] at 
 org.apache.lucene.util.automaton.TestMinimize.testAgainstBrzozowski(TestMinimize.java:43)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
 [junit] at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
 [junit] 
 [junit] 
 [junit] Test org.apache.lucene.util.automaton.TestMinimize FAILED
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Simon Willnauer
On Mon, May 16, 2011 at 2:57 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, May 16, 2011 at 8:48 AM, Uwe Schindler u...@thetaphi.de wrote:
 Sorry to be negative,

 - BulkPostings (my +1 since I want to enable positional scoring on all 
 queries)

 My problem is the really crappy and unusable API of BulkPostings (wait for 
 my talk at Lucene Rev...). For anybody else than Mike, Yonik and yourself 
 that’s unusable. I tried to understand even the simple 
 MultiTermQueryWrapperFilter - easy on trunk, horrible on branch - sorry 
 that’s a no-go.

 Its code duplication everywhere and unreadable.


 I don't think you should apologize for being negative, its true there
 is a ton of work to do here before that branch is ready. Thats why
 in my email I tried to brainstorm some alternative ways we could get
 some of these features into the hands of users without being held up
 by this work.


I have to admit that branch is very rough and the API is super hard to
use. For now!
Lets not be dragged away into discussion how this API should look like
there will be time
for that. I agree with robert that I see a large amount of work left
on that branch though so maybe
we should move the positional scoring (LUCENE-2878) over to trunk as
another option.

I think we should not wait much longer with Lucene 4.0 so I lean
towards Roberts option 2 even if we need to pay the price
for a major change in 5.0. I am not sure if we really need to change
much API for Realtime Search since this should be hidden in IW,
IndexingChain and IW#getReader() - I kind of like the idea to be close
to 4.0 :)

simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 I have to admit that branch is very rough and the API is super hard to
 use. For now!
 Lets not be dragged away into discussion how this API should look like
 there will be time
 for that.

+1, this is what i really meant by decide how to handle. I don't
think we will be able to quickly decide how to fix the branch
itself, i think its really complicated. But we can admit its really
complicated and won't be solved very soon, and try to figure out a
release strategy with this in mind.

(p.s. sorry simon, you got two copies of this message i accidentally
hit reply instead of reply-all)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Steven A Rowe
Hi Stanisław,

You don’t need to be logged into people.apache.org to update the website.

Have you seen these instructions?  The “unversioned website” section is what 
you want, I think:

http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite

Steve

From: stac...@gmail.com [mailto:stac...@gmail.com] On Behalf Of Stanislaw 
Osinski
Sent: Monday, May 16, 2011 8:56 AM
To: dev@lucene.apache.org; simon.willna...@gmail.com
Cc: java-...@lucene.apache.org; java-comm...@lucene.apache.org
Subject: Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html 
docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

stanislav you are a full committer afaik?!

I've been working mostly on the clustering plugin for now, so I'm not sure if 
it's right to move me to the core section right away :-)

Incidentally, I tried to svn up on 
/www/lucene.apache.org/java/docshttp://lucene.apache.org/java/docs at 
people.apache.orghttp://people.apache.org to push the modifications live, but 
there is an SVN lock on that directory. Am I missing anything? I'm assuming 
that's the right directory for the commiters list?

S.




Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Stanislaw Osinski
Hi Steve,

That explains everything, thanks! I somehow failed to locate that wiki page
and was looking at http://wiki.apache.org/solr/Website_Update_HOWTO instead.

S.

On Mon, May 16, 2011 at 15:25, Steven A Rowe sar...@syr.edu wrote:

 Hi Stanisław,



 You don’t need to be logged into people.apache.org to update the website.



 Have you seen these instructions?  The “unversioned website” section is
 what you want, I think:



 http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite



 Steve



 *From:* stac...@gmail.com [mailto:stac...@gmail.com] *On Behalf Of *Stanislaw
 Osinski
 *Sent:* Monday, May 16, 2011 8:56 AM

 *To:* dev@lucene.apache.org; simon.willna...@gmail.com
 *Cc:* java-...@lucene.apache.org; java-comm...@lucene.apache.org
 *Subject:* Re: svn commit: r1103709 - in /lucene/java/site:
 docs/whoweare.html docs/whoweare.pdf
 src/documentation/content/xdocs/whoweare.xml



 stanislav you are a full committer afaik?!



 I've been working mostly on the clustering plugin for now, so I'm not sure
 if it's right to move me to the core section right away :-)



 Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at
 people.apache.org to push the modifications live, but there is an SVN lock
 on that directory. Am I missing anything? I'm assuming that's the right
 directory for the commiters list?



 S.







Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Mark Miller

On May 16, 2011, at 8:55 AM, Stanislaw Osinski wrote:

 stanislav you are a full committer afaik?!
 
 I've been working mostly on the clustering plugin for now, so I'm not sure if 
 it's right to move me to the core section right away :-)
 
 Incidentally, I tried to svn up on /www/lucene.apache.org/java/docs at 
 people.apache.org to push the modifications live, but there is an SVN lock on 
 that directory. Am I missing anything? I'm assuming that's the right 
 directory for the commiters list?
 
 S.
 
 

Stanislav - we certainly nominated you in the spirit of maintaining the carrot2 
contrib, but you are still a full committer. We have decided to stop adding new 
Contrib committers. A full committer may be someone that only works on part of 
the project. IMO, a full committer might be someone that only has commit bits 
so that he can update the website! We trust full committers to only mess with 
what they are comfortable with. So we trust that you will stick to Carrot2 or 
other areas you are strong in, and that if you want to move into other code, 
you will do so intelligently. Essentially, by making you a Committer, we are 
mostly just saying - we trust you.

But you are a full committer and not a contrib committer. We no longer mint new 
contrib committers.

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034018#comment-13034018
 ] 

Robert Muir commented on SOLR-1942:
---

any update on this? Would be nice to be able to hook in codecproviders and 
codecs this way.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034025#comment-13034025
 ] 

Martijn van Groningen commented on LUCENE-3098:
---

Hmmm... So you get a list of all grouped values. That can be useful. Only 
remember that doesn't tell anything about the group head (most relevant 
document of a group), since we don't sort inside the groups.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1103709 - in /lucene/java/site: docs/whoweare.html docs/whoweare.pdf src/documentation/content/xdocs/whoweare.xml

2011-05-16 Thread Stanislaw Osinski
Hi Mark,

Thanks for clarifying the difference between contrib and full committers, I
was probably too shy to subscribe myself to the latter group right away :-)
For the time being, I'll most likely stick with maintaining the clustering
bit and will consult you guys if I have something to contribute in the other
areas of the code.

S.

On Mon, May 16, 2011 at 15:41, Mark Miller markrmil...@gmail.com wrote:


 Stanislav - we certainly nominated you in the spirit of maintaining the
 carrot2 contrib, but you are still a full committer. We have decided to stop
 adding new Contrib committers. A full committer may be someone that only
 works on part of the project. IMO, a full committer might be someone that
 only has commit bits so that he can update the website! We trust full
 committers to only mess with what they are comfortable with. So we trust
 that you will stick to Carrot2 or other areas you are strong in, and that if
 you want to move into other code, you will do so intelligently. Essentially,
 by making you a Committer, we are mostly just saying - we trust you.

 But you are a full committer and not a contrib committer. We no longer mint
 new contrib committers.

 - Mark Miller
 lucidimagination.com

 Lucene/Solr User Conference
 May 25-26, San Francisco
 www.lucenerevolution.org






 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034040#comment-13034040
 ] 

Michael McCandless commented on LUCENE-3098:


Right, we'd make it clear the collection is unordered.

It just seems like, since we are building up this collection anyway, we may as 
well give access to the consumer?

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034050#comment-13034050
 ] 

Martijn van Groningen commented on LUCENE-3098:
---

That is true. It is just a simple un-orded collection of all values of the 
group field that have matches the query. I'll include this as well.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034051#comment-13034051
 ] 

Grant Ingersoll commented on SOLR-1942:
---

I thought I would have time last week, but that turned out to not be the case.  
If you have time, Robert, feel free, otherwise I might be able to get to it 
later in the week (pending conf. prep).  From the sounds of it, it likely just 
needs to be updated to trunk and then it should be ready to go (we should also 
doc it on the wiki)

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034053#comment-13034053
 ] 

Simon Willnauer commented on LUCENE-3090:
-

I did 150 runs for all Lucene Tests incl. contrib - no failure so far. Seems to 
be good to go.

 DWFlushControl does not take active DWPT out of the loop on fullFlush
 -

 Key: LUCENE-3090
 URL: https://issues.apache.org/jira/browse/LUCENE-3090
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Critical
 Fix For: 4.0

 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch


 We have seen several OOM on TestNRTThreads and all of them are caused by 
 DWFlushControl missing DWPT that are set as flushPending but can't full due 
 to a full flush going on. Yet that means that those DWPT are filling up in 
 the background while they should actually be checked out and blocked until 
 the full flush finishes. Even further we currently stall on the 
 maxNumThreadStates while we should stall on the num of active thread states. 
 I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1942) Ability to select codec per field

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034059#comment-13034059
 ] 

Robert Muir commented on SOLR-1942:
---

ok thanks Grant. I'll take a look thru the patch some today and post back what 
I think.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-16 Thread Shai Erera
We anyway seem to mark every new API as @lucene.experimental these days, so
we shouldn't have too much problem when 4.0 is out :).

Experimental API is subject to change at any time. We can consider that as
an option as well (maybe it adds another option to Robert's?).

Though personally, I'm not a big fan of this notion - I think we deceive
ourselves and users when we have @experimental on a stable branch. Any
@experimental API on trunk today falls into this bucket after 4.0 is out.
And I'm sure there are a couple in 3.x already.

Don't get me wrong - I don't suggest we should stop using it. But I think we
should consider to review the @experimental API before every stable
release, and reduce it over time, not increase it.

Shai

On Mon, May 16, 2011 at 4:20 PM, Robert Muir rcm...@gmail.com wrote:

 On Mon, May 16, 2011 at 9:12 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
  I have to admit that branch is very rough and the API is super hard to
  use. For now!
  Lets not be dragged away into discussion how this API should look like
  there will be time
  for that.

 +1, this is what i really meant by decide how to handle. I don't
 think we will be able to quickly decide how to fix the branch
 itself, i think its really complicated. But we can admit its really
 complicated and won't be solved very soon, and try to figure out a
 release strategy with this in mind.

 (p.s. sorry simon, you got two copies of this message i accidentally
 hit reply instead of reply-all)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Field should accept BytesRef?

2011-05-16 Thread Jason Rutherglen
 But when you create an untokenized field (or even a binary field, which is 
 stored-only at the moment), you could theoretically index the bytes directly

Right, if I already have a BytesRef of what needs to be indexed, then
passing the BR into Field/able should reduce garbage collection of
strings?

On Sun, May 15, 2011 at 9:59 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 I think Jason meant the field value,  not the field name.

 Field names should stay Strings, as they are only identifiers making them 
 BytesRefs is not really useful.

 But when you create an untokenized field (or even a binary field, which is 
 stored-only at the moment), you could theoretically index the bytes directly.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Sunday, May 15, 2011 6:22 PM
 To: dev@lucene.apache.org
 Subject: Re: Field should accept BytesRef?

 On Sun, May 15, 2011 at 12:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  In the Field object a text value must be of type string, however I
  think we can allow a BytesRef to be passed in?
 

 it would be nice if we sorted them in byte order too? I think right now 
 fields
 are sorted in utf-16 order, but terms are sorted in utf-8 order? (if so, 
 this is
 confusing)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

bq. Only thing is: I would be careful about directly setting those private 
fields of the cachedScorer; I think (not sure) this incurs an access check on 
each assignment. Maybe make them package protected? Or use a setter?

Good catch Mike. I read about it some and found this nice webpage which 
explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the 
member is private (whether it's in the inner or outer class), there is an 
access check. So the right think to do is to declare is protected / 
package-private, which I did. Thanks for the opportunity to get some education !

Patch fixes this. I intend to commit this shortly + move the class to core + 
apply to trunk. Then, I'll continue w/ the rest of the improvements.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Field should accept BytesRef?

2011-05-16 Thread Robert Muir
On Mon, May 16, 2011 at 11:29 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 But when you create an untokenized field (or even a binary field, which is 
 stored-only at the moment), you could theoretically index the bytes directly

 Right, if I already have a BytesRef of what needs to be indexed, then
 passing the BR into Field/able should reduce garbage collection of
 strings?


you can do this with a tokenstream, see
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/Test2BTerms.java
for an example

(sorry i somehow was confused about your message earlier).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2450.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Carrot2 clustering should use both its own and Solr's stop words
 

 Key: SOLR-2450
 URL: https://issues.apache.org/jira/browse/SOLR-2450
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2450.patch


 While using only Solr's stop words for clustering isn't a good idea (compared 
 to indexing, clustering needs more aggressive stop word removal to get 
 reasonable cluster labels), it would be good if Carrot2 used both its own and 
 Solr's stop words.
 I'm not sure what the best way to implement this would be though. My first 
 thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
 them with Carrot2's. But then, maybe a better approach would be to get the 
 stop words from the StopFilter being used? Ideally, we should also consider 
 the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2449.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Loading of Carrot2 resources from Solr config directory
 ---

 Key: SOLR-2449
 URL: https://issues.apache.org/jira/browse/SOLR-2449
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
 Fix For: 3.2, 4.0

 Attachments: SOLR-2449.patch


 Currently, Carrot2 clustering algorithms read linguistic resources (stop 
 words, stop labels) from the classpath (Carrot2 JAR), which makes them 
 difficult to edit/override. The directory from which Carrot2 should read its 
 resources (absolute, or relative to Solr config dir) could be specified in 
 the {{engine}} element. By default, the path could be e.g. 
 {{solr.conf/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2448.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2505) Output cluster scores

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2505.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Output cluster scores
 -

 Key: SOLR-2505
 URL: https://issues.apache.org/jira/browse/SOLR-2505
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


 Carrot2 algorithms compute cluster scores; we could expose them on the output 
 from Solr clustering component. Along with scores, we can output a boolean 
 flag that marks the Other Topics groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

Here updated patch that removes some ListSI usage from DirectoryReader and 
IndexWriter for rollback when commit fails. I am still not happy with 
interacting of IndexWriter code directly with the list, but this should maybe 
fixed later.

This patch could also be backported to cleanup 3.x, but for backwards 
compatibility, the SegmentInfos class should still extend VectorSI, but we 
can make the fields segment simply point to this. I am not sure how to 
deprecated extension of a class? A possibility would be to add each Vector 
method as a overridden one-liner and deprecated, but thats a non-brainer and 
stupid to do :(

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034091#comment-13034091
 ] 

Michael McCandless commented on LUCENE-3102:


Patch looks great Shai -- +1 to commit!!

Yes that is very sneaky about the private fields in inner/outer classes -- it's 
good you added a comment explaining it!

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034093#comment-13034093
 ] 

Michael McCandless commented on LUCENE-3084:


Uwe, this looks like a great step forward?  Even if there are other things to 
fix later, we should commit this first (progress not perfection)?  Thanks!

On backporting, this is an experimental API, and it's rather expert for code 
to be interacting with SegmentInfos, so I think we can just break it (and 
advertise we did so)?

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034095#comment-13034095
 ] 

Michael McCandless commented on LUCENE-3090:


Patch looks good but hairy Simon!

I ran 144 iters of all (Solr+lucene+lucene-contrib) tests.  I hit three fails 
(one in Solr's TestJoin.testRandomJoin, and two in Solr's HighlighterTest) but 
I don't think these are related to this patch.

 DWFlushControl does not take active DWPT out of the loop on fullFlush
 -

 Key: LUCENE-3090
 URL: https://issues.apache.org/jira/browse/LUCENE-3090
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Critical
 Fix For: 4.0

 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch


 We have seen several OOM on TestNRTThreads and all of them are caused by 
 DWFlushControl missing DWPT that are set as flushPending but can't full due 
 to a full flush going on. Yet that means that those DWPT are filling up in 
 the background while they should actually be checked out and blocked until 
 the full flush finishes. Even further we currently stall on the 
 maxNumThreadStates while we should stall on the num of active thread states. 
 I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2521) TestJoin.testRandom fails

2011-05-16 Thread Michael McCandless (JIRA)
TestJoin.testRandom fails
-

 Key: SOLR-2521
 URL: https://issues.apache.org/jira/browse/SOLR-2521
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 4.0


Hit this random failure; it reproduces on trunk:

{noformat}

[junit] Testsuite: org.apache.solr.TestJoin
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.512 sec
[junit] 
[junit] - Standard Error -
[junit] 2011-05-16 12:51:46 org.apache.solr.TestJoin testRandomJoin
[junit] SEVERE: GROUPING MISMATCH: mismatch: '0'!='1' @ response/numFound
[junit] 
request=LocalSolrQueryRequest{echoParams=allindent=trueq={!join+from%3Dsmall_i+to%3Dsmall3_is}*:*wt=json}
[junit] result={
[junit]   responseHeader:{
[junit] status:0,
[junit] QTime:0,
[junit] params:{
[junit]   echoParams:all,
[junit]   indent:true,
[junit]   q:{!join from=small_i to=small3_is}*:*,
[junit]   wt:json}},
[junit]   response:{numFound:1,start:0,docs:[
[junit]   {
[junit] id:NXEA,
[junit] score_f:87.90162,
[junit] small3_ss:[N,
[junit]   v,
[junit]   n],
[junit] small_i:4,
[junit] small2_i:1,
[junit] small2_is:[2],
[junit] small3_is:[69,
[junit]   88,
[junit]   54,
[junit]   80,
[junit]   75,
[junit]   83,
[junit]   57,
[junit]   73,
[junit]   85,
[junit]   52,
[junit]   50,
[junit]   88,
[junit]   51,
[junit]   89,
[junit]   12,
[junit]   8,
[junit]   19,
[junit]   23,
[junit]   53,
[junit]   75,
[junit]   26,
[junit]   99,
[junit]   0,
[junit]   44]}]
[junit]   }}
[junit] expected={numFound:0,start:0,docs:[]}
[junit] model={NXEA:Doc(0):[id=NXEA, score_f=87.90162, small3_ss=[N, 
v, n], small_i=4, small2_i=1, small2_is=2, small3_is=[69, 88, 54, 80, 75, 83, 
57, 73, 85, 52, 50, 88, 51, 89, 12, 8, 19, 23, 53, 75, 26, 99, 0, 
44]],JSLZ:Doc(1):[id=JSLZ, score_f=11.198811, small2_ss=[c, d], 
small3_ss=[b, R, H, Q, O, f, C, e, Z, u, z, u, w, I, f, _, Y, r, w, u], 
small_i=6, small2_is=[2, 3], small3_is=[22, 1]],FAWX:Doc(2):[id=FAWX, 
score_f=25.524109, small_s=d, small3_ss=[O, D, X, `, W, z, k, M, j, m, r, [, E, 
P, w, ^, y, T, e, R, V, H, g, e, I], small_i=2, small2_is=[2, 1], 
small3_is=[95, 42]],GDDZ:Doc(3):[id=GDDZ, score_f=8.483642, small2_ss=[b, 
e], small3_ss=[o, i, y, l, I, O, r, O, f, d, E, e, d, f, b, P], small2_is=[6, 
6], small3_is=[36, 48, 9, 8, 40, 40, 68]],RBIQ:Doc(4):[id=RBIQ, 
score_f=97.06258, small_s=b, small2_s=c, small2_ss=[e, e], small_i=2, 
small2_is=6, small3_is=[13, 77, 96, 45]],LRDM:Doc(5):[id=LRDM, 
score_f=82.302124, small_s=b, small2_s=a, small2_ss=d, small3_ss=[H, m, O, D, 
I, J, U, D, f, N, ^, m, I, j, L, s, F, h, A, `, c, j], small2_i=2, 
small2_is=[2, 7], small3_is=[81, 31, 78, 23, 88, 1, 7, 86, 20, 7, 40, 52, 100, 
81, 34, 45, 87, 72, 14, 5]]}
[junit] NOTE: reproduce with: ant test -Dtestcase=TestJoin 
-Dtestmethod=testRandomJoin 
-Dtests.seed=-4998031941344546449:8541928265064992444
[junit] NOTE: test params are: codec=RandomCodecProvider: {id=MockRandom, 
small2_ss=Standard, small2_is=MockFixedIntBlock(blockSize=1738), 
small2_s=MockFixedIntBlock(blockSize=1738), 
small3_is=MockVariableIntBlock(baseBlockSize=77), 
small_i=MockFixedIntBlock(blockSize=1738), 
small_s=MockVariableIntBlock(baseBlockSize=77), score_f=MockSep, 
small2_i=Pulsing(freqCutoff=9), small3_ss=SimpleText}, locale=sr_BA, 
timezone=America/Barbados
[junit] NOTE: all tests run in this JVM:
[junit] [TestJoin]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 
1.6.0_21 (64-bit)/cpus=24,threads=1,free=252342544,total=308084736
[junit] -  ---
[junit] Testcase: testRandomJoin(org.apache.solr.TestJoin): FAILED
[junit] mismatch: '0'!='1' @ response/numFound
[junit] junit.framework.AssertionFailedError: mismatch: '0'!='1' @ 
response/numFound
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
[junit] at org.apache.solr.TestJoin.testRandomJoin(TestJoin.java:172)
[junit] 
[junit] 
[junit] Test org.apache.solr.TestJoin FAILED
{noformat}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: 

[jira] [Assigned] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-3100:
---

Assignee: Simon Willnauer

 IW.commit() writes but fails to fsync the N.fnx file
 

 Key: LUCENE-3100
 URL: https://issues.apache.org/jira/browse/LUCENE-3100
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0


 In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
 bug!
 Because the new N.fnx file is written at the last minute along with the 
 segments file, it's not included in the sis.files() that IW uses to figure 
 out which files to sync.
 This bug means one could call IW.commit(), successfully, return, and then the 
 machine could crash and when it comes back up your index could be corrupted.
 We should hopefully first fix TestCrash so that it hits this bug (maybe it 
 needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034101#comment-13034101
 ] 

Michael McCandless commented on SOLR-2519:
--

I think the attached patch is a good starting point. It fixes the
generic text fieldType to have good all around defaults for all
languages, so that non-whitespace languages work fine.

Then, I think we should iteratively add in custom languages over time
(as separate issues).  We can eg add text_en_autophrase, text_en,
text_zh, etc.  We should at least do first sweep of nice analyzers
module and add fieldTypes for them.

This way we will eventually get to the ideal future when we have
text_XX coverage for many languages.


 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-2027) Deprecate Directory.touchFile

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2027:
--

Assignee: Michael McCandless

 Deprecate Directory.touchFile
 -

 Key: LUCENE-2027
 URL: https://issues.apache.org/jira/browse/LUCENE-2027
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 4.0

 Attachments: LUCENE-2027.patch


 Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl 
 shouldn't swallow the returned result from File.setLastModified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2027) Deprecate Directory.touchFile

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2027:
---

Attachment: LUCENE-2027.patch

Patch, removing Dir.touchFile from trunk.

For 3.x I'll deprecate.

 Deprecate Directory.touchFile
 -

 Key: LUCENE-2027
 URL: https://issues.apache.org/jira/browse/LUCENE-2027
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 4.0

 Attachments: LUCENE-2027.patch


 Lucene doesn't use this method, and, FindBugs reports that FSDirectory's impl 
 shouldn't swallow the returned result from File.setLastModified.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3090) DWFlushControl does not take active DWPT out of the loop on fullFlush

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034103#comment-13034103
 ] 

Simon Willnauer commented on LUCENE-3090:
-

Thanks mike for review and testing!! It makes me feel better with those asserts 
in there now... I will commit tomorrow.

 DWFlushControl does not take active DWPT out of the loop on fullFlush
 -

 Key: LUCENE-3090
 URL: https://issues.apache.org/jira/browse/LUCENE-3090
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Critical
 Fix For: 4.0

 Attachments: LUCENE-3090.patch, LUCENE-3090.patch, LUCENE-3090.patch


 We have seen several OOM on TestNRTThreads and all of them are caused by 
 DWFlushControl missing DWPT that are set as flushPending but can't full due 
 to a full flush going on. Yet that means that those DWPT are filling up in 
 the background while they should actually be checked out and blocked until 
 the full flush finishes. Even further we currently stall on the 
 maxNumThreadStates while we should stall on the num of active thread states. 
 I will attach a patch tomorrow.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3084:
--

Attachment: LUCENE-3084-trunk-only.patch

New patch that also has BalancedMergePolicy from contrib refactored to new API 
(sorry that was missing).

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034120#comment-13034120
 ] 

Yonik Seeley commented on SOLR-2519:


I think maybe there's a misconception that the fieldType named text was meant 
to be generic for all languages.  As I said in the thread, if I had to do it 
over again, I would have named it text_en because that's what it's purpose 
was.  But at this point, it seems like the best way forward is to leave text 
as an english fieldType and simply add other fieldTypes that can support other 
languages.

Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):
- The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages.  WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
- You left the english stemmer on the text fieldType... but if it's supposed 
to be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?

Taking into account all the existing users (and all the existing documentation, 
examples, tutorial, etc), I favor a more conservative approach of adding new 
fieldTypes rather than radically changing the behavior of existing ones.

Random question: what are the implications of changing from WhitespaceTokenizer 
to StandardTokenizer, esp w.r.t. WDF?

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034151#comment-13034151
 ] 

Hoss Man commented on SOLR-2520:


I'm confused here: As far as i can tell, the JSONResponseWriter does in fact 
output valid JSON (the link mentioned points out that there are control 
characters valid in JSON which are not valid in javascript, but that's what the 
response writer produces -- JSON) ... so what is the bug?

And what do you mean by the query option to ask for jsonp ? ...  i don't see 
that option in the JSONResponseWriter

(is this bug about some third party response writer?)

 Solr creates invalid jsonp strings
 --

 Key: SOLR-2520
 URL: https://issues.apache.org/jira/browse/SOLR-2520
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Benson Margulies

 Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
 If a stored field contains invalid Javascript characters, and you use the 
 query option to ask for jsonp, solr does *not* escape some invalid Unicode 
 characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034154#comment-13034154
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. I think maybe there's a misconception that the fieldType named text was 
meant to be generic for all languages.

Regardless of what the original intention was, text today has become
the generic text fieldType new users use on starting with Solr.  I
mean, it has the perfect name for that :)

bq. As I said in the thread, if I had to do it over again, I would have named 
it text_en because that's what it's purpose was.

Hindsight is 20/20... but, we can still fix this today.  We shouldn't
lock ourselves into poor defaults.

Especially, as things improve and we get better analyzers, etc., we
should be free to improve the defaults in schema.xml to take advantage
of these improvements.

bq. But at this point, it seems like the best way forward is to leave text as 
an english fieldType and simply add other fieldTypes that can support other 
languages.

I think this is a dangerous approach -- the name (ie, missing _en if
in fact it has such English-specific configuration) is misleading and
traps new users.

Ideally, in the future, we wouldn't even have a text fieldType, only
text_XX per-language examples and then maybe something like
text_general, which you use if you cannot find your language.

{quote}
Some downsides I see to this patch (i.e. trying to make the 'text' fieldType 
generic):

The current WordDelimiterFilter options the fieldType feel like a trap for 
non-whitespace-delimited languages. WDF is configured to index catenations as 
well as splits... so all of the tokens (words?) that are split out are also 
catenated together and indexed (which seems like it could lead to some truly 
huge tokens erroneously being indexed.)
{quote}
Ahh good point.  I think we should remove WDF altogether from the
generic text fieldType.

{quote}
You left the english stemmer on the text fieldType... but if it's supposed to 
be generic, couldn't this be bad for some other western languages where it 
could cause stemming collisions of words not related to each other?
{quote}

+1, we should remove the stemming too from text.

bq. Taking into account all the existing users (and all the existing 
documentation, examples, tutorial, etc), I favor a more conservative approach 
of adding new fieldTypes rather than radically changing the behavior of 
existing ones.

Can you point to specific examples (docs, examples, tutorial)?  I'd
like to understand how much work it is to fix these...

My feeling is we should simply do the work here (I'll sign up to it)
and fix any places that actually rely on the specifics of text
fieldType, eg autophrase.

We shouldn't avoid fixing things well because it's gonna be more work
today, especially if someone (me) is signing up to do it.

Also: existing users would be unaffected by this?  They've already
copied over / edited their own schema.xml?  This is mainly about new
users?


 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034158#comment-13034158
 ] 

Michael McCandless commented on SOLR-2519:
--

It's also spooky that text fieldType has different index
time vs query time analyzers?  Ie, WDF is configured differently.

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) Solr creates invalid jsonp strings

2011-05-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034159#comment-13034159
 ] 

Benson Margulies commented on SOLR-2520:


Fun happens when you specify something in json.wrf. This demands 'jsonp' 
instead of json, which results in the result being treated as javascript, not 
json.  wt=jsonjson.wrf=SOME_PREFIX will cause Solr to respond with

 SOME_PREFIX({whatever it was otherwise going to return})

instead of just

 {whatever it was otherwise going to return}

If there is then an interesting Unicode character in there, Chrome implodes and 
firefox quietly rejects.



 Solr creates invalid jsonp strings
 --

 Key: SOLR-2520
 URL: https://issues.apache.org/jira/browse/SOLR-2520
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Benson Margulies

 Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
 If a stored field contains invalid Javascript characters, and you use the 
 query option to ask for jsonp, solr does *not* escape some invalid Unicode 
 characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098.patch

Attached patch with the discussed changes.
3x patch follows soon.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034172#comment-13034172
 ] 

Hoss Man commented on SOLR-2519:


I feel like we are convoluting two issues here: the default behavior of 
TextField, and the example configs.

i don't have any strong opinions about changing the default behavior of 
TextField when {{autoGeneratePhraseQueries}} is not specified in the 
{{fieldType/}} but if we do make such a change, it should be contingent on 
the schema version property (which we should bump) so that people who upgrade 
will get consistent behavior with their existing configs (TextField.init 
already has an example of this for when we changed the default of {{omitNorms}})

as far as the example configs: i agree with yonik, that changing text at this 
point might be confusing ... i think the best way to iterate moving forward 
would probably be:

* rename {{fieldType name=text/}} and {{field name=text/}} to something 
that makes their purpose more clear (text_en, or text_western, or 
text_european, or some other more general descriptive word for the types of 
languages were it makes sense) and switch all existing {{field/}} 
declarations that currently use use field type text to use this new name.

* add a new {{fieldType name=text_general/}} which is designed (and 
documented to be a general purpose field type when the language is unknown (it 
may make sense to fix/repurpose the existing {{fieldType name=textgen/}} 
for this, since it already suggests that's what it's for)

* Audit all {{field/}} declarations that use text_en (or whatever name was 
chosen above) and the existing sample data for those fields to see if it makes 
more sense to change them to text_general. also change any where based on 
usage it shouldn't matter.

The end result being that we have no {{fieldType/}} named text in the 
example configs, so people won't get it confused with previous versions, and 
we'll have a new {{fieldType/}} that works as well as possible with all 
langauges which we use as much as possible with the example data.






 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034176#comment-13034176
 ] 

Hoss Man commented on SOLR-2519:


bq. Also: existing users would be unaffected by this? They've already copied 
over / edited their own schema.xml? This is mainly about new users?

The trap we've seen with this type of thing in the past (ie: the numeric 
fields) is that people who tend to use the example configs w/o changing them 
much refer to the example field types by name when talking about them on the 
mailing list, not considering that those names can have differnet meanings 
depending on version.

if we make radical changes to a {{fieldType/}} but leave the name alone, it 
could confuse a lot of people, ie: i tried using the 'text' field but it 
didn't work; which version of solr are you using?; Solr 4.1; that should 
work, what exactly does your schema look like; ...; that's the schema from 
3.6; yeah, i started with 3.6 nad then upgraded to 4.1 later, etc...

Bottom line: it's less confusing to *remove* {{fieldType/}} and add new ones 
with new names then to make radical changes to existing ones.

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Component/s: (was: contrib/*)
 modules/grouping

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/grouping
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-2520:
---

Summary: JSONResponseWriter w/json.wrf can produce invalid javascript 
depending on unicode chars in response data  (was: Solr creates invalid jsonp 
strings)

Benson: thanks for the clarification, i've updated the summary to attempt to 
clarify the root of the issue.

Would make more sense to have a JavascriptResponseWriter or to have the 
JSONResponseWriter do unicode escaping/stripping if/when json.wrf is specified?

 JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
 unicode chars in response data
 

 Key: SOLR-2520
 URL: https://issues.apache.org/jira/browse/SOLR-2520
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Benson Margulies

 Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
 If a stored field contains invalid Javascript characters, and you use the 
 query option to ask for jsonp, solr does *not* escape some invalid Unicode 
 characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034185#comment-13034185
 ] 

Michael McCandless commented on SOLR-2519:
--

bq. Bottom line: it's less confusing to remove fieldType/ and add new ones 
with new names then to make radical changes to existing ones.

Ahh, this makes great sense!

I really like your proposal Hoss, and that's a great point about emails to the 
mailing lists.

So we'd have no more text fieldType.  Just text_en (what text now is) and 
text_general (basically just StandardAnalyzer, but maybe move/absorb textgen 
over).

Over time we can add in more language specific text_XX fieldTypes...

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034187#comment-13034187
 ] 

Benson Margulies commented on SOLR-2520:


I'd vote for the later. I assume that there is some large inventory of people 
who are currently using json.wrf=foo and who would benefit from the change. 
However, I have limited context here, so if anyone else knows more about how 
users are using this stuff I hope they will speak up. Sorry not to have been 
fully clear on the first attempt.


 JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
 unicode chars in response data
 

 Key: SOLR-2520
 URL: https://issues.apache.org/jira/browse/SOLR-2520
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Benson Margulies

 Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
 If a stored field contains invalid Javascript characters, and you use the 
 query option to ask for jsonp, solr does *not* escape some invalid Unicode 
 characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3103:


Attachment: LUCENE-3103.patch

attached is a first patch... maybe Uwe won't be able to resist rewriting it to 
make it simpler :)

 create a simple test that indexes and searches byte[] terms
 ---

 Key: LUCENE-3103
 URL: https://issues.apache.org/jira/browse/LUCENE-3103
 Project: Lucene - Java
  Issue Type: Test
  Components: general/test
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3103.patch


 Currently, the only good test that does this is Test2BTerms (disabled by 
 default)
 I think we should test this capability, and also have a simpler example for 
 how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2520) JSONResponseWriter w/json.wrf can produce invalid javascript depending on unicode chars in response data

2011-05-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034197#comment-13034197
 ] 

Yonik Seeley commented on SOLR-2520:


It looks like we already escape \u2028 (see SOLR-1936), so we should just do 
the same for \u2029?

 JSONResponseWriter w/json.wrf can produce invalid javascript depending on 
 unicode chars in response data
 

 Key: SOLR-2520
 URL: https://issues.apache.org/jira/browse/SOLR-2520
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Benson Margulies

 Please see http://timelessrepo.com/json-isnt-a-javascript-subset.
 If a stored field contains invalid Javascript characters, and you use the 
 query option to ask for jsonp, solr does *not* escape some invalid Unicode 
 characters, resulting in strings that explode on contact with browsers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098.patch

Attached a new patch.

* Renamed TotalGroupCountCollector to AllGroupsCollector. This rename reflects 
more what the collector is actual doing.
* Group values are now collected in an ArrayList instead of a LinkedList. The 
initialSize is now also used for the ArrayList.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2519) Improve the defaults for the text field type in default schema.xml

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034203#comment-13034203
 ] 

Robert Muir commented on SOLR-2519:
---

As someone frustrated by this (but who would ultimately like to move past it 
and try to help with solr's intl), I just wanted to say +1 to Hoss Man's 
proposal.

My only suggestion on what he said is that I would greatly prefer text_en over 
text_western or whatever for these reasons:
1. the stemming and stopwords and crap here are english.
2. for other western languages, even if you swap these out to be say, french or 
italian (which is the seemingly obvious way to cut over), the whole 
WDF+autophrase is still a huge trap (see 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for an example). in this case use of ElisionFilter can be taken to avoid it.

 Improve the defaults for the text field type in default schema.xml
 

 Key: SOLR-2519
 URL: https://issues.apache.org/jira/browse/SOLR-2519
 Project: Solr
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: SOLR-2519.patch


 Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
 The text fieldType in schema.xml is unusable for non-whitespace
 languages, because it has the dangerous auto-phrase feature (of
 Lucene's QP -- see LUCENE-2458) enabled.
 Lucene leaves this off by default, as does ElasticSearch
 (http://http://www.elasticsearch.org/).
 Furthermore, the text fieldType uses WhitespaceTokenizer when
 StandardTokenizer is a better cross-language default.
 Until we have language specific field types, I think we should fix
 the text fieldType to work well for all languages, by:
   * Switching from WhitespaceTokenizer to StandardTokenizer
   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034214#comment-13034214
 ] 

Michael McCandless commented on LUCENE-3098:


Looks great Martijn!

I'll commit in a day or two if nobody objects...

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3098) Grouped total count

2011-05-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-3098:
--

Assignee: Michael McCandless

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034217#comment-13034217
 ] 

Robert Muir commented on LUCENE-3103:
-

one thing i did previously (seemed overkill but maybe good to do) was to 
clearAttributes(), setBytesRef() on each incrementToken,
more like a normal tokenizer. we could still change it to work like this. in 
this case clear() set the br to null.

another thing to inspect is the reflection api so toString prints the bytes... 
didnt check this.


 create a simple test that indexes and searches byte[] terms
 ---

 Key: LUCENE-3103
 URL: https://issues.apache.org/jira/browse/LUCENE-3103
 Project: Lucene - Java
  Issue Type: Test
  Components: general/test
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3103.patch


 Currently, the only good test that does this is Test2BTerms (disabled by 
 default)
 I think we should test this capability, and also have a simpler example for 
 how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034220#comment-13034220
 ] 

Uwe Schindler commented on LUCENE-3103:
---

Reflection should work correct. No need to change anything.

 create a simple test that indexes and searches byte[] terms
 ---

 Key: LUCENE-3103
 URL: https://issues.apache.org/jira/browse/LUCENE-3103
 Project: Lucene - Java
  Issue Type: Test
  Components: general/test
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3103.patch


 Currently, the only good test that does this is Test2BTerms (disabled by 
 default)
 I think we should test this capability, and also have a simpler example for 
 how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3103) create a simple test that indexes and searches byte[] terms

2011-05-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034224#comment-13034224
 ] 

Michael McCandless commented on LUCENE-3103:


+1 -- this is a great test to add, now that we support arbitrary binary terms.


 create a simple test that indexes and searches byte[] terms
 ---

 Key: LUCENE-3103
 URL: https://issues.apache.org/jira/browse/LUCENE-3103
 Project: Lucene - Java
  Issue Type: Test
  Components: general/test
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-3103.patch


 Currently, the only good test that does this is Test2BTerms (disabled by 
 default)
 I think we should test this capability, and also have a simpler example for 
 how to do this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3098) Grouped total count

2011-05-16 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-3098:
--

Attachment: LUCENE-3098-3x.patch

Great! Attached the 3x backport.

 Grouped total count
 ---

 Key: LUCENE-3098
 URL: https://issues.apache.org/jira/browse/LUCENE-3098
 Project: Lucene - Java
  Issue Type: New Feature
Reporter: Martijn van Groningen
Assignee: Michael McCandless
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3098-3x.patch, LUCENE-3098-3x.patch, 
 LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, LUCENE-3098.patch, 
 LUCENE-3098.patch


 When grouping currently you can get two counts:
 * Total hit count. Which counts all documents that matched the query.
 * Total grouped hit count. Which counts all documents that have been grouped 
 in the top N groups.
 Since the end user gets groups in his search result instead of plain 
 documents with grouping. The total number of groups as total count makes more 
 sense in many situations. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3100) IW.commit() writes but fails to fsync the N.fnx file

2011-05-16 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3100:


Attachment: LUCENE-3100.patch

here is a patch sync'ing the file on successful write during prepareCommit

 IW.commit() writes but fails to fsync the N.fnx file
 

 Key: LUCENE-3100
 URL: https://issues.apache.org/jira/browse/LUCENE-3100
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3100.patch


 In making a unit test for NRTCachingDir (LUCENE-3092) I hit this surprising 
 bug!
 Because the new N.fnx file is written at the last minute along with the 
 segments file, it's not included in the sis.files() that IW uses to figure 
 out which files to sync.
 This bug means one could call IW.commit(), successfully, return, and then the 
 machine could crash and when it comes back up your index could be corrupted.
 We should hopefully first fix TestCrash so that it hits this bug (maybe it 
 needs more/better randomization?), then fix the bug

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-16 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034242#comment-13034242
 ] 

Simon Willnauer commented on LUCENE-3092:
-

mike I attached a patch to LUCENE-3100 and tested with the latest patch on this 
issue. The test randomly fails (after I close the IW in the test!) here is a 
trace:

{noformat}

junit-sequential:
[junit] Testsuite: org.apache.lucene.store.TestNRTCachingDirectory
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.16 sec
[junit] 
[junit] - Standard Error -
[junit] NOTE: reproduce with: ant test -Dtestcase=TestNRTCachingDirectory 
-Dtestmethod=testNRTAndCommit 
-Dtests.seed=-753565914717395747:-1817581638532977526
[junit] NOTE: test params are: codec=RandomCodecProvider: 
{docid=SimpleText, body=MockFixedIntBlock(blockSize=1993), 
title=Pulsing(freqCutoff=3), titleTokenized=MockSep, date=SimpleText}, 
locale=ar_AE, timezone=America/Santa_Isabel
[junit] NOTE: all tests run in this JVM:
[junit] [TestNRTCachingDirectory]
[junit] NOTE: Mac OS X 10.6.7 x86_64/Apple Inc. 1.6.0_24 
(64-bit)/cpus=2,threads=1,free=46213552,total=85000192
[junit] -  ---
[junit] Testcase: 
testNRTAndCommit(org.apache.lucene.store.TestNRTCachingDirectory):FAILED
[junit] limit=12 actual=16
[junit] junit.framework.AssertionFailedError: limit=12 actual=16
[junit] at 
org.apache.lucene.index.RandomIndexWriter.doRandomOptimize(RandomIndexWriter.java:165)
[junit] at 
org.apache.lucene.index.RandomIndexWriter.close(RandomIndexWriter.java:199)
[junit] at 
org.apache.lucene.store.TestNRTCachingDirectory.testNRTAndCommit(TestNRTCachingDirectory.java:179)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1282)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1211)
[junit] 
[junit] 
[junit] Test org.apache.lucene.store.TestNRTCachingDirectory FAILED
{noformat}

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch, 
 LUCENE-3092.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >