[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845214#action_12845214 ] Uwe Schindler commented on SOLR-1677: - I also added support for instantiating Lucene Analyzers directly, that broke with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls back to the no-arg one. The only thing that is not working at the moment is the -Aware stuff, as SolrResourceLoader.newInstance() was not useable. Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1799) enable matching of CamelCase with camelcase in WordDelimiterFilter
[ https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-1799: Fix Version/s: (was: 1.3) 1.5 enable matching of CamelCase with camelcase in WordDelimiterFilter -- Key: SOLR-1799 URL: https://issues.apache.org/jira/browse/SOLR-1799 Project: Solr Issue Type: Improvement Components: search Affects Versions: 1.3, 1.4 Reporter: Chris Darroch Priority: Minor Fix For: 1.5 Attachments: SOLR-1799.patch At the bottom of the WordDelimiterFilter.java code there's the following comment: // downsides: if source text is powershot then a query of PowerShot won't match! Another serious example for us might be something like an indexed document containing the word Tribeca or Soho, and then a user trying to search for TriBeCa or SoHo. This issue has turned up in a couple of recent mailing list threads: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e In the first thread I found the best explication of what my own misunderstanding was, and it's something I'm sure must trip up other people as well: {quote} I've misunderstood WordDelimiterFilter. You might think that catenateAll=1 would append the full phrase (sans delimiters) as an OR against the query. So jOkersWild would produce: j (okers wild) OR jokerswild But you thought wrong. Its actually: j (okers wild jokerswild) Which is confusing and won't match... {quote} In the second thread, Yonik Seeley gives a good explanation of why this occurs, and provides a suggested workaround where you duplicate your data fields and then query on one using generateWordParts=1 and on the other using catenateWords=1. That works, but obviously requires data duplication. In our case, we are also following what I believe is recommended practice and duplicating our data already into stemmed and unstemmed indexes. To my mind, to further duplicate both of these fields a second time, with no difference in the indexed data of the additional copy, seems needlessly wasteful when the problem lies entirely in the query side of things. At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, but seems to work for us. In WordDelimiterFilter, if generateWordParts=1 and catenateWords=2, then we move the concatenated word to overlap its position with the first generated token instead of the last (which is the behaviour with catenateWords=1). We further insert a preceding dummy flag token with the special type CATENATE_FIRST. In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the entirety of the getFieldQuery() code from Lucene's QueryParser. This is ugly, I know. This code is then tweaked so that in the case where the dummy flag token is seen, it creates a BooleanQuery with the following token (the concatenated word) as a conditional TermQuery clause, and then adds the generated terms in their usual MultiPhraseQuery as a second conditional clause. Now I realize this patch is (a) not likely acceptable on style and elegance grounds, and (b) only against Solr 1.3, not trunk. My apologies for both; after I'd spent most of what time I had available tracking down the source of the problem, I just needed to get something working quickly. Perhaps this patch will inspire others to greatness, though, or at a minimum provide a starting point for those who stumble over this same issue. Thanks for a great application! Cheers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1814) select count(distinct fieldname) in SOLR
[ https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-1814: Fix Version/s: (was: 1.4) select count(distinct fieldname) in SOLR Key: SOLR-1814 URL: https://issues.apache.org/jira/browse/SOLR-1814 Project: Solr Issue Type: New Feature Components: SearchComponents - other Affects Versions: 1.5 Reporter: Marcus Herou Fix For: 1.5 Attachments: CountComponent.java I have seen questions on the mailinglist about having the functionality for counting distinct on a field. We at Tailsweep as well want to that in for example our blogsearch. Example: You had 1345 hits on 244 blogs The 244 part is not possible in SOLR today (correct me if I am wrong). So I've written a component which does this. Attaching it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1814) select count(distinct fieldname) in SOLR
[ https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-1814: Affects Version/s: (was: 2.0) (was: 1.6) (was: 1.4) Fix Version/s: (was: 2.0) (was: 1.6) select count(distinct fieldname) in SOLR Key: SOLR-1814 URL: https://issues.apache.org/jira/browse/SOLR-1814 Project: Solr Issue Type: New Feature Components: SearchComponents - other Affects Versions: 1.5 Reporter: Marcus Herou Fix For: 1.5 Attachments: CountComponent.java I have seen questions on the mailinglist about having the functionality for counting distinct on a field. We at Tailsweep as well want to that in for example our blogsearch. Example: You had 1345 hits on 244 blogs The 244 part is not possible in SOLR today (correct me if I am wrong). So I've written a component which does this. Attaching it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
[ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated SOLR-1677: Attachment: SOLR-1677-lucenetrunk-branch-3.patch SOLR-1677-lucenetrunk-branch-2.patch Just for documentation: Here the patches with improvements to the version support for the Lucene-trunk upgrade branch. - More lenient matchVersion support (V.V) - Default matchVersion for tests - Remove code duplication and some additional checks for analysis plugins that need version support to enforce the version Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory --- Key: SOLR-1677 URL: https://issues.apache.org/jira/browse/SOLR-1677 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Uwe Schindler Attachments: SOLR-1677-lucenetrunk-branch-2.patch, SOLR-1677-lucenetrunk-branch-3.patch, SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility with old indexes created using older versions of Lucene. The most important example is StandardTokenizer, which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in 2.9. In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9, the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer. This patch adds basic support for the Lucene Version property to the base factories. Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently contains a helper map to decode the version strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version ctors in Lucene). This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845301#action_12845301 ] Robert Muir commented on SOLR-1804: --- I wonder if you guys have any insight why the results of this test may have changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: http://svn.apache.org/viewvc?view=revisionrevision=923048 It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why the results would change between 3.0 and 3.1-dev. One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT somewhere in its code. Any ideas? Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1823) XMLWriter throws ClassCastException on writing maps other than String,?
XMLWriter throws ClassCastException on writing maps other than String,? - Key: SOLR-1823 URL: https://issues.apache.org/jira/browse/SOLR-1823 Project: Solr Issue Type: Improvement Components: documentation, Response Writers Reporter: Frank Wesemann http://lucene.apache.org/solr/api/org/apache/solr/response/SolrQueryResponse.html#returnable_data says that a Map containing any of the items in this list may be contained in a SolrQueryResponse and will be handled by QueryResponseWriters. This is not true for (at least) Keys in Maps. XMLWriter tries to cast keys to Strings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1824) partial field types created on error
partial field types created on error Key: SOLR-1824 URL: https://issues.apache.org/jira/browse/SOLR-1824 Project: Solr Issue Type: Bug Affects Versions: 1.1.0 Reporter: Yonik Seeley Priority: Minor When abortOnConfigurationError=false, and there is a typo in one of the filters in a chain, the field type is still created by omitting that particular filter. This is particularly dangerous since it will result in incorrect indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1824) partial field types created on error
[ https://issues.apache.org/jira/browse/SOLR-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845319#action_12845319 ] Yonik Seeley commented on SOLR-1824: The partial field is created regardless of abortOnConfigurationError... it's just more serious when it's false and things may look OK. partial field types created on error Key: SOLR-1824 URL: https://issues.apache.org/jira/browse/SOLR-1824 Project: Solr Issue Type: Bug Affects Versions: 1.1.0 Reporter: Yonik Seeley Priority: Minor When abortOnConfigurationError=false, and there is a typo in one of the filters in a chain, the field type is still created by omitting that particular filter. This is particularly dangerous since it will result in incorrect indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1823) XMLWriter throws ClassCastException on writing maps other than String,?
[ https://issues.apache.org/jira/browse/SOLR-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Wesemann updated SOLR-1823: - Attachment: SOLR-1823.patch This patch uses String.valueOf( entry.getKey ) to write an entry's key. It therefore could not fail. XMLWriter throws ClassCastException on writing maps other than String,? - Key: SOLR-1823 URL: https://issues.apache.org/jira/browse/SOLR-1823 Project: Solr Issue Type: Improvement Components: documentation, Response Writers Reporter: Frank Wesemann Attachments: SOLR-1823.patch http://lucene.apache.org/solr/api/org/apache/solr/response/SolrQueryResponse.html#returnable_data says that a Map containing any of the items in this list may be contained in a SolrQueryResponse and will be handled by QueryResponseWriters. This is not true for (at least) Keys in Maps. XMLWriter tries to cast keys to Strings. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: XMLWriter
Created SOLR-1823. I attached a patch for this particular problem. Any other places we missed this? None, that I could spot, there are so many warnings about unchecked castings rsp. not using Generics. -- mit freundlichem Gruß, Frank Wesemann Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.com/ Potsdamer Str. 96 Tel: +49 30 25 79 28 90 10785 BerlinFax: +49 30 25 79 28 999 Sitz: Berlin Amtsgericht Berlin Charlottenburg (HRB 73099) Geschäftsführer: Ali Paczensky
[jira] Commented: (SOLR-1824) partial field types created on error
[ https://issues.apache.org/jira/browse/SOLR-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845419#action_12845419 ] Uwe Schindler commented on SOLR-1824: - It should be easy to fix. The init() method in the AbstractPluginLoader anonymous class checks for plugin!=null. In the null case it should throw exception to make the whole loadAnalyzer() call invalid, what makes the field type disappear. partial field types created on error Key: SOLR-1824 URL: https://issues.apache.org/jira/browse/SOLR-1824 Project: Solr Issue Type: Bug Affects Versions: 1.1.0 Reporter: Yonik Seeley Priority: Minor When abortOnConfigurationError=false, and there is a typo in one of the filters in a chain, the field type is still created by omitting that particular filter. This is particularly dangerous since it will result in incorrect indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441 ] Stanislaw Osinski commented on SOLR-1804: - Hi Robert, Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845451#action_12845451 ] Robert Muir commented on SOLR-1804: --- Hi Stanislaw: Correct, I did not upgrade anything else, just lucene. I'm sorry its not exactly related to this issue (although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then thats ok) My concern is more that we did something in Lucene between 3.0 and now that caused the results to be different... though again this could be explained if somewhere in its code Carrot2 uses some Lucene analysis component, but doesn't hardwire Version to LUCENE_29. If all else fails I can try to seek out the svn rev # of Lucene that causes this change, by brute force binary search :) Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845453#action_12845453 ] Grant Ingersoll commented on SOLR-1804: --- Robert, instead of tracking it down by brute force, you might just dump out the clusters and see if they are still reasonable. If they are, I wouldn't worry too much about it, as it is likely due to the issues Staszek mentioned. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845455#action_12845455 ] Robert Muir commented on SOLR-1804: --- Grant I am concerned about a possible BW break in Lucene trunk, that is all. I think its strange that 3.0 and 3.1 jars give different results. Can you tell me if the clusters are reasonable? here is the output. {noformat} junit.framework.AssertionFailedError: number of clusters: [ {labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, {labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, {labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, {labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, {labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, {labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, {labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, {labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, {labels=[Dedicated],docs=[10, 11],clusters=[]}, {labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, {labels=[Information from Large],docs=[3, 7],clusters=[]}, {labels=[Neural Networks],docs=[12, 1],clusters=[]}, {labels=[Open],docs=[15, 20],clusters=[]}, {labels=[Research],docs=[26, 8],clusters=[]}, {labels=[Other Topics],docs=[16],clusters=[]} ] expected:16 but was:15 {noformat} Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459 ] Stanislaw Osinski commented on SOLR-1804: - I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes. If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell. I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated? If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462 ] Stanislaw Osinski commented on SOLR-1804: - Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845474#action_12845474 ] Robert Muir commented on SOLR-1804: --- Thanks for the confirmation the clusters are ok. Well, this is embarrassing, it turns out it is a backwards break, though documented, and the culprit is yours truly. This is the reason it gets different results: {noformat} * LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. (Robert Muir) {noformat} I'll change the test to expect 15 clusters with Lucene 3.1, thanks :) Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
removal of deprecated HtmlStrip*Tokenizer factories
Hello, Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? These can be done with CharFilter instead and they have some problems with lucene's trunk. If no one objects, I'd like to remove these in the branch. Otherwise, Uwe tells me there is some way to make them work if need be. Thanks! -- Robert Muir rcm...@gmail.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 9:39 PM, Robert Muir rcm...@gmail.com wrote: Hello, Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? Maybe a communication issue, you need to read the source code or javadocs to know it is deprecated These can be done with CharFilter instead and they have some problems with lucene's trunk. Personally, I don't object, but then one should consider bumping Solr to 2.0 along with the removal of other deprecated API's/features And of course adapt the wiki page as well http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Best regards Paul If no one objects, I'd like to remove these in the branch. Otherwise, Uwe tells me there is some way to make them work if need be. Thanks! -- Robert Muir rcm...@gmail.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On 03/15/2010 05:24 PM, Paul Borgermans wrote: On Mon, Mar 15, 2010 at 9:39 PM, Robert Muirrcm...@gmail.com wrote: Hello, Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? Maybe a communication issue, you need to read the source code or javadocs to know it is deprecated It is certainly deprecated ;) These can be done with CharFilter instead and they have some problems with lucene's trunk. Personally, I don't object, but then one should consider bumping Solr to 2.0 along with the removal of other deprecated API's/features And of course adapt the wiki page as well http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Looking like the next version of Solr will actually be 3.1. Or whatever the next version of Lucene is. So its a great time to remove deprecations IMO. Best regards Paul If no one objects, I'd like to remove these in the branch. Otherwise, Uwe tells me there is some way to make them work if need be. Thanks! -- Robert Muir rcm...@gmail.com -- - Mark http://www.lucidimagination.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Tue, Mar 16, 2010 at 2:09 AM, Robert Muir rcm...@gmail.com wrote: Hello, Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? These can be done with CharFilter instead and they have some problems with lucene's trunk. If no one objects, I'd like to remove these in the branch. Otherwise, Uwe tells me there is some way to make them work if need be. Is there a way we can fix LUCENE-2098 too? -- Regards, Shalin Shekhar Mangar.
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 5:30 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Is there a way we can fix LUCENE-2098 too? I think this is good to fix, yet removing the deprecations is unrelated to this slowdown. The deprecated functionality (HtmlStrip*Tokenizer) is implemented in terms of the slower CharFilter, so its not any faster, getting rid of it won't slow anyone down. That being said I think we should still try to improve the performance of this stuff, I agree. -- Robert Muir rcm...@gmail.com
Re: welcome new lucene/solr committers
: Development on branches/solr to get on lucene trunk is progressing at : a furious (nay... ferocious) pace, pushed by the not new, but new to : solr committers. Feels great to have everyone on the same team! I feel like i must have missed out on some sort of discussion -- what was the motivation behind creating a branch for this? (as opposed to just using solr/trunk, since it seemed like there was a clear concensus from all the solr devs (in the merge discussion) that the next major solr release should be in sync with Lucene 3.x) Also: why such a horrible branch name? ... seems more then a little vague. -Hoss
Re: removal of deprecated HtmlStrip*Tokenizer factories
: Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories? I'm not adverse to gutting *internal* deprecated classes on just about any release (requiring plugin writers to deal with the deprecation) but if it's possible to keep things working for users with no java knowledge i'd prefer it. In the case of these factories: can't we eliminate the Html*Tokenizers themselves, but make the *factories* return the neccessary *Tokenizer wrapped in an HtmlStripCharFilter ? (if not oh well, i'm just looking for ways to simplify the upgrade path for the common case) -Hoss
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 7:18 PM, Chris Hostetter hossman_luc...@fucit.org wrote: In the case of these factories: can't we eliminate the Html*Tokenizers themselves, but make the *factories* return the neccessary *Tokenizer wrapped in an HtmlStripCharFilter ? They would not be able to re-use if you did this, because when you call reset(Reader) on them, the Reader would not be wrapped. -- Robert Muir rcm...@gmail.com
Re: welcome new lucene/solr committers
On 03/15/2010 07:14 PM, Chris Hostetter wrote: : Development on branches/solr to get on lucene trunk is progressing at : a furious (nay... ferocious) pace, pushed by the not new, but new to : solr committers. Feels great to have everyone on the same team! I feel like i must have missed out on some sort of discussion -- what was the motivation behind creating a branch for this? (as opposed to just using solr/trunk, since it seemed like there was a clear concensus from all the solr devs (in the merge discussion) that the next major solr release should be in sync with Lucene 3.x) Because getting Solr on Lucene 3.x is a combination of a bunch of issues and patches - robert and I were trying to juggle them all and it was major annoying. So we made a branch that we could commit crappy stuff too fast and furious to get things up to speed and iterate. This branch is basically the culmination of all the patches, plus whatever else we needed. Also: why such a horrible branch name? ... seems more then a little vague. God don't ask. As Robert and I were looking for a place for a branch, it came up in #Lucene irc chat that we should put it in a certain place. It turns out, that certain place caused a raucous. For one, Uwe popped up and said something like: REVERT!! REVERT!! REVERT!! REVERT!! REVERT!! So while it made some sense to call it solr in the unspoken place that it was, I was in such a hurry to move it I just left the name. Now it would require everyone svn switching to change it, so we have just left it for now. Renames and moves are easy in svn though, so I'm sure we could organize something better - we just meant for this to be a very temporary scratch pad to play with what was need to get up to Lucene trunk. We haven't meant to do anything official is why we havn't dropped onto the dev-list - we were just looking for a branch to hash out these patches. Now its up to everyone what we do with this branch. -Hoss -- - Mark http://www.lucidimagination.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
: They would not be able to re-use if you did this, because when you : call reset(Reader) on them, the Reader would not be wrapped. Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer combo will be able to deal with this any better, but i'll take your word for it. Kill it then, and we'll just have to start making a list in the Upgrading section of CHANGES.txt noting the recommended upgrad path for this (and many, many things to come i imagine) -Hoss
Re: welcome new lucene/solr committers
Sorry - hit a bad keyboard short cut and sent this mid way through writing it - please disregard and read the followup. On 03/15/2010 07:21 PM, Mark Miller wrote: On 03/15/2010 07:14 PM, Chris Hostetter wrote: : Development on branches/solr to get on lucene trunk is progressing at : a furious (nay... ferocious) pace, pushed by the not new, but new to : solr committers. Feels great to have everyone on the same team! I feel like i must have missed out on some sort of discussion -- what was the motivation behind creating a branch for this? (as opposed to just using solr/trunk, since it seemed like there was a clear concensus from all the solr devs (in the merge discussion) that the next major solr release should be in sync with Lucene 3.x) Because getting Solr on Lucene 3.x is a combination of a bunch of issues and patches - robert and I were trying to juggle them all and it was major annoying. So we made a branch that we could commit crappy stuff too fast and furious to get things up to speed and iterate. This branch is basically the culmination of all the patches, plus whatever else we needed. Also: why such a horrible branch name? ... seems more then a little vague. God don't ask. As Robert and I were looking for a place for a branch, it came up in #Lucene irc chat that we should put it in a certain place. It turns out, that certain place caused a raucous. Uwe popped up and said something like: REVERT!! REVERT!! -Hoss -- - Mark http://www.lucidimagination.com
Re: removal of deprecated HtmlStrip*Tokenizer factories
On Mon, Mar 15, 2010 at 7:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer combo will be able to deal with this any better, but i'll take your word for it. you can see this behavior in SolrAnalyzer's reusableTokenStream method, it re-uses the Tokenizer but wraps the readers with charStream() [overridden by TokenizerChain to wrap the Reader with your CharFilter chain]. @Override public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException { // if (true) return tokenStream(fieldName, reader); TokenStreamInfo tsi = (TokenStreamInfo)getPreviousTokenStream(); if (tsi != null) { tsi.getTokenizer().reset(charStream(reader)); // -- right here Kill it then, and we'll just have to start making a list in the Upgrading section of CHANGES.txt noting the recommended upgrad path for this (and many, many things to come i imagine) cool, I'll add some additional verbage to the CHANGES in the branch. -- Robert Muir rcm...@gmail.com
Re: welcome new lucene/solr committers
On Mon, Mar 15, 2010 at 07:25:00PM -0400, Mark Miller wrote: We haven't meant to do anything official is why we havn't dropped onto the dev-list - we were just looking for a branch to hash out these patches. Makes sense to me. This is the kind of thing you'd do on a local checkout with git-svn, but if you don't have expertise in that (: I don't either :) then a throwaway svn branch is an alternative. Marvin Humphrey
[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field
[ https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845656#action_12845656 ] Lance Norskog commented on SOLR-1803: - Actually the problem is that the effect of combining params and generated values is not defined well. I suggest that the semantics should be, a param is treated exactly like a generated field. Under this theory, these are the test cases: literal.single_s=abc and no generated single_s data: str name=single_sabc/str literal.single_s=abc and generated data def: str name=single_sabc def/str literal.multi_s=abc and generated data def: arr name=multi_s strabc/str strdef/str /arr Is this a coherent and useful semantics? ExtractingRequestHandler does not propagate multiple values to a multi-valued field --- Key: SOLR-1803 URL: https://issues.apache.org/jira/browse/SOLR-1803 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Reporter: Lance Norskog Priority: Minor Attachments: display-extracting-bug.patch When multiple values for one field are extracted from a document, only the last value is stored in the document. If one or more values are given as parameters, those values are all stored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
lucene and solr trunk
Due to a tremendous amount of work by our newly merged committer corps, the get-on-lucene-trunk branch (branches/solr) is ready for prime-time as the new solr trunk! Lucene and Solr need to move to a common trunk for a host of reasons, including single patches that can cover both, shared tags and branches, and shared test code w/o a test jar. The current Lucene trunk is: .../lucene/java/trunk The current Solr trunk is: .../lucene/solr/trunk So, we have a few options on where to put Solr's new trunk: Lucene moves to Solr's trunk: /solr/trunk, /solr/trunk/lucene Solr moves to Lucene's trunk: /java/trunk, /java/trunk/solr Both projects move to a new trunk: /something/trunk/java, /something/trunk/solr -Yonik
Re: lucene and solr trunk
On 03/15/2010 11:28 PM, Yonik Seeley wrote: So, we have a few options on where to put Solr's new trunk: Solr moves to Lucene's trunk: /java/trunk, /java/trunk/sol +1. With the goal of merged dev, merged tests, this looks the best to me. Simple to do patches that span both, simple to setup Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. -- - Mark http://www.lucidimagination.com
Re: lucene and solr trunk
On Mon, Mar 15, 2010 at 11:43 PM, Mark Miller markrmil...@gmail.com wrote: Solr moves to Lucene's trunk: /java/trunk, /java/trunk/sol +1. With the goal of merged dev, merged tests, this looks the best to me. Simple to do patches that span both, simple to setup Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it. +1 -- Robert Muir rcm...@gmail.com
[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field
[ https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845691#action_12845691 ] Hoss Man commented on SOLR-1803: Lance: i agree that the current semantics are either poorly definied, or not very useful, but your suggestion seems like it overlooks what is probably the two most common cases: * to have literal values that overwrite/replace extracted values * to have literal values that act as defaults unless extracted values are found ...those seem like they should both be possible for single and multivalued fields ExtractingRequestHandler does not propagate multiple values to a multi-valued field --- Key: SOLR-1803 URL: https://issues.apache.org/jira/browse/SOLR-1803 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Reporter: Lance Norskog Priority: Minor Attachments: display-extracting-bug.patch When multiple values for one field are extracted from a document, only the last value is stored in the document. If one or more values are given as parameters, those values are all stored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: lucene and solr trunk
: prime-time as the new solr trunk! Lucene and Solr need to move to a : common trunk for a host of reasons, including single patches that can : cover both, shared tags and branches, and shared test code w/o a test : jar. Without a clearer picture of how people envision development overhead working as we move forward, it's really hard to understand how any of these ideas make sense... 1) how should hte automated build process(es) work? 2) how are we going to do branching/tagging for releases? particularly in situations where one product is ready for a rlease and hte other isn't? 3) how are we going to deal with mino bug fix release tagging? 4) should it be possible for people to check out Lucene-Java w/o checking out Solr? (i suspect a whole lot of people who only care about the core library are going to really adamantly not want to have to check out all of Solr just to work on the core) : Both projects move to a new trunk: : /something/trunk/java, /something/trunk/solr by gut says something like this will more the most sense, assuming /something/trunk == /java/trunk and java actually means core ... ie: this discussion should really be part and parcel with how contribs should be reorged. -Hoss
[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field
[ https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845697#action_12845697 ] Mark Miller commented on SOLR-1803: --- bq. Actually the problem is that the effect of combining params and generated values is not defined well. Your tests and summary don't appear to try and cover this ... should we update the Title and Description? bq. I suggest that the semantics should be, a param is treated exactly like a generated field. Have you tested that this is not the case? When I look at the code, it appears to me that it does what your proposed semantics say - params are treated like generated fields when adding multiple fields or concatenating - I have not tested this, but thats what the code looks like its doing ... ExtractingRequestHandler does not propagate multiple values to a multi-valued field --- Key: SOLR-1803 URL: https://issues.apache.org/jira/browse/SOLR-1803 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Reporter: Lance Norskog Priority: Minor Attachments: display-extracting-bug.patch When multiple values for one field are extracted from a document, only the last value is stored in the document. If one or more values are given as parameters, those values are all stored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: lucene and solr trunk
On Tue, Mar 16, 2010 at 12:01 AM, Chris Hostetter hossman_luc...@fucit.org wrote: 4) should it be possible for people to check out Lucene-Java w/o checking out Solr? (i suspect a whole lot of people who only care about the core library are going to really adamantly not want to have to check out all of Solr just to work on the core) This wouldn't really be merged development now would it? When I run 'ant test' I want the Solr tests to run, too. If one breaks because of a change, I want to look at the source and know why. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
: (i suspect a whole lot of people who only care about the core library are : going to really adamantly not want to have to check out all of Solr just : to work on the core) : : This wouldn't really be merged development now would it? : When I run 'ant test' I want the Solr tests to run, too. : If one breaks because of a change, I want to look at the source and know why. And as a committer, you should be concerned about things like this ... that doesn't mean every user of Lucene-Java who wants to build from source or apply their own local patches is going to feel the same way. -Hoss
Re: lucene and solr trunk
On Tue, Mar 16, 2010 at 12:39 AM, Chris Hostetter hossman_luc...@fucit.org wrote: And as a committer, you should be concerned about things like this ... that doesn't mean every user of Lucene-Java who wants to build from source or apply their own local patches is going to feel the same way. Yep, those users probably already hate our backwards tests and the contrib tests too. -- Robert Muir rcm...@gmail.com
Re: lucene and solr trunk
Hi Hoss, : (i suspect a whole lot of people who only care about the core library are : going to really adamantly not want to have to check out all of Solr just : to work on the core) : : This wouldn't really be merged development now would it? : When I run 'ant test' I want the Solr tests to run, too. : If one breaks because of a change, I want to look at the source and know why. And as a committer, you should be concerned about things like this ... that doesn't mean every user of Lucene-Java who wants to build from source or apply their own local patches is going to feel the same way. +1. Personally, I'm one of those users and appreciate the separation in SVN. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: lucene and solr trunk
: Yep, those users probably already hate our backwards tests and the : contrib tests too. probably ... which is just another reason why it probably makes sense sense to move core stuff from Lucene-Java into it's own module along side solr, and other modules that get refactored out of Solr or the existing contribs. But back to my first point: these types of issues are why some discussions are warranted about what the plan should be for automated builds, releasees, point-release branching, etc... before we pick a directory structures. trunk is nothing more then a convention in SVN, so we could decide that Solr should live under /lucene/yatzee/solr and Lucene-Java should live under /lucene/bigfoot/java, and branches and tags of both should live in /lucene/whatsallthisnow/somestuff, but if that doesn't actually make progress any easier there's not much point. -- Likewise, ther's not much point in picking between any of the other structures suggested so far unless we have a clear idea how we're going to use them. structure should follow function. -Hoss