date:20100315

[jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-15 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845214#action_12845214
]

Uwe Schindler commented on SOLR-1677:
-

I also added support for instantiating Lucene Analyzers directly, that broke
with the 3.0-upgrade. The new code now prefers a one-arg-Version-ctor and falls
back to the no-arg one. The only thing that is not working at the moment is the
-Aware stuff, as SolrResourceLoader.newInstance() was not useable.

Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
BaseTokenFilterFactory
---

Key: SOLR-1677
URL: https://issues.apache.org/jira/browse/SOLR-1677
Project: Solr
Issue Type: Sub-task
Components: Schema and Analysis
Reporter: Uwe Schindler
Attachments: SOLR-1677-lucenetrunk-branch.patch, SOLR-1677.patch,
SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch

Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards
compatibility with old indexes created using older versions of Lucene. The
most important example is StandardTokenizer, which changed its behaviour with
posIncr and incorrect host token types in 2.4 and also in 2.9.
In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with
much more Unicode support, almost every Tokenizer/TokenFilter needs this
Version parameter. In 2.9, the deprecated old ctors without Version take
LUCENE_24 as default to mimic the old behaviour, e.g. in StandardTokenizer.
This patch adds basic support for the Lucene Version property to the base
factories. Subclasses then can use the luceneMatchVersion decoded enum (in
3.0) / Parameter (in 2.9) for constructing Tokenstreams. The code currently
contains a helper map to decode the version strings, but in 3.0 is can be
replaced by Version.valueOf(String), as the Version is a subclass of Java5
enums. The default value is Version.LUCENE_24 (as this is the default for the
no-version ctors in Lucene).
This patch also removes unneeded conversions to CharArraySet from
StopFilterFactory (now done by Lucene since 2.9). The generics are also fixed
to match Lucene 3.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1799) enable matching of CamelCase with camelcase in WordDelimiterFilter

2010-03-15 Thread Shalin Shekhar Mangar (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shalin Shekhar Mangar updated SOLR-1799:

Fix Version/s: (was: 1.3)
1.5

enable matching of CamelCase with camelcase in WordDelimiterFilter
--

Key: SOLR-1799
URL: https://issues.apache.org/jira/browse/SOLR-1799
Project: Solr
Issue Type: Improvement
Components: search
Affects Versions: 1.3, 1.4
Reporter: Chris Darroch
Priority: Minor
Fix For: 1.5

Attachments: SOLR-1799.patch

At the bottom of the WordDelimiterFilter.java code there's the following
comment:
// downsides: if source text is powershot then a query of PowerShot
won't match!
Another serious example for us might be something like an indexed document
containing the word Tribeca or Soho, and then a user trying to search for
TriBeCa or SoHo.
This issue has turned up in a couple of recent mailing list threads:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e
In the first thread I found the best explication of what my own
misunderstanding was, and it's something I'm sure must trip up other people
as well:
{quote}
I've misunderstood WordDelimiterFilter. You might think that catenateAll=1
would append the full phrase (sans delimiters) as an OR against the query.
So jOkersWild would produce:
j (okers wild) OR jokerswild
But you thought wrong. Its actually:
j (okers wild jokerswild)
Which is confusing and won't match...
{quote}
In the second thread, Yonik Seeley gives a good explanation of why this
occurs, and provides a suggested workaround where you duplicate your data
fields and then query on one using generateWordParts=1 and on the other
using catenateWords=1. That works, but obviously requires data
duplication. In our case, we are also following what I believe is
recommended practice and duplicating our data already into stemmed and
unstemmed indexes. To my mind, to further duplicate both of these fields a
second time, with no difference in the indexed data of the additional copy,
seems needlessly wasteful when the problem lies entirely in the query side of
things.
At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky,
but seems to work for us. In WordDelimiterFilter, if generateWordParts=1
and catenateWords=2, then we move the concatenated word to overlap its
position with the first generated token instead of the last (which is the
behaviour with catenateWords=1). We further insert a preceding dummy flag
token with the special type CATENATE_FIRST.
In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the
entirety of the getFieldQuery() code from Lucene's QueryParser. This is
ugly, I know. This code is then tweaked so that in the case where the dummy
flag token is seen, it creates a BooleanQuery with the following token (the
concatenated word) as a conditional TermQuery clause, and then adds the
generated terms in their usual MultiPhraseQuery as a second conditional
clause.
Now I realize this patch is (a) not likely acceptable on style and elegance
grounds, and (b) only against Solr 1.3, not trunk. My apologies for both;
after I'd spent most of what time I had available tracking down the source of
the problem, I just needed to get something working quickly. Perhaps this
patch will inspire others to greatness, though, or at a minimum provide a
starting point for those who stumble over this same issue.
Thanks for a great application! Cheers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1814) select count(distinct fieldname) in SOLR

2010-03-15 Thread Shalin Shekhar Mangar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-1814:


Fix Version/s: (was: 1.4)

 select count(distinct fieldname) in SOLR
 

 Key: SOLR-1814
 URL: https://issues.apache.org/jira/browse/SOLR-1814
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Affects Versions: 1.5
Reporter: Marcus Herou
 Fix For: 1.5

 Attachments: CountComponent.java


 I have seen questions on the mailinglist about having the functionality for 
 counting distinct on a field. We at Tailsweep as well want to that in for 
 example our blogsearch.
 Example:
 You had 1345 hits on 244 blogs
 The 244 part is not possible in SOLR today (correct me if I am wrong). So 
 I've written a component which does this. Attaching it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1814) select count(distinct fieldname) in SOLR

2010-03-15 Thread Shalin Shekhar Mangar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-1814:


Affects Version/s: (was: 2.0)
   (was: 1.6)
   (was: 1.4)
Fix Version/s: (was: 2.0)
   (was: 1.6)

 select count(distinct fieldname) in SOLR
 

 Key: SOLR-1814
 URL: https://issues.apache.org/jira/browse/SOLR-1814
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Affects Versions: 1.5
Reporter: Marcus Herou
 Fix For: 1.5

 Attachments: CountComponent.java


 I have seen questions on the mailinglist about having the functionality for 
 counting distinct on a field. We at Tailsweep as well want to that in for 
 example our blogsearch.
 Example:
 You had 1345 hits on 244 blogs
 The 244 part is not possible in SOLR today (correct me if I am wrong). So 
 I've written a component which does this. Attaching it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory

2010-03-15 Thread Uwe Schindler (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated SOLR-1677:

Attachment: SOLR-1677-lucenetrunk-branch-3.patch
SOLR-1677-lucenetrunk-branch-2.patch

Just for documentation:
Here the patches with improvements to the version support for the Lucene-trunk
upgrade branch.

- More lenient matchVersion support (V.V)
- Default matchVersion for tests
- Remove code duplication and some additional checks for analysis plugins that
need version support to enforce the version

Add support for o.a.lucene.util.Version for BaseTokenizerFactory and
BaseTokenFilterFactory
---

Key: SOLR-1677
URL: https://issues.apache.org/jira/browse/SOLR-1677
Project: Solr
Issue Type: Sub-task
Components: Schema and Analysis
Reporter: Uwe Schindler
Attachments: SOLR-1677-lucenetrunk-branch-2.patch,
SOLR-1677-lucenetrunk-branch-3.patch, SOLR-1677-lucenetrunk-branch.patch,
SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845301#action_12845301
 ] 

Robert Muir commented on SOLR-1804:
---

I wonder if you guys have any insight why the results of this test may have 
changed from 16 to 15 between Lucene 3.0 and Lucene 3.1-dev: 
http://svn.apache.org/viewvc?view=revisionrevision=923048

It did not change between Lucene 2.9 and Lucene 3.0, so I'm concerned about why 
the results would change between 3.0 and 3.1-dev. 

One possible explanation would be if Carrot2 used Version.LUCENE_CURRENT 
somewhere in its code. Any ideas?

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1823) XMLWriter throws ClassCastException on writing maps other than String,?

2010-03-15 Thread Frank Wesemann (JIRA)

XMLWriter throws ClassCastException on writing maps other than String,?
-

 Key: SOLR-1823
 URL: https://issues.apache.org/jira/browse/SOLR-1823
 Project: Solr
  Issue Type: Improvement
  Components: documentation, Response Writers
Reporter: Frank Wesemann


http://lucene.apache.org/solr/api/org/apache/solr/response/SolrQueryResponse.html#returnable_data
 says that a Map containing any of the items in this list may be contained in 
a SolrQueryResponse and will be handled by QueryResponseWriters.

This is not true for (at least) Keys in Maps.
XMLWriter tries to cast keys to Strings. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1824) partial field types created on error

2010-03-15 Thread Yonik Seeley (JIRA)

partial field types created on error


 Key: SOLR-1824
 URL: https://issues.apache.org/jira/browse/SOLR-1824
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Yonik Seeley
Priority: Minor


When abortOnConfigurationError=false, and there is a typo in one of the filters 
in a chain, the field type is still created by omitting that particular filter. 
 This is particularly dangerous since it will result in incorrect indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1824) partial field types created on error

2010-03-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845319#action_12845319
 ] 

Yonik Seeley commented on SOLR-1824:


The partial field is created regardless of abortOnConfigurationError... it's 
just more serious when it's false and things may look OK.

 partial field types created on error
 

 Key: SOLR-1824
 URL: https://issues.apache.org/jira/browse/SOLR-1824
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Yonik Seeley
Priority: Minor

 When abortOnConfigurationError=false, and there is a typo in one of the 
 filters in a chain, the field type is still created by omitting that 
 particular filter.  This is particularly dangerous since it will result in 
 incorrect indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1823) XMLWriter throws ClassCastException on writing maps other than String,?

2010-03-15 Thread Frank Wesemann (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wesemann updated SOLR-1823:
-

Attachment: SOLR-1823.patch

This patch uses String.valueOf( entry.getKey ) to write an entry's key.  
It therefore could not fail.



 XMLWriter throws ClassCastException on writing maps other than String,?
 -

 Key: SOLR-1823
 URL: https://issues.apache.org/jira/browse/SOLR-1823
 Project: Solr
  Issue Type: Improvement
  Components: documentation, Response Writers
Reporter: Frank Wesemann
 Attachments: SOLR-1823.patch


 http://lucene.apache.org/solr/api/org/apache/solr/response/SolrQueryResponse.html#returnable_data
  says that a Map containing any of the items in this list may be contained 
 in a SolrQueryResponse and will be handled by QueryResponseWriters.
 This is not true for (at least) Keys in Maps.
 XMLWriter tries to cast keys to Strings. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: XMLWriter

2010-03-15 Thread Frank Wesemann



Created SOLR-1823.
I attached a patch for this particular problem.


Any other places we missed this?

None, that I could spot,
there are so many warnings about unchecked castings rsp. not using Generics.

--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky

[jira] Commented: (SOLR-1824) partial field types created on error

2010-03-15 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845419#action_12845419
 ] 

Uwe Schindler commented on SOLR-1824:
-

It should be easy to fix. The init() method in the AbstractPluginLoader 
anonymous class checks for plugin!=null. In the null case it should throw 
exception to make the whole loadAnalyzer() call invalid, what makes the field 
type disappear.

 partial field types created on error
 

 Key: SOLR-1824
 URL: https://issues.apache.org/jira/browse/SOLR-1824
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Yonik Seeley
Priority: Minor

 When abortOnConfigurationError=false, and there is a typo in one of the 
 filters in a chain, the field type is still created by omitting that 
 particular filter.  This is particularly dangerous since it will result in 
 incorrect indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Hi Robert,

Lucene dependency is the only change, right? Or you also upgraded Carrot2 from 
e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have 
changed e.g. because we tuned stop words or other algorithm attributes.

S.



 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845451#action_12845451
 ] 

Robert Muir commented on SOLR-1804:
---

Hi Stanislaw:

Correct, I did not upgrade anything else, just lucene. 

I'm sorry its not exactly related to this issue 
(although If we need to upgrade carrot2 to be compatible with Lucene 3.x, then 
thats ok)

My concern is more that we did something in Lucene between 3.0 
and now that caused the results to be different... though again
this could be explained if somewhere in its code Carrot2 uses some
Lucene analysis component, but doesn't hardwire Version to LUCENE_29.

If all else fails I can try to seek out the svn rev # of Lucene that causes 
this change,
by brute force binary search :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845453#action_12845453
 ] 

Grant Ingersoll commented on SOLR-1804:
---

Robert, instead of tracking it down by brute force, you might just dump out the 
clusters and see if they are still reasonable.  If they are, I wouldn't worry 
too much about it, as it is likely due to the issues Staszek mentioned.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845455#action_12845455
 ] 

Robert Muir commented on SOLR-1804:
---

Grant  I am concerned about a possible BW break in Lucene trunk, that is all.
I think its strange that 3.0 and 3.1 jars give different results.

Can you tell me if the clusters are reasonable? here is the output.

{noformat}
junit.framework.AssertionFailedError: number of clusters: [
{labels=[Data Mining Applications], docs=[5, 13, 25, 12, 27],clusters=[]}, 
{labels=[Databases],docs=[15, 21, 7, 17, 11],clusters=[]}, 
{labels=[Knowledge Discovery],docs=[6, 18, 15, 17, 10],clusters=[]}, 
{labels=[Statistical Data Mining],docs=[28, 24, 2, 14],clusters=[]}, 
{labels=[Data Mining Solutions],docs=[5, 22, 8],clusters=[]}, 
{labels=[Data Mining Techniques],docs=[12, 2, 14],clusters=[]}, 
{labels=[Known as Data Mining],docs=[23, 17, 19],clusters=[]}, 
{labels=[Text Mining],docs=[6, 9, 29],clusters=[]}, 
{labels=[Dedicated],docs=[10, 11],clusters=[]}, 
{labels=[Extraction of Hidden Predictive],docs=[3, 11],clusters=[]}, 
{labels=[Information from Large],docs=[3, 7],clusters=[]}, 
{labels=[Neural Networks],docs=[12, 1],clusters=[]}, 
{labels=[Open],docs=[15, 20],clusters=[]}, 
{labels=[Research],docs=[26, 8],clusters=[]}, 
{labels=[Other Topics],docs=[16],clusters=[]}
] expected:16 but was:15
{noformat}

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

I was about to offer advice similar to Grant's, but wanted to wait to confirm 
the scope of changes.

If it was only Lucene dependency update, with the assumption that the update 
didn't change the documents fed to Carrot2 in tests, the results shouldn't 
change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the 
standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

I haven't got Solr code handy, but maybe the test performs clustering on 
summaries generated from the original test documents and Lucene 3.x introduces 
some changes in the way summaries are generated?

If the clusters look reasonable, the problem is probably not critical, but 
still worth investigation to make sure it's not a bug of some kind.

S.


 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we 
could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be 
distributed together with Solr.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845474#action_12845474
 ] 

Robert Muir commented on SOLR-1804:
---

Thanks for the confirmation the clusters are ok.

Well, this is embarrassing, it turns out it is a backwards break, 
though documented, and the culprit is yours truly.

This is the reason it gets different results:
{noformat}
* LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default.
  This means that terms with a position increment gap of zero do not
  affect the norms calculation by default.  (Robert Muir)
{noformat}

I'll change the test to expect 15 clusters with Lucene 3.1, thanks :)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

Hello,

Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories?

These can be done with CharFilter instead and they have some problems
with lucene's trunk.

If no one objects, I'd like to remove these in the branch.
Otherwise, Uwe tells me there is some way to make them work if need be.

Thanks!

-- 
Robert Muir
rcm...@gmail.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Paul Borgermans

On Mon, Mar 15, 2010 at 9:39 PM, Robert Muir rcm...@gmail.com wrote:
 Hello,

 Is there any concern with removing the deprecated HtmlStrip*Tokenizer 
 factories?


Maybe a communication issue, you need to read the source code or
javadocs to know it is deprecated

 These can be done with CharFilter instead and they have some problems
 with lucene's trunk.


Personally, I don't object, but then one should consider bumping Solr
to 2.0 along with the removal of other deprecated API's/features

And of course adapt the wiki page as well
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Best regards
Paul

 If no one objects, I'd like to remove these in the branch.
 Otherwise, Uwe tells me there is some way to make them work if need be.

 Thanks!

 --
 Robert Muir
 rcm...@gmail.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Mark Miller


On 03/15/2010 05:24 PM, Paul Borgermans wrote:

On Mon, Mar 15, 2010 at 9:39 PM, Robert Muirrcm...@gmail.com  wrote:
   

Hello,

Is there any concern with removing the deprecated HtmlStrip*Tokenizer factories?

 

Maybe a communication issue, you need to read the source code or
javadocs to know it is deprecated
   


It is certainly deprecated ;)

   

These can be done with CharFilter instead and they have some problems
with lucene's trunk.

 

Personally, I don't object, but then one should consider bumping Solr
to 2.0 along with the removal of other deprecated API's/features

And of course adapt the wiki page as well
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
   


Looking like the next version of Solr will actually be 3.1. Or whatever 
the next version of Lucene is. So its a great time to remove 
deprecations IMO.



Best regards
Paul

   

If no one objects, I'd like to remove these in the branch.
Otherwise, Uwe tells me there is some way to make them work if need be.

Thanks!

--
Robert Muir
rcm...@gmail.com

 



--
- Mark

http://www.lucidimagination.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Shalin Shekhar Mangar

On Tue, Mar 16, 2010 at 2:09 AM, Robert Muir rcm...@gmail.com wrote:

 Hello,

 Is there any concern with removing the deprecated HtmlStrip*Tokenizer
 factories?

 These can be done with CharFilter instead and they have some problems
 with lucene's trunk.

 If no one objects, I'd like to remove these in the branch.
 Otherwise, Uwe tells me there is some way to make them work if need be.


Is there a way we can fix LUCENE-2098 too?

-- 
Regards,
Shalin Shekhar Mangar.

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 5:30 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:

 Is there a way we can fix LUCENE-2098 too?


I think this is good to fix, yet removing the deprecations is
unrelated to this slowdown.

The deprecated functionality (HtmlStrip*Tokenizer) is implemented in
terms of the slower CharFilter, so its not any faster, getting rid of
it won't slow anyone down.

That being said I think we should still try to improve the performance
of this stuff, I agree.

-- 
Robert Muir
rcm...@gmail.com

Re: welcome new lucene/solr committers

2010-03-15 Thread Chris Hostetter


: Development on branches/solr to get on lucene trunk is progressing at
: a furious (nay... ferocious) pace, pushed by the not new, but new to
: solr committers.  Feels great to have everyone on the same team!

I feel like i must have missed out on some sort of discussion -- what was 
the motivation behind creating a branch for this? (as opposed to just 
using solr/trunk, since it seemed like there was a clear concensus from 
all the solr devs (in the merge discussion) that the next major solr 
release should be in sync with Lucene 3.x)

Also: why such a horrible branch name?  ... seems more then a little 
vague.



-Hoss

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Chris Hostetter


: Is there any concern with removing the deprecated HtmlStrip*Tokenizer 
factories?

I'm not adverse to gutting *internal* deprecated classes on just about any 
release (requiring plugin writers to deal with the deprecation) but if 
it's possible to keep things working for users with no java knowledge i'd 
prefer it.

In the case of these factories: can't we eliminate the Html*Tokenizers 
themselves, but make the *factories* return the neccessary *Tokenizer 
wrapped in an HtmlStripCharFilter ?

(if not oh well, i'm just looking for ways to simplify the upgrade path 
for the common case)



-Hoss

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 7:18 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 In the case of these factories: can't we eliminate the Html*Tokenizers
 themselves, but make the *factories* return the neccessary *Tokenizer
 wrapped in an HtmlStripCharFilter ?

They would not be able to re-use if you did this, because when you
call reset(Reader) on them, the Reader would not be wrapped.


-- 
Robert Muir
rcm...@gmail.com

Re: welcome new lucene/solr committers

2010-03-15 Thread Mark Miller


On 03/15/2010 07:14 PM, Chris Hostetter wrote:

: Development on branches/solr to get on lucene trunk is progressing at
: a furious (nay... ferocious) pace, pushed by the not new, but new to
: solr committers.  Feels great to have everyone on the same team!

I feel like i must have missed out on some sort of discussion -- what was
the motivation behind creating a branch for this? (as opposed to just
using solr/trunk, since it seemed like there was a clear concensus from
all the solr devs (in the merge discussion) that the next major solr
release should be in sync with Lucene 3.x)
   


Because getting Solr on Lucene 3.x is a combination of a bunch of issues 
and patches - robert and I were trying to juggle them all and it was 
major annoying. So we made a branch that we could commit crappy stuff 
too fast and furious to get things up to speed and iterate. This branch 
is basically the culmination of all the patches, plus whatever else we 
needed.




Also: why such a horrible branch name?  ... seems more then a little
vague.
   


God don't ask. As Robert and I were looking for a place for a branch, it 
came up in #Lucene irc chat that we should put it in a certain place.
It turns out, that certain place caused a raucous. For one, Uwe popped 
up and said something like:


REVERT!!
REVERT!!
REVERT!!
REVERT!!
REVERT!!

So while it made some sense to call it solr in the unspoken place that 
it was, I was in such a hurry to move it I just left the name. Now it 
would require everyone svn switching to change it, so we have just left 
it for now. Renames and moves are easy in svn though, so I'm sure we 
could organize something better - we just meant for this to be a very 
temporary scratch pad to play with what was need to get up to Lucene trunk.


We haven't meant to do anything official is why we havn't dropped onto 
the dev-list - we were just looking for a branch to hash out these 
patches. Now its up to everyone what we do with this branch.





-Hoss

   



--
- Mark

http://www.lucidimagination.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Chris Hostetter


: They would not be able to re-use if you did this, because when you
: call reset(Reader) on them, the Reader would not be wrapped.

Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer 
combo will be able to deal with this any better, but i'll take your word 
for it.

Kill it then, and we'll just have to start making a list in the 
Upgrading section of CHANGES.txt noting the recommended upgrad path 
for this (and many, many things to come i imagine)



-Hoss

Re: welcome new lucene/solr committers

2010-03-15 Thread Mark Miller

Sorry - hit a bad keyboard short cut and sent this mid way through 
writing it - please disregard and read the followup.


On 03/15/2010 07:21 PM, Mark Miller wrote:

On 03/15/2010 07:14 PM, Chris Hostetter wrote:

: Development on branches/solr to get on lucene trunk is progressing at
: a furious (nay... ferocious) pace, pushed by the not new, but new to
: solr committers.  Feels great to have everyone on the same team!

I feel like i must have missed out on some sort of discussion -- what 
was

the motivation behind creating a branch for this? (as opposed to just
using solr/trunk, since it seemed like there was a clear concensus from
all the solr devs (in the merge discussion) that the next major solr
release should be in sync with Lucene 3.x)


Because getting Solr on Lucene 3.x is a combination of a bunch of 
issues and patches - robert and I were trying to juggle them all and 
it was major annoying. So we made a branch that we could commit crappy 
stuff too fast and furious to get things up to speed and iterate. This 
branch is basically the culmination of all the patches, plus whatever 
else we needed.




Also: why such a horrible branch name?  ... seems more then a little
vague.


God don't ask. As Robert and I were looking for a place for a branch, 
it came up in #Lucene irc chat that we should put it in a certain place.
It turns out, that certain place caused a raucous. Uwe popped up and 
said something like:


REVERT!!

REVERT!!




-Hoss







--
- Mark

http://www.lucidimagination.com

Re: removal of deprecated HtmlStrip*Tokenizer factories

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 7:25 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 Hmmm... I'm not sure i understand how any declared CharFilter/TOkenizer
 combo will be able to deal with this any better, but i'll take your word
 for it.

you can see this behavior in SolrAnalyzer's reusableTokenStream
method, it re-uses the Tokenizer but wraps the readers with
charStream() [overridden by TokenizerChain to wrap the Reader with
your CharFilter chain].

  @Override
  public TokenStream reusableTokenStream(String fieldName, Reader
reader) throws IOException {
// if (true) return tokenStream(fieldName, reader);
TokenStreamInfo tsi = (TokenStreamInfo)getPreviousTokenStream();
if (tsi != null) {
  tsi.getTokenizer().reset(charStream(reader)); // -- right here



 Kill it then, and we'll just have to start making a list in the
 Upgrading section of CHANGES.txt noting the recommended upgrad path
 for this (and many, many things to come i imagine)


cool, I'll add some additional verbage to the CHANGES in the branch.



-- 
Robert Muir
rcm...@gmail.com

Re: welcome new lucene/solr committers

2010-03-15 Thread Marvin Humphrey

On Mon, Mar 15, 2010 at 07:25:00PM -0400, Mark Miller wrote:
 We haven't meant to do anything official is why we havn't dropped onto 
 the dev-list - we were just looking for a branch to hash out these 
 patches. 

Makes sense to me.  This is the kind of thing you'd do on a local checkout
with git-svn, but if you don't have expertise in that (: I don't either :)
then a throwaway svn branch is an alternative.

Marvin Humphrey

[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field

2010-03-15 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845656#action_12845656
 ] 

Lance Norskog commented on SOLR-1803:
-

Actually the problem is that the effect of combining params and generated 
values is not defined well. I suggest that the semantics should be, a param is 
treated exactly like a generated field.

Under this theory, these are the test cases:

literal.single_s=abc and no generated single_s data:
str name=single_sabc/str

literal.single_s=abc and generated data def:
str name=single_sabc def/str

literal.multi_s=abc and generated data def:
arr name=multi_s
  strabc/str
  strdef/str
/arr

Is this a coherent and useful semantics? 

 ExtractingRequestHandler does not propagate multiple values to a multi-valued 
 field
 ---

 Key: SOLR-1803
 URL: https://issues.apache.org/jira/browse/SOLR-1803
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Reporter: Lance Norskog
Priority: Minor
 Attachments: display-extracting-bug.patch


 When multiple values for one field are extracted from a document, only the 
 last value is stored in the document. If one or more values are given as 
 parameters, those values are all stored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

lucene and solr trunk

2010-03-15 Thread Yonik Seeley

Due to a tremendous amount of work by our newly merged committer
corps, the get-on-lucene-trunk branch (branches/solr) is ready for
prime-time as the new solr trunk!  Lucene and Solr need to move to a
common trunk for a host of reasons, including single patches that can
cover both, shared tags and branches, and shared test code w/o a test
jar.

The current Lucene trunk is: .../lucene/java/trunk
The current Solr trunk is: .../lucene/solr/trunk

So, we have a few options on where to put Solr's new trunk:

Lucene moves to Solr's trunk:
  /solr/trunk, /solr/trunk/lucene

Solr moves to Lucene's trunk:
  /java/trunk, /java/trunk/solr

Both projects move to a new trunk:
  /something/trunk/java, /something/trunk/solr

-Yonik

Re: lucene and solr trunk

2010-03-15 Thread Mark Miller


On 03/15/2010 11:28 PM, Yonik Seeley wrote:

So, we have a few options on where to put Solr's new trunk:


Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol
+1. With the goal of merged dev, merged tests, this looks the best to 
me. Simple to do patches that span both, simple to setup

Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.

--
- Mark

http://www.lucidimagination.com

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Mon, Mar 15, 2010 at 11:43 PM, Mark Miller markrmil...@gmail.com wrote:

 Solr moves to Lucene's trunk:
   /java/trunk, /java/trunk/sol

 +1. With the goal of merged dev, merged tests, this looks the best to me.
 Simple to do patches that span both, simple to setup
 Solr to use Lucene trunk rather than jars. Short paths. Simple. I like it.


+1


-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field

2010-03-15 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845691#action_12845691
 ] 

Hoss Man commented on SOLR-1803:


Lance: i agree that the current semantics are either poorly definied, or not 
very useful, but your suggestion seems like it overlooks what is probably the 
two most common cases:
 * to have literal values that overwrite/replace extracted values
 * to have literal values that act as defaults unless extracted values are 
found
...those seem like they should both be possible for single and multivalued 
fields

 ExtractingRequestHandler does not propagate multiple values to a multi-valued 
 field
 ---

 Key: SOLR-1803
 URL: https://issues.apache.org/jira/browse/SOLR-1803
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Reporter: Lance Norskog
Priority: Minor
 Attachments: display-extracting-bug.patch


 When multiple values for one field are extracted from a document, only the 
 last value is stored in the document. If one or more values are given as 
 parameters, those values are all stored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: lucene and solr trunk

2010-03-15 Thread Chris Hostetter

: prime-time as the new solr trunk!  Lucene and Solr need to move to a
: common trunk for a host of reasons, including single patches that can
: cover both, shared tags and branches, and shared test code w/o a test
: jar.

Without a clearer picture of how people envision development overhead 
working as we move forward, it's really hard to understand how any of 
these ideas make sense...
  1) how should hte automated build process(es) work?
  2) how are we going to do branching/tagging for releases?  particularly 
in situations where one product is ready for a rlease and hte other isn't?
  3) how are we going to deal with mino bug fix release tagging?
  4) should it be possible for people to check out Lucene-Java w/o 
checking out Solr?

(i suspect a whole lot of people who only care about the core library are 
going to really adamantly not want to have to check out all of Solr just 
to work on the core)

: Both projects move to a new trunk:
:   /something/trunk/java, /something/trunk/solr

by gut says something like this will more the most sense, assuming 
/something/trunk == /java/trunk and java actually means core ... 
ie: this discussion should really be part and parcel with how contribs 
should be reorged.



-Hoss

[jira] Commented: (SOLR-1803) ExtractingRequestHandler does not propagate multiple values to a multi-valued field

2010-03-15 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845697#action_12845697
 ] 

Mark Miller commented on SOLR-1803:
---

bq. Actually the problem is that the effect of combining params and generated 
values is not defined well.

Your tests and summary don't appear to try and cover this ... should we update 
the Title and Description?

bq.  I suggest that the semantics should be, a param is treated exactly like a 
generated field.

Have you tested that this is not the case? When I look at the code, it appears 
to me that it does what your proposed semantics say -
params are treated like generated fields when adding multiple fields or 
concatenating - I have not tested this, but thats what the
code looks like its doing ...

 ExtractingRequestHandler does not propagate multiple values to a multi-valued 
 field
 ---

 Key: SOLR-1803
 URL: https://issues.apache.org/jira/browse/SOLR-1803
 Project: Solr
  Issue Type: Bug
  Components: contrib - Solr Cell (Tika extraction)
Reporter: Lance Norskog
Priority: Minor
 Attachments: display-extracting-bug.patch


 When multiple values for one field are extracted from a document, only the 
 last value is stored in the document. If one or more values are given as 
 parameters, those values are all stored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Tue, Mar 16, 2010 at 12:01 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:
  4) should it be possible for people to check out Lucene-Java w/o
 checking out Solr?

 (i suspect a whole lot of people who only care about the core library are
 going to really adamantly not want to have to check out all of Solr just
 to work on the core)

This wouldn't really be merged development now would it?
When I run 'ant test' I want the Solr tests to run, too.
If one breaks because of a change, I want to look at the source and know why.

-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-15 Thread Chris Hostetter


:  (i suspect a whole lot of people who only care about the core library are
:  going to really adamantly not want to have to check out all of Solr just
:  to work on the core)
: 
: This wouldn't really be merged development now would it?
: When I run 'ant test' I want the Solr tests to run, too.
: If one breaks because of a change, I want to look at the source and know why.

And as a committer, you should be concerned about things like this ... 
that doesn't mean every user of Lucene-Java who wants to build from source 
or apply their own local patches is going to feel the same way.


-Hoss

Re: lucene and solr trunk

2010-03-15 Thread Robert Muir

On Tue, Mar 16, 2010 at 12:39 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 And as a committer, you should be concerned about things like this ...
 that doesn't mean every user of Lucene-Java who wants to build from source
 or apply their own local patches is going to feel the same way.


Yep, those users probably already hate our backwards tests and the
contrib tests too.


-- 
Robert Muir
rcm...@gmail.com

Re: lucene and solr trunk

2010-03-15 Thread Mattmann, Chris A (388J)

Hi Hoss,

 :  (i suspect a whole lot of people who only care about the core library are
 :  going to really adamantly not want to have to check out all of Solr just
 :  to work on the core)
 :
 : This wouldn't really be merged development now would it?
 : When I run 'ant test' I want the Solr tests to run, too.
 : If one breaks because of a change, I want to look at the source and know
 why.
 
 And as a committer, you should be concerned about things like this ...
 that doesn't mean every user of Lucene-Java who wants to build from source
 or apply their own local patches is going to feel the same way.

+1. Personally, I'm one of those users and appreciate the separation in SVN.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: lucene and solr trunk

2010-03-15 Thread Chris Hostetter


: Yep, those users probably already hate our backwards tests and the
: contrib tests too.

probably ... which is just another reason why it probably makes sense 
sense to move core stuff from Lucene-Java into it's own module along 
side solr, and other modules that get refactored out of Solr or the 
existing contribs.

But back to my first point: these types of issues are why some discussions 
are warranted about what the plan should be for automated builds, 
releasees, point-release branching, etc... before we pick a directory 
structures.  

trunk is nothing more then a convention in SVN, so we could decide that 
Solr should live under /lucene/yatzee/solr and Lucene-Java should live 
under /lucene/bigfoot/java, and branches and tags of both should live in 
/lucene/whatsallthisnow/somestuff, but if that doesn't actually make 
progress any easier there's not much point.  -- Likewise, ther's not much 
point in picking between any of the other structures suggested so far 
unless we have a clear idea how we're going to use them.

structure should follow function.


-Hoss

44 matches

Mail list logo