On Sun, May 15, 2011 at 7:44 PM, Mark Miller <markrmil...@gmail.com> wrote:

>> Could you please revert your commit, until we've reached some
>> consensus on this discussion first?
>
> Let's reach some consensus, but why revert? This has been the behavior - 
> shouldn't the consensus onus be on changing it to begin with? That's how I 
> see it.

To be clear, I'm asking that Yonik revert his commit from yesterday
(rev 1103444), where he added "text_nwd" fieldType and dynamic fields
*_nwd to the example schema.xml.

I agree we should reach consensus before changing what's already
committed, that's exactly why I'm asking Yonik to revert -- we were in
the middle of discussing this, and I had posted a patch on SOLR-2519,
when he suddenly committed the text_nwd change, yesterday.

Does anyone disagree that Yonik's commit was inappropriate?  This is
not how we work at Apache.

> I'm going to need to get back up to speed on this issue before I can comment 
> more helpfully. Better out of the box support for other languages is 
> important - I think it makes sense to discuss this issue again myself.

+1

Solr, out of box, is just awful for non-whitespace languages (eg CJK,
and others).  And for every user who comes to the list asking for help
(thank you cyang2010!), I imagine there are many others who simply
gave up and walked away (from Solr) when they tried it on CJK
content.

Lucene has made awesome strides in having natural defaults that work
well across many languages, thanks to the hard work of Robert and
others (StandardAnalyzer now actually follows a standard (UAX #29 --
text segmentation), autophrase off in QP, etc.), and I think we should
take advantage of this in Solr, just like ElasticSearch does.

Really, the best solution (I think) would be to have language-specific
fieldTypes (text_en, text_zh, etc.), but I suspect there's a good
amount of work to reach that so in the meantime I think we should fix
the defaults for the "text" fieldType to work well across many
languages.

Mike

http://blog.mikemccandless.com

Reply via email to