[jira] [Commented] (LUCENE-7202) Come up with a comprehensive proposal for naming spatial modules and technologies
[ https://issues.apache.org/jira/browse/LUCENE-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239406#comment-15239406 ] Jack Krupansky commented on LUCENE-7202: Morton seems like more of a codec-level issue than an API - you still have k-dimensions of coordinates, but they are simply encoded to a singe number for each k-dimensional point. Maybe the implementation name finds its way into the API, but the first issue should be what is logically being modeled - what kind of points, like lat-lon, geospatial. or what. Presumably any can of k-dimensional space can be Morton-encoded. XYZ? That's fine for math-style axes, for things like 3-D CAD models and 3-D printing, but seems inappropriate for a coordinate system intended to model points on the surface of a sphere like the locations of places around the globe. To me, "Geo" seems to be an accepted reference to modeling "geographical" locations on the globe/planet. How you model things like the location of a satellite or the space station is another matter. Geosynchronous satellites simply have an elevation/altitude above a surface point. Non-geosynchronous satellites have an orbit rather than a location per se, although we can speak of their location (surface plus elevation/altitude) at any given/specified moment in time. Ditto for aircraft, which have a flight path and only momentary location at some altitude (although a helicopter can maintain location for a longer moment.) Besides geospatial surface points and 3-D CAD-style monitoring, which real-world use cases are these modules intended to cover. IOW, how should real-world users relate to them and choose from them? > Come up with a comprehensive proposal for naming spatial modules and > technologies > - > > Key: LUCENE-7202 > URL: https://issues.apache.org/jira/browse/LUCENE-7202 > Project: Lucene - Core > Issue Type: Task > Components: modules/sandbox, modules/spatial, modules/spatial3d >Affects Versions: master >Reporter: Karl Wright > > There are three different spatial implementations circulating at the moment, > and nobody seems happy with the naming of them. For each implementation > strategy, we need both a module name and a descriptive technology name that > we can use to distinguish one from the other. I would expect the following > people to have an interest in this process: [~rcmuir], [~dsmiley], > [~mikemccand], etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8396) Investigate PointField to replace NumericField types
[ https://issues.apache.org/jira/browse/SOLR-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212559#comment-15212559 ] Jack Krupansky commented on SOLR-8396: -- My apologies as I am still only very slowly coming up to speed on this "New Math For Lucene" stuff. It feels like there are three distinct issues in play: 1. Desire to use the latest and greatest Lucene numeric field types. Granted, they are currently now called IntPoint, FloatPoint, DoublePoint, etc., but functionally they are still simply int, float, and double values - no semantic difference, just the class names and then some method name changes for indexing (?) and query. My feeling is that we should preserve the legacy type names even if Lucene insists on calling them "points." Keep user schema files unchanged. 2. Desire to work with existing data - and existing schema files. Mix metaphors: cans of worms and nested Russian dolls. 3. Desire to auto-upgrade existing Solr index data to new "points" for better performance, reduced storage, reduced memory, reduced heap. Some points: 1. Personally, I think it would be worth the effort to see if the Lucene guys can stick to to old names for IntField, et al even if the implementation is different under the hood. 2. Maybe there will be a need to be able to open an existing numeric field, discover that it is legacy numeric field (trie), and then under the hood use some wrapper to maintain the new API for the old format. IOW, switch Solr to using the new API, even for legacy numeric fields. 3. Seems like there is some need investigate the possibility or a NumericFieldUpgrader to rewrite a trie field as a point field. Seems like a necessary job for the Lucene guys for existing Lucene indexes, even if Solr wasn't in the picture. > Investigate PointField to replace NumericField types > > > Key: SOLR-8396 > URL: https://issues.apache.org/jira/browse/SOLR-8396 > Project: Solr > Issue Type: Improvement >Reporter: Ishan Chattopadhyaya > Attachments: SOLR-8396.patch, SOLR-8396.patch > > > In LUCENE-6917, [~mikemccand] mentioned that DimensionalValues are better > than NumericFields in most respects. We should explore the benefits of using > it in Solr and hence, if appropriate, switch over to using them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8176) Model distributed graph traversals with Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208763#comment-15208763 ] Jack Krupansky commented on SOLR-8176: -- To what extent can the graph traversal be parallelized for the data on a single node? The eternal question with Solr is how much data you can put on a node before you need to shard, or how big each shard can be. I'm curious how graph traversal affects that calculation. Also, how merge policy and segment size should be configured so that segments can be traversed in parallel. If there was some more idea way to organize the nodes in segments, maybe people could pack a lot more data on fat nodes to reduce the inter-node delays. Alternatively, maybe have more nodes mean more of the operations can be done in parallel without conflicting on local machine resources. Interesting tradeoffs. > Model distributed graph traversals with Streaming Expressions > - > > Key: SOLR-8176 > URL: https://issues.apache.org/jira/browse/SOLR-8176 > Project: Solr > Issue Type: New Feature > Components: clients - java, SolrCloud, SolrJ >Affects Versions: master >Reporter: Joel Bernstein > Labels: Graph > Fix For: master > > > I think it would be useful to model a few *distributed graph traversal* use > cases with Solr's *Streaming Expression* language. This ticket will explore > different approaches with a goal of implementing two or three common graph > traversal use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8844) Date math silently ignored for date strings
[ https://issues.apache.org/jira/browse/SOLR-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193332#comment-15193332 ] Jack Krupansky commented on SOLR-8844: -- 1. No field is specified for the fq parameter here. What is df? 2. Do any matching date/time values occur as literal strings in the default search field? > Date math silently ignored for date strings > --- > > Key: SOLR-8844 > URL: https://issues.apache.org/jira/browse/SOLR-8844 > Project: Solr > Issue Type: Bug >Affects Versions: 5.5 >Reporter: Markus Jelsma >Priority: Minor > Fix For: 6.1 > > > Consider the following query, ordered by date ascending: {code} > http://localhost:8983/solr/logs/select?q=*:*=[2011-05-26T08:15:36Z%2B3DAY%20TO%20NOW/DAY]=time%20asc > {code} > Should not have a result set where the first entry entry has > 2011-05-26T08:15:36Z for the time field. > It appears date math is just ignored, while i would expect it to work or > throw an error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8831) allow _version_ field to be retrievable via docValues
[ https://issues.apache.org/jira/browse/SOLR-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191504#comment-15191504 ] Jack Krupansky commented on SOLR-8831: -- Now that docValues is supported for _version_, the question arises as to which is preferred (faster, less memory), stored or docValues. IOW, which should be the default. I presume it should be docValues, but I have no real clue. Also, the doc for Atomic Update has this example as a Power Tip, that has BOTH stored and docValues set: {code} {code} See: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents Should that be changed to stored="false"? Or, is there actually some aditional hidden benefit to store="true" AND docValues="true"? > allow _version_ field to be retrievable via docValues > - > > Key: SOLR-8831 > URL: https://issues.apache.org/jira/browse/SOLR-8831 > Project: Solr > Issue Type: Improvement >Reporter: Yonik Seeley > Fix For: 6.0 > > Attachments: SOLR-8831.patch > > > Right now, one is prohibited from having an unstored _version_ field, even if > docValues are enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-8831) allow _version_ field to be unstored
[ https://issues.apache.org/jira/browse/SOLR-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191075#comment-15191075 ] Jack Krupansky edited comment on SOLR-8831 at 3/11/16 3:49 PM: --- Can we come up with a nice clean term for "stored or docValues are enabled"? I mean, the issue title here is misleading, as the description then indicates - "if docValues are enabled." So, it should be "allow _version_ field to be unstored if docValues are enabled." Traditional database nomenclature is no help here since the concept of non-stored data is meaningless in a true database. Personally, I'd be happier if Solr hid a lot of the byzantine complexity of Lucene, including this odd distinction between stored and docValues. I mean, to me they are just two different implementations of the logical concept of storing data for later retreival - how the data is stored rather than whether it is stored. I'll offer two suggested simple terms to be used at the Solr level even if Lucene insists on remaining byzantine: "xstored" or "retrievable", both meaning that the field attributes make it possible for Solr to retrieve data after indexing, either because the field is stored or has docValues enabled. This is not a proposal for a feature, but simply terminology to be used to talk about fields which are... "either stored or have docValues enabled." (If I wanted a feature, it might be to have a new attribute like retrieval_storage="\{by_field|by_document|none}" or... stored="\{yes|no|docValues|fieldValues}".) I'm not proposing any feature here since that would be out of the scope of the issue, but since this issue needs doc, I am just proposing new terminology for that doc. Again, to summarize more briefly, I am proposed that the terminology of "retrievable" be used to refer to fields that are either stored or have docValues enabled. was (Author: jkrupan): Can we come up with a nice clean term for "stored or docValues are enabled"? I mean, the issue title here is misleading, as the description then indicates - "if docValues are enabled." So, it should be "allow _version_ field to be unstored if docValues are enabled." Traditional database nomenclature is no help here since the concept of non-stored data is meaningless in a true database. Personally, I'd be happier if Solr hid a lot of the byzantine complexity of Lucene, including this odd distinction between stored and docValues. I mean, to me they are just two different implementations of the logical concept of storing data for later retreival - how the data is stored rather than whether it is stored. I'll offer two suggested simple terms to be used at the Solr level even if Lucene insists on remaining byzantine: "xstored" or "retrievable", both meaning that the field attributes make it possible for Solr to retrieve data after indexing, either because the field is stored or has docValues enabled. This is not a proposal for a feature, but simply terminology to be used to talk about fields which are... "either stored or have docValues enabled." (If I wanted a feature, it might be to have a new attribute like retrieval_storage="{by_field|by_document|none}" or... stored="{yes|no|docValues|fieldValues}".) I'm not proposing any feature here since that would be out of the scope of the issue, but since this issue needs doc, I am just proposing new terminology for that doc. Again, to summarize more briefly, I am proposed that the terminology of "retrievable" be used to refer to fields that are either stored or have docValues enabled. > allow _version_ field to be unstored > > > Key: SOLR-8831 > URL: https://issues.apache.org/jira/browse/SOLR-8831 > Project: Solr > Issue Type: Improvement >Reporter: Yonik Seeley > Attachments: SOLR-8831.patch > > > Right now, one is prohibited from having an unstored _version_ field, even if > docValues are enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8831) allow _version_ field to be unstored
[ https://issues.apache.org/jira/browse/SOLR-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191075#comment-15191075 ] Jack Krupansky commented on SOLR-8831: -- Can we come up with a nice clean term for "stored or docValues are enabled"? I mean, the issue title here is misleading, as the description then indicates - "if docValues are enabled." So, it should be "allow _version_ field to be unstored if docValues are enabled." Traditional database nomenclature is no help here since the concept of non-stored data is meaningless in a true database. Personally, I'd be happier if Solr hid a lot of the byzantine complexity of Lucene, including this odd distinction between stored and docValues. I mean, to me they are just two different implementations of the logical concept of storing data for later retreival - how the data is stored rather than whether it is stored. I'll offer two suggested simple terms to be used at the Solr level even if Lucene insists on remaining byzantine: "xstored" or "retrievable", both meaning that the field attributes make it possible for Solr to retrieve data after indexing, either because the field is stored or has docValues enabled. This is not a proposal for a feature, but simply terminology to be used to talk about fields which are... "either stored or have docValues enabled." (If I wanted a feature, it might be to have a new attribute like retrieval_storage="{by_field|by_document|none}" or... stored="{yes|no|docValues|fieldValues}".) I'm not proposing any feature here since that would be out of the scope of the issue, but since this issue needs doc, I am just proposing new terminology for that doc. Again, to summarize more briefly, I am proposed that the terminology of "retrievable" be used to refer to fields that are either stored or have docValues enabled. > allow _version_ field to be unstored > > > Key: SOLR-8831 > URL: https://issues.apache.org/jira/browse/SOLR-8831 > Project: Solr > Issue Type: Improvement >Reporter: Yonik Seeley > Attachments: SOLR-8831.patch > > > Right now, one is prohibited from having an unstored _version_ field, even if > docValues are enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8812) ExtendedDismaxQParser (edismax) ignores Boolean OR when q.op=AND
[ https://issues.apache.org/jira/browse/SOLR-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188158#comment-15188158 ] Jack Krupansky commented on SOLR-8812: -- The difference in the generated query appears to be the "))~2" which indicates a BooleanQuery with a minShouldMatch of 2 which means that both OR/SHOULD terms MUST match, effectively turning SHOULD/OR into MUST/AND. I'm guessing it was this 5.5 change: SOLR-2649: {code} * SOLR-2649: MM ignored in edismax queries with operators. (Greg Pendlebury, Jan Høydahl et. al. via Erick Erickson) {code} I think q.op=AND simply sets MM=100%, effectively overriding the explicit OR. > ExtendedDismaxQParser (edismax) ignores Boolean OR when q.op=AND > > > Key: SOLR-8812 > URL: https://issues.apache.org/jira/browse/SOLR-8812 > Project: Solr > Issue Type: Bug > Components: query parsers >Affects Versions: 5.5 >Reporter: Ryan Steinberg > > The edismax parser ignores Boolean OR in queries when q.op=AND. This behavior > is new to Solr 5.5.0 and an unexpected major change. > Example: > "q": "id:12345 OR zz", > "defType": "edismax", > "q.op": "AND", > where "12345" is a known document ID and "zz" is a string NOT present > in my data > Version 5.5.0 produces zero results: > "rawquerystring": "id:12345 OR zz", > "querystring": "id:12345 OR zz", > "parsedquery": "(+((id:12345 > DisjunctionMaxQuery((text:zz)))~2))/no_coord", > "parsedquery_toString": "+((id:12345 (text:zz))~2)", > "explain": {}, > "QParser": "ExtendedDismaxQParser" > Version 5.4.0 produces one result as expected > "rawquerystring": "id:12345 OR zz", > "querystring": "id:12345 OR zz", > "parsedquery": "(+(id:12345 > DisjunctionMaxQuery((text:zz/no_coord", > "parsedquery_toString": "+(id:12345 (text:zz))" > "explain": {}, > "QParser": "ExtendedDismaxQParser" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8740) use docValues by default
[ https://issues.apache.org/jira/browse/SOLR-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185336#comment-15185336 ] Jack Krupansky commented on SOLR-8740: -- My apologies for any unnecessary noise I may have caused here. I just think that every single docValues issue raised for Solr should endeavor to make the lives of Solr users a lot easier, not more complicated and even more confusing. As things stand, docValues is more of an expert-only feature. The mere fact that we can't make docValues uniformly the default illustrates that in spades. > use docValues by default > > > Key: SOLR-8740 > URL: https://issues.apache.org/jira/browse/SOLR-8740 > Project: Solr > Issue Type: Improvement >Affects Versions: master >Reporter: Yonik Seeley > Fix For: master > > > We should consider switching to docValues for most of our non-text fields. > This may be a better default since it is more NRT friendly and acts to avoid > OOM errors due to large field cache or UnInvertedField entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8740) use docValues by default
[ https://issues.apache.org/jira/browse/SOLR-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183177#comment-15183177 ] Jack Krupansky commented on SOLR-8740: -- And default to docValuesFormat="Memory" as well, or is that already the default when docValues="true" is set? Personally, I still find the whole docValues vs. Stored fields narrative extremely confusing. I've never been able to figure out why Lucene still needs Stored fields (other than for tokenized text fields) if docValues is so much better. In any case, with this Jira in place, there should be clear doc as to what scenarios, if any, stored="true" might have any utility for non-tokenized/text fields. > use docValues by default > > > Key: SOLR-8740 > URL: https://issues.apache.org/jira/browse/SOLR-8740 > Project: Solr > Issue Type: Improvement >Affects Versions: master >Reporter: Yonik Seeley > Fix For: master > > > We should consider switching to docValues for most of our non-text fields. > This may be a better default since it is more NRT friendly and acts to avoid > OOM errors due to large field cache or UnInvertedField entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3744) Solr LuceneQParser only handles pure negative queries at the top-level query, but not within parenthesized sub-queries
[ https://issues.apache.org/jira/browse/SOLR-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176543#comment-15176543 ] Jack Krupansky commented on SOLR-3744: -- Personally, I think the proper fix is in Lucene BooleanQuery itself - if no positive clauses are present, a MatchAllDocsQuery should be added as a MUST clause. For example, currently if you have only one clause and it is MUST_NOT, BQ explicitly rewrites to MatchNoDocsQuery. See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java Any objection from the core Lucene committers? What does ES do when no positive clauses are present in a subquery? > Solr LuceneQParser only handles pure negative queries at the top-level query, > but not within parenthesized sub-queries > -- > > Key: SOLR-3744 > URL: https://issues.apache.org/jira/browse/SOLR-3744 > Project: Solr > Issue Type: Bug > Components: query parsers >Affects Versions: 3.6.1, 4.0-BETA >Reporter: Jack Krupansky > > The SolrQuerySyntax wiki says that pure negative queries are supported ("Pure > negative queries (all clauses prohibited) are allowed"), which is true at the > top-level query, but not for sub-queries enclosed within parentheses. > See: > http://wiki.apache.org/solr/SolrQuerySyntax > Some queries that will not evaluate properly: > test AND (-fox) > test (-fox) > test OR (abc OR (-fox)) > test (-fox) > Sub-queries combined with the "AND" and "OR" keyword operators also fail to > evaluate properly. For example, > test OR -fox > -fox OR test > Note that all of these queries are supported properly by the edismax query > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3744) Solr LuceneQParser only handles pure negative queries at the top-level query, but not within parenthesized sub-queries
[ https://issues.apache.org/jira/browse/SOLR-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176525#comment-15176525 ] Jack Krupansky commented on SOLR-3744: -- Long ago... I'll try to remember. My vague recollection is that Solr simply fixed the top-level and inherited the nested behavior from Lucene, but now that Solr has its own copy of the basic query parser it should be fixable in the base query parser. But... I don't recall where I had tracked that down to. > Solr LuceneQParser only handles pure negative queries at the top-level query, > but not within parenthesized sub-queries > -- > > Key: SOLR-3744 > URL: https://issues.apache.org/jira/browse/SOLR-3744 > Project: Solr > Issue Type: Bug > Components: query parsers >Affects Versions: 3.6.1, 4.0-BETA >Reporter: Jack Krupansky > > The SolrQuerySyntax wiki says that pure negative queries are supported ("Pure > negative queries (all clauses prohibited) are allowed"), which is true at the > top-level query, but not for sub-queries enclosed within parentheses. > See: > http://wiki.apache.org/solr/SolrQuerySyntax > Some queries that will not evaluate properly: > test AND (-fox) > test (-fox) > test OR (abc OR (-fox)) > test (-fox) > Sub-queries combined with the "AND" and "OR" keyword operators also fail to > evaluate properly. For example, > test OR -fox > -fox OR test > Note that all of these queries are supported properly by the edismax query > parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171096#comment-15171096 ] Jack Krupansky commented on SOLR-8110: -- bq. "safe"... "moderate"... "legacy" My only real nit is that it would be a shame if we couldn't say simply that people will be safe if they stick to Java identifier rules. That would mean $ and full Unicode. My point is that it makes learning Solr more intuitive since Java is more of a commonly-known entity - "Solr field names are Java identifiers", rather than encumber people with yet another set of rules to learn. Note that the current Solr code mostly uses isJavaIdentifierStart/isJavaIdentifierPart today, but disallowing $, probably due to parameter substitution. IOW, Unicode is there today. See: https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/StrParser.java https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/SolrReturnFields.java > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171092#comment-15171092 ] Jack Krupansky commented on SOLR-8110: -- bq. lucene expressions I was going to say that Luceene Expressions are basically JavaScript, but... they are sort-of based on JS, but really more of a conceptual rather than literal basis. Here's Lucene's grammar rule for VARIABLE: {code} VARIABLE: ID ARRAY* ( [.] ID ARRAY* )*; fragment ARRAY: [[] ( STRING | INTEGER ) [\]]; fragment ID: [_$a-zA-Z] [_$a-zA-Z0-9]*; fragment STRING : ['] ( '\\\'' | '' | ~[\\'] )*? ['] | ["] ( '\\"' | '' | ~[\\"] )*? ["] ; {code} See: https://github.com/apache/lucene-solr/blob/master/lucene/expressions/src/java/org/apache/lucene/expressions/js/Javascript.g4 No Unicode support, no random special characters, just $ and _, but apparently dot as well. An ID is: {code} ID: [_$a-zA-Z] [_$a-zA-Z0-9]* {code} And any number of IDs can be written with dots between them to represent a single VARIABLE token. JavaScript identifiers are defined in the ECMAScript spec: https://tc39.github.io/ecma262/#prod-IdentifierName Letters in Java/ECMAScript are Unicode as defined by the Unicode property “ID_Start” and "ID_Continue". Java/ECMAScript supports $ and _ in addition to letters. Identifier start and continue character types are defined by the Unicode UAX#31 Identifier spec: http://unicode.org/reports/tr31/ > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170700#comment-15170700 ] Jack Krupansky commented on SOLR-8110: -- I can't recall any explicit statement on case sensitivity, although I would imagine that the existing "anything goes" model would default to case-sensitive. Personally, I would prefer case-insensitive. I can't recall a schema in which case-sensitive field names were used, while case mistakes are not uncommon. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170698#comment-15170698 ] Jack Krupansky commented on SOLR-8110: -- Dollar sign is permitted in Java identifier, including at the start. As per the Java Spec, "The "Java letters" include uppercase and lowercase ASCII Latin letters A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or \u0024)." It goes on to say that "The $ character should be used only in mechanically generated source code or, rarely, to access pre-existing names on legacy systems." See: https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.8 If anything, I had been assuming that we were proposing a superset of Java identifiers (hyphen, dot as part of name.) I'm not positive whether there might be any conflict with parameter substitution for dollar sign. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170155#comment-15170155 ] Jack Krupansky commented on SOLR-8110: -- I've accepted the fact that Solr will probably never need to support full infix expressions. If somebody wants to seriously propose full infix expressions, fine, but it seems too much to me to worry much about vague possibilities. Note that I am still a proponent of having quoted/escaped names which allow anything in names, ala SQL. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169830#comment-15169830 ] Jack Krupansky commented on SOLR-8110: -- Dot is a tough case. I can see reserving it for future expansion, but I can also see its utility in field names where its value is based on using it as a pseudo-field delimiter, such as in cases where data may in fact have come from an SQL ETL operation that actually did use the dot as a compound field name. How about... saying that dot is pseudo-reserved for compound field name references, and if the decomposed field name has a well-defined meaning in some context, such as where there are contextual named structural entities, such as table or collection names, then so be it, but if it has no clear meaning in a context, then the full, dotted name will be treated as a raw field name? So, at the level of the fl parameter a dotted name would get parsed as a compound name and then treated as a simple field name. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > Attachments: SOLR-8110.patch, SOLR-8110.patch > > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8713) New UI points to the wiki for Query Syntax instead of the Reference Guide
[ https://issues.apache.org/jira/browse/SOLR-8713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169703#comment-15169703 ] Jack Krupansky commented on SOLR-8713: -- Be careful, because Confluence has the working text of the NEXT release of Solr (6.0 right now), not the current release or even necessarily the release that the admin UI is running. It would be nice if the Confluence doc was per-release in addition to the development version, but right now only the PDF is per-release, which is what the admin UI should point to. > New UI points to the wiki for Query Syntax instead of the Reference Guide > - > > Key: SOLR-8713 > URL: https://issues.apache.org/jira/browse/SOLR-8713 > Project: Solr > Issue Type: Bug > Components: UI >Affects Versions: master >Reporter: Tomás Fernández Löbbe >Priority: Trivial > Labels: newdev > > Old Admin UI points to > https://cwiki.apache.org/confluence/display/solr/Query+Syntax+and+Parsing but > the new one points to http://wiki.apache.org/solr/SolrQuerySyntax -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159113#comment-15159113 ] Jack Krupansky commented on SOLR-8110: -- 1. Since the concept of enforcement of naming conventions is new, I would suggest making it optional in 6.x, preferably out-out - most people can probably live with it without problem. Whether it would just be a schema version trigger or a separate config/schema option can be debated. 2. Consider the concept of delimited identifiers as in SQL - enclose non-regular names in quotes. It is worth noting that highly-irregular names are not currently supported in queries even today (most special characters will terminate the field name in most query parsers.) > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156259#comment-15156259 ] Jack Krupansky commented on SOLR-8110: -- There is the issue of simple ASCII letters vs. Unicode letters. Java Identifiers support arbitrary Unicode letters which "allows programmers to use identifiers in their programs that are written in their native languages." See Character.isJavaIdentifierStart and isJavaIdentifierPart. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8621) solrconfig.xml: deprecate/replace with
[ https://issues.apache.org/jira/browse/SOLR-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153085#comment-15153085 ] Jack Krupansky commented on SOLR-8621: -- Shouldn't the index config reference page still list but with a "Deprecated" notice? Ditto for . The Upgrading Solr ref page does give an example of how to migrate from MP to MPF (and for MF) - it would be nice to link to that from a deprecated notice on the index config page. See: https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig https://cwiki.apache.org/confluence/display/solr/Upgrading+Solr > solrconfig.xml: deprecate/replace with > - > > Key: SOLR-8621 > URL: https://issues.apache.org/jira/browse/SOLR-8621 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke > Fix For: 5.5, master > > Attachments: SOLR-8621-example_contrib_configs.patch, > SOLR-8621-example_contrib_configs.patch, SOLR-8621.patch, > explicit-merge-auto-set.patch > > > * end-user benefits:* > * Lucene's UpgradeIndexMergePolicy can be configured in Solr > * Lucene's SortingMergePolicy can be configured in Solr (with SOLR-5730) > * customisability: arbitrary merge policies including wrapping/nested merge > policies can be created and configured > *roadmap:* > * solr 5.5 introduces support > * solr 5.5 deprecates (but maintains) support > * SOLR-8668 in solr 6.0(\?) will remove support -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7555) Display total space and available space in Admin
[ https://issues.apache.org/jira/browse/SOLR-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151147#comment-15151147 ] Jack Krupansky edited comment on SOLR-7555 at 2/17/16 8:52 PM: --- I recently noticed that quite a few of the Amazon EC2 instance types have two or more local SSD storage devices. Should Solr display "total space" across all available local devices or just for the storage device on which Solr appears to be configured? If the instance supports EBS-only, I presume it would be total for EBS that the instance type supports. was (Author: jkrupan): I recently noticed that quite a few f the Amazon EC2 instance types have two or more local SSD storage devices. Should Solr display "total space" across all available local devices or just for the storage device on which Solr appears to be configured? If the instance supports EBS-only, I presume it would be total for EBS that the instance type supports. > Display total space and available space in Admin > > > Key: SOLR-7555 > URL: https://issues.apache.org/jira/browse/SOLR-7555 > Project: Solr > Issue Type: Improvement > Components: web gui >Affects Versions: 5.1 >Reporter: Eric Pugh >Assignee: Erik Hatcher >Priority: Minor > Fix For: 6.0 > > Attachments: DiskSpaceAwareDirectory.java, > SOLR-7555-display_disk_space.patch, SOLR-7555-display_disk_space_v2.patch, > SOLR-7555-display_disk_space_v3.patch, SOLR-7555-display_disk_space_v4.patch, > SOLR-7555-display_disk_space_v5.patch, SOLR-7555.patch, SOLR-7555.patch, > SOLR-7555.patch > > > Frequently I have access to the Solr Admin console, but not the underlying > server, and I'm curious how much space remains available. This little patch > exposes total Volume size as well as the usable space remaining: > !https://monosnap.com/file/VqlReekCFwpK6utI3lP18fbPqrGI4b.png! > I'm not sure if this is the best place to put this, as every shard will share > the same data, so maybe it should be on the top level Dashboard? Also not > sure what to call the fields! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7555) Display total space and available space in Admin
[ https://issues.apache.org/jira/browse/SOLR-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151147#comment-15151147 ] Jack Krupansky commented on SOLR-7555: -- I recently noticed that quite a few f the Amazon EC2 instance types have two or more local SSD storage devices. Should Solr display "total space" across all available local devices or just for the storage device on which Solr appears to be configured? If the instance supports EBS-only, I presume it would be total for EBS that the instance type supports. > Display total space and available space in Admin > > > Key: SOLR-7555 > URL: https://issues.apache.org/jira/browse/SOLR-7555 > Project: Solr > Issue Type: Improvement > Components: web gui >Affects Versions: 5.1 >Reporter: Eric Pugh >Assignee: Erik Hatcher >Priority: Minor > Fix For: 6.0 > > Attachments: DiskSpaceAwareDirectory.java, > SOLR-7555-display_disk_space.patch, SOLR-7555-display_disk_space_v2.patch, > SOLR-7555-display_disk_space_v3.patch, SOLR-7555-display_disk_space_v4.patch, > SOLR-7555-display_disk_space_v5.patch, SOLR-7555.patch, SOLR-7555.patch, > SOLR-7555.patch > > > Frequently I have access to the Solr Admin console, but not the underlying > server, and I'm curious how much space remains available. This little patch > exposes total Volume size as well as the usable space remaining: > !https://monosnap.com/file/VqlReekCFwpK6utI3lP18fbPqrGI4b.png! > I'm not sure if this is the best place to put this, as every shard will share > the same data, so maybe it should be on the top level Dashboard? Also not > sure what to call the fields! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8110) Start enforcing field naming recomendations in next X.0 release?
[ https://issues.apache.org/jira/browse/SOLR-8110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151121#comment-15151121 ] Jack Krupansky commented on SOLR-8110: -- It would be nice to say that a "Solr identifier" had the same rules as a Java identifier, but Java allows dollar signs and excludes keywords and reserved terms like if, for, true, false, null. Hmmm... I don't know if many people would complain is Solr didn't allow those keywords as field names. The main three exceptions to the current soft-rule that I have run across are: 1. Dot for compound names. 2. Hyphen feels a little more natural than underscore unless you're truly thinking about Java code and imagining that you could write a minus sign for a subtraction operation. 3. An ISO date/time value for dynamic fields which want to be time stamped. An optional text keyword prefix and hyphen are common for these timestamped columns as well. 4. Spaces, but I think sensible people can accept those as not permitted in names. The main difficulty I am aware of in Solr is parsing of function queries, including (or especially) in the field list of the fl parameter. > Start enforcing field naming recomendations in next X.0 release? > > > Key: SOLR-8110 > URL: https://issues.apache.org/jira/browse/SOLR-8110 > Project: Solr > Issue Type: Improvement >Reporter: Hoss Man > > For a very long time now, Solr has made the following "recommendation" > regarding field naming conventions... > bq. field names should consist of alphanumeric or underscore characters only > and not start with a digit. This is not currently strictly enforced, but > other field names will not have first class support from all components and > back compatibility is not guaranteed. ... > I'm opening this issue to track discussion about if/how we should start > enforcing this as a rule instead (instead of just a "recommendation") in our > next/future X.0 (ie: major) release. > The goals of doing so being: > * simplify some existing code/apis that currently use hueristics to deal with > lists of field and produce strange errors when the huerstic fails (example: > ReturnFields.add) > * reduce confusion/pain for new users who might start out unaware of the > recommended conventions and then only later encountering a situation where > their field names are not supported by some feature and get frustrated > because they have to change their schema, reindex, update index/query client > expectations, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5730) make Lucene's SortingMergePolicy and EarlyTerminatingSortingCollector configurable in Solr
[ https://issues.apache.org/jira/browse/SOLR-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129131#comment-15129131 ] Jack Krupansky commented on SOLR-5730: -- Let me try again... again, my apologies for not commenting much earlier before things got a bit complicated. Let me see if I have this straight: 1. There are three related tickets: SOLR-4654, SOLR-5730, SOLR-8621. 2. There are three key features of interest: UpgradeIndexMergePolicy, SortingMergePolicy , and EarlyTerminatingSortingCollector. 3. The first ticket is kind of the umbrella. 4. The second ticket is focused on the second and third features. 5. The third ticket is the foundation for all three features. 6. The third ticket has some user impact and delivers some additional minor benefits, but enabling those other three features is its true purpose. 7. SortingMergePolicy and EarlyTerminatingSortingCollector are really two sides of a single feature, the index side and the query side of (in my words) "pre-sorted indexing". Now, I have only one remaining question area: Isn't the forceMerge method the only real benefit of UpgradeIndexMergePolicy? Is that purely for the Solr optimize option, or is there some intent to surface it for users some other way in Solr? Isn't it more of a one-time operation rather than something that should be in place for all merge operations? Or is it so cheap if not used that we should simply pre-configure it all the time? > make Lucene's SortingMergePolicy and EarlyTerminatingSortingCollector > configurable in Solr > -- > > Key: SOLR-5730 > URL: https://issues.apache.org/jira/browse/SOLR-5730 > Project: Solr > Issue Type: New Feature >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Attachments: SOLR-5730-part1of2.patch, SOLR-5730-part1of2.patch, > SOLR-5730-part2of2.patch, SOLR-5730-part2of2.patch > > > *Example configuration (solrconfig.xml) - corresponding to latest attached > patch:* > {noformat} > > timestamp desc > > {noformat} > *Example configuration (solrconfig.xml) - corresponding to current > (work-in-progress master-solr-8621) SOLR-8621 efforts:* > {noformat} > - > + > + TieredMergePolicyFactory > + timestamp desc > + > {noformat} > *Example use (EarlyTerminatingSortingCollector):* > {noformat} > =timestamp+desc=true > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5730) make Lucene's SortingMergePolicy and EarlyTerminatingSortingCollector configurable in Solr
[ https://issues.apache.org/jira/browse/SOLR-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128303#comment-15128303 ] Jack Krupansky commented on SOLR-5730: -- Sorry for arriving so late to the party here, but I've gotten lost in all the back and forth... is there going to be a simple and easy to use XML element to let the user simply enable sort merge and specify a field list, as opposed to having to manually construct an elaborate Lucene-level set of wrapped merge policies? I mean, sure, some experts will indeed wish to fully configure every detail of a Lucene merge policy, but for non-expert users who just want to assure that their index is pre-sorted to align with a query sorting, the syntax should be... simple. If the user does construct some elaborate wrapped MP, then some sort of parameter substitution would be needed, but if the user uses the default solrconfig which has no explicit MP, Solr should build that full, wrapped MP with just the sort field names substituted. In short, I just wanted to know whether this was intended to be a very easy to use feature (supposed to be the trademark of Solr) or some super-elaborate expert-only feature that we would be forced to recommend that average users stay away from. Personally, my preference would be to focus on introducing a first-class Solr feature of a "preferred document order", which is effectively a composite primary key in database nomenclature. So, let's not forget that this is Solr we are talking about, not raw Lucene. I'd like to know that [~yo...@apache.org] and [~hossman] are explicitly on board with what is bring proposed. > make Lucene's SortingMergePolicy and EarlyTerminatingSortingCollector > configurable in Solr > -- > > Key: SOLR-5730 > URL: https://issues.apache.org/jira/browse/SOLR-5730 > Project: Solr > Issue Type: New Feature >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Attachments: SOLR-5730-part1of2.patch, SOLR-5730-part1of2.patch, > SOLR-5730-part2of2.patch, SOLR-5730-part2of2.patch > > > *Example configuration (solrconfig.xml):* > {noformat} > > timestamp desc > > {noformat} > *Example use (EarlyTerminatingSortingCollector):* > {noformat} > =timestamp+desc=true > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8621) solrconfig.xml: deprecate/replace with
[ https://issues.apache.org/jira/browse/SOLR-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127090#comment-15127090 ] Jack Krupansky commented on SOLR-8621: -- Will both the and elements will be deprecated as well (in addition to being allowed within the new )? > solrconfig.xml: deprecate/replace with > - > > Key: SOLR-8621 > URL: https://issues.apache.org/jira/browse/SOLR-8621 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke > > * end-user benefits:* > * Lucene's UpgradeIndexMergePolicy can be configured in Solr > * (with SOLR-5730) Lucene's SortingMergePolicy can be configured in Solr > * customisability: arbitrary merge policies including wrapping/nested merge > policies can be created and configured > *(proposed) roadmap:* > * solr 5.5 introduces support > * solr 5.5(\?) deprecates (but maintains) support > * solr 6.0(\?) removes support > +work-in-progress git branch:+ > [master-solr-8621|https://github.com/apache/lucene-solr/tree/master-solr-8621] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8621) solrconfig.xml: deprecate/replace with
[ https://issues.apache.org/jira/browse/SOLR-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127099#comment-15127099 ] Jack Krupansky commented on SOLR-8621: -- IIUC, the motivation here is to permit any number of merge policies to be configured with the goal of supporting wrapping of merge policies. Okay, but what tells Solr which MP is the outer/default MP? > solrconfig.xml: deprecate/replace with > - > > Key: SOLR-8621 > URL: https://issues.apache.org/jira/browse/SOLR-8621 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke > > * end-user benefits:* > * Lucene's UpgradeIndexMergePolicy can be configured in Solr > * (with SOLR-5730) Lucene's SortingMergePolicy can be configured in Solr > * customisability: arbitrary merge policies including wrapping/nested merge > policies can be created and configured > *(proposed) roadmap:* > * solr 5.5 introduces support > * solr 5.5(\?) deprecates (but maintains) support > * solr 6.0(\?) removes support > +work-in-progress git branch:+ > [master-solr-8621|https://github.com/apache/lucene-solr/tree/master-solr-8621] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8621) solrconfig.xml: deprecate/replace with
[ https://issues.apache.org/jira/browse/SOLR-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123590#comment-15123590 ] Jack Krupansky commented on SOLR-8621: -- Is this simply a rename of the XML element name (from to ) or is there some other user-visible feature enhancement or change? Is Fix Version 6.0? > solrconfig.xml: deprecate/replace with > - > > Key: SOLR-8621 > URL: https://issues.apache.org/jira/browse/SOLR-8621 > Project: Solr > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke > > * end-user benefits:* > * Lucene's UpgradeIndexMergePolicy can be configured in Solr > * (with SOLR-5730) Lucene's SortingMergePolicy can be configured in Solr > * customisability: arbitrary merge policies including wrapping/nested merge > policies can be created and configured > *roadmap:* > * solr 5.5 introduces support > * solr 5.5(\?) deprecates (but maintains) support > * solr 6.0(\?) removes support > +work-in-progress git branch:+ > [master-solr-8621|https://github.com/apache/lucene-solr/tree/master-solr-8621] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6991) WordDelimiterFilter bug
[ https://issues.apache.org/jira/browse/LUCENE-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115405#comment-15115405 ] Jack Krupansky commented on LUCENE-6991: Does seem odd and wrong. I also notice that it is not generating terms for the single letters from the %-escapes: %3A, %2F. It also seems odd that that long token of catenated word parts is not all of the word parts from the URL. It seems like a digit not preceded by a letter is causing a break, while a digit preceded by a letter prevents a break. Since you are using the white space tokenizer, the WDF is only seeing each space-delimited term at a time. You might try your test with just the URL portion itself, both with and without the escaped quote, just to see if that affects anything. > WordDelimiterFilter bug > --- > > Key: LUCENE-6991 > URL: https://issues.apache.org/jira/browse/LUCENE-6991 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.10.4, 5.3.1 >Reporter: Pawel Rog >Priority: Minor > > I was preparing analyzer which contains WordDelimiterFilter and I realized it > sometimes gives results different then expected. > I prepared a short test which shows the problem. I haven't used Lucene tests > for this but this doesn't matter for showing the bug. > {code} > String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 > (X11; Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > List tokens1 = new ArrayList(); > List tokens2 = new ArrayList(); > WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(); > TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > CharTermAttribute charAttrib = > tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens1.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +] \"GET > /products/key-phrase-extractor/ HTTP/1.1\"" + > " 200 3437 \"http://www.google.com/url?sa=t=j==s&; + > > "source=web=15=rja=0CEgQFjAEOAo=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" > + > > "phrase-extractor%2F=TPOuUbaWM-OKiQfGxIGYDw=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg" > + > "=oYitONI2EIZ0CQar7Ej8HA=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; > Ubuntu; Linux i686; rv:20.0) " + > "Gecko/20100101 Firefox/20.0\""; > System.out.println("\n\n\n\n"); > tokenStream = analyzer.tokenStream("test", urlIndexed); > tokenStream = new WordDelimiterFilter(tokenStream, > WordDelimiterFilter.GENERATE_WORD_PARTS | > WordDelimiterFilter.CATENATE_WORDS | > WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, > null); > charAttrib = tokenStream.addAttribute(CharTermAttribute.class); > tokenStream.reset(); > while(tokenStream.incrementToken()) { > tokens2.add(charAttrib.toString()); > System.out.println(charAttrib.toString()); > } > tokenStream.end(); > tokenStream.close(); > assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2)); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8029) Modernize and standardize Solr APIs
[ https://issues.apache.org/jira/browse/SOLR-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109850#comment-15109850 ] Jack Krupansky commented on SOLR-8029: -- Is this likely to be in 6.0 or 6.1? +1 for 6.0, even if not absolutely 100% completely done. At least 6.0 can be billed as having a modern API, even if there might be some additional work required to get it fully rock solid/fully tested in 6.1. > Modernize and standardize Solr APIs > --- > > Key: SOLR-8029 > URL: https://issues.apache.org/jira/browse/SOLR-8029 > Project: Solr > Issue Type: Improvement >Affects Versions: Trunk >Reporter: Noble Paul >Assignee: Noble Paul > Labels: API, EaseOfUse > Fix For: Trunk > > Attachments: SOLR-8029.patch, SOLR-8029.patch, SOLR-8029.patch, > SOLR-8029.patch > > > Solr APIs have organically evolved and they are sometimes inconsistent with > each other or not in sync with the widely followed conventions of HTTP > protocol. Trying to make incremental changes to make them modern is like > applying band-aid. So, we have done a complete rethink of what the APIs > should be. The most notable aspects of the API are as follows: > The new set of APIs will be placed under a new path {{/solr2}}. The legacy > APIs will continue to work under the {{/solr}} path as they used to and they > will be eventually deprecated. > There are 4 types of requests in the new API > * {{/v2//*}} : Hit a collection directly or manage > collections/shards/replicas > * {{/v2//*}} : Hit a core directly or manage cores > * {{/v2/cluster/*}} : Operations on cluster not pertaining to any collection > or core. e.g: security, overseer ops etc > This will be released as part of a major release. Check the link given below > for the full specification. Your comments are welcome > [Solr API version 2 Specification | http://bit.ly/1JYsBMQ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3141) Deprecate OPTIMIZE command in Solr
[ https://issues.apache.org/jira/browse/SOLR-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086474#comment-15086474 ] Jack Krupansky commented on SOLR-3141: -- bq. optimize() is rarely necessary anymore Well... I used to say that same thing, because I was under the impression that the common merge policies would automatically optimize segments over time, but over the past year there have been several email threads with users who had heavy update/delete usage patterns where the index size appeared to remain bloated due to deleted/updated documents. So... we need a revised story... and doc. What exactly should we be telling people who update/delete lots of docs frequently and still find that the index is bloated? Is there maybe some underlying bug or tuning of the delete/merge policy needed? Or... maybe people still need an explicit "force merge" command to effectively say "I just finished a large batch of document updates/deletes but I'm done now, so merge away." Personally, I would like to see a "start batch" mode, which signals that the user intends to make a lot of changes and Solr/Lucene should make no attempt to optimize or clean things up or update caches until the user signals "end of batch", at which time any appropriate merging or optimization or cache refreshing can occur. Not everybody will want to do this, but it still seems to be a semi-common use of Solr. > Deprecate OPTIMIZE command in Solr > -- > > Key: SOLR-3141 > URL: https://issues.apache.org/jira/browse/SOLR-3141 > Project: Solr > Issue Type: Improvement > Components: update >Affects Versions: 3.5 >Reporter: Jan Høydahl >Assignee: Jan Høydahl > Labels: force, optimize > Fix For: 4.9, Trunk > > Attachments: SOLR-3141.patch, SOLR-3141.patch, SOLR-3141.patch > > > Background: LUCENE-3454 renames optimize() as forceMerge(). Please read that > issue first. > Now that optimize() is rarely necessary anymore, and renamed in Lucene APIs, > what should be done with Solr's ancient optimize command? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2649) MM ignored in edismax queries with operators
[ https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038585#comment-15038585 ] Jack Krupansky commented on SOLR-2649: -- The behavior of mm only applying to the top-level query is not documented at present. Even if having mm apply only to the top-level query is intended, it seems a separate matter as to how the q.op parameter applies. I've never seen any doc or discussion that suggested that the default operator should only apply to the top-level query. I haven't looked at the code lately, but it used to be that q.op was just used to set the internal mm value and then completely ignored in the sense that it was not passed down to the Lucene query parser to use as the Lucene default operator. IOW, the Lucene setDefaultOperator method was never called. See: https://lucene.apache.org/core/5_3_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setDefaultOperator(org.apache.lucene.queryparser.classic.QueryParser.Operator) > MM ignored in edismax queries with operators > > > Key: SOLR-2649 > URL: https://issues.apache.org/jira/browse/SOLR-2649 > Project: Solr > Issue Type: Improvement > Components: query parsers >Reporter: Magnus Bergmark >Assignee: Erick Erickson > Fix For: 4.9, Trunk > > Attachments: SOLR-2649-with-Qop.patch, SOLR-2649-with-Qop.patch, > SOLR-2649.diff, SOLR-2649.patch > > > Hypothetical scenario: > 1. User searches for "stocks oil gold" with MM set to "50%" > 2. User adds "-stockings" to the query: "stocks oil gold -stockings" > 3. User gets no hits since MM was ignored and all terms where AND-ed > together > The behavior seems to be intentional, although the reason why is never > explained: > // For correct lucene queries, turn off mm processing if there > // were explicit operators (except for AND). > boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; > (lines 232-234 taken from > tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java) > This makes edismax unsuitable as an replacement to dismax; mm is one of the > primary features of dismax. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988338#comment-14988338 ] Jack Krupansky commented on LUCENE-6874: Certainly Solr can update its example schemas to use whatever alternative tokenizer or option is decided on so that Solr users, many of whom are not Java developers, will no longer fall into this NBSP trap, but... that still feels like a less than desirable resolution. [~thetaphi], could you elaborate more specifically on the existing use case that you are trying to preserve? I mean, like in terms of a real-world example. Where do some of your NBSPs actually live in the wild? It seems to me that the vast majority of normal users would not be negatively impacted by having "white space" be defined using the Unicode model. I never objected to using the Java model, but that's because I had overlooked this nuance of NBSP. My concern for Solr users is that NBSP occurs somewhat commonly in HTML web pages - as a formatting technique more than an attempt at influencing tokenization. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988646#comment-14988646 ] Jack Krupansky commented on LUCENE-6874: bq. Because WST and WDF should really only be used as a last resort. Absolutely agreed. From a Solr user perspective we really need a much simpler model for semi-standard tokens out of the box without the user having to scratch their heads and resorting to WST in the first (last) place. LOL - maybe if we could eliminate this need to resort to WST, we wouldn't have to fret as much about WST. bq. I generally suggest to my users to use ClassicTokenizer Personally, I've always refrained from recommending CT since I thought ST was supposed to replace it and that the email and URL support was considered an excess not worth keeping. I've considered CT as if it were deprecated (which it is not.) And, I never see anybody else recommending it on the user list. And, the fact that it can't handle slashes for product number is a deal killer. I'm not sure that I would argue in favor of resurrecting CT as a first-class recommendation, especially since it can't handle non-European languages, but... That said, I do think it is worth separately (from this Jira) considering a fresh, new tokenizer that starts with the goodness of ST and adds in an approximation of the reasons that people resort to WST. Whether that can be an option on ST or has to be a separate tokenizer would need to be debated. I'd prefer an option on ST, either to simply allow embedded special characters or to specify a list or regex of special character to be allowed or excluded. People would still need to combine NewT with WDF, but at least the tokenization would be more explicit. Personally I would prefer to see an option for whether to retain or strip external punctuation vs. embedded special characters. Trailing periods and commas and columns and enclosing parentheses are just the kinds of things we had to resort to WDF for when using WST to retain embedded special characters. And if people really want to be ambitious, a totally new tokenizer that subsumed the good parts of WDF would make a lot of lives of Solr users much easier. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988615#comment-14988615 ] Jack Krupansky commented on LUCENE-6874: Tika is the other (main?) approach to ingesting text from HTML web pages. I haven't checked exactly what it does on . Maybe [~dsmiley] could elaborate on which use case he was encountering that inspired this Jira issue. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540 ] Jack Krupansky commented on LUCENE-6874: +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. >From a Solr user perspective, there is like zero value to having NBSP from >HTML web pages being treated as if it were not traditional white space. >From a Solr user perspective, the primary use of whitespace tokenizer is to >avoid the fact that standard tokenizer breaks on various special characters >such as occur in product numbers. In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP
[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985540#comment-14985540 ] Jack Krupansky edited comment on LUCENE-6874 at 11/2/15 5:34 PM: - +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. >From a Solr user perspective, there is like zero value to having NBSP from >HTML web pages being treated as if it were not traditional white space. >From a Solr user perspective, the primary use of whitespace tokenizer is to >avoid the fact that standard tokenizer breaks on various special characters >such as occur in product numbers. One of the ongoing problems in the Solr community is the sheer amount of time spent explaining nuances and gotchas, even if they do happen to be documented somewhere in the fine print - no sane user reads the fine print anyway. No Solr user actually uses WhitespaceTokenizer directly - they reference WhitespaceTokenizerFactory, and then having to drop down to Lucene and Java for doc is way too much to ask a typical Solr user. Our collective goal should be to minimize nuances and gotchas (IMHO.) In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. Ugh... there are plenty of other places in doc for other tokenizers and filters that refer to "whitespace" and need to address this same issue, either to treat NBSP as white space or doc the nuance/gotcha much more thoroughly and effectively. OTOH... an alternative view... having so many un/poorly-documented nuances and gotchas is money in the pockets of consultants and a great argument in favor of Solr users maximizing the employment of Solr consultants. was (Author: jkrupan): +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. >From a Solr user perspective, there is like zero value to having NBSP from >HTML web pages being treated as if it were not traditional white space. >From a Solr user perspective, the primary use of whitespace tokenizer is to >avoid the fact that standard tokenizer breaks on various special characters >such as occur in product numbers. In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. > WhitespaceTokenizer should tokenize on NBSP > --- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: David Smiley >Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6842) No way to limit the fields cached in memory and leads to OOM when there are thousand of fields (thousands)
[ https://issues.apache.org/jira/browse/LUCENE-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963420#comment-14963420 ] Jack Krupansky commented on LUCENE-6842: Generally, Lucene has few hard limits, but the general guidance is that ultimately you will be limited by available system resources such as RAM and CPU. There may not be any hard limit to the number of fields, but that doesn't mean that you can safely assume that a large number of fields will always work for a limited amount of RAM and CPU. Exactly how much RAM and CPU you need will depend on your specific application, that you yourself will have to test for - known as a proof of concept. Generally, people have resource problems based on the number of documents rather than the number of fields for each document. You haven't detailed how many documents you are indexing and how many of these fields are actually present in an average document. Who knows, maybe the number of fields is not the problem per se and it is the number of documents that is the cause of the resource issue, or a combination of the two. That said, I will defer to the more senior Lucene committers here, but personally I would suggest that "hundreds" or "low thousands" is a more practical recommended best practice upper limit to total number of fields in a Lucene index. Generally, "dozens" or at most "low hundreds" would be most recommended and the safest assumption. Sure, maybe 10,000 fields might actually work, but then number of documents and operations and query complexity will also come into play. All of that said, I'm sure we are all intently curious why exactly you feel that you need so many fields. > No way to limit the fields cached in memory and leads to OOM when there are > thousand of fields (thousands) > -- > > Key: LUCENE-6842 > URL: https://issues.apache.org/jira/browse/LUCENE-6842 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.6.1 > Environment: Linux, openjdk 1.6.x >Reporter: Bala Kolla > Attachments: HistogramOfHeapUsage.png > > > I am opening this defect to get some guidance on how to handle a case of > server running out of memory and it seems like it's something to do how we > index. But want to know if there is anyway to reduce the impact of this on > memory usage before we look into the way of reducing the number of fields. > Basically we have many thousands of fields being indexed and it's causing a > large amount of memory being used (25GB) and eventually leading to > application to hang and force us to restart every few minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8160) Terms query parser should optionally do query analysis
[ https://issues.apache.org/jira/browse/SOLR-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955080#comment-14955080 ] Jack Krupansky commented on SOLR-8160: -- The doc is a bit misleading, for both the Term and Terms query parsers: bq. documents matching any of the specified values. This can be useful for generating filter queries from the external human readable terms returned by the faceting or terms components It should be explicit that these are indexed, already analyzed term values, not "external human readable terms" as the doc indicates. > Terms query parser should optionally do query analysis > --- > > Key: SOLR-8160 > URL: https://issues.apache.org/jira/browse/SOLR-8160 > Project: Solr > Issue Type: Improvement > Components: query parsers, search >Affects Versions: 5.3 >Reporter: Devansh Dhutia > > Field setup as > {code} > multiValued="false" required="false" /> > > > > > > > > > > > {code} > Value sent to cs field for indexing include: AA, BB > Following is observed > {code}={!terms f=cs}AA,BB{code} yields 0 results > {code}={!terms f=cs}aa,bb{code} yields 2 results > {code}=cs:(AA BB){code} yields 2 results > {code}=cs:(aa bb){code} yields 2 results > The first variant above should behave like the other 3 & obey query time > analysis -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-6301) Deprecate Filter
[ https://issues.apache.org/jira/browse/LUCENE-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953084#comment-14953084 ] Jack Krupansky edited comment on LUCENE-6301 at 10/12/15 12:58 PM: --- I know this change has been in progress for awhile, but it just kind of sunk for me finally and now I'm wondering what the impact on Solr will be. I mean, wasn't Filter supposed to be a big performance win over a Query since it eliminates the performance impact of scoring? If that was the case, is Lucene proving some alternate method of achieving a similar performance improvement? I think it is, but... not stated quite so explicitly. An example of the expected migration would help a lot. I think the example should be in the Lucene Javadoc - "To filter documents without the performance overhead of scoring, use the following technique..." If I understand properly, one would simply wrap the query in a BooleanQuery with a single clause that uses BooleanQuery.Clause.FILTER and that would have exactly the same effect (and performance gain) as the old Filter class. Is that statement 100% accurate? If so, it would be good to make it explicit here in Jira, in the deprecation comment in the the Filter class, and in BooleanQuery as well. Thanks! was (Author: jkrupan): I know this change has been in progress for awhile, but it just kind of sunk for me finally in and now I'm wondering what the impact on Solr will be. I mean, wasn't Filter supposed to be a big performance win over a Query since it eliminates the performance impact of scoring? If that was the case, is Lucene proving some alternate method of achieving a similar performance improvement? I think it is, but... not stated quite so explicitly. An example of the expected migration would help a lot. I think the example should be in the Lucene Javadoc - "To filter documents without the performance overhead of scoring, use the following technique..." If I understand properly, one would simply wrap the query in a BooleanQuery with a single clause that uses BooleanQuery.Clause.FILTER and that would have exactly the same effect (and performance gain) as the old Filter class. Is that statement 100% accurate? If so, it would be good to make it explicit here in Jira, in the deprecation comment in the the Filter class, and in BooleanQuery as well. Thanks! > Deprecate Filter > > > Key: LUCENE-6301 > URL: https://issues.apache.org/jira/browse/LUCENE-6301 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 5.2, Trunk > > Attachments: LUCENE-6301.patch, LUCENE-6301.patch > > > It will still take time to completely remove Filter, but I think we should > start deprecating it now to state our intention and encourage users to move > to queries as soon as possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6301) Deprecate Filter
[ https://issues.apache.org/jira/browse/LUCENE-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953084#comment-14953084 ] Jack Krupansky commented on LUCENE-6301: I know this change has been in progress for awhile, but it just kind of sunk for me finally in and now I'm wondering what the impact on Solr will be. I mean, wasn't Filter supposed to be a big performance win over a Query since it eliminates the performance impact of scoring? If that was the case, is Lucene proving some alternate method of achieving a similar performance improvement? I think it is, but... not stated quite so explicitly. An example of the expected migration would help a lot. I think the example should be in the Lucene Javadoc - "To filter documents without the performance overhead of scoring, use the following technique..." If I understand properly, one would simply wrap the query in a BooleanQuery with a single clause that uses BooleanQuery.Clause.FILTER and that would have exactly the same effect (and performance gain) as the old Filter class. Is that statement 100% accurate? If so, it would be good to make it explicit here in Jira, in the deprecation comment in the the Filter class, and in BooleanQuery as well. Thanks! > Deprecate Filter > > > Key: LUCENE-6301 > URL: https://issues.apache.org/jira/browse/LUCENE-6301 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 5.2, Trunk > > Attachments: LUCENE-6301.patch, LUCENE-6301.patch > > > It will still take time to completely remove Filter, but I think we should > start deprecating it now to state our intention and encourage users to move > to queries as soon as possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6301) Deprecate Filter
[ https://issues.apache.org/jira/browse/LUCENE-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14953174#comment-14953174 ] Jack Krupansky commented on LUCENE-6301: Thanks! LGTM. Now let's see if the Solr guys pick up on this. > Deprecate Filter > > > Key: LUCENE-6301 > URL: https://issues.apache.org/jira/browse/LUCENE-6301 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 5.2, Trunk > > Attachments: LUCENE-6301.patch, LUCENE-6301.patch > > > It will still take time to completely remove Filter, but I think we should > start deprecating it now to state our intention and encourage users to move > to queries as soon as possible? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6305) BooleanQuery.equals should ignore clause order
[ https://issues.apache.org/jira/browse/LUCENE-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950255#comment-14950255 ] Jack Krupansky commented on LUCENE-6305: No objection, but it would be good for the javadoc for BQ and BQ.Builder to explicitly state the contract that the order that clauses are added will not impact either the results of the query, their order, or the performance of the execution of the query - assuming those facts are all true. > BooleanQuery.equals should ignore clause order > -- > > Key: LUCENE-6305 > URL: https://issues.apache.org/jira/browse/LUCENE-6305 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-6305.patch, LUCENE-6305.patch > > > BooleanQuery.equals is sensitive to the order in which clauses have been > added. So for instance "+A +B" would be considered different from "+B +A" > although it generates the same matches and scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942686#comment-14942686 ] Jack Krupansky commented on LUCENE-6664: Hey [~mikemccand], don't get discouraged, this was a very valuable exercise. I am a solid proponent of getting multi-term synonyms working in a full and robust manner, but I recognize that they just don't fit in cleanly with the existing flat token stream architecture. That's life. In any case, don't give up on this long-term effort. Maybe the best thing for now is to retain the traditional flat synonym filter for compatibility, fully add the new SynonymGraphFilter, and then add the optional ability to enable graph support in the main Lucene query parser. (Alas, Solr, has its own fork of the Lucene query parser.) Support within phrase queries is the tricky part. It would also be good to address the issue with non-phrase terms being analyzed separately - the query parser should recognize adjacent terms without operators are analyze as a group so that multi-token synonyms can be recognized. > Replace SynonymFilter with SynonymGraphFilter > - > > Key: LUCENE-6664 > URL: https://issues.apache.org/jira/browse/LUCENE-6664 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, > LUCENE-6664.patch, usa.png, usa_flat.png > > > Spinoff from LUCENE-6582. > I created a new SynonymGraphFilter (to replace the current buggy > SynonymFilter), that produces correct graphs (does no "graph > flattening" itself). I think this makes it simpler. > This means you must add the FlattenGraphFilter yourself, if you are > applying synonyms during indexing. > Index-time syn expansion is a necessarily "lossy" graph transformation > when multi-token (input or output) synonyms are applied, because the > index does not store {{posLength}}, so there will always be phrase > queries that should match but do not, and then phrase queries that > should not match but do. > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > goes into detail about this. > However, with this new SynonymGraphFilter, if instead you do synonym > expansion at query time (and don't do the flattening), and you use > TermAutomatonQuery (future: somehow integrated into a query parser), > or maybe just "enumerate all paths and make union of PhraseQuery", you > should get 100% correct matches (not sure about "proper" scoring > though...). > This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6821) TermQuery's constructors should clone the incoming term
[ https://issues.apache.org/jira/browse/LUCENE-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942310#comment-14942310 ] Jack Krupansky commented on LUCENE-6821: Won't this change have the prospect of increasing the amount of GC due to all these extra objects? Maybe might it be advisable to have an alternative constructor that doesn't clone so that users like Solr can exploit the fact that their code won't be making any further use of the input term? > TermQuery's constructors should clone the incoming term > --- > > Key: LUCENE-6821 > URL: https://issues.apache.org/jira/browse/LUCENE-6821 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > Attachments: LUCENE-6821.patch > > > This is a follow-up of LUCENE-6435: the bug stems from the fact that you can > build term queries out of shared BytesRef objects (such as the ones returned > by TermsEnum.next), which is a bit trappy. If TermQuery's constructors would > clone the incoming term, we wouldn't have this trap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7249) Solr engine misses null-values in OR null part for eDisMax parser
[ https://issues.apache.org/jira/browse/SOLR-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363383#comment-14363383 ] Jack Krupansky commented on SOLR-7249: -- It's best to pursue this type of issue on the Solr user list first. Have you added debugQuery=true to your request and looked at the parsed_query in the response? That shows how your query is actually interpreted. You wrote AND -area, but that probably should be NOT area or simply -area. Solr engine misses null-values in OR null part for eDisMax parser --- Key: SOLR-7249 URL: https://issues.apache.org/jira/browse/SOLR-7249 Project: Solr Issue Type: Bug Components: query parsers Affects Versions: 4.10.3 Environment: Windows 7 CentOS 6.6 Reporter: Arsen Li Solr engine misses null-values in OR null part for eDisMax parser For example, I have following query: ((*:* AND -area:[* TO *]) OR area:[100 TO 300]) AND objectId:40105451 full query path visible in Solr Admin panel is select?q=((*%3A*+AND+-area%3A%5B*+TO+*%5D)+OR+area%3A%5B100+TO+300%5D)+AND+objectId%3A40105451wt=jsonindent=true so, it should return record if area between 100 and 300 or area not declared. it works ok for default parser, but when I set edismax checkbox checked in Solr admin panel - it returns nothing (area for objectId=40105451 is null). Request path is following select?q=((*%3A*+AND+-area%3A%5B*+TO+*%5D)+OR+area%3A%5B100+TO+300%5D)+AND+objectId%3A40105451wt=jsonindent=truedefType=edismaxstopwords=truelowercaseOperators=true However, when I move query from q field to q.alt field - it works ok, query is select?wt=jsonindent=truedefType=edismaxq.alt=((*%3A*+AND+-area%3A%5B*+TO+*%5D)+OR+area%3A%5B100+TO+300%5D)+AND+objectId%3A40105451stopwords=truelowercaseOperators=true note, asterisks are not saved by editor, refer to http://stackoverflow.com/questions/29059460/solr-misses-or-null-query-when-parsing-by-edismax-parser if needed more accurate syntax -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5507) Admin UI - Refactoring using AngularJS
[ https://issues.apache.org/jira/browse/SOLR-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259657#comment-14259657 ] Jack Krupansky commented on SOLR-5507: -- bq. All I ask, though, is that you forgive the occasional burst of ebullient enthusiasm! No need for it to be forgiven... all ebullient enthusiasm is always welcome and encouraged. Admin UI - Refactoring using AngularJS -- Key: SOLR-5507 URL: https://issues.apache.org/jira/browse/SOLR-5507 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Assignee: Stefan Matheis (steffkes) Priority: Minor Attachments: SOLR-5507.patch On the LSR in Dublin, i've talked again to [~upayavira] and this time we talked about Refactoring the existing UI - using AngularJS: providing (more, internal) structure and what not ; He already started working on the Refactoring, so this is more a 'tracking' issue about the progress he/we do there. Will extend this issue with a bit more context additional information, w/ thoughts about the possible integration in the existing UI and more (: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6892) Make it possible to define update request processors as toplevel components
[ https://issues.apache.org/jira/browse/SOLR-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259665#comment-14259665 ] Jack Krupansky commented on SOLR-6892: -- Thanks for the description updates. Comments... 1. We need to be explicit about how and when the hard-wired processors are invoked. In particular the run update processor. The log update processor is somewhat special in that it is not mandatory, but a lot of people are not explicitly aware of it, so if they leave it out, they will be wondering why they don't get logging of updates. 2. I suggest three parameters: pre.processors to specify processors before the default chain, post.processors to specify processors after the default chain (before or after run update and log update??), and processors to specify a processor list to completely replace the default chain. 3. Make log update be automatically added at the end unless a nolog processor is specified. 4. Make run update be automatically added at the end unless a norun processor is specified. 5. Discuss processor vs. processors - I prefer the latter since it is explicit, but maybe allow both since the singular/plural can be confusing. 6. Consider supporting both a single parameter with a csv list as well as multiple parameters each with a single value. I prefer having the choice. Having a separate parameter for each processor can be more explicit sometimes. 7. Consider a single-processor parameter with the option to specify the parameters for that processor. That would make it possible to invoke the various field mutating update processors, which would be especially cool and convenient. Make it possible to define update request processors as toplevel components Key: SOLR-6892 URL: https://issues.apache.org/jira/browse/SOLR-6892 Project: Solr Issue Type: Bug Reporter: Noble Paul Assignee: Noble Paul The current update processor chain is rather cumbersome and we should be able to use the updateprocessors without a chain. The scope of this ticket is * A new tag updateProcessor becomes a toplevel tag and it will be equivalent to the {{processor}} tag inside {{updateRequestProcessorChain}} . The only difference is that it should require a {{name}} attribute. The {{updateProcessorChain}} tag will continue to exist and it should be possible to define processor inside as well . It should also be possible to reference a named URP in a chain. * Any update request will be able to pass a param {{processor=a,b,c}} , where a,b,c are names of update processors. A just in time chain will be created with those URPs * Some in built update processors (wherever possible) will be predefined with standard names and can be directly used in requests * What happens when I say processor=a,b,c in a request? It will execute the default chain after the just-in-time chain {{a-b-c}} . * How to execute a different chain other than the default chain? the same old mechanism of update.chain=x means that the chain {{x}} will be applied after {{a,b,c}} * How to avoid the default processor chain from being executed ? There will be an implicit URP called {{STOP}} . send your request as processor=a,b,c,STOP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5507) Admin UI - Refactoring using AngularJS
[ https://issues.apache.org/jira/browse/SOLR-5507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259455#comment-14259455 ] Jack Krupansky commented on SOLR-5507: -- This issue has gotten confused. Please clarify the summary and description to inform readers whether the intention is: 1. Simply refactor the implementation to make the code more maintainable and extensible. 2. Add features to the existing UI to cater to advanced users. 3. Revamp the UI itself to cater to new and novice users. 4. Replace the existing UI or supplement it with two UI's, one for novices (guides them through processes) and one for experts (access more features more easily.) IOW, what are the requirements here? I'm not opposed to any of the above, but the original issue summary and description seemed more focused on the internal implementation rather than the externals of a new UI. Admin UI - Refactoring using AngularJS -- Key: SOLR-5507 URL: https://issues.apache.org/jira/browse/SOLR-5507 Project: Solr Issue Type: Improvement Components: web gui Reporter: Stefan Matheis (steffkes) Assignee: Stefan Matheis (steffkes) Priority: Minor Attachments: SOLR-5507.patch On the LSR in Dublin, i've talked again to [~upayavira] and this time we talked about Refactoring the existing UI - using AngularJS: providing (more, internal) structure and what not ; He already started working on the Refactoring, so this is more a 'tracking' issue about the progress he/we do there. Will extend this issue with a bit more context additional information, w/ thoughts about the possible integration in the existing UI and more (: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6892) Make update processors toplevel components
[ https://issues.apache.org/jira/browse/SOLR-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259566#comment-14259566 ] Jack Krupansky commented on SOLR-6892: -- Issue type should be Improvement, not Bug, right? Make update processors toplevel components --- Key: SOLR-6892 URL: https://issues.apache.org/jira/browse/SOLR-6892 Project: Solr Issue Type: Bug Reporter: Noble Paul Assignee: Noble Paul The current update processor chain is rather cumbersome and we should be able to use the updateprocessors without a chain. The scope of this ticket is * updateProcessor tag becomes a toplevel tag and it will be equivalent to the processor tag inside updateRequestProcessorChain . The only difference is that it should require a {{name}} attribute * Any update request will be able to pass a param {{processor=a,b,c}} , where a,b,c are names of update processors. A just in time chain will be created with those update processors * Some in built update processors (wherever possible) will be predefined with standard names and can be directly used in requests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6892) Make update processors toplevel components
[ https://issues.apache.org/jira/browse/SOLR-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259567#comment-14259567 ] Jack Krupansky commented on SOLR-6892: -- It might be instructive to look at how the search handler deals with search components and possibly consider rationalizing the two handlers so that there is a little more commonality in how lists of components/processors are specified. For example, consider a first, last, and full processor list. IOW, be able to specify a list of processors to apply before the solrconfig-specified list, after, or to completely replace the solrconfig-specified list of processors. Make update processors toplevel components --- Key: SOLR-6892 URL: https://issues.apache.org/jira/browse/SOLR-6892 Project: Solr Issue Type: Bug Reporter: Noble Paul Assignee: Noble Paul The current update processor chain is rather cumbersome and we should be able to use the updateprocessors without a chain. The scope of this ticket is * updateProcessor tag becomes a toplevel tag and it will be equivalent to the processor tag inside updateRequestProcessorChain . The only difference is that it should require a {{name}} attribute * Any update request will be able to pass a param {{processor=a,b,c}} , where a,b,c are names of update processors. A just in time chain will be created with those update processors * Some in built update processors (wherever possible) will be predefined with standard names and can be directly used in requests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6852) SimplePostTool should no longer default to collection1
[ https://issues.apache.org/jira/browse/SOLR-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247435#comment-14247435 ] Jack Krupansky commented on SOLR-6852: -- Is this really for 5.0 only and not trunk/6.0 as well? SimplePostTool should no longer default to collection1 -- Key: SOLR-6852 URL: https://issues.apache.org/jira/browse/SOLR-6852 Project: Solr Issue Type: Improvement Reporter: Anshum Gupta Assignee: Anshum Gupta Fix For: 5.0 Attachments: SOLR-6852.patch, SOLR-6852.patch Solr no longer would be bootstrapped with collection1 and so it no longer makes sense for the SimplePostTool to default to collection1 either. Without an explicit collection/core/url value, the call should just fail fast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4792) stop shipping a war in 5.0
[ https://issues.apache.org/jira/browse/SOLR-4792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229141#comment-14229141 ] Jack Krupansky commented on SOLR-4792: -- As I just noted on the Solr user list, it would be helpful if people could provide a reference to some existing server products that they are attempting to model Solr 5.0 after - or provide a rationale as to why no existing server products provide a model worthy of adopting for Solr. I mean, are we trying too reinvent the wheel here, or what?! So, which existing Apache server product is Solr 5.0 most closely trying to emulate in terms of overall operation as a server and web service? I'd request that the description of this Jira be redone to provide a more clear description of what Solr is expected to look like - from a Solr user perspective - once the infamous war is no longer shipped. I mean, the phrase we are free to do anything we want may mean something to some of the more elite devs here, but show a little sympathy to the rest of the Solr community! stop shipping a war in 5.0 -- Key: SOLR-4792 URL: https://issues.apache.org/jira/browse/SOLR-4792 Project: Solr Issue Type: Task Components: Build Reporter: Robert Muir Assignee: Mark Miller Fix For: 5.0, Trunk Attachments: SOLR-4792.patch see the vote on the developer list. This is the first step: if we stop shipping a war then we are free to do anything we want. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6079) PatternReplaceCharFilter crashes JVM with OutOfMemoryError
[ https://issues.apache.org/jira/browse/LUCENE-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227671#comment-14227671 ] Jack Krupansky commented on LUCENE-6079: But the pattern might in fact need the entire input, such as to match the end of the input with $. Still, it would be nice to have an optional chunked mode for cases such as this (assuming that pattern doesn't end with $), such as input which is the full text of a multi-MB PDF file. I would suggest that such as mode be the default, with a reasonable chunk size such as 100K. There should also be an overlap size so that when reading the next chunk it would start matching with an overlap from the end of the previous chunk, and not perform a match that extends into the overlap area at the end of a chunk unless it is the last chunk, so that matches could be made across chunk boundaries. Actually, it turns out that there was such a feature, with a maxBlockChars parameter, but it was deprecated long ago - no mention in CHANGES.TXT. But... it's still supported in the factory code, with only a TODO comment suggesting that a warning would be appropriate, but the actual Lucene filter constructor simply ignores this parameter. PatternReplaceCharFilter crashes JVM with OutOfMemoryError -- Key: LUCENE-6079 URL: https://issues.apache.org/jira/browse/LUCENE-6079 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.10.2 Environment: Microsoft Windows, x86_64, 32 GB main memory Reporter: Alexander Veit Priority: Critical PatternReplaceCharFilter fills memory with input data until an OutOfMemoryError is thrown. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569) at java.lang.StringBuilder.append(StringBuilder.java:190) at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.fill(PatternReplaceCharFilter.java:84) at org.apache.lucene.analysis.pattern.PatternReplaceCharFilter.read(PatternReplaceCharFilter.java:74) ... PatternReplaceCharFilter should read data chunk-wise and pass the transformed output chunk-wise to the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4587) Implement Saved Searches a la ElasticSearch Percolator
[ https://issues.apache.org/jira/browse/SOLR-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203509#comment-14203509 ] Jack Krupansky commented on SOLR-4587: -- bq. as long as the API remains the same -1 Just go with a contrib module ASAP, like even today's Luwak in 5.0, and let people get experience with an experimental API, and then debate what the final, non-contrib API should be, or maybe there might be real benefit with multiple modules with somewhat distinct APIs for different use cases. No need to presume that a one-size-fits-all API is necessarily best here. Implement Saved Searches a la ElasticSearch Percolator -- Key: SOLR-4587 URL: https://issues.apache.org/jira/browse/SOLR-4587 Project: Solr Issue Type: New Feature Components: SearchComponents - other, SolrCloud Reporter: Otis Gospodnetic Fix For: Trunk Use Lucene MemoryIndex for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4586) Increase default maxBooleanClauses
[ https://issues.apache.org/jira/browse/SOLR-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196260#comment-14196260 ] Jack Krupansky commented on SOLR-4586: -- [~yo...@apache.org], I think you just stumbled upon the single most compelling reason for releasing and attracting people to Solr 5.0 - No more Max Boolean Clauses! Increase default maxBooleanClauses -- Key: SOLR-4586 URL: https://issues.apache.org/jira/browse/SOLR-4586 Project: Solr Issue Type: Improvement Affects Versions: 4.2 Environment: 4.3-SNAPSHOT 1456767M - ncindex - 2013-03-15 13:11:50 Reporter: Shawn Heisey Attachments: SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586_verify_maxClauses.patch In the #solr IRC channel, I mentioned the maxBooleanClauses limitation to someone asking a question about queries. Mark Miller told me that maxBooleanClauses no longer applies, that the limitation was removed from Lucene sometime in the 3.x series. The config still shows up in the example even in the just-released 4.2. Checking through the source code, I found that the config option is parsed and the value stored in objects, but does not actually seem to be used by anything. I removed every trace of it that I could find, and all tests still pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4586) Increase default maxBooleanClauses
[ https://issues.apache.org/jira/browse/SOLR-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14196266#comment-14196266 ] Jack Krupansky commented on SOLR-4586: -- [~reparker], yeah, this is the known behavior - the first core loaded sets this setting and any subsequent core loads ignore any new setting. So, yes, you need the bounce to change it. Increase default maxBooleanClauses -- Key: SOLR-4586 URL: https://issues.apache.org/jira/browse/SOLR-4586 Project: Solr Issue Type: Improvement Affects Versions: 4.2 Environment: 4.3-SNAPSHOT 1456767M - ncindex - 2013-03-15 13:11:50 Reporter: Shawn Heisey Attachments: SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586.patch, SOLR-4586_verify_maxClauses.patch In the #solr IRC channel, I mentioned the maxBooleanClauses limitation to someone asking a question about queries. Mark Miller told me that maxBooleanClauses no longer applies, that the limitation was removed from Lucene sometime in the 3.x series. The config still shows up in the example even in the just-released 4.2. Checking through the source code, I found that the config option is parsed and the value stored in objects, but does not actually seem to be used by anything. I removed every trace of it that I could find, and all tests still pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5302) Analytics Component
[ https://issues.apache.org/jira/browse/SOLR-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191672#comment-14191672 ] Jack Krupansky commented on SOLR-5302: -- Fix version still says trunk only... but this will be in 5.0 (branch_5x), right? Analytics Component --- Key: SOLR-5302 URL: https://issues.apache.org/jira/browse/SOLR-5302 Project: Solr Issue Type: New Feature Reporter: Steven Bower Assignee: Erick Erickson Fix For: Trunk Attachments: SOLR-5302.patch, SOLR-5302.patch, SOLR-5302.patch, SOLR-5302.patch, SOLR-5302_contrib.patch, Search Analytics Component.pdf, Statistical Expressions.pdf, solr_analytics-2013.10.04-2.patch This ticket is to track a replacement for the StatsComponent. The AnalyticsComponent supports the following features: * All functionality of StatsComponent (SOLR-4499) * Field Faceting (SOLR-3435) ** Support for limit ** Sorting (bucket name or any stat in the bucket ** Support for offset * Range Faceting ** Supports all options of standard range faceting * Query Faceting (SOLR-2925) * Ability to use overall/field facet statistics as input to range/query faceting (ie calc min/max date and then facet over that range * Support for more complex aggregate/mapping operations (SOLR-1622) ** Aggregations: min, max, sum, sum-of-square, count, missing, stddev, mean, median, percentiles ** Operations: negation, abs, add, multiply, divide, power, log, date math, string reversal, string concat ** Easily pluggable framework to add additional operations * New / cleaner output format Outstanding Issues: * Multi-value field support for stats (supported for faceting) * Multi-shard support (may not be possible for some operations, eg median) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5992) Version should not be encoded as a String in the index
[ https://issues.apache.org/jira/browse/LUCENE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160283#comment-14160283 ] Jack Krupansky commented on LUCENE-5992: What about versions of an index during the development process, like each time a change to the index format is committed? Such as the alpha and beta stages in 4.0? I'd be happier with four version ints: major, minor, patch, change. Although, in theory, we shouldn't be changing the index format in either minor or patch releases, but bug fixes for indexing can be valid changes as well. Now, the question is whether change should reset to zero each time we branch, or should really just be an ever-increasing index format version number. The latter may make sense, but either is fine. The latter also makes sense from the perspective of the potential of successive releases which don't introduce index incompatibilities. I lean towards the latter, but still makes sense to defensively record which release wrote an index. Version should not be encoded as a String in the index -- Key: LUCENE-5992 URL: https://issues.apache.org/jira/browse/LUCENE-5992 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0, Trunk Attachments: LUCENE-5992.patch The version is really just 3 (maybe 4) ints under-the-hood, but today we write it as a String which then requires spooky string tokenization/parsing when we open the index. I think it should be encoded directly as ints. In LUCENE-5952 I had tried to make this change, but it was controversial, and got booted. Then in LUCENE-5969, I tried again, but that issue has morphed (nicely!) into fixing all sorts of things *except* these three ints. Maybe 3rd time's a charm ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5989) Add BinaryField, to index a single binary token
[ https://issues.apache.org/jira/browse/LUCENE-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159511#comment-14159511 ] Jack Krupansky commented on LUCENE-5989: bq. rename StringField to KeywordField, making it more obvious that this field isn't tokenized. Then a KeywordsField can take a String or BytesRef in ctors. Both Lucene and Solr are suffering from a conflation of the two concepts of treating an input stream as a single token (a keyword) and as a sequence of tokens (sequence of keywords). We have the KeywordTokenizer that does NOT tokenize the input stream into a sequence of keywords. The term keyword search is commonly used to describe the ability of search engines to find individual keywords in extended streams of text - a clear reference to keyword in a tokenized stream. So, I don't understand how it is claimed that naming StringField to KeywordField is making anything obvious - it seems to me to be adding to the existing confusion rather than clarifying anything. I mean, the term keyword should be treated more as a synonym for token or term, NOT as synonym for string or raw character sequence. I agree that we need a term for raw, uninterpreted character sequence, but it seems to me that string is a more obvious candidate than keyword. There has been some grumbling at the Solr level that KeywordTokenizer should be renamed to... something, anything, but just not KeywordTokenizer, which obviously implied that the input stream will be tokenized into a sequence of keywords, which it does not. In an effort to try to resolve this ongoing confusion, can somebody provide from historical background as to how KeywordTokenizer got its name, and how a subset of people continue to refer to an uninterpreted sequence of characters as a keyword rather than a string. I checked the Javadoc, Jira, and even the source code, but came up empty. In short, it is a real eye-opener to see a claim that the term keyword in any way makes it obvious that input is not tokenized!! Maybe we could fix this for 5.0 to have a cleaner set of terminology going forward. At a minimum, we should have some clarifying language in the Javadoc. And hopefully we can refrain from making the confusion/conflation worse by renaming StringField to KeywordField. bq. Then a KeywordsField can take a String Is that simply a typo or is the intent to have both a KeywordField (singular) and a KeywordsField (plural)? I presume it is a typo, but... maybe it's a Freudian slip and highlights this semantic difficulty that persists in the Lucene terminology (and hence infects Solr terminology as well.) Add BinaryField, to index a single binary token --- Key: LUCENE-5989 URL: https://issues.apache.org/jira/browse/LUCENE-5989 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0, Trunk Attachments: LUCENE-5989.patch 5 years ago (LUCENE-1458) we enabled fully binary terms in the lowest levels of Lucene (the codec APIs) yet today, actually adding an arbitrary byte[] binary term during indexing is far from simple: you must make a custom Field with a custom TokenStream and a custom TermToBytesRefAttribute, as far as I know. This is supremely expert, I wonder if anyone out there has succeeded in doing so? I think we should make indexing a single byte[] as simple as indexing a single String. This is a pre-cursor for issues like LUCENE-5596 (encoding IPv6 address as byte[16]) and LUCENE-5879 (encoding native numeric values in their simple binary form). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6568) Join Discovery Contrib
[ https://issues.apache.org/jira/browse/SOLR-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150589#comment-14150589 ] Jack Krupansky commented on SOLR-6568: -- This sounds quite interesting, but... it's tagged as minor, so... what's the catch or limitation that prevents this from being a major? Does it well well or at all for indexes that are not 100% memory resident? What about SSD? Does it only work with integer join keys? Is that a restriction that could be relaxed? Or possibly have two parallel components, one that is super fast for integer keys and only reasonably fast for non-integer keys. Might it be possible to build an off-heap map from non-integer key to a temporary integer key? Join Discovery Contrib -- Key: SOLR-6568 URL: https://issues.apache.org/jira/browse/SOLR-6568 Project: Solr Issue Type: New Feature Reporter: Joel Bernstein Assignee: Joel Bernstein Priority: Minor Fix For: 5.0 This contribution was commissioned by the *NCBI* (National Center for Biotechnology Information). The Join Discovery Contrib is a set of Solr plugins that support large scale joins and join facets between Solr cores. There are two different Join implementations included in this contribution. Both implementations are designed to work with integer join keys. It is very common in large BioInformatic and Genomic databases to use integer primary and foreign keys. Integer keys allow Bioinformatic and Genomic search engines and discovery tools to perform complex operations on large data sets very efficiently. The Join Discovery Contrib provides features that will be applicable to anyone working with the freely available databases from the NCBI and likely a large number of other BioInformatic and Genomic databases. These features are not specific though to Bioinformatics and Genomics, they can be used in any datasets where integer keys are used to define the primary and foreign keys. What is included in this contrib: 1) A new JoinComponent. This component is used instead of the standard QueryComponent. It facilitates very large scale relational joins between two Solr indexes (cores). The join algorithm used in this component is known as a *parallel partitioned merge join*. This is an algorithm which partitions the results from both sides of the join and then sorts and merges the partitions in parallel. Below are some of it's features: * Sub-second performance on very large joins. The parallel join algorithm is capable of sub-second performance on joins with tens of millions of records on both sides of the join. * The JoinComponent returns tuples with fields from both sides of the join. The initial release returns the primary keys from both sides of the join and the join key. * The tuples also include, and are ranked by, a combined score from both sides of the join. * Special purpose memory-mapped on-disk indexes to support \*:\* joins. This makes it possible to join an entire index with a sub-set of another index with sub-second performance. * Support for very fast one-to-one, one-to-many and many-to-many joins. Fast many-to-many joins make it possible to join between indexes on multi-value fields. 2) A new JoinFacetComponent. This component provides facets for both indexes involved in the join. 3) The BitSetJoinQParserPlugin. A very fast parallel filter join based on bitsets that supports infinite levels of nesting. It can be used as a filter query in combination with the JoinComponent or with the standard query component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6568) Join Discovery Contrib
[ https://issues.apache.org/jira/browse/SOLR-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150589#comment-14150589 ] Jack Krupansky edited comment on SOLR-6568 at 9/27/14 1:53 PM: --- This sounds quite interesting, but... it's tagged as minor, so... what's the catch or limitation that prevents this from being a major? Does it work well or at all for indexes that are not 100% memory resident? What about SSD? Does it only work with integer join keys? Is that a restriction that could be relaxed? Or possibly have two parallel components, one that is super fast for integer keys and only reasonably fast for non-integer keys. Might it be possible to build an off-heap map from non-integer key to a temporary integer key? was (Author: jkrupan): This sounds quite interesting, but... it's tagged as minor, so... what's the catch or limitation that prevents this from being a major? Does it well well or at all for indexes that are not 100% memory resident? What about SSD? Does it only work with integer join keys? Is that a restriction that could be relaxed? Or possibly have two parallel components, one that is super fast for integer keys and only reasonably fast for non-integer keys. Might it be possible to build an off-heap map from non-integer key to a temporary integer key? Join Discovery Contrib -- Key: SOLR-6568 URL: https://issues.apache.org/jira/browse/SOLR-6568 Project: Solr Issue Type: New Feature Reporter: Joel Bernstein Assignee: Joel Bernstein Priority: Minor Fix For: 5.0 This contribution was commissioned by the *NCBI* (National Center for Biotechnology Information). The Join Discovery Contrib is a set of Solr plugins that support large scale joins and join facets between Solr cores. There are two different Join implementations included in this contribution. Both implementations are designed to work with integer join keys. It is very common in large BioInformatic and Genomic databases to use integer primary and foreign keys. Integer keys allow Bioinformatic and Genomic search engines and discovery tools to perform complex operations on large data sets very efficiently. The Join Discovery Contrib provides features that will be applicable to anyone working with the freely available databases from the NCBI and likely a large number of other BioInformatic and Genomic databases. These features are not specific though to Bioinformatics and Genomics, they can be used in any datasets where integer keys are used to define the primary and foreign keys. What is included in this contrib: 1) A new JoinComponent. This component is used instead of the standard QueryComponent. It facilitates very large scale relational joins between two Solr indexes (cores). The join algorithm used in this component is known as a *parallel partitioned merge join*. This is an algorithm which partitions the results from both sides of the join and then sorts and merges the partitions in parallel. Below are some of it's features: * Sub-second performance on very large joins. The parallel join algorithm is capable of sub-second performance on joins with tens of millions of records on both sides of the join. * The JoinComponent returns tuples with fields from both sides of the join. The initial release returns the primary keys from both sides of the join and the join key. * The tuples also include, and are ranked by, a combined score from both sides of the join. * Special purpose memory-mapped on-disk indexes to support \*:\* joins. This makes it possible to join an entire index with a sub-set of another index with sub-second performance. * Support for very fast one-to-one, one-to-many and many-to-many joins. Fast many-to-many joins make it possible to join between indexes on multi-value fields. 2) A new JoinFacetComponent. This component provides facets for both indexes involved in the join. 3) The BitSetJoinQParserPlugin. A very fast parallel filter join based on bitsets that supports infinite levels of nesting. It can be used as a filter query in combination with the JoinComponent or with the standard query component. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6445) Allow flexible JSON input
[ https://issues.apache.org/jira/browse/SOLR-6445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113800#comment-14113800 ] Jack Krupansky commented on SOLR-6445: -- +1 for violating the JSON standard! Okay, sure maybe we should have an option to require strict JSON, but it should default to false. Could we support unquoted simple name values as well? Like: {code} {id: my-key} {code} And if people strenuously object, maybe we just need to have a Solr JSON (SJSON or SON - Solr Object Notation) format with the relaxed rules. Allow flexible JSON input -- Key: SOLR-6445 URL: https://issues.apache.org/jira/browse/SOLR-6445 Project: Solr Issue Type: Improvement Reporter: Noble Paul Support single quotes and unquoted keys {code:javascript} //all the following must be valid and equivalent {id :mykey} {'id':'mykey'} {id: mykey} {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3619) Rename 'example' dir to 'server' and pull examples into an 'examples' directory
[ https://issues.apache.org/jira/browse/SOLR-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090765#comment-14090765 ] Jack Krupansky commented on SOLR-3619: -- I do hope that people use Elasticsearch as a high priority criteria for whether standing up a tutorial or production instance of Solr is easy enough. I mean, I still hear plenty of chatter that Solr is too hard. Granted, a lot of that is just perception, but the final result of this issue should be that Solr has two SHORT web pages for those two use cases that clearly show that Solr is just as easy to stand up as Elasticsearch. Elasticsearch says Installation is a snap. Solr needs to be able to do the same. Rename 'example' dir to 'server' and pull examples into an 'examples' directory --- Key: SOLR-3619 URL: https://issues.apache.org/jira/browse/SOLR-3619 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Timothy Potter Fix For: 4.9, 5.0 Attachments: SOLR-3619.patch, SOLR-3619.patch, managed-schema, server-name-layout.png, solrconfig.xml -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6315) Remove SimpleOrderedMap
[ https://issues.apache.org/jira/browse/SOLR-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083962#comment-14083962 ] Jack Krupansky commented on SOLR-6315: -- Is order a part of the contract for the usages of this class? I mean, the current Javadoc does explicitly say that repetition and null values are NOT a part of the contract, but it doesn't say that order, another feature of NameList, is not important, while the name itself says Ordered. Kind of ambiguous, so a first order (Hah!) of business is to clarify whether maintaining order is a part of the contract, and then to validate that contract with actual usages. Switching to map implies that order is no longer part of the contract, so it will be free to vary from release to release or between JVMs. Personally, I wish that Map was UnorderedMap, or even UnstableOrderMap, to make the contract crystal clear. In fact it would be great to have the ordering of serialization of Map be a seeded random test framework parameter to catch cases where the code or test cases have become dependent on order of map serialization or any other non-contract behavior for that matter. Will this change have ANY behavior change that will be visible to Solr application developers or users? Remove SimpleOrderedMap --- Key: SOLR-6315 URL: https://issues.apache.org/jira/browse/SOLR-6315 Project: Solr Issue Type: Improvement Components: clients - java Reporter: Shai Erera Assignee: Shai Erera Attachments: SOLR-6315.patch As I described on SOLR-912, SimpleOrderedMap is redundant and generally useless class, with confusing jdocs. We should remove it. I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5867) Add BooleanSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082778#comment-14082778 ] Jack Krupansky commented on LUCENE-5867: Would this be expected to result in any dramatic improvement in indexing or query performance, or a dramatic reduction in index size? Add BooleanSimilarity - Key: LUCENE-5867 URL: https://issues.apache.org/jira/browse/LUCENE-5867 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Attachments: LUCENE-5867.patch This can be used when the user doesn't want tf/idf scoring for some reason. The idea is that the score is just query_time_boost * index_time_boost, no queryNorm/IDF/TF/lengthNorm... -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6103) Add DateRangeField
[ https://issues.apache.org/jira/browse/SOLR-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082978#comment-14082978 ] Jack Krupansky commented on SOLR-6103: -- You might want to take a peek at the LucidWorks Search query parser support of date queries. It would be so nice to have comparable date support in Solr itself. It includes the ability to auto-expand a simple partial date/time term into a full range, as well as using partial date/time in explicit range queries. See: http://docs.lucidworks.com/display/lweug/Date+Queries Add DateRangeField -- Key: SOLR-6103 URL: https://issues.apache.org/jira/browse/SOLR-6103 Project: Solr Issue Type: New Feature Components: spatial Reporter: David Smiley Assignee: David Smiley Fix For: 5.0 Attachments: SOLR-6103.patch LUCENE-5648 introduced a date range index search capability in the spatial module. This issue is for a corresponding Solr FieldType to be named DateRangeField. LUCENE-5648 includes a parseCalendar(String) method that parses a superset of Solr's strict date format. It also parses partial dates (e.g.: 2014-10 has month specificity), and the trailing 'Z' is optional, and a leading +/- may be present (minus indicates BC era), and * means all-time. The proposed field type would use it to parse a string and also both ends of a range query, but furthermore it will also allow an arbitrary range query of the form {{calspec TO calspec}} such as: {noformat}2000 TO 2014-05-21T10{noformat} Which parses as the year 2000 thru 2014 May 21st 10am (GMT). I suggest this syntax because it is aligned with Lucene's range query syntax. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6103) Add DateRangeField
[ https://issues.apache.org/jira/browse/SOLR-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083031#comment-14083031 ] Jack Krupansky commented on SOLR-6103: -- Once nuance is for the end of the range - [2010 TO 2012] should expand the starting date to the beginning of that period, but expand the ending date to the end of that period (2012-12-31T23:59:59.999Z). And [2010 TO 2012} would expand the ending date to the beginning (rather than the ending) of the period (2012-01-01T00:00:00Z), with the exclusive flag set as well. Add DateRangeField -- Key: SOLR-6103 URL: https://issues.apache.org/jira/browse/SOLR-6103 Project: Solr Issue Type: New Feature Components: spatial Reporter: David Smiley Assignee: David Smiley Fix For: 5.0 Attachments: SOLR-6103.patch LUCENE-5648 introduced a date range index search capability in the spatial module. This issue is for a corresponding Solr FieldType to be named DateRangeField. LUCENE-5648 includes a parseCalendar(String) method that parses a superset of Solr's strict date format. It also parses partial dates (e.g.: 2014-10 has month specificity), and the trailing 'Z' is optional, and a leading +/- may be present (minus indicates BC era), and * means all-time. The proposed field type would use it to parse a string and also both ends of a range query, but furthermore it will also allow an arbitrary range query of the form {{calspec TO calspec}} such as: {noformat}2000 TO 2014-05-21T10{noformat} Which parses as the year 2000 thru 2014 May 21st 10am (GMT). I suggest this syntax because it is aligned with Lucene's range query syntax. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6103) Add DateRangeField
[ https://issues.apache.org/jira/browse/SOLR-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14083031#comment-14083031 ] Jack Krupansky edited comment on SOLR-6103 at 8/1/14 10:42 PM: --- One nuance is for the end of the range - [2010 TO 2012] should expand the starting date to the beginning of that period, but expand the ending date to the end of that period (2012-12-31T23:59:59.999Z). And [2010 TO 2012} would expand the ending date to the beginning (rather than the ending) of the period (2012-01-01T00:00:00Z), with the exclusive flag set as well. was (Author: jkrupan): Once nuance is for the end of the range - [2010 TO 2012] should expand the starting date to the beginning of that period, but expand the ending date to the end of that period (2012-12-31T23:59:59.999Z). And [2010 TO 2012} would expand the ending date to the beginning (rather than the ending) of the period (2012-01-01T00:00:00Z), with the exclusive flag set as well. Add DateRangeField -- Key: SOLR-6103 URL: https://issues.apache.org/jira/browse/SOLR-6103 Project: Solr Issue Type: New Feature Components: spatial Reporter: David Smiley Assignee: David Smiley Fix For: 5.0 Attachments: SOLR-6103.patch LUCENE-5648 introduced a date range index search capability in the spatial module. This issue is for a corresponding Solr FieldType to be named DateRangeField. LUCENE-5648 includes a parseCalendar(String) method that parses a superset of Solr's strict date format. It also parses partial dates (e.g.: 2014-10 has month specificity), and the trailing 'Z' is optional, and a leading +/- may be present (minus indicates BC era), and * means all-time. The proposed field type would use it to parse a string and also both ends of a range query, but furthermore it will also allow an arbitrary range query of the form {{calspec TO calspec}} such as: {noformat}2000 TO 2014-05-21T10{noformat} Which parses as the year 2000 thru 2014 May 21st 10am (GMT). I suggest this syntax because it is aligned with Lucene's range query syntax. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5859) Remove Version.java completely
[ https://issues.apache.org/jira/browse/LUCENE-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080778#comment-14080778 ] Jack Krupansky commented on LUCENE-5859: bq. users don't even understand how this versioning works anyway Anybody want to take a shot at a clear description that will make sense to the rest of us? I mean, don't normal users simply want precisely one thing - back compat with their existing index, like always, plus auto upgrade when that is sensible? Should a non-expert user EVER be setting the version explicitly? Some advanced or expert users want to create indexes for a specific release, but let's not confuse them with normal users. I concede that this may be an overly simplistic view, but I think we should start with where normal users should want to be, and at least elaborate in the language of normal users precisely what additional considerations they need to keep in mind and decisions they will have to make and what factors they will need to consider, with specific recommendations. And this is just Lucene. Solr... will it stay unchanged at the API level, or is this Lucene change going to ripple out to Solr users as well? Remove Version.java completely -- Key: LUCENE-5859 URL: https://issues.apache.org/jira/browse/LUCENE-5859 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Fix For: 5.0 Attachments: LUCENE-5859_dead_code.patch This has always been a mess: analyzers are easy enough to make on your own, we don't need to take responsibility for the users analysis chain for 2 major releases. The code maintenance is horrible here. This creates a huge usability issue too, and as seen from numerous mailing list issues, users don't even understand how this versioning works anyway. I'm sure someone will whine if i try to remove these constants, but we can at least make no-arg ctors forwarding to VERSION_CURRENT so that people who don't care about back compat (e.g. just prototyping) don't have to deal with the horribly complex versioning system. If you want to make the argument that doing this is trappy (i heard this before), i think thats bogus, and ill counter by trying to remove them. Either way, I'm personally not going to add any of this kind of back compat logic myself ever again. Updated: description of the issue updated as expected. We should remove this API completely. No one else on the planet has APIs that require a mandatory version parameter. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5849) Scary read past EOF in RAMDir
[ https://issues.apache.org/jira/browse/LUCENE-5849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079873#comment-14079873 ] Jack Krupansky commented on LUCENE-5849: Any sense of whether this is JVM-dependent? Or whether it is an issue for the JVM itself? Scary read past EOF in RAMDir --- Key: LUCENE-5849 URL: https://issues.apache.org/jira/browse/LUCENE-5849 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: TestBinaryDocIndex.java Nightly build hit this: http://builds.flonkings.com/job/Lucene-trunk-Linux-Java7-64-test-only/91095 And I'm able to repro at least once after beasting w/ the right JVM (1.7.0_55) and G1GC. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5843) IndexWriter should refuse to create an index with more than INT_MAX docs
[ https://issues.apache.org/jira/browse/LUCENE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072300#comment-14072300 ] Jack Krupansky commented on LUCENE-5843: That Solr Jira has my comments as well, but I just want to reiterate that the actual limit should be more clearly documented. I filed a Jira for that quite awhile ago - LUCENE-4104. And if this new issue will resolve the problem, please mark my old LUCENE-4105 issue as a duplicate. IndexWriter should refuse to create an index with more than INT_MAX docs Key: LUCENE-5843 URL: https://issues.apache.org/jira/browse/LUCENE-5843 Project: Lucene - Core Issue Type: Bug Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0, 4.10 It's more and more common for users these days to create very large indices, e.g. indexing lines from log files, or packets on a network, etc., and it's not hard to accidentally exceed the maximum number of documents in one index. I think the limit is actually Integer.MAX_VALUE-1 docs, because we use that value as a sentinel during searching. I'm not sure what IW does today if you create a too-big index but it's probably horrible; it may succeed and then at search time you hit nasty exceptions when we overflow int. I think it should throw an IndexFullException instead. It'd be nice if we could do this on the very doc that when added would go over the limit, but I would also settle for just throwing at flush as well ... i.e. I think what's really important is that the index does not become unusable. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6260) Rename DirectUpdateHandler2
[ https://issues.apache.org/jira/browse/SOLR-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069198#comment-14069198 ] Jack Krupansky commented on SOLR-6260: -- I noticed that the SolrCore code does in fact default the update handler class to DIH2/SIH if the class attribute is not specified, so maybe the upgrade instructions can simply be for users to remove the updateHandler class attribute, rather than for them to have to learn yet another internal name. And I would reiterate my proposal to remove the class attribute from the example solrconfig.xml files, for both 5.0 and 4.x. Either way, the patch should include changes to the Upgrading section of CHANGES.txt. Do those three things and then I'm an easy +1! Rename DirectUpdateHandler2 --- Key: SOLR-6260 URL: https://issues.apache.org/jira/browse/SOLR-6260 Project: Solr Issue Type: Improvement Affects Versions: 5.0 Reporter: Tomás Fernández Löbbe Assignee: Mark Miller Priority: Minor Attachments: SOLR-6260.patch, SOLR-6260.patch DirectUpdateHandler was removed, I think in Solr 4. DirectUpdateHandler2 should be renamed, at least remove that 2. I don't know really what direct means here. Maybe it could be renamed to DefaultUpdateHandler, or UpdateHandlerDefaultImpl, or other good suggestions -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6260) Rename DirectUpdateHandler2
[ https://issues.apache.org/jira/browse/SOLR-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069198#comment-14069198 ] Jack Krupansky edited comment on SOLR-6260 at 7/21/14 8:20 PM: --- I noticed that the SolrCore code does in fact default the update handler class to DUH2/SUH if the class attribute is not specified, so maybe the upgrade instructions can simply be for users to remove the updateHandler class attribute, rather than for them to have to learn yet another internal name. And I would reiterate my proposal to remove the class attribute from the example solrconfig.xml files, for both 5.0 and 4.x. Either way, the patch should include changes to the Upgrading section of CHANGES.txt. Do those three things and then I'm an easy +1! was (Author: jkrupan): I noticed that the SolrCore code does in fact default the update handler class to DIH2/SIH if the class attribute is not specified, so maybe the upgrade instructions can simply be for users to remove the updateHandler class attribute, rather than for them to have to learn yet another internal name. And I would reiterate my proposal to remove the class attribute from the example solrconfig.xml files, for both 5.0 and 4.x. Either way, the patch should include changes to the Upgrading section of CHANGES.txt. Do those three things and then I'm an easy +1! Rename DirectUpdateHandler2 --- Key: SOLR-6260 URL: https://issues.apache.org/jira/browse/SOLR-6260 Project: Solr Issue Type: Improvement Affects Versions: 5.0 Reporter: Tomás Fernández Löbbe Assignee: Mark Miller Priority: Minor Attachments: SOLR-6260.patch, SOLR-6260.patch DirectUpdateHandler was removed, I think in Solr 4. DirectUpdateHandler2 should be renamed, at least remove that 2. I don't know really what direct means here. Maybe it could be renamed to DefaultUpdateHandler, or UpdateHandlerDefaultImpl, or other good suggestions -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3619) Rename 'example' dir to 'server' and pull examples into an 'examples' directory
[ https://issues.apache.org/jira/browse/SOLR-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067950#comment-14067950 ] Jack Krupansky commented on SOLR-3619: -- bq. a database like MySQL But Solr isn't a database! (Nor is Elasticsearch.) I think part of the issue here is that there are two distinct use cases: single core and multi-core, or single collection and multiple collection. Solr is perfectly usable in single-core/collection mode - the user need not concern themselves with naming a collection. In that case, the fact that there is this extra level of abstraction called a collection and it is named collection1 is a bit of an annoyance and distraction, so the less annoying the better. Forcing the user to come up with a name and perform an extract step of naming that default collection adds no significant value for the single-core-collection use case, or the onboarding or introduction of new users to Solr as a simple but powerful search platform. Sure, once the user has decided that they indeed have the multi-core/collection use case, THEN they will want to name their cores/collections with real names. Sure, by all means make support for this use case as clean and convenient as possible. Why not simply give the user a choice, up front, and let them decide for themselves what use case they want? Whether that is a separate download or a separate startup command or a separate start directory seems like more of a detail than an architectural choice for de-supporting one useful use case. I would say leave the current example where it is, as it is, and have a separate, clean download for multi-collection server mode. I'm sure people deploying SolrCloud clusters in the cloud would appreciate the latter, without any burden of example and tutorial fluff. And maybe the use case distinction is simply SolrCloud vs. traditional Solr. And then for the new (5.0) SolrCloud server mode, we can have a little script for quick demo mode that is more like the current example/collection1 setup - or a separate example/introduction/tutorial download from the raw server download. In short, don't sacrifice the current simplicity, but do pursue the 5.0 server mode. Maybe if progress were made on the 5.0 Solr server, some of these details would just fall out or at least be more obvious and non-controversial. As it is, this is feeling a lot more like rearranging deck chairs on the Titanic than helping Solr to leapfrog to a whole new level in either server-ness or ease-of-use-ness. BTW, has any thought been given to including a packaging of the 5.0 Solr server as a Windows service? That might also help to clarify some of this packaging stuff. Rename 'example' dir to 'server' and pull examples into an 'examples' directory --- Key: SOLR-3619 URL: https://issues.apache.org/jira/browse/SOLR-3619 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.9, 5.0 Attachments: SOLR-3619.patch, server-name-layout.png -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6260) Rename DirectUpdateHandler2
[ https://issues.apache.org/jira/browse/SOLR-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068068#comment-14068068 ] Jack Krupansky commented on SOLR-6260: -- Could we at least remove it from the example solrconfig in 5.0? Change the name as you see fit, and make it the default for the updateHandler class attribute? I mean, it always was kind of a wart to have to specify that kind of internal detail externally like that. Rename DirectUpdateHandler2 --- Key: SOLR-6260 URL: https://issues.apache.org/jira/browse/SOLR-6260 Project: Solr Issue Type: Improvement Affects Versions: 5.0 Reporter: Tomás Fernández Löbbe Priority: Minor Attachments: SOLR-6260.patch, SOLR-6260.patch DirectUpdateHandler was removed, I think in Solr 4. DirectUpdateHandler2 should be renamed, at least remove that 2. I don't know really what direct means here. Maybe it could be renamed to DefaultUpdateHandler, or UpdateHandlerDefaultImpl, or other good suggestions -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5746) solr.xml parsing of str vs int vs bool is brittle; fails silently; expects odd type for shareSchema
[ https://issues.apache.org/jira/browse/SOLR-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14059805#comment-14059805 ] Jack Krupansky commented on SOLR-5746: -- Will the changes for this issue result in a bump of the Solr schema version (to 1.6), so that if existing apps do happen to work (albeit maybe incorrectly) with the current version 1.5 schema processing, they will still work in Solr 4.10 (or whenever this ships)? I hope so. solr.xml parsing of str vs int vs bool is brittle; fails silently; expects odd type for shareSchema -- Key: SOLR-5746 URL: https://issues.apache.org/jira/browse/SOLR-5746 Project: Solr Issue Type: Bug Affects Versions: 4.3, 4.4, 4.5, 4.6 Reporter: Hoss Man Attachments: SOLR-5746.patch, SOLR-5746.patch A comment in the ref guide got me looking at ConfigSolrXml.java and noticing that the parsing of solr.xml options here is very brittle and confusing. In particular: * if a boolean option foo is expected along the lines of {{bool name=footrue/bool}} it will silently ignore {{str name=footrue/str}} * likewise for an int option {{int name=bar32/int}} vs {{str name=bar32/str}} ... this is inconsistent with the way solrconfig.xml is parsed. In solrconfig.xml, the xml nodes are parsed into a NamedList, and the above options will work in either form, but an invalid value such as {{bool name=fooNOT A BOOLEAN/bool}} will generate an error earlier (when parsing config) then {{str name=fooNOT A BOOLEAN/str}} (attempt to parse the string as a bool the first time the config value is needed) In addition, i notice this really confusing line... {code} propMap.put(CfgProp.SOLR_SHARESCHEMA, doSub(solr/str[@name='shareSchema'])); {code} shareSchema is used internally as a boolean option, but as written the parsing code will ignore it unless the user explicitly configures it as a {{str/}} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-247) Allow facet.field=* to facet on all fields (without knowing what they are)
[ https://issues.apache.org/jira/browse/SOLR-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058716#comment-14058716 ] Jack Krupansky commented on SOLR-247: - The earlier commentary clearly lays out that the primary concern is that it would be a performance nightmare, but... that does depend on your particular use case. Personally, I would say to go forward with adding this feature, but with a clear documentation caveat that this feature should be use with great care since it is likely to be extremely memory and performance intensive and more of a development testing tool than a production feature, although it could have value when wildcard patterns are crafted with care for a very limited number of fields. Allow facet.field=* to facet on all fields (without knowing what they are) -- Key: SOLR-247 URL: https://issues.apache.org/jira/browse/SOLR-247 Project: Solr Issue Type: Improvement Reporter: Ryan McKinley Priority: Minor Labels: beginners, newdev Attachments: SOLR-247-FacetAllFields.patch, SOLR-247.patch, SOLR-247.patch, SOLR-247.patch I don't know if this is a good idea to include -- it is potentially a bad idea to use it, but that can be ok. This came out of trying to use faceting for the LukeRequestHandler top term collecting. http://www.nabble.com/Luke-request-handler-issue-tf3762155.html -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051934#comment-14051934 ] Jack Krupansky commented on LUCENE-3451: [~yo...@apache.org] says: bq. The current handling of boolean queries with only prohibited clauses is not a bug, but working as designed, so this issue is about changing that behavior. Currently working applications will now start unexpectedly throwing exceptions... now that's trappy. The fact that a pure negative query, actually a sub-query within parentheses in the query parser, returns zero documents has been a MAJOR problem for Solr users. I've lost count how many times it has come up on the user list and we tell users to work around the problem by manually inserting \*:\* after the left parenthesis. But I am interested in hearing why it is believed that it is working as designed and whether there are really applications that would intentionally write a list of negative clauses when the design is that they will simply be ignored and match no documents. If that kind of compatibility is really needed, I would say it can be accommodated with a config setting, rather than give unexpected and bad behavior for so many other people with the current behavior. I would prefer to see a fix the problem by having BQ do the right thing by implicitly starting with a MatchAllDocsQuery if only MUST_NOT clauses are present, but... if that is not possible, an exception would be much better. Alternatively, given the difficulty of doing almost anything with the various query parsers, the method that generates the BQ for the query parser (QueryParserBase .getBooleanQuery) should just check for pure negative clauses and then add the MADQ. If this is massively controversial, just add a config option to disable it. Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery - Key: LUCENE-3451 URL: https://issues.apache.org/jira/browse/LUCENE-3451 Project: Lucene - Core Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.9, 5.0 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows pure negative Filter clauses. This is not supported by BooleanQuery and confuses users (I think that's the problem in LUCENE-3450). The hack is buggy, as it does not respect deleted documents and returns them in its DocIdSet. Also we should think about disallowing pure-negative Queries at all and throw UOE. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051978#comment-14051978 ] Jack Krupansky commented on LUCENE-3451: Thanks, [~yo...@apache.org]. Although the (a -x) stop word case seems to argue even more strenuously for at least an exception if ]\*:\* can't be inserted. Besides, the stop word case is better handled by the Lucid approach of keeping all stop words (if they are indexed) if the sub-query terms are all stop words as in this case. So it would be only be problematic for the case of non-indexed stop words, which is really an anti-pattern anyway these days. Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery - Key: LUCENE-3451 URL: https://issues.apache.org/jira/browse/LUCENE-3451 Project: Lucene - Core Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.9, 5.0 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows pure negative Filter clauses. This is not supported by BooleanQuery and confuses users (I think that's the problem in LUCENE-3450). The hack is buggy, as it does not respect deleted documents and returns them in its DocIdSet. Also we should think about disallowing pure-negative Queries at all and throw UOE. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051980#comment-14051980 ] Jack Krupansky commented on LUCENE-3451: [~yo...@apache.org] says: bq. I personally think it would be fine to insert *:* for the user where appropriate. Ah! Since the divorce that gave Solr custody of its own copy of QueryParserBase, this change could be made there, right? I can file a Solr Jira for that (or just use one of the two open Solr issues related to pure-negative sub-queries), unless you want to do it. And then if the Solr people are happy over there, the Lucene guys can have their exception here and close this issue, and the everybody can live happily ever after, right? Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery - Key: LUCENE-3451 URL: https://issues.apache.org/jira/browse/LUCENE-3451 Project: Lucene - Core Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.9, 5.0 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows pure negative Filter clauses. This is not supported by BooleanQuery and confuses users (I think that's the problem in LUCENE-3450). The hack is buggy, as it does not respect deleted documents and returns them in its DocIdSet. Also we should think about disallowing pure-negative Queries at all and throw UOE. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5791) QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
[ https://issues.apache.org/jira/browse/LUCENE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046813#comment-14046813 ] Jack Krupansky commented on LUCENE-5791: At least consider clear Javadoc on limitations and performance, such as the need to keep wildcard patterns brief. Maybe consider a limit of how many wildcards can be used in a single wildcard query. Possibly configurable. Maybe consider a trim mode - if too many wildcards appear, simply trim trailing portions of the pattern to get under the limit. For example, this test case might get trimmed to abc*mno*xyz*. This would still match all of the intended matches, albeit also matching some unintended cases. Maybe a limit of three wildcards would be reasonable. Does ? have the same issue, or is it much more linear? Would ???*???*???*??? be as bad as abc*mno*xyz*pqr* ? Do adjacent ** get collapsed to a single * ? QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load --- Key: LUCENE-5791 URL: https://issues.apache.org/jira/browse/LUCENE-5791 Project: Lucene - Core Issue Type: Bug Components: modules/queryparser Environment: Lucene 4.7.2 Java 6 Reporter: Clemens Wyss Attachments: afterdet.png The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? 99% the threaddump/stacktrace looks as follows: BasicOperations.determinize(Automaton) line: 680 Automaton.determinize() line: 759 SpecialOperations.getCommonSuffixBytesRef(Automaton) line: 165 CompiledAutomaton.init(Automaton, Boolean, boolean) line: 168 CompiledAutomaton.init(Automaton) line: 91 WildcardQuery(AutomatonQuery).init(Term, Automaton) line: 67 WildcardQuery.init(Term) line: 57 WildcardQueryNodeBuilder.build(QueryNode) line: 42 WildcardQueryNodeBuilder.build(QueryNode) line: 32 StandardQueryTreeBuilder(QueryTreeBuilder).processNode(QueryNode, QueryBuilder) line: 186 StandardQueryTreeBuilder(QueryTreeBuilder).process(QueryNode) line: 125 StandardQueryTreeBuilder(QueryTreeBuilder).build(QueryNode) line: 218 StandardQueryTreeBuilder.build(QueryNode) line: 82 StandardQueryTreeBuilder.build(QueryNode) line: 53 StandardQueryParser(QueryParserHelper).parse(String, String) line: 258 StandardQueryParser.parse(String, String) line: 168 QueryParserUtil.parse(String, String[], BooleanClause$Occur[], Analyzer) line: 119 IndexingTest.queryParserUtilLimit() line: 1450 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5791) QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load
[ https://issues.apache.org/jira/browse/LUCENE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046813#comment-14046813 ] Jack Krupansky edited comment on LUCENE-5791 at 6/28/14 11:11 AM: -- At least consider clear Javadoc on limitations and performance, such as the need to keep wildcard patterns brief. Maybe consider a limit of how many wildcards can be used in a single wildcard query. Possibly configurable. Maybe consider a trim mode - if too many wildcards appear, simply trim trailing portions of the pattern to get under the limit. For example, this test case might get trimmed to abc*mno*xyz*. This would still match all of the intended matches, albeit also matching some unintended cases. Maybe a limit of three wildcards would be reasonable. Does ? have the same issue, or is it much more linear? Would ???*???*???*??? be as bad as abc*mno*xyz*pqr* ? Do adjacent ** get collapsed to a single * ? Fuzzy query has a very strict limit to assure that it is performant - I would think that these two query types should have the same performance goals. was (Author: jkrupan): At least consider clear Javadoc on limitations and performance, such as the need to keep wildcard patterns brief. Maybe consider a limit of how many wildcards can be used in a single wildcard query. Possibly configurable. Maybe consider a trim mode - if too many wildcards appear, simply trim trailing portions of the pattern to get under the limit. For example, this test case might get trimmed to abc*mno*xyz*. This would still match all of the intended matches, albeit also matching some unintended cases. Maybe a limit of three wildcards would be reasonable. Does ? have the same issue, or is it much more linear? Would ???*???*???*??? be as bad as abc*mno*xyz*pqr* ? Do adjacent ** get collapsed to a single * ? QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load --- Key: LUCENE-5791 URL: https://issues.apache.org/jira/browse/LUCENE-5791 Project: Lucene - Core Issue Type: Bug Components: modules/queryparser Environment: Lucene 4.7.2 Java 6 Reporter: Clemens Wyss Attachments: afterdet.png The following testcase runs endlessly and produces VERY heavy load. ... String query = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut + labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et + ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. + Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt + ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores + et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet; String query = query.replaceAll( \\s+, * ); try { QueryParserUtil.parse( query, new String[] { test }, new Occur[] { Occur.MUST }, new KeywordAnalyzer() ); } catch ( Exception e ) { Assert.fail( e.getMessage() ); } ... I don't say this testcase makes sense, nevertheless the question remains whether this is a bug or a feature? 99% the threaddump/stacktrace looks as follows: BasicOperations.determinize(Automaton) line: 680 Automaton.determinize() line: 759 SpecialOperations.getCommonSuffixBytesRef(Automaton) line: 165 CompiledAutomaton.init(Automaton, Boolean, boolean) line: 168 CompiledAutomaton.init(Automaton) line: 91 WildcardQuery(AutomatonQuery).init(Term, Automaton) line: 67 WildcardQuery.init(Term) line: 57 WildcardQueryNodeBuilder.build(QueryNode) line: 42 WildcardQueryNodeBuilder.build(QueryNode) line: 32 StandardQueryTreeBuilder(QueryTreeBuilder).processNode(QueryNode, QueryBuilder) line: 186 StandardQueryTreeBuilder(QueryTreeBuilder).process(QueryNode) line: 125 StandardQueryTreeBuilder(QueryTreeBuilder).build(QueryNode) line: 218 StandardQueryTreeBuilder.build(QueryNode) line: 82 StandardQueryTreeBuilder.build(QueryNode) line: 53 StandardQueryParser(QueryParserHelper).parse(String, String) line: 258 StandardQueryParser.parse(String, String) line: 168 QueryParserUtil.parse(String, String[], BooleanClause$Occur[], Analyzer) line: 119 IndexingTest.queryParserUtilLimit() line: 1450 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail:
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043454#comment-14043454 ] Jack Krupansky commented on LUCENE-5785: It is worth keeping in mind that a token isn't necessarily the same as a term. It may indeed be desirable to limit the length of terms in the Lucene index for tokenized fields, but all too often an initial token is further broken down using token filters (e.g., word delimiter filter) so that the final term(s) are much shorter than the initial token. So, 256 may be a reasonable limit for indexed terms, but not a great limit for initial tokenization in a complex analysis chain. Whether the default token length limit should be changed as part of this issue is open. Personally I'd prefer a more reasonable limit such as 4096. But as long as the limit can be upped using a tokenizer attribute, that should be enough for now. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043670#comment-14043670 ] Jack Krupansky commented on LUCENE-5785: bq. Make the limit configurable for all tokenizers, and expose that config option in the Solr schema. I wouldn't mind having a Solr-only, core/schema-specific default setting. Not like max Boolean clause which was a Java static for Lucene and quite a mess in terms of the order cores were loaded. In short, leave the default as 256 in Lucene, but we could have Solr default to something much less restrictive, like 4096, and in addition to the tokenizer-specific attribute, the user could specify a global (for the core/schema) override. One key advantage of the schema-global override is that the user could leave the existing field types intact. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
[ https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041194#comment-14041194 ] Jack Krupansky commented on LUCENE-5785: The pattern tokenizer can be used as a workaround for the white space tokenizer since it doesn't have that hard-wired token length limit. White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token
Jack Krupansky created LUCENE-5785: -- Summary: White space tokenizer has undocumented limit of 256 characters per token Key: LUCENE-5785 URL: https://issues.apache.org/jira/browse/LUCENE-5785 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Affects Versions: 4.8.1 Reporter: Jack Krupansky Priority: Minor The white space tokenizer breaks tokens at 256 characters, which is a hard-wired limit of the character tokenizer abstract class. The limit of 256 is obviously fine for normal, natural language text, but excessively restrictive for semi-structured data. 1. Document the current limit in the Javadoc for the character tokenizer. Add a note to any derived tokenizers (such as the white space tokenizer) that token size is limited as per the character tokenizer. 2. Added the setMaxTokenLength method to the character tokenizer ala the standard tokenizer so that an application can control the limit. This should probably be added to the character tokenizer abstract class, and then other derived tokenizer classes can inherit it. 3. Disallow a token size limit of 0. 4. A limit of -1 would mean no limit. 5. Add a token limit mode method - skip (what the standard tokenizer does), break (current behavior of the white space tokenizer and its derived tokenizers), and trim (what I think a lot of people might expect.) 6. Not sure whether to change the current behavior of the character tokenizer (break mode) to fix it to match the standard tokenizer, or to be trim mode, which is my choice and likely to be what people might expect. 7. Add matching attributes to the tokenizer factories for Solr, including Solr XML javadoc. At a minimum, this issue should address the documentation problem. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6113) Edismax doesn't parse well the query uf (User Fields)
[ https://issues.apache.org/jira/browse/SOLR-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009733#comment-14009733 ] Jack Krupansky commented on SOLR-6113: -- Better doc for the intended behavior would help, at least a little. At least we could point people to a clear description of what actually happens. Edismax doesn't parse well the query uf (User Fields) - Key: SOLR-6113 URL: https://issues.apache.org/jira/browse/SOLR-6113 Project: Solr Issue Type: Bug Components: query parsers Reporter: Liram Vardi It seems that Edismax User Fields feature does not behave as expected. For instance, assuming the following query: _q= id:b* user:Anna CollinsdefType=edismaxuf=* -userrows=0_ The parsed query (taken from query debug info) is: _+((id:b* (text:user) (text:anna collins))~1)_ I expect that because user was filtered out in uf (User fields), the parsed query should not contain the user search part. In another words, the parsed query should look simply like this: _+id:b*_ This issue is affected by a the patch on issue SOLR-2649: When changing the default OP of Edismax to AND, the query results change. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6065) Solr / IndexWriter should prevent you from adding docs if it creates an index to big to open
[ https://issues.apache.org/jira/browse/SOLR-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996549#comment-13996549 ] Jack Krupansky commented on SOLR-6065: -- As a historical note, I had filed LUCENE-4104 and LUCENE-4105, as well as SOLR-3504 and SOLR-3505 to both document and check against the per-index document limit in both Lucene and Solr. I think Lucene should check against the limit, and then Solr should respond to that condition. Two interesting use cases: 1. Deleted documents exist, so Solr should tell the user that optimize can resolve the problem. 2. No deleted documents exist, Solr can only report that the document limit has been reached. As an afterthought, maybe we should have a configurable Solr parameter for maximum documents per shard since anybody adding 2 billion documents to a shard is very likely to run into performance issues long before they get near the absolute maximum limit. I'd suggest a Solr configurable limit of like 250 million. Alternatively, this configurable limit could simply be a (noisy) warning, or maybe it could be configurable as either a hard error or a soft warning. Solr / IndexWriter should prevent you from adding docs if it creates an index to big to open Key: SOLR-6065 URL: https://issues.apache.org/jira/browse/SOLR-6065 Project: Solr Issue Type: Bug Reporter: Hoss Man yamazaki reported an error on solr-user where, on opening a new searcher, he got an IAE from BaseCompositeReader because the numDocs was greater then Integer.MAX_VALUE. I'm surprised that in a straight forward setup (ie: no AddIndex merging) IndexWriter will even let you add more docs then max int. We should investigate if this makes sense and either add logic in IndexWriter to prevent this from happening, or add logic to Solr's UpdateHandler to prevent things from getting that far. ie: we should be failing to add too many documents, and leaving the index usable -- not accepting the add and leaving hte index in an unusable state. stack trace reported by user... {noformat} ERROR org.apache.solr.core.CoreContainer – Unable to create core: collection1 org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:821) at org.apache.solr.core.SolrCore.init(SolrCore.java:618) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:949) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:984) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597) at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1438) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1550) at org.apache.solr.core.SolrCore.init(SolrCore.java:796) ... 13 more Caused by: org.apache.solr.common.SolrException: Error opening Reader at org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172) at org.apache.solr.search.SolrIndexSearcher.init(SolrIndexSearcher.java:183) at org.apache.solr.search.SolrIndexSearcher.init(SolrIndexSearcher.java:179) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1414) ... 15 more Caused by: java.lang.IllegalArgumentException: Too many documents, composite IndexReaders cannot exceed 2147483647 at org.apache.lucene.index.BaseCompositeReader.init(BaseCompositeReader.java:77) at org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:368) at org.apache.lucene.index.StandardDirectoryReader.init(StandardDirectoryReader.java:42) at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:71) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88) at
[jira] [Commented] (SOLR-6036) Can't create collection with replicationFactor=0
[ https://issues.apache.org/jira/browse/SOLR-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986967#comment-13986967 ] Jack Krupansky commented on SOLR-6036: -- But I can sympathize - the term copies of the data is ambiguous and vague, unless you have seriously taken the mantra there is no master! to heart and etched it into your arms with acid. Maybe instances of the data would be a little less ambiguous. Can't create collection with replicationFactor=0 Key: SOLR-6036 URL: https://issues.apache.org/jira/browse/SOLR-6036 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1, 4.8 Reporter: John Wong Priority: Trivial solrcloud$ curl 'http://localhost:8983/solr/admin/collections?action=CREATEname=collectionnumShards=2replicationFactor=0' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status400/intint name=QTime60052/int/lststr name=Operation createcollection caused exception:org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: replicationFactor must be greater than or equal to 0/strlst name=exceptionstr name=msgreplicationFactor must be greater than or equal to 0/strint name=rspCode400/int/lstlst name=errorstr name=msgreplicationFactor must be greater than or equal to 0/strint name=code400/int/lst /response I am using solr 4.3.1, but I peeked into the source up to 4.8 and the problem still persists, but in 4.8, the exception message now is changed to be greater than 0. The code snippet in OverseerCollectionProcessor.java: if (repFactor = 0) { throw new SolrException(ErrorCode.BAD_REQUEST, REPLICATION_FACTOR + must be greater than 0); } I believe the = should just be as it won't allow 0. It may have been legacy from when replicationFactor of 1 included the leader/master copy, whereas in solr 4.x, replicationFactor is defined by additional replicas on top of the leader. http://wiki.apache.org/solr/SolrCloud replicationFactor: The number of copies of each document (or, the number of physical replicas to be created for each logical shard of the collection.) A replicationFactor of 3 means that there will be 3 replicas (one of which is normally designated to be the leader) for each logical shard. NOTE: in Solr 4.0, replicationFactor was the number of *additional* copies as opposed to the total number of copies. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6003) JSON Update increment field with non-stored fields causes subtle problems
[ https://issues.apache.org/jira/browse/SOLR-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981086#comment-13981086 ] Jack Krupansky commented on SOLR-6003: -- It sounds like a separate Jira should be filed for some of these broader discussions. This specific Jira should focus on the specific issue of increment for a non-stored field, and append to a non-stored multivalued field. Clearly this case should produce an exception since it can't possibly do anything reasonable since it needs to access the previous value before applying the increment or append. JSON Update increment field with non-stored fields causes subtle problems - Key: SOLR-6003 URL: https://issues.apache.org/jira/browse/SOLR-6003 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.7.1 Reporter: Kingston Duffie In our application we have large multi-field documents. We occasionally need to increment one of the numeric fields or add a value to a multi-value text field. This appears to work correctly using JSON update. But later we discovered that documents were disappearing from search results and eventually found the documentation that indicates that to use field modification you must store all fields of the document. Perhaps you will argue that you need to impose this restriction -- which I would hope could be overcome because of the cost of us having to store all fields. But in any case, it would be better for others if you could return an error if someone tries to update a field on documents with non-stored fields. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5871) Ability to see the list of fields that matched the query with scores
[ https://issues.apache.org/jira/browse/SOLR-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967913#comment-13967913 ] Jack Krupansky commented on SOLR-5871: -- I've lost count of how many times users have requested this feature. The basic request is for an easy way to determine which fields matched which values for each document, as opposed to having to sift through the debug explanation. One technical difficulty is analysis - the results could report the analyzed field values which matched, which won't necessarily literally agree with the source terms due to case, stemming, synonyms, etc. Ability to see the list of fields that matched the query with scores Key: SOLR-5871 URL: https://issues.apache.org/jira/browse/SOLR-5871 Project: Solr Issue Type: Wish Reporter: Alexander S. Assignee: Erick Erickson Hello, I need the ability to tell users what content matched their query, this way: | Name | Twitter Profile | Topics | Site Title | Site Description | Site content | | John Doe | Yes| No | Yes | No | Yes | | Jane Doe | No | Yes | No | No | Yes | All these columns are indexed text fields and I need to know what content matched the query and would be also cool to be able to show the score per field. As far as I know right now there's no way to return this information when running a query request. Debug outputs is suitable for visual review but has lots of nesting levels and is hard for understanding. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5936) Deprecate non-Trie-based numeric (and date) field types in 4.x and remove them from 5.0
[ https://issues.apache.org/jira/browse/SOLR-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13954438#comment-13954438 ] Jack Krupansky commented on SOLR-5936: -- As part of this cleanup, could somebody volunteer to create a plain-English summary of exactly what a trie field really is, what good it is, and why we can't live without them? I've read the code and, okay, there is a sequence of bit shifts and generation of extra terms, but in plain English, what's the point? I'm not asking for a recitation of the actual algorithm(s), but some intuitively accessible summary. I would note that the typical examples are for strings with prefixes rather than binary numbers. See: http://en.wikipedia.org/wiki/Trie And, is trie really the best solution for number types? Does it actually have real value for float and double values? And I would really like to see some plain, easily readable explanation of precision step. Again, especially for real numbers. And how should precision step be used for dates? I mean, other than assuring sort order, why bother with trie? Or more specifically, why does a Solr (or Lucene) user need to know that trie is used for the implementation? Specifically, for example, does it matter if a field has an evenly distributed range of numeric values with little repetition vs. numeric codes where there is a relatively small number of distinct values (e.g., 1-10, or scores of 0-100 or dates in years between 1970 and 2014) and relatively high cardinality? I mean, does trie do a uniformly great job for both of these extreme use cases, including for faceting? And if trie really is the best approach for numeric fields, why not just do all of this under the hood instead of polluting the field type names with trie? IOW, rename TrieIntField to IntField, etc. To me, trie just seems like unnecessary noise to average users. Deprecate non-Trie-based numeric (and date) field types in 4.x and remove them from 5.0 --- Key: SOLR-5936 URL: https://issues.apache.org/jira/browse/SOLR-5936 Project: Solr Issue Type: Task Components: Schema and Analysis Reporter: Steve Rowe Assignee: Steve Rowe Priority: Minor Fix For: 4.8, 5.0 Attachments: SOLR-5936.branch_4x.patch, SOLR-5936.branch_4x.patch We've been discouraging people from using non-Trie numericdate field types for years, it's time we made it official. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5896) Create and edit a CWiki page that describes UpdateRequestProcessors, especially FieldMutatingUpdateProcessors
[ https://issues.apache.org/jira/browse/SOLR-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943316#comment-13943316 ] Jack Krupansky commented on SOLR-5896: -- I have plenty of examples for these (and all other) update processors in my e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html Create and edit a CWiki page that describes UpdateRequestProcessors, especially FieldMutatingUpdateProcessors - Key: SOLR-5896 URL: https://issues.apache.org/jira/browse/SOLR-5896 Project: Solr Issue Type: Improvement Components: documentation Affects Versions: 4.8, 5.0 Reporter: Erick Erickson Assignee: Erick Erickson The capabilities here aren't really documented as a group anywhere I could see in the official pages, there are a couple of references to them but nothing that really serves draws attention. These need to be documented. Where does it make sense to put this? It doesn't really fit under Understanding Analyzers, Tokenizers, and Filters, except kinda since they can be used to alter how data gets indexed, think of the Parse[Date|Int|Float..] factories. Straw-man: add child pages to Understanding Analyzers, Tokenizers, and Filters for What is an UpdateRequestProcessor, UpdateRequestProcessors, and probably something like How to configure your UpdateRequestProcessor. Or??? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5654) Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API
[ https://issues.apache.org/jira/browse/SOLR-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13886574#comment-13886574 ] Jack Krupansky commented on SOLR-5654: -- Two reasonable and reliable use cases I have encountered: 1. Update or replace query-time synonyms - no risk for existing indexed data. 2. Add new index-time synonyms that will apply to new indexed documents - again, no expectation that they would apply to existing documents, but reindexing would of course apply them anyway. Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API -- Key: SOLR-5654 URL: https://issues.apache.org/jira/browse/SOLR-5654 Project: Solr Issue Type: Sub-task Components: Schema and Analysis Reporter: Steve Rowe A synonym filter factory could be (re)configurable via REST API by registering with the RESTManager described in SOLR-5653, and then responding to REST API calls to modify its init params and its synonyms resource file. Read-only (GET) REST API calls should also be provided, both for init params and the synonyms resource file. It should be possible to add/remove/modify one or more entries in the synonyms resource file. We should probably use JSON for the REST request body, as is done in the Schema REST API methods. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5517) Treat POST with no Content-Type as application/x-www-form-urlencoded
[ https://issues.apache.org/jira/browse/SOLR-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13835838#comment-13835838 ] Jack Krupansky commented on SOLR-5517: -- What about curl commands? It is kind of an annoyance that you have to explicitly enter a Content-type. Treat POST with no Content-Type as application/x-www-form-urlencoded Key: SOLR-5517 URL: https://issues.apache.org/jira/browse/SOLR-5517 Project: Solr Issue Type: Improvement Reporter: Ryan Ernst Attachments: SOLR-5517.patch While the http spec states requests without a content-type should be treated as application/octet-stream, the html spec says instead that post requests without a content-type should be treated as a form (http://www.w3.org/MarkUp/html-spec/html-spec_8.html#SEC8.2.1). It would be nice to allow large search requests from html forms, and not have to rely on the browser to set the content type (since the spec says it doesn't have to). -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5401) In Solr's ResourceLoader, add a check for @Deprecated annotation in the plugin/analysis/... class loading code, so we print a warning in the log if a deprecated factory
[ https://issues.apache.org/jira/browse/SOLR-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808292#comment-13808292 ] Jack Krupansky commented on SOLR-5401: -- Solr has a logging admin API that will return recent log entries. For example: {code} curl http://localhost:8983/solr/admin/logging?threshold=WARNtestsince=0indent=true; {code} More examples and the API parameters are in the admin API section of my e-book that is currently in progress, but that isn't out yet. The source code is currently your best guide: org.apache.solr.handler.admin.LoggingHandler. In Solr's ResourceLoader, add a check for @Deprecated annotation in the plugin/analysis/... class loading code, so we print a warning in the log if a deprecated factory class is used -- Key: SOLR-5401 URL: https://issues.apache.org/jira/browse/SOLR-5401 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6, 4.5 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.6, 5.0 Attachments: SOLR-5401.patch While changing an antique 3.6 schema.xml to Solr 4.5, I noticed that some factories were deprecated in 3.x and were no longer available in 4.x (e.g. solr._Language_PorterStemFilterFactory). If the user would have got a notice before, this could have been prevented and user would have upgraded before. In fact the factories were @Deprecated in 3.6, but the Solr loader does not print any warning. My proposal is to add some simple code to SolrResourceLoader that it prints a warning about the deprecated class, if any configuartion setting loads a class with @Deprecated warning. So we can prevent that problem in the future. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5401) In Solr's ResourceLoader, add a check for @Deprecated annotation in the plugin/analysis/... class loading code, so we print a warning in the log if a deprecated factory
[ https://issues.apache.org/jira/browse/SOLR-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808551#comment-13808551 ] Jack Krupansky commented on SOLR-5401: -- I suspect that all the needed logic is sprinkled throughout the Solr logging API. Yes, probably way too much effort for this one test, but it would be good to have lots of other Solr features fully test their error and warning handling, so eventually this piece of test infrastructure would be valuable. In Solr's ResourceLoader, add a check for @Deprecated annotation in the plugin/analysis/... class loading code, so we print a warning in the log if a deprecated factory class is used -- Key: SOLR-5401 URL: https://issues.apache.org/jira/browse/SOLR-5401 Project: Solr Issue Type: Improvement Components: Schema and Analysis Affects Versions: 3.6, 4.5 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.6, 5.0 Attachments: SOLR-5401.patch While changing an antique 3.6 schema.xml to Solr 4.5, I noticed that some factories were deprecated in 3.x and were no longer available in 4.x (e.g. solr._Language_PorterStemFilterFactory). If the user would have got a notice before, this could have been prevented and user would have upgraded before. In fact the factories were @Deprecated in 3.6, but the Solr loader does not print any warning. My proposal is to add some simple code to SolrResourceLoader that it prints a warning about the deprecated class, if any configuartion setting loads a class with @Deprecated warning. So we can prevent that problem in the future. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org