Re: Rollback w/ Atomic Update
Yonik Seeley wrote > "rollback" is a lucene-level operation that isn't really supported at > the solr level: > https://issues.apache.org/jira/browse/SOLR-4733 I find it odd that this unsupported operation has been around since Solr 1.4. In this case, it seems like there is some underlying issue specific to partial updates. -- View this message in context: http://lucene.472066.n3.nabble.com/Rollback-w-Atomic-Update-tp4309550p4309596.html Sent from the Solr - User mailing list archive at Nabble.com.
Rollback w/ Atomic Update
We've noticed that partial updates are not rolling back with subsequent commits based on the same document id. Our only success in mitigating this issue has been to issue an empty commit immediately following the rollback. I've included an example below showing the partial updates unexpected results. We are currently using SolrJ 4.8.1 with the default deletion policy and auto commits disabled in the configuration. Any help would be greatly appreciated in better understanding this scenario. /update?commit=true (initial add) [ { "id": "12345", "createdBy_t": "John Someone" } ] /update [ { "id": "12345", "favColors_txt": { "set": ["blue", "green"] } } ] /update?rollback=true - [] /update?commit=true [ { "id": "12345", "cityBorn_t": { "add": "Charleston" } } ] /select?q=id:12345 -- [ { "id": "12345", "createdBy_t": "John Someone", "favColors_txt": ["blue", "green"], "cityBorn_t": "Charleston" } ] -- View this message in context: http://lucene.472066.n3.nabble.com/Rollback-w-Atomic-Update-tp4309550.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Atomic Update w/ Date Copy Field
Stefan Matheis-3 wrote > To me, it sounds more like you shouldn’t have to care about such gory > details as a user - at all. > > would you mind opening a issue on JIRA Todd? Including all the details you > already provided in as well as a link to this thread, would be best. > > Depending on what you actually did to find this all out, you probably do > even have a test case at hand which demonstrates the behaviour? if not, > that’s obviously not a problem :) Agreed on the gory details. Yes, it definitely seems like the format should be consistent between full and partial updates. I'll go ahead and open an issue on JIRA. Alexandre Rafalovitch wrote > I noticed (and abused) the issue Todd described in my Solr puzzle at: > http://blog.outerthoughts.com/2016/04/solr-5-puzzle-magic-date-answer/ > > The second format ("EEE...") looks rather strange. I would suspect > that the conversion Date->String code is using the active locale and > that is the format default for that locale. So, the bug might be that > the locale needs to be more specific to preserve the consistence. Thank you for the Solr puzzle reference. The EEE format is most certainly the java.util.Date.toString() method being called when re-creating the field. -- View this message in context: http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779p4295049.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Atomic Update w/ Date Copy Field
It looks like the issue has to do with the Date object. When the document is fully updated (with the date specified) the field is created with a String object so everything is indexed as it appears. When the document is partially updated (with the date omitted) the field is re-created using the previously stored Date object which takes the "toString" representation (i.e. EEE MMM dd HH:mm:ss zzz ). I ended up creating a DateTextField which extends TextField and simply overrides the "FieldType.createField(SchemaField, Object, float)" method. I then check for a Date instance and format as necessary. Any ideas on a better approach or does it sound like this is the way to go? I wasn't sure if this could be accomplished in a filter or some other way. -- View this message in context: http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779p4293968.html Sent from the Solr - User mailing list archive at Nabble.com.
Atomic Update w/ Date Copy Field
We recently started using atomic updates in our application and have since noticed that date fields copied to a text field have varying results between full and partial updates. When the document is fully updated the copied text date appears as expected (i.e. -MM-dd'T'HH:mm:ss.SSSZ); however, when the document is partially updated (while omitting the date field) the original stored date value is copied to a different format (i.e. EEE MMM d HH:mm:ss z ). I've included an example below of what we are seeing with the indexed value of our "createdDate_facet_t" field. Is there a way that we can force the copy field to always use "-MM-dd'T'HH:mm:ss.SSSZ" as the resulting text format without having to always include the field in the update? schema /update (full) - { "id": "12345", "createdBy_t": "someone", "createdDate_dt": "2015-07-14T12:58:17.535Z" } createdDate_facet_t = "2015-07-14t12:58:17.535z" /update (partial) { "id": "12345", "createdBy_t": { "set": "another" } } createdDate_facet_t = "tue jul 14 12:58:17 utc 2015" -- View this message in context: http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DIH Caching w/ BerkleyBackedCache
James, I apologize for the late response. Dyer, James-2 wrote > With the DIH request, are you specifying "cacheDeletePriorData=false" We are not specifying that property (it looks like it defaults to "false"). I'm actually seeing this issue when running a full clean/import. It appears that the Berkeley DB "cleaner" is always removing the oldest file once there are three. In this case, I'll see two 1GB files and then as the third file is being written (after ~200MB) the oldest 1GB file will fall off (i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any other configuration properties other than what I mentioned before. I simply cannot figure out what is going on with the "cleaner" logic that would deem that file "lowest utilized". Any other Berkeley DB/system configuration I could consider that would affect this? It's possible that this caching simply might not be suitable for our data set where one document might contain a field with tens of thousands of values... maybe this is the bottleneck with using this database as every add copies in the prior data and then the "cleaner" removes the old stuff. Maybe it's working like it should but just incredibly slow... I can get a full index without caching in about two hours, however, when using this caching it was still running after 24 hours (still caching the sub-entity). Thanks again for the reply. Respectfully, Todd -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
Mikhail Khludnev wrote > It's worth to mention that for really complex relations scheme it might be > challenging to organize all of them into parallel ordered streams. This will most likely be the issue for us which is why I would like to have the Berkley cache solution to fall back on, if possible. Again, I'm not sure why but it appears that the Berkley cache is overwriting itself (i.e. cleaning up unused data) when building the database... I've read plenty of other threads where it appears folks are having success using that caching solution. Mikhail Khludnev wrote > threads... you said? Which ones? Declarative parallelization in > EntityProcessor worked only with certain 3.x version. We are running multiple DIH instances which query against specific partitions of the data (i.e. mod of the document id we're indexing). -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
Mikhail Khludnev wrote > "External merge" join helps to avoid boilerplate caching in such simple > cases. Thank you for the reply. I can certainly look into this though I would have to apply the patch for our version (i.e. 4.8.1). I really just simplified our data configuration here which actually consists of many sub-entities that are successfully using the SortedMapBackedCache cache. I imagine this would still apply to those as the queries themselves are simple for the most part. I assume performance-wise this would only require the single table scan? I'm still very much interested in resolving this Berkley database cache issue. I'm sure there is some minor configuration I'm missing that is causing this behavior. Again, I've had no issues with the SortedMapBackedCache for its caching purpose... I've tried simplifying our data configuration to only one thread with a single sub-entity with the same results. Again, any help would be greatly appreciated with this. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH Caching w/ BerkleyBackedCache
We currently index using DIH along with the SortedMapBackedCache cache implementation which has worked well until recently when we needed to index a much larger table. We were running into memory issues using the SortedMapBackedCache so we tried switching to the BerkleyBackedCache but appear to have some configuration issues. I've included our basic setup below. The issue we're running into is that it appears the Berkley database is evicting database files (see message below) before they've completed. When I watch the cache directory I only ever see two database files at a time with each one being ~1GB in size (this appears to be hard coded). Is there some additional configuration I'm missing to prevent the process from "cleaning" up database files before the index has finished? I think this "cleanup" continues to kickoff the caching which never completes... without caching the indexing is ~2 hours. Any help would be greatly appreciated. Thanks. Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0 ..." -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching with Delta Import
Erick Erickson wrote > Have you considered using SolrJ instead of DIH? I've seen > situations where that can make a difference for things like > caching small tables at the start of a run, see: > > searchhub.org/2012/02/14/indexing-with-solrj/ Nice write-up. I think we're going to move to that eventually so we can leverage our models instead of maintaining a separate data configuration. Thank you for sharing the link. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598p4238094.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DIH Caching with Delta Import
Dyer, James-2 wrote > The DIH Cache feature does not work with delta import. Actually, much of > DIH does not work with delta import. The workaround you describe is > similar to the approach described here: > https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport , > which in my opinion is the best way to implement partial updates with DIH. Not what I was hoping to hear but at least that explains the delta import funkyness we were experiencing. Thank you for providing the partial updates implementation link. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598p4236384.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH Caching with Delta Import
It appears that DIH entity caching (e.g. SortedMapBackedCache) does not work with deltas... is this simply a bug with the DIH cache support or somehow by design? Any ideas on a workaround for this? Ideally, I could just omit the "cacheImpl" attribute but that leaves the query (using the default processor in my case) without the appropriate where clause including the "cacheKey" and "cacheLookup". Should SqlEntityProcessor be smart enough to ignore the cache with deltas and simply append a where clause which includes the "cacheKey" and "cacheLookup"? Or possibly just include a where clause which includes ('${dih.request.command}' = 'full-import' or cacheKey = cacheLookup)? I suppose those could be used to mitigate the issue but I was hoping for possibly a better solution. Any help would be greatly appreciated. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Numeric Sorting with 0 and NULL Values
Todd Long wrote > I'm curious as to where the loss of precision would be when using > "-(Double.MAX_VALUE)" as you mentioned? Also, any specific reason why you > chose that over Double.MIN_VALUE (sorry, just making sure I'm not missing > something)? So, to answer my own question it looks like Double.MIN_VALUE is somewhat misleading (or poorly named perhaps?)... from the javadoc it states "A constant holding the smallest positive nonzero value of type double". In this case, the cast to int/long would result in 0 with the loss of precision which is definitely not what I want (and back to the original issue). It would certainly seem that -Double.MAX_VALUE would be the way to go! This is something that I was not aware of with Double... thank you. Chris Hostetter-3 wrote > ...i mention this as being a workarround for floats/doubles because the > functions are evaluated as doubles (no "casting" or "forced integer > context" type support at the moment), so with integer/float fields there > would be some loss of precision. I'm still curious of whether or not there would be any cast issue going from double to int/long within the "def()" function. Any additional details would be greatly appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654p4233361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Numeric Sorting with 0 and NULL Values
Chris Hostetter-3 wrote > ...i mention this as being a workarround for floats/doubles because the > functions are evaluated as doubles (no "casting" or "forced integer > context" type support at the moment), so with integer/float fields there > would be some loss of precision. Excellent, thank you for the reply. My initial thought was going with the extra un-indexed/un-stored field... I wasn't aware of the "docValues" attribute to be used in that case for sorting (I assume this is more for performance). Thank you for the default value explanation. I definitely like the workaround as a reindex-free option. I'm curious as to where the loss of precision would be when using "-(Double.MAX_VALUE)" as you mentioned? Also, any specific reason why you chose that over Double.MIN_VALUE (sorry, just making sure I'm not missing something)? I would think an int or long field would simply cast down from the double min/max value... at least that is what I gathered from poking around the "def()" function code. Of course, the decimal would be lost with the int and long but I would still come away with the min value of -2147483648 and -9223372036854775808, respectively. -- View this message in context: http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654p4233117.html Sent from the Solr - User mailing list archive at Nabble.com.
Numeric Sorting with 0 and NULL Values
I'm trying to sort on numeric (e.g. TrieDoubleField) fields and running into an issue where 0 and NULL values are being compared as equal. This appears to be the "common case" in the FieldComparator class where the missing value (i.e. NULL) gets assigned for a 0 value (which is perfectly valid). Is there any way around this short of indexing another field to signify there is a value? I need the sort such that ascending will have the NULL values first and descending will have the NULL values last (i.e. sortMissingFirst="false" and sortMissingLast="false"). expected: NULL NULL 0 0.7 5 32 actual: NULL 0 NULL 0.7 5 32 Please let me know if I can provide any additional information. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard/Regex Searching with Decimal Fields
Sounds good. Thank you for the synonym (definitely will work on this) and padding suggestions. - Todd -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206421.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard/Regex Searching with Decimal Fields
I see what you're saying and that should do the trick. I could index 123 with an index synonym 123.0. Then my regex query /123/ should hit along with a boolean query 123.0 OR 123.00*. Is there a cleaner approach to breaking apart the boolean query in this case? Right now, outside of Solr, I'm just looking for any extraneous zeros and wildcards to get the exact value (e.g. 123.0) and OR'ing that with the original user input. Thank you for your help. - Todd -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard/Regex Searching with Decimal Fields
Erick Erickson wrote But I _really_ have to go back to one of my original questions: What's the use-case? The use-case is with autocompleting fields. The user might know a frequency starts with 2 so we want to limit those results (e.g. 2, 23, 214, etc.). We would still index/store the numeric-type but maintain an additional string index for autocompleting (and regular expressions). We can throw away the contains but will at least need the starts with behavior. - Todd -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206398.html Sent from the Solr - User mailing list archive at Nabble.com.
Wildcard/Regex Searching with Decimal Fields
I'm having some normalization issues when trying to search decimal fields (i.e. TrieDoubleField copied to TextField). 1. Wildcard searching: I created a separate TextField field type (e.g. filter_decimal) which filters whole numbers to have at least one decimal place (i.e. dot zero) using the pattern replace filter. When I build the query I remove any extraneous zeros in the decimal (e.g. 235.000 becomes 235.0) to make sure my wildcard search will match on the non-wildcard decimal (hopefully that makes sense). I then build the wildcard query based on the original input along with the extraneous zeros removed (see examples below). Is this the best approach or does Solr allow me to go about this another way? e.g. input: 2*5.000 query: filter_decimal:2*5.000* OR filter_decimal:2*5.0 e.g. input: 235. query: filter_decimal:235.* 2. Regex searching: When indexing decimal fields with a dot zero any regular expressions that don't take that into account return no results (see example below). The only way around this is by dropping the dot zero when indexing. Of course, this now requires me to define another field type with an appropriate pattern replace filter. I tried creating a query token filter but by the time I get the term attribute I don't if the search was a regular expression or not. Any ideas on this? Is it best to just create another field type that removes the dot zero? e.g. /23[58]/ (will not match on 235.0) Please let me know if I can provide any additional details. Thanks for the help! -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard/Regex Searching with Decimal Fields
Essentially, we have a grid of data (i.e. frequencies, baud rates, data rates, etc.) and we allow wildcard filtering on the various columns. As the user provides input, in a specific column, we simply filter the overall data by an implicit starts with query (i.e. 23 becomes 23*). In most cases, yes, a range search would suffice until you get to those contains queries. We are working with strings with the need to properly handle the decimal place. I don't know the exact use case where the contains query comes into play with the numerics but most likely it would have to do with pattern matching (i.e. knowing a certain sequence where 2*3 might be helpful). It's easy enough to normalize the user input and perform an OR search with the wildcard. I'm just trying to find a way to index the data once that allows me to handle the dot zero in both wildcard and regex searches. I guess it would be nice to index the numeric as a string without dot zero and when performing a search have the input hit against both the whole number and dot zero. Erick Erickson wrote You could simply inject synonyms without the .0 in the same field though. Using a SynonymFilterFactory? If so, can this be done dynamically as I won't know the numeric (I guess we can call them string) values. -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206050.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard/Regex Searching with Decimal Fields
Erick Erickson wrote No, not using SynonymFilterFactory. Rather take that as a base for a custom Filter that doesn't use any input file. OK, I just wanted to make sure I wasn't missing something that could be done with the SynonymFilterFactory itself. At one time, I started going down this path but I wasn't sure if I could access the indexed values using a query filter though I assume that is part of what SynonymFilterFactory is doing... I was able to create a custom filter but I was only able to access the query input of which I still couldn't distinguish what type of search was being done (i.e. regex or not). The regex query input did not include the surrounding forward slashes. - Todd -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206155.html Sent from the Solr - User mailing list archive at Nabble.com.