Re: Rollback w/ Atomic Update

2016-12-13 Thread Todd Long
Yonik Seeley wrote
> "rollback" is a lucene-level operation that isn't really supported at
> the solr level:
> https://issues.apache.org/jira/browse/SOLR-4733

I find it odd that this unsupported operation has been around since Solr
1.4. In this case, it seems like there is some underlying issue specific to
partial updates.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rollback-w-Atomic-Update-tp4309550p4309596.html
Sent from the Solr - User mailing list archive at Nabble.com.


Rollback w/ Atomic Update

2016-12-13 Thread Todd Long
We've noticed that partial updates are not rolling back with subsequent
commits based on the same document id. Our only success in mitigating this
issue has been to issue an empty commit immediately following the rollback.
I've included an example below showing the partial updates unexpected
results. We are currently using SolrJ 4.8.1 with the default deletion policy
and auto commits disabled in the configuration. Any help would be greatly
appreciated in better understanding this scenario.

/update?commit=true (initial add)

[
  {
"id": "12345",
"createdBy_t": "John Someone"
  }
]

/update

[
  {
"id": "12345",
"favColors_txt": { "set": ["blue", "green"] }
  }
]

/update?rollback=true
-
[]

/update?commit=true

[
  {
"id": "12345",
"cityBorn_t": { "add": "Charleston" }
  }
]

/select?q=id:12345
--
[
  {
"id": "12345",
"createdBy_t": "John Someone",
"favColors_txt": ["blue", "green"],
"cityBorn_t": "Charleston"
  }
]



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Rollback-w-Atomic-Update-tp4309550.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Atomic Update w/ Date Copy Field

2016-09-07 Thread Todd Long
Stefan Matheis-3 wrote
> To me, it sounds more like you shouldn’t have to care about such gory
> details as a user - at all.
> 
> would you mind opening a issue on JIRA Todd? Including all the details you
> already provided in as well as a link to this thread, would be best.
> 
> Depending on what you actually did to find this all out, you probably do
> even have a test case at hand which demonstrates the behaviour? if not,
> that’s obviously not a problem :)

Agreed on the gory details. Yes, it definitely seems like the format should
be consistent between full and partial updates. I'll go ahead and open an
issue on JIRA.


Alexandre Rafalovitch wrote
> I noticed (and abused) the issue Todd described in my Solr puzzle at:
> http://blog.outerthoughts.com/2016/04/solr-5-puzzle-magic-date-answer/
> 
> The second format ("EEE...") looks rather strange. I would suspect
> that the conversion Date->String code is using the active locale and
> that is the format default for that locale. So, the bug might be that
> the locale needs to be more specific to preserve the consistence.

Thank you for the Solr puzzle reference. The EEE format is most certainly
the java.util.Date.toString() method being called when re-creating the
field.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779p4295049.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Atomic Update w/ Date Copy Field

2016-08-30 Thread Todd Long
It looks like the issue has to do with the Date object. When the document is
fully updated (with the date specified) the field is created with a String
object so everything is indexed as it appears. When the document is
partially updated (with the date omitted) the field is re-created using the
previously stored Date object which takes the "toString" representation
(i.e. EEE MMM dd HH:mm:ss zzz ).

I ended up creating a DateTextField which extends TextField and simply
overrides the "FieldType.createField(SchemaField, Object, float)" method. I
then check for a Date instance and format as necessary.

Any ideas on a better approach or does it sound like this is the way to go?
I wasn't sure if this could be accomplished in a filter or some other way.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779p4293968.html
Sent from the Solr - User mailing list archive at Nabble.com.


Atomic Update w/ Date Copy Field

2016-08-29 Thread Todd Long
We recently started using atomic updates in our application and have since
noticed that date fields copied to a text field have varying results between
full and partial updates. When the document is fully updated the copied text
date appears as expected (i.e. -MM-dd'T'HH:mm:ss.SSSZ); however, when
the document is partially updated (while omitting the date field) the
original stored date value is copied to a different format (i.e. EEE MMM d
HH:mm:ss z ). I've included an example below of what we are seeing with
the indexed value of our "createdDate_facet_t" field. Is there a way that we
can force the copy field to always use "-MM-dd'T'HH:mm:ss.SSSZ" as the
resulting text format without having to always include the field in the
update?

schema








  



  


/update (full)
-
{
  "id": "12345",
  "createdBy_t": "someone",
  "createdDate_dt": "2015-07-14T12:58:17.535Z"
}

createdDate_facet_t = "2015-07-14t12:58:17.535z"

/update (partial)

{
  "id": "12345",
  "createdBy_t": { "set": "another" }
}

createdDate_facet_t = "tue jul 14 12:58:17 utc 2015"



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Atomic-Update-w-Date-Copy-Field-tp4293779.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DIH Caching w/ BerkleyBackedCache

2015-12-16 Thread Todd Long
James,

I apologize for the late response.


Dyer, James-2 wrote
> With the DIH request, are you specifying "cacheDeletePriorData=false"

We are not specifying that property (it looks like it defaults to "false").
I'm actually seeing this issue when running a full clean/import.

It appears that the Berkeley DB "cleaner" is always removing the oldest file
once there are three. In this case, I'll see two 1GB files and then as the
third file is being written (after ~200MB) the oldest 1GB file will fall off
(i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm
using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any
other configuration properties other than what I mentioned before. I simply
cannot figure out what is going on with the "cleaner" logic that would deem
that file "lowest utilized". Any other Berkeley DB/system configuration I
could consider that would affect this?

It's possible that this caching simply might not be suitable for our data
set where one document might contain a field with tens of thousands of
values... maybe this is the bottleneck with using this database as every add
copies in the prior data and then the "cleaner" removes the old stuff. Maybe
it's working like it should but just incredibly slow... I can get a full
index without caching in about two hours, however, when using this caching
it was still running after 24 hours (still caching the sub-entity).

Thanks again for the reply.

Respectfully,
Todd



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH Caching w/ BerkleyBackedCache

2015-11-17 Thread Todd Long
Mikhail Khludnev wrote
> It's worth to mention that for really complex relations scheme it might be
> challenging to organize all of them into parallel ordered streams.

This will most likely be the issue for us which is why I would like to have
the Berkley cache solution to fall back on, if possible. Again, I'm not sure
why but it appears that the Berkley cache is overwriting itself (i.e.
cleaning up unused data) when building the database... I've read plenty of
other threads where it appears folks are having success using that caching
solution.


Mikhail Khludnev wrote
> threads... you said? Which ones? Declarative parallelization in
> EntityProcessor worked only with certain 3.x version.

We are running multiple DIH instances which query against specific
partitions of the data (i.e. mod of the document id we're indexing).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH Caching w/ BerkleyBackedCache

2015-11-16 Thread Todd Long
Mikhail Khludnev wrote
> "External merge" join helps to avoid boilerplate caching in such simple
> cases.

Thank you for the reply. I can certainly look into this though I would have
to apply the patch for our version (i.e. 4.8.1). I really just simplified
our data configuration here which actually consists of many sub-entities
that are successfully using the SortedMapBackedCache cache. I imagine this
would still apply to those as the queries themselves are simple for the most
part. I assume performance-wise this would only require the single table
scan?

I'm still very much interested in resolving this Berkley database cache
issue. I'm sure there is some minor configuration I'm missing that is
causing this behavior. Again, I've had no issues with the
SortedMapBackedCache for its caching purpose... I've tried simplifying our
data configuration to only one thread with a single sub-entity with the same
results. Again, any help would be greatly appreciated with this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH Caching w/ BerkleyBackedCache

2015-11-13 Thread Todd Long
We currently index using DIH along with the SortedMapBackedCache cache
implementation which has worked well until recently when we needed to index
a much larger table. We were running into memory issues using the
SortedMapBackedCache so we tried switching to the BerkleyBackedCache but
appear to have some configuration issues. I've included our basic setup
below. The issue we're running into is that it appears the Berkley database
is evicting database files (see message below) before they've completed.
When I watch the cache directory I only ever see two database files at a
time with each one being ~1GB in size (this appears to be hard coded). Is
there some additional configuration I'm missing to prevent the process from
"cleaning" up database files before the index has finished? I think this
"cleanup" continues to kickoff the caching which never completes... without
caching the indexing is ~2 hours. Any help would be greatly appreciated.
Thanks.

Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0
..."


  

  


  


  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH Caching with Delta Import

2015-11-03 Thread Todd Long
Erick Erickson wrote
> Have you considered using SolrJ instead of DIH? I've seen
> situations where that can make a difference for things like
> caching small tables at the start of a run, see:
> 
> searchhub.org/2012/02/14/indexing-with-solrj/

Nice write-up. I think we're going to move to that eventually so we can
leverage our models instead of maintaining a separate data configuration.
Thank you for sharing the link.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598p4238094.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DIH Caching with Delta Import

2015-10-24 Thread Todd Long
Dyer, James-2 wrote
> The DIH Cache feature does not work with delta import.  Actually, much of
> DIH does not work with delta import.  The workaround you describe is
> similar to the approach described here:
> https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ,
> which in my opinion is the best way to implement partial updates with DIH.

Not what I was hoping to hear but at least that explains the delta import
funkyness we were experiencing. Thank you for providing the partial updates
implementation link.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598p4236384.html
Sent from the Solr - User mailing list archive at Nabble.com.


DIH Caching with Delta Import

2015-10-20 Thread Todd Long
It appears that DIH entity caching (e.g. SortedMapBackedCache) does not work
with deltas... is this simply a bug with the DIH cache support or somehow by
design?

Any ideas on a workaround for this? Ideally, I could just omit the
"cacheImpl" attribute but that leaves the query (using the default processor
in my case) without the appropriate where clause including the "cacheKey"
and "cacheLookup". Should SqlEntityProcessor be smart enough to ignore the
cache with deltas and simply append a where clause which includes the
"cacheKey" and "cacheLookup"? Or possibly just include a where clause which
includes ('${dih.request.command}' = 'full-import' or cacheKey =
cacheLookup)? I suppose those could be used to mitigate the issue but I was
hoping for possibly a better solution.

Any help would be greatly appreciated. Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Numeric Sorting with 0 and NULL Values

2015-10-07 Thread Todd Long
Todd Long wrote
> I'm curious as to where the loss of precision would be when using
> "-(Double.MAX_VALUE)" as you mentioned? Also, any specific reason why you
> chose that over Double.MIN_VALUE (sorry, just making sure I'm not missing
> something)?

So, to answer my own question it looks like Double.MIN_VALUE is somewhat
misleading (or poorly named perhaps?)... from the javadoc it states "A
constant holding the smallest positive nonzero value of type double". In
this case, the cast to int/long would result in 0 with the loss of precision
which is definitely not what I want (and back to the original issue). It
would certainly seem that -Double.MAX_VALUE would be the way to go! This is
something that I was not aware of with Double... thank you.


Chris Hostetter-3 wrote
> ...i mention this as being a workarround for floats/doubles because the 
> functions are evaluated as doubles (no "casting" or "forced integer 
> context" type support at the moment), so with integer/float fields there 
> would be some loss of precision.

I'm still curious of whether or not there would be any cast issue going from
double to int/long within the "def()" function. Any additional details would
be greatly appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654p4233361.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Numeric Sorting with 0 and NULL Values

2015-10-06 Thread Todd Long
Chris Hostetter-3 wrote
> ...i mention this as being a workarround for floats/doubles because the 
> functions are evaluated as doubles (no "casting" or "forced integer 
> context" type support at the moment), so with integer/float fields there 
> would be some loss of precision.

Excellent, thank you for the reply.

My initial thought was going with the extra un-indexed/un-stored field... I
wasn't aware of the "docValues" attribute to be used in that case for
sorting (I assume this is more for performance). Thank you for the default
value explanation.

I definitely like the workaround as a reindex-free option. I'm curious as to
where the loss of precision would be when using "-(Double.MAX_VALUE)" as you
mentioned? Also, any specific reason why you chose that over
Double.MIN_VALUE (sorry, just making sure I'm not missing something)? I
would think an int or long field would simply cast down from the double
min/max value... at least that is what I gathered from poking around the
"def()" function code. Of course, the decimal would be lost with the int and
long but I would still come away with the min value of -2147483648 and
-9223372036854775808, respectively.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654p4233117.html
Sent from the Solr - User mailing list archive at Nabble.com.


Numeric Sorting with 0 and NULL Values

2015-10-04 Thread Todd Long
I'm trying to sort on numeric (e.g. TrieDoubleField) fields and running into
an issue where 0 and NULL values are being compared as equal. This appears
to be the "common case" in the FieldComparator class where the missing value
(i.e. NULL) gets assigned for a 0 value (which is perfectly valid). Is there
any way around this short of indexing another field to signify there is a
value? I need the sort such that ascending will have the NULL values first
and descending will have the NULL values last (i.e. sortMissingFirst="false"
and sortMissingLast="false").

expected:
NULL
NULL
0
0.7
5
32

actual:
NULL
0
NULL
0.7
5
32

Please let me know if I can provide any additional information. Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Numeric-Sorting-with-0-and-NULL-Values-tp4232654.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard/Regex Searching with Decimal Fields

2015-05-19 Thread Todd Long
Sounds good. Thank you for the synonym (definitely will work on this) and
padding suggestions.

- Todd



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206421.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard/Regex Searching with Decimal Fields

2015-05-19 Thread Todd Long
I see what you're saying and that should do the trick. I could index 123 with
an index synonym 123.0. Then my regex query /123/ should hit along with a
boolean query 123.0 OR 123.00*. Is there a cleaner approach to breaking
apart the boolean query in this case? Right now, outside of Solr, I'm just
looking for any extraneous zeros and wildcards to get the exact value (e.g.
123.0) and OR'ing that with the original user input.

Thank you for your help.

- Todd



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206288.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard/Regex Searching with Decimal Fields

2015-05-19 Thread Todd Long
Erick Erickson wrote
 But I _really_ have to go back to one of my original questions: What's
 the use-case?

The use-case is with autocompleting fields. The user might know a frequency
starts with 2 so we want to limit those results (e.g. 2, 23, 214, etc.). We
would still index/store the numeric-type but maintain an additional string
index for autocompleting (and regular expressions). We can throw away the
contains but will at least need the starts with behavior.

- Todd



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Wildcard/Regex Searching with Decimal Fields

2015-05-18 Thread Todd Long
I'm having some normalization issues when trying to search decimal fields
(i.e. TrieDoubleField copied to TextField).

1. Wildcard searching: I created a separate TextField field type (e.g.
filter_decimal) which filters whole numbers to have at least one decimal
place (i.e. dot zero) using the pattern replace filter. When I build the
query I remove any extraneous zeros in the decimal (e.g. 235.000 becomes
235.0) to make sure my wildcard search will match on the non-wildcard
decimal (hopefully that makes sense). I then build the wildcard query based
on the original input along with the extraneous zeros removed (see examples
below). Is this the best approach or does Solr allow me to go about this
another way?

e.g.
input: 2*5.000
query: filter_decimal:2*5.000* OR filter_decimal:2*5.0

e.g.
input: 235.
query: filter_decimal:235.*

2. Regex searching: When indexing decimal fields with a dot zero any regular
expressions that don't take that into account return no results (see example
below). The only way around this is by dropping the dot zero when indexing.
Of course, this now requires me to define another field type with an
appropriate pattern replace filter. I tried creating a query token filter
but by the time I get the term attribute I don't if the search was a regular
expression or not. Any ideas on this? Is it best to just create another
field type that removes the dot zero?

e.g. /23[58]/ (will not match on 235.0)

Please let me know if I can provide any additional details. Thanks for the
help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard/Regex Searching with Decimal Fields

2015-05-18 Thread Todd Long
Essentially, we have a grid of data (i.e. frequencies, baud rates, data
rates, etc.) and we allow wildcard filtering on the various columns. As the
user provides input, in a specific column, we simply filter the overall data
by an implicit starts with query (i.e. 23 becomes 23*). In most cases,
yes, a range search would suffice until you get to those contains queries.
We are working with strings with the need to properly handle the decimal
place. I don't know the exact use case where the contains query comes into
play with the numerics but most likely it would have to do with pattern
matching (i.e. knowing a certain sequence where 2*3 might be helpful).

It's easy enough to normalize the user input and perform an OR search with
the wildcard. I'm just trying to find a way to index the data once that
allows me to handle the dot zero in both wildcard and regex searches. I
guess it would be nice to index the numeric as a string without dot zero and
when performing a search have the input hit against both the whole number
and dot zero.


Erick Erickson wrote
 You could simply inject synonyms without the .0 in the same field
 though.

Using a SynonymFilterFactory? If so, can this be done dynamically as I won't
know the numeric (I guess we can call them string) values.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206050.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard/Regex Searching with Decimal Fields

2015-05-18 Thread Todd Long
Erick Erickson wrote
 No, not using SynonymFilterFactory. Rather take that as a base for a
 custom Filter that
 doesn't use any input file.

OK, I just wanted to make sure I wasn't missing something that could be done
with the SynonymFilterFactory itself. At one time, I started going down this
path but I wasn't sure if I could access the indexed values using a query
filter though I assume that is part of what SynonymFilterFactory is doing...
I was able to create a custom filter but I was only able to access the query
input of which I still couldn't distinguish what type of search was being
done (i.e. regex or not). The regex query input did not include the
surrounding forward slashes.

- Todd



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206155.html
Sent from the Solr - User mailing list archive at Nabble.com.