Re: Problem with numeric math types and the dataimport handler

2015-05-20 Thread Shalin Shekhar Mangar
Sounds similar to https://issues.apache.org/jira/browse/SOLR-6165 which I fixed in 4.10. Can you try a newer release? On Wed, May 20, 2015 at 6:51 AM, Shawn Heisey apa...@elyograg.org wrote: An unusual problem is happening with the DIH on a field that is an unsigned BIGINT in the MySQL

Re: Deduplication

2015-05-20 Thread Bram Van Dam
Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: My

Re: Looking up arrays in a sub-entity

2015-05-20 Thread Upayavira
Personally, I see this as a limit of the dataimporthandler. It gets you started, but when your needs get at all complicated, it can't help you. I would encourage you to write your own indexing code. A little bit of code that reads over your database, sorts it out in the right way, and pushes it

Re: Solr query which return only those docs whose all tokens are from given list

2015-05-20 Thread Naresh Yadav
Requesting Solr experts again to suggest some solutions to my above problem as i am not able to solve this. On Tue, May 12, 2015 at 11:04 AM, Naresh Yadav nyadav@gmail.com wrote: Thanks Andrew, You got my problem precisely But solutions you suggested may not work for me. In my API i get

Re: Problem with numeric math types and the dataimport handler

2015-05-20 Thread Shawn Heisey
On 5/20/2015 12:06 AM, Shalin Shekhar Mangar wrote: Sounds similar to https://issues.apache.org/jira/browse/SOLR-6165 which I fixed in 4.10. Can you try a newer release? I can't upgrade yet. I am using a plugin that hasn't been verified against anything newer than 4.9. When a new version

Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote: Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the

[solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Björn Keil
Hello, might anyone suggest a field type with which I may do both a full text search (i.e. there is an analyzer including a tokenizer) and apply a collation? An example for what I want to do: There is a field composer for which I passed the value Dvořák, Antonín. I want the following queries to

Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Thanks Jack. In my case there is only one document - Foo Foo is in bar As per your comment, I should expect TF to be 2. But I am getting one. Is there any check where if one match is a subset of other, is calculated once? My class extends DefaultSimilarity. Cheers Ariya Bala S On Wed, May 20,

Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Hi, I have made custom class for scoring the similarity (TermFrequencyBiasedSimilarity). The score was deduced by considering just the TF part (acheived by setting IDF=1). Question is: - *Document content:* Foo Foo is in bar *Search query:* Foo bar *slop:* 3 With Slop 3, There

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread Jack Krupansky
Yes. tf is both 1 and 2 - tf is per document, which is 1 for the first document and 2 for the second document. See: http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html -- Jack Krupansky On Wed, May 20, 2015 at 6:13 AM, ariya bala

Problem using a function with a multivalued field

2015-05-20 Thread Fernando Agüero
Hi everyone, I’ve been reading answers around this problem but I wanted to make sure that there is another way out of my problem. The thing is that the solution shouldn’t be on index-time, involve indexing a new field or changing this multi-valued field to a single-valued one. Problem: I

When is too many fields in qf is too many?

2015-05-20 Thread Steven White
Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via

Error on grouping result set

2015-05-20 Thread Abhijit Deka
Hi, I am having some problem whille grouping the result set.I have a solr schema like this fields field name=id type=string indexed=false stored=true required=true / field name=product type=string indexed=true stored=true required=true / field name=vendor type=string indexed=true

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala
Please ignore. On Wed, May 20, 2015 at 2:45 PM, ariya bala ariya...@gmail.com wrote: Thanks Jack. In my case there is only one document - Foo Foo is in bar As per your comment, I should expect TF to be 2. But I am getting one. Is there any check where if one match is a subset of other, is

Re: Solr query which return only those docs whose all tokens are from given list

2015-05-20 Thread Mikhail Khludnev
Use update processor to add number of tags per doc. eg check CountFieldValuesUpdateProcessorFactory Doc1 - tags:T1 T2 ; tagNum: 2 Doc2 - tags:T1 T3 ; tagNum: 2 Doc3 - tags:T1 T4 ; tagNum: 2 Doc4 - tags:T1 T2 T3 ; tagNum: 3 than when you search for tags you need to get number of tags matched

Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote: Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This

Re: Block Join Query update documents, how to do it correctly?

2015-05-20 Thread Mikhail Khludnev
On Thu, May 14, 2015 at 12:01 AM, Tom Devel deve...@gmail.com wrote: I tried to repost the whole modified document (the parent and ALL of its children as one file), and it seems to work on a small toy example, but of course I cannot be sure for a larger instance with thousands of documents,

Re: When is too many fields in qf is too many?

2015-05-20 Thread Steven White
Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? Answer: This is a large number of different record types, each with a relatively small number of

Re: Looking up arrays in a sub-entity

2015-05-20 Thread rumford
I was able to get what I wanted by processing the column in question as massaged text, so that it was a comma-delimited series of IDs, and then passing that to a subentity query that went something like: SELECT value FROM othertable WHERE id IN (${master.ids}). It's slow but I think it's getting

[ANN] Relevant Search -- The Book on Search Relevance

2015-05-20 Thread Doug Turnbull
Hello fellow Solr users, We're writing a book on applied Lucene search relevance -- Relevant Search (http://manning.com/turnbull). We want to teach you to improve the quality of your Solr search results! We're trying to bridge the academic side of Information Retrieval from books like Intro. to

Re: When is too many fields in qf is too many?

2015-05-20 Thread Steven White
Thanks Shawn. I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a

Re: Suggestion on field type

2015-05-20 Thread Vishal Swaroop
Thank you all... You all are experts... I will go with double as this seems to be more feasible. Regards On Tue, May 19, 2015 at 7:26 PM, Walter Underwood wun...@wunderwood.org wrote: A field type based on BigDecimal could be useful, but that would be a fair amount more work. Double is

Re: scoreMode ToParentBlockJoinQuery

2015-05-20 Thread Mikhail Khludnev
Hello, Here is the patch https://issues.apache.org/jira/browse/SOLR-5882 On Tue, May 12, 2015 at 1:11 PM, StrW_dev r.j.bamb...@structweb.nl wrote: Hi Is it possible to configure the scoreMode of the Parent block join query parser (ToParentBlockJoinQuery)? It seems it's set to none, while

Re: When is too many fields in qf is too many?

2015-05-20 Thread Shawn Heisey
On 5/20/2015 9:24 AM, Steven White wrote: I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr
Shawn I agree with you, but, some of the decisions in the corporate world are handed down through higher powers/pay grade, who do not always like to hear counter arguments. For example, this is the same reason why govt/federal restrict tech folks only use certified DBs/App Servers like Oracle,WSAD

Re: Edismax

2015-05-20 Thread Walter Underwood
I highly recommend using boost= in edismax rather than bq=. The multiplicative boost is stable with a wide range of scores. bq is additive and has problems with high or low scores. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:04

Re: Edismax

2015-05-20 Thread John Blythe
could i do that the same way as my mention of using bq? the docs aren't very rich in their example or explanation of boost= here: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 |

Upgrading question

2015-05-20 Thread Craig Longman
We've been using Solr a bit now for a year or so, 4.6 is the oldest version of Solr we've deployed. We're currently working through the process we'll use to upgrade to 5.1, an upgrade we need for the new facet.stats capabilities. Reading the Major Changes document, it indicates that there is

Re: Error on grouping result set

2015-05-20 Thread Erick Erickson
Possibly you changed the field type sometime without completely blowing away your index and re-indexing from scratch? Based on: unexpected docvalues type SORTED_SET for field 'vendor' (expected=SORTED) Because you can't group on multi-valued fields, which is I think what's going on here. Either

Re: Edismax

2015-05-20 Thread Walter Underwood
I believe that boost is a superset of the bq functionality. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote: could i do that the same way as my mention of using bq? the docs aren't very

Re: ConfigSets and SolrCloud

2015-05-20 Thread Erick Erickson
What is it? There isn't one except zkcli and variants ;). Things are all automatic once you get things _to_ Zookeeper, but pushing the config sets up is a manual process. The usual process is to have the configs in some VCS somewhere so they're safe, and do the usual checkout/edit/checkin and at

Re: [solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Ahmet Arslan
Hi Bjorn, solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields. Your example looks like diacritics insensitive search. Please see : ASCIIFoldingFilterFactory Ahmet On Wednesday, May 20, 2015 2:53 PM, Björn Keil deeph...@web.de wrote: Hello, might anyone

Re: Edismax

2015-05-20 Thread Shawn Heisey
On 5/20/2015 2:54 PM, John Blythe wrote: new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? Scores are 100% relative, and the number only has meaning in the context of

Re: Edismax

2015-05-20 Thread John Blythe
thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident metric of sorts. we've found some good standard deviation info when plotting out the accuracy of the top result and the relative score with the analyzers currently in production and hope to strengthen

Edismax

2015-05-20 Thread John Blythe
Hi all, I've been fine tuning our current Solr implementation the last week or two to get more precise results. We are trying to get our implementation accurate enough to serve as a lightweight machine learning (obviously a misnomer) implementation of sorts. Actual user generated searching is far

Re: Problem using a function with a multivalued field

2015-05-20 Thread Erick Erickson
bq: Keep a copy of the value into a non-multi-valued field, using an update processor: This involves indexing a new field Why can't you do this? You can't re-index the data perhaps? It's by far the easiest solution Best, Erick On Wed, May 20, 2015 at 2:45 AM, Fernando Agüero

Re: Edismax

2015-05-20 Thread John Blythe
cool, will check into it some more via testing -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wun...@wunderwood.org wrote: I believe that boost is a

Re: Edismax

2015-05-20 Thread Walter Underwood
I was going to post the same advice. If your approach depends on absolute scores, you need to change your approach. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 2:54

Re: Edismax

2015-05-20 Thread John Blythe
new question re edismax: when i turn it on (in solr admin) my score goes wayy down. from 772 to 4.9. what in the edismax query parser would account for that huge nosedive? -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave

Re: Need help with Nested docs situation

2015-05-20 Thread Mikhail Khludnev
data scale and request rate can judge between block, plain joins and field collapsing. On Thu, Apr 30, 2015 at 1:07 PM, roySolr royrutten1...@gmail.com wrote: Hello, I have a situation and i'm a little bit stuck on the way how to fix it. For example the following data structure: *Deal*

Re: When is too many fields in qf is too many?

2015-05-20 Thread Doug Turnbull
Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still

Re: Help to index nested document

2015-05-20 Thread Mikhail Khludnev
I'm absolutely sure that you need to group them externally in the indexer eg like a child VALUES entity in DataImportHandler. On Mon, May 11, 2015 at 9:52 PM, Vishal Swaroop vishal@gmail.com wrote: Need your valuable inputs... I am indexing data from database (one table) which is in this

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora
Seems like the attachements get stripped off. Anyways, here is the 4.7 log on startup INFO - 2015-05-20 10:35:45.786; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2015-05-20 10:35:45.804; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor

ConfigSets and SolrCloud

2015-05-20 Thread Jim . Musil
Hi, I need a little clarification on configSets in solr 5.x. According to this page: https://cwiki.apache.org/confluence/display/solr/Config+Sets I can create named configSets to be shared by other cores. If I create them using this method AND am operating in SolrCloud mode, will it

Re: When is too many fields in qf is too many?

2015-05-20 Thread Steven White
Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17

Re: When is too many fields in qf is too many?

2015-05-20 Thread Steven White
Thanks for calling out maxBooleanClauses. The current default of 1024 has not caused me any issues (so far) in my testing. However, you probably saw Doug Tumbull's reply, it looks like my relevance will suffer. Steve On Wed, May 20, 2015 at 11:42 AM, Shawn Heisey apa...@elyograg.org wrote:

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Toke Eskildsen
Shawn Heisey apa...@elyograg.org wrote: I'm wondering ... if Jetty is good enough for the Google App Engine, why isn't it good enough for your infrastructure standards? Replace Jetty vs. Glassfish with Linux vs. Windows, Eclipse vs. Idea, emacs vs. vi, Java vs. C#... There are many reasons

Re: Edismax

2015-05-20 Thread Erick Erickson
John: The spam filter is very aggressive. Try changing the type to plain text rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote: thanks guys. it doesn't depend on absolute scores, but it is leaning on the score as a confident

Re: Edismax

2015-05-20 Thread John Blythe
Good call thank you On Wed, May 20, 2015 at 5:15 PM, Erick Erickson erickerick...@gmail.com wrote: John: The spam filter is very aggressive. Try changing the type to plain text rather than rich text or html... Best, Erick On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com

Re: Edismax

2015-05-20 Thread Shawn Heisey
On 5/20/2015 3:35 PM, John Blythe wrote: regarding the new question itself, i'd replied to this thread w more info but had the system kick it back to me for some reason. maybe i replied too much too soon? anyway, it ended up being a result of my query still being in the primary query box

Re: Reindex of document leaves old fields behind

2015-05-20 Thread Erick Erickson
Well, let's see the code. Standard updates should replace the previous docs, reindexing the same unique ID with fewer fields should show fewer fields. So something's weird here. Although do, just for yucks, issue a query on some of the unique ids in question, I'd be curious if you get more than

Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon
I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields removed so upon reindexing those fields should be gone in Solr. They are not. So the result is a new doc merged with an old doc rather than a replacement which is what I need. I do not know whether the issue is with

Re: Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon
The uniqueKey value is the same. The new documents contain fewer fields than the already indexed ones. Could this cause the updates to be treated as atomic? With the persisting fields treated as un-updated? Routing should be implicit since the collection was created using numShards. Many

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr
On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr
Never mind. I found that thread. Sorry for the noise. On 5/20/15, 5:56 PM, TK Solr wrote: On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the

SolrCloud Leader Election

2015-05-20 Thread Ryan Steele
My SolrCloud cluster isn't reassigning the collections leaders from downed cores--the downed cores are still listed as the leaders. The cluster has been in the state for a few hours and the logs continue to report No registered leader was found after waiting for 4000ms. Is there a way to force

Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built?

Re: Upgrading question

2015-05-20 Thread Erick Erickson
Yep. Solr/Lucene strives for one major revision backwards compatibility. So any 5x should be able to read any index produced with 4x, but no index produced with 3x. Best, Erick On Wed, May 20, 2015 at 2:44 PM, Craig Longman clong...@iconect.com wrote: We've been using Solr a bit now for a year

Re: Edismax

2015-05-20 Thread Upayavira
A few things: Scores aren't confidence metrics, they are relative rankings, in relation to a single resultset, that's all. Secondly for edismax, boost does multiplicative boosting (whatever function you provide, the score is multiplied by that), whereas bf does additive boosting. Upayavira On

Re: Reindex of document leaves old fields behind

2015-05-20 Thread Shawn Heisey
On 5/20/2015 4:43 PM, tuxedomoon wrote: I'm reindexing Mongo docs into SolrCloud. The new docs have had a few fields removed so upon reindexing those fields should be gone in Solr. They are not. So the result is a new doc merged with an old doc rather than a replacement which is what I

SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= deletequerysource:foo/query/delete Took a couple

Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:41 PM, Ryan Cutter wrote: I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body=

Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:57 PM, Ryan Cutter wrote: GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many

Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
Shawn, thank you very much for that explanation. It helps a lot. Cheers, Ryan On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 5:57 PM, Ryan Cutter wrote: GC is operating the way I think it should but I am lacking memory. I am just surprised because

solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr
I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish or

Re: When is too many fields in qf is too many?

2015-05-20 Thread Jack Krupansky
The uf parameter is used to specify which fields a user may query against - the qf parameter specifies the set of fields that an unfielded query term must be queried against. The user is free to specify fielded query terms, like field1:term1 OR field2:term2. So, which use case are you really

Re: When is too many fields in qf is too many?

2015-05-20 Thread Doug Turnbull
Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Shawn Heisey
On 5/20/2015 9:07 AM, Ravi Solr wrote: I have read that solr 5.x has moved away from deployable WAR architecture to a runnable Java Application architecture. Our infrastructure/standards folks are adamant about not running SOLR on Jetty (as we are about to upgrade from 4.7.2 to 5.1), any ideas

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora
Erick Thanks for your response. Logs don't seem to show any explicit errors (I have log level at INFO). I am attaching the logs from a 4.7 start and a 5.1 start here. Note that both logs seem to show the shards as Down initially but for 5.1, the state change to Active later on. Also, note that

Re: When is too many fields in qf is too many?

2015-05-20 Thread Shawn Heisey
On 5/20/2015 6:27 AM, Steven White wrote: My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by