Re: full name free text search problem

2018-01-31 Thread Alexandre Rafalovitch
You need to tokenize the full name in several different ways and then search both (all) tokenization versions with different boosts. This way you can tokenize as full string (perhaps lowercased) and then also on white space and then maybe even with phonetic mapping to catch spellings. You can

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: Perhaps you are right on not to consider stemming. Instead fuzzy search will cover these along with the misspellings. In case of symbols, we want the titles matching the symbols ranked higher than the others. Perhaps we can use this field only for boosting. Certain movies have around

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
I was the first search engineer at Netflix and moved their search from a home-grown engine to Solr. It worked very well with a single title field and aliases. I think your schema is too complicated for movie search. Stemming is not useful. It doesn’t help search and it can hurt. You don’t want

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This is done through the fieldnorm component in the class. The issue is when the field is multivalued. Consider the field has two string each of 4 tokens. The fieldNorm from the lucene TFIDFSimilarity class considers the total

Re: Query fields with data of certain length

2018-01-31 Thread Zheng Lin Edwin Yeo
Hi, Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ? Regards, Edwin On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo wrote: > Hi Emir, > > An example of the string in Chinese is 预支款管理及账务处理办法 > > The number of characters is 12, but the expected

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter documents. Another way to say this, if your documents are 5-10 terms, the 5 terms are going to win. You might think about having per token, or token pair, weight. I would be surprised if there was not something similar out

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: We have 6 fields declared in schema.xml for title each with different type of analyzer. One without processing symbols, other stemmed and other removing symbols, etc. So, if we have separate fields for each alias it will be that many times the number of final fields declared in

Re: Distributed search cross cluster

2018-01-31 Thread Jan Høydahl
Erick: > ...one for each cluster and just merged the docs when it got them back This would be the logical way. I'm afraid that "just merged the docs" is the crux here, that would make this an expensive task. You'd have to merge docs, facets, highlights etc, handle the different search phases

Re: facet.method=uif not working in solr cloud?

2018-01-31 Thread Wei
Thanks Alessandro. Totally agree that from the logic I can't see why the requested facet.method=uif is not accepted. I don't see anything in solr.log also. However I find that the uif method somehow works with json facet api in cloud mode, e.g: curl

Re: Distributed search cross cluster

2018-01-31 Thread Jan Høydahl
Hi, I am an ex FAST employee and actually used Unity a lot myself, even hacking the code writing custom mixers etc :) That is all cool, if you want to write a generic federation layer. In our case we only ever need to talk to Solr instances with exactly the same schema and doument types,

Re: Mixing simple and nested docs in same update?

2018-01-31 Thread Jan Høydahl
Thanks for the reply. I see that the child doctransformer (https://lucene.apache.org/solr/guide/6_6/transforming-result-documents.html#TransformingResultDocuments-_child_-ChildDocTransformerFactory) has a childFilter= option which, when used, solves the issue/bug. But such a childFilter does

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy, I was thinking of query q=method:“x-ray*” “Solution NMR” This should be equivalent to one with OR between them. If you want to put AND between those two, query would be q=+method:”x-ray*” +”Solution NMR” Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr &

Re: Minimum memory requirement

2018-01-31 Thread Shawn Heisey
On 1/31/2018 1:54 PM, TK Solr wrote: On my AWS t2.micro instance, which only has 1 GB memory, I installed Solr (4.7.1 - please don't ask) and tried to run it in sample directory as java -jar start.jar. It exited shortly due to lack of memory. How much memory does Solr require to run, with

RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Hello S.G. We do not complain about speed improvements at all, it is clear 7.x is faster than its predecessor. The problem is stability and not recovering from weird circumstances. In general, it is our high load cluster containing user interaction logs that suffers the most. Our main text

Minimum memory requirement

2018-01-31 Thread TK Solr
On my AWS t2.micro instance, which only has 1 GB memory, I installed Solr (4.7.1 - please don't ask) and tried to run it in sample directory as java -jar start.jar. It exited shortly due to lack of memory. How much memory does Solr require to run, with empty core? TK

Re: Long GC Pauses

2018-01-31 Thread S G
Hey Maulin, I hope you are using some tools to look at your gc.log file (There are couple available online) or grepping for pauses. Do you mind sharing your G1GC settings and some screenshots from your gc.log analyzer's output ? -SG On Wed, Jan 31, 2018 at 9:16 AM, Erick Erickson

Re: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread S G
We did some basic load testing on our 7.1.0 and 7.2.1 clusters. And that came out all right. We saw a performance increase of about 30% in read latencies between 6.6.0 and 7.1.0 And then we saw a performance degradation of about 10% between 7.1.0 and 7.2.1 in many metrics. But overall, it still

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Wendy2
Hi Emir, Listed below are the debugQuery outputs from query without "OR" operator. I really appreciate your help! --Wendy ===DebugQuery Outputs for case 1f-a, 1f-b without "OR" operator= *1f-a (/search?q=+method:"x-ray*" +method:"Solution NMR") result counts = 0: *

Sorting results for spatial search

2018-01-31 Thread Leila Deljkovic
Hiya, So I have some nested documents in my index with this kind of structure: { "id": “parent", "gridcell_rpt": "POLYGON((30 10, 40 40, 20 40, 10 20, 30 10))", "density": “30" "_childDocuments_" : [ { "id":"child1",

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Luigi Caiazza
Hi, first of all, thank you for your answers. @ Rick: the reason is that the set of pages that are stored into the disk represents just a static view of the Web, in order to let my experiments be fully replicable. My need is to run simulations of different crawlers on top of it, each working on

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy, With OR with spaces OR is interpreted as another search term. Can you try without or - just a space between two parts. If you need and, use + before each part. HTH, Emir On Jan 31, 2018 6:24 PM, "Wendy2" wrote: Hi Emir, Thank you so much for following up with

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Wendy2
Hi Emir, Thank you so much for following up with your ticket. Listed below are the parts of debugQuery outputs via /search request handler. The reason I used * in the query term is that there are a couple of methods starting with "x-ray". When I used space surrounding the "OR" boolean search

Re: Long GC Pauses

2018-01-31 Thread Erick Erickson
Just to double check, when you san you're seeing 60-200 sec GC pauses are you looking at the GC logs (or using some kind of monitor) or is that the time it takes the query to respond to the client? Because a single GC pause that long on 40G is unusual no matter what. Another take on Jason's

Re: How to avoid warning message

2018-01-31 Thread Shawn Heisey
On 1/31/2018 9:07 AM, Tamás Barta wrote: I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java client I use SolrJ this way: *client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();* In the log I see the followings: *WARN

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Erick Erickson
Or use a boost for the phrase, something like "beauty and the beast"^5 On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood wrote: > You can use a separate field for title aliases. That is what I did for > Netflix search. > > Why disable idf? Disabling tf for titles can be a

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
You can use a separate field for title aliases. That is what I did for Netflix search. Why disable idf? Disabling tf for titles can be a good idea, for example the movie “New York, New York” is not twice as much about New York as some other film that just lists it once. Also, consider using a

How to avoid warning message

2018-01-31 Thread Tamás Barta
Hi, I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java client I use SolrJ this way: *client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();* In the log I see the followings: *WARN [org.apache.zookeeper.SaslClientCallbackHandler] Could not login: the

Re: Query parser problem, using fuzzy search

2018-01-31 Thread David Frese
Am 29.01.18 um 18:05 schrieb Erick Erickson: Try searching with lowercase the word and. Somehow you have to allow the parser to distinguish the two. Oh yeah, the biggest unsolved problem in the ~80 years history of programming languages... NOT ;-) You _might_ be able to try "AND~2" (with

Re: Long GC Pauses

2018-01-31 Thread Jason Gerlowski
Hi Maulin, To clarify, when you said "...allocated 40 GB RAM to each shard." above, I'm going to assume you meant "to each node" instead. If you actually did mean "to each shard" above, please correct me and anyone who chimes in afterward. Firstly, it's really hard to even take guesses about

Solrj + spring data: Indexing file body + own fields

2018-01-31 Thread Joris De Smedt
Hi I'm using Solrj 6.6.1 found in spring-data-solr 3.0.3.RELEASE, solr is 7.2.1 . I'm currently able to upload solrDocument via spring-data but would like to add the equivalent to tika new AutoDetectParser().parse(stream, new BodyContentHandler(-1), new MetaData()) as a content field.

Save the document size in to a new field

2018-01-31 Thread Blackknight
Hello guys, I want to add an option to search document by size. For example, find the top categories with the biggest documents. I thought about creating a new update processor wich will counting the bytes of all fields in the document, but I think it wont work good, because some fields are

Long GC Pauses

2018-01-31 Thread Maulin Rathod
Hi, We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We have 2 shards and each shard have 2 replicas). We have allocated 40 GB RAM to each shard. Intermittently we found long GC pauses (60 sec to 200 sec) due to which solr stops responding and hence collections goes in

Clusterstatus Action

2018-01-31 Thread Chris Ulicny
Hi all, According to the documentation, the 'shard' parameter for the CLUSTERSTATUS action should allow a comma delimited list of shards. However, passing 'shard1,shard2' as the value results in a shard-not-found error where it was looking for 'shard1,shard2'. Not a search for 'shard1' and

Re: Using SolrJ for digest authentication

2018-01-31 Thread Rick Leir
Eddy Maybe your request is getting through twice. Check your logs to see. Cheers -- Rick On January 31, 2018 5:59:53 AM EST, ddramireddy wrote: >We are currently deploying Solr in war mode(Yes, recommendation is not >war. >But this is something I can't change now. Planned

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Rick Leir
Luigi Is there a reason for not indexing all of your on-disk pages? That seems to be the first step. But I do not understand what your goal is. Cheers -- Rick On January 30, 2018 1:33:27 PM EST, Luigi Caiazza wrote: >Hello, > >I am working on a project that simulates a

Re:Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Luigi, What about using an updatable DocValue [1] for the field x ? you could initially set it to -1, and then update it for the docs in the step j. Range queries should still work and the update should be fast. Cheers [1]

Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
Hi, We are using solr for our movie title search. As it is "title search", this should be treated different than the normal document search. Hence, we use a modified version of TFIDFSimilarity with the following changes. - disabled TF & IDF and will only have 1 as value. - disabled norms by

Re: Save the document size in to a new field

2018-01-31 Thread Emir Arnautović
With any generic solution there will be always the question of what is the document size: should you count the same field twice if indexed in two different ways? Does size of index count or size of response? If simplified version works for you - approximate doc size to the size of the largest

OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hello, I'm trying to get the documents which got indexed on calling DIH and I want to differentiate such documents with the ones which are added using SolrJ atomic update. Is it possible to get the document primary keys which got indexed thru "onImportEnd" Eventlistener? Any alternative way

Save the document size in to a new field

2018-01-31 Thread Blackknight
Hello guys, I want to add an option to search document by size. For example, find the top categories with the biggest documents. I thought about creating a new update processor wich will counting the bytes of all fields in the document, but I think it wont work good, because some fields are

RE: 7.2.1 cluster dies within minutes after restart

2018-01-31 Thread Markus Jelsma
Ah thanks, i just submitted a patch fixing it. Anyway, in the end it appears this is not the problem we are seeing as our timeouts were already at 30 seconds. All i know is that at some point nodes start to lose ZK connections due to timeouts (logs say so, but all within 30 seconds), the logs

Re: Searching for an efficient and scalable way to filter query results using non-indexed and dynamic range values

2018-01-31 Thread Alessandro Benedetti
I am not sure I fully understood your use case, but let me suggest few different possible solutions : 1) Query Time join approach : you keep 2 collections, one static with all the pages, one that just store lighweight documents containing the crawling interaction : 1) Id, content -> Pages

Re: OnImportEnd EventListener

2018-01-31 Thread Emir Arnautović
So all fields are DIH imported? And you just want to know which are from the last run? Can you add date field and track when DIH started and ended and filter based on that? Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training -

Using SolrJ for digest authentication

2018-01-31 Thread ddramireddy
We are currently deploying Solr in war mode(Yes, recommendation is not war. But this is something I can't change now. Planned for future). I am setting authentication for solr. As Solr provided basic authentication is not working in Solr 6.4.2, I am setting up digest authentication in tomcat for

RE: OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hi Emir, Thanks for the reply, As I'm doing atomic update on the existing documents(already indexed from DIH) as well, with the suggested approach, I might end up doing atomic update on DIH imported document and commit the same. So, I wanted to get the document values which were indexed when

Re: Computing record score depending on its association with other records

2018-01-31 Thread Gintautas Sulskus
Yes, that is correct. Collection 'features' stores mapping between features and their scores. For simplicity, I tried to keep the level of detail about these collections to a minimum. Both collections contain thousands of records and are updated by (lily) hbase-indexer. Therefore storing

Re: OnImportEnd EventListener

2018-01-31 Thread Emir Arnautović
Hi Srinivas, I guess you can add some field that will be set in your DIH config - something like: And you can use ‘dih’ field to filter out doc that are imported using DIH. HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training

Re: facet.method=uif not working in solr cloud?

2018-01-31 Thread Alessandro Benedetti
I worked personally on the SimpleFacets class which does the facet method selection : FacetMethod appliedFacetMethod = selectFacetMethod(field, sf, requestedMethod, mincount, exists); RTimer timer = null; if (fdebug != null)

OnImportEnd EventListener

2018-01-31 Thread Srinivas Kashyap
Hello, I'm trying to get the documents which got indexed on calling DIH and I want to differentiate such documents with the ones which are added using SolrJ atomic update. Is it possible to get the document primary keys which got indexed thru "onImportEnd" Eventlistener? Any alternative way

Re: Help with Boolean search using Solr parser edismax

2018-01-31 Thread Emir Arnautović
Hi Wendy, I see several issues, but not sure if any of them is the reason why you are not getting what you expect: * there are no spaces around OR and that results in query being parsed sometimes with OR, e.g. (pdb_id:OR\”Solution)^5 * wildcard in quotes - it is not handled as you expected - the

Re: full name free text search problem

2018-01-31 Thread Alessandro Benedetti
"I am getting the records matching the full name sorted by distance. If the input string(for ex Dae Kim) is provided, I am getting the records other than Dae Kim(for ex Rodney Kim) too at the top of the search results including Dae Kim just before the next Dae Kim because Kim is matching with

Re: Distributed search cross cluster

2018-01-31 Thread Bernd Fehling
Many years ago, in a different universe, when Federated Search was a buzzword we used Unity from FAST FDS (which is now MS ESP). It worked pretty well across many systems like FAST FDS, Google, Gigablast, ... Very flexible with different mixers, parsers, query transformers. Was written in Python

Re: Distributed search cross cluster

2018-01-31 Thread Charlie Hull
On 30/01/2018 16:09, Jan Høydahl wrote: Hi, A customer has 10 separate SolrCloud clusters, with same schema across all, but different content. Now they want users in each location to be able to federate a search across all locations. Each location is 100% independent, with separate ZK etc.