Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

2017-03-17 Thread Jay Hill
I've got a very difficult project to tackle. I've been tasked with using schemaless mode to index json files that we receive. The structure of the json files will always be very different as we're receiving files from different customers totally unrelated to one another. We are attempting to build

Cascading failures with replicas

2017-03-17 Thread Walter Underwood
I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down Solr on one host because it got into some kind of bad, can’t-recover state where it was causing timeouts across the whole cluster (bug #1). I ran a load benchmark near the capacity of the cluster. This had run

Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread donato
Thanks for the response, Erik! Can you download my schema file here? CLICK HERE . I'm not too familiar with this technology yet. I tried adding that =query at the end of my URL, but nothing happened. Thanks again for the repsonse! All along, I just wanted queries

Re: analysis matches aren't counting as matches in query

2017-03-17 Thread Erick Erickson
The most common issue here is that the query isn't being parsed like you think it is. The simplest is that the query has spaces somewhere. I.e. "q=f1:a b" gets parsed as q=f1:a default_field:b The analysis page (which I assume you're talking about) tells you what happens _after_ the query is

Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread Erick Erickson
what stemmers are you using? I got the results I by using EnglishPosessiveFilterFactory followeed by PorterStemFilterFactory. Or you could use Porter and remove the leftover trailing apostrophe. Best, Erick On Fri, Mar 17, 2017 at 5:05 PM, Erick Erickson wrote: > Your

Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread Erick Erickson
Your schema file didn't come through. Have you tried looking at the admin UI/Analysis page for the three values? That often tells you what is going on. The other thing to do is attach =query to the URL. That'll show you how the query parsed, which is separate from the analysis bits. Best, Erick

Solr Split

2017-03-17 Thread Azazel K
Hi, We have a solr index running in 4.5.0 that we are trying to upgrade to 4.7.2 and split the shard. The uniqueKey is a TrieLongField, and it's values are always negative : In prod (2 shards, 1 replica for each shard) Max : -9223372035490849922 Min : -9223372036854609508 In lab (1

How on EARTH do I remove 's in schema file?

2017-03-17 Thread donato
I have been racking my brain for days... I need to remove 's from say "patrick's" If I search for "patrick" or "patricks" I get the same number of results, however, if I search for "patrick's" it's a different number. I just want solr to ignore the 'sCan someone PLEASE help me It is driving me

analysis matches aren't counting as matches in query

2017-03-17 Thread John Blythe
hi all, i'm having a hard time w understanding why i'm not getting hits on a manufacturer field that i recently updated. i get the following results, the top row being the index analysis and the second the query. RDTF mentor advanced sterilize RDTF mentor advanced sterilize yet when the

Re: Parallelizing post filter for better performance

2017-03-17 Thread Joel Bernstein
You'll probably get better results by trying to get more performance out of your single threaded postfilter. If you can post the code in you collect() method you may get some ideas on how to improve the performance. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Mar 17, 2017 at 2:13 PM,

Re: Parallelizing post filter for better performance

2017-03-17 Thread Mikhail Khludnev
Lucene can search segments in parallel. In Solr, you can break it to multiple shards/cores and "distributeSearch" even on single Solr instance (even without SolrCloud). On Fri, Mar 17, 2017 at 9:13 PM, Sundeep T wrote: > Hello, > > Is there a way to execute the post

Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey wrote: > On 3/17/2017 8:11 AM, Yonik Seeley wrote: >> For Solr 6.4, we've managed to circumvent this for filter queries and >> other contexts where scoring isn't needed. >> http://yonik.com/solr-6-4/ "More efficient filter

Re: SOLR Data Locality

2017-03-17 Thread Toke Eskildsen
Imad Qureshi wrote: > I understand that but unfortunately that's not an option right now. > We already have 16 TB of index in HDFS. > > So let me rephrase this question. How important is data locality for > SOLR. Is performance impacted if SOLR data is on a remote

Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 8:11 AM, Yonik Seeley wrote: > For Solr 6.4, we've managed to circumvent this for filter queries and > other contexts where scoring isn't needed. > http://yonik.com/solr-6-4/ "More efficient filter queries" Nice! If the filter looks like the following (because q.op=AND), does it

Parallelizing post filter for better performance

2017-03-17 Thread Sundeep T
Hello, Is there a way to execute the post filter in a parallel mode so that multiple query results can be filtered in parallel? Right now, in our code, the post filter is becoming kind of bottleneck as we had to do some post processing on every returned result, and it runs serially in a single

Re: Data Import

2017-03-17 Thread Mike Thomsen
If Solr is down, then adding through SolrJ would fail as well. Kafka's new API has some great features for this sort of thing. The new client API is designed to be run in a long-running loop where you poll for new messages with a certain amount of defined timeout (ex: consumer.poll(1000) for 1s)

Re: Data Import

2017-03-17 Thread OTH
Are Kafka and SQS interchangeable? (The latter does not seem to be free.) @Wunder: I'm assuming, that updating to Solr would fail if Solr is unavailable not just if posting via say a DB trigger, but probably also if trying to post through SolrJ? (Which is what I'm using for now.) So, even if

RE: Data Import

2017-03-17 Thread Liu, Daphne
NO, I use the free version. I have the driver from someone else. I can share it if you want to use Cassandra. They have modified it for me since the free JDBC driver I found will timeout when the document is greater than 16mb. Kind regards, Daphne Liu BI Architect - Matrix SCM CEVA Logistics

Re: Data Import

2017-03-17 Thread vishal jain
Streaming the data through kafka would be a good option if near real time data indexing is the key requirement. In our application the RDBMS data is populated by an ETL job periodically so we don't need real time data indexing for now. Cheers, Vishal On Fri, Mar 17, 2017 at 10:30 PM, Erick

Re: SOLR Data Locality

2017-03-17 Thread Imad Qureshi
Hi Mike I understand that but unfortunately that's not an option right now. We already have 16 TB of index in HDFS. So let me rephrase this question. How important is data locality for SOLR. Is performance impacted if SOLR data is on a remote node? Thanks Imad > On Mar 17, 2017, at 12:02

Re: Data Import

2017-03-17 Thread Walter Underwood
That fails if Solr is not available. To avoid dropping updates, you need some kind of persistent queue. We use Amazon SQS for our incremental updates. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 17, 2017, at 10:09 AM, OTH

Re: Data Import

2017-03-17 Thread vishal jain
Hi Daphne, Are you using DSE? Thanks & Regards, Vishal On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne wrote: > I just want to share my recent project. I have successfully sent all our > EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC >

Re: Data Import

2017-03-17 Thread OTH
Could the database trigger not just post the change to solr? On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson wrote: > Or set a trigger on your RDBMS's main table to put the relevant > information in a different table (call it EVENTS) and have your SolrJ > consult the

Re: SOLR Data Locality

2017-03-17 Thread Mike Thomsen
I've only ever used the HDFS support with Cloudera's build, but my experience turned me off to use HDFS. I'd much rather use the native file system over HDFS. On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi < imadgr...@yahoo.com.invalid> wrote: > We have a 30 node Hadoop cluster and each

Re: Data Import

2017-03-17 Thread Erick Erickson
Or set a trigger on your RDBMS's main table to put the relevant information in a different table (call it EVENTS) and have your SolrJ consult the EVENTS table periodically. Essentially you're using the EVENTS table as a queue where the trigger is the producer and the SolrJ program is the consumer.

Re: Data Import

2017-03-17 Thread vishal jain
Thanks to all of you for the valuable inputs. Being on J2ee platform I also felt using solrJ in a multi threaded environment would be a better choice to index RDBMS data into SolrCloud. I will try with a scheduler triggered micro service to do the job using SolrJ. Regards, Vishal On Fri, Mar 17,

Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch
One assumes by hooking into the same code that updates RDBMS, as opposed to be reverse engineering the changes from looking at the DB content. This would be especially the case for Delete changes. Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced

Re: Data Import

2017-03-17 Thread OTH
> > Also, solrj is good when you want your RDBMS updates make immediately > available in solr. How can SolrJ be used to make RDBMS updates immediately available? Thanks On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar wrote: > Hi Vishal, > > As per my experience DIH is

Enhanced output of SearchComponent not visible in SolrCloud

2017-03-17 Thread Markus, Sascha
Hi, I created a serch component which enriches the response for a query. So my json result looks like { "responseHeader":{...}, "response":{"numFound":116652,"start":0,"maxScore":1.0,"docs":...}, "facet_counts":{...}, "facets":{...}, "expand.entities":{... } } I did this using

Re: Grouping and result pagination

2017-03-17 Thread Shawn Heisey
On 3/17/2017 9:07 AM, Erick Erickson wrote: > I think the answer is that you have to co-locate the docs with the > same value you're grouping by on the same shard whether in SolrCloud > or not... > > Hmmm: from: >

Re: Grouping and result pagination

2017-03-17 Thread Erick Erickson
I think the answer is that you have to co-locate the docs with the same value you're grouping by on the same shard whether in SolrCloud or not... Hmmm: from: https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats "group.ngroups and

Re: Alphanumeric sort with alphabets first

2017-03-17 Thread Erick Erickson
I would back up further and say that 2500 fields is too much from the start. Why do you need this many fields? And you say you can sort on any of them... for a corpus of any decent size this is going to chew up memory like crazy. Admittedly OS memory if you use docValues but still memory. That

Re: fq performance

2017-03-17 Thread Erick Erickson
And to chime in. bq: It contains information about who have access to the documents, like field as (U1_s:true). I wanted to make explicit the implications of Micael's response. You are talking about different _fields_ per user or group, i.e. Don't do this, it's horribly wasteful. Instead as

Grouping and result pagination

2017-03-17 Thread Shawn Heisey
We use pagination (start/rows) frequently with our queries. Nothing unusual there. Now we have need to use grouping with a request like this, for a set-mode search, where only one document from each set is returned:

Re: Exact match works only for some of the strings

2017-03-17 Thread Mikhail Khludnev
Hello Gintas, >From the first letter I've got that you use colon to separate fieldname and text. But here it's =, which is never advised in lucence syntax. On Fri, Mar 17, 2017 at 2:37 PM, Gintautas Sulskus < gintautas.suls...@gmail.com> wrote: > Hi, > > Thank you for your replies. > Sorry,

Re: managing active/passive cores in Solr and Haystack

2017-03-17 Thread Shawn Heisey
On 3/15/2017 7:55 AM, serwah sabetghadam wrote: > Thanks Erick for the fast answer:) > > I knew about sharding, just as far as I know it will work on different > servers. > I wonder if it is possible to do sth like sharding as you mentioned but on > a single standalone Solr? > Can I use the

Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey wrote: [...] > Lucene has a global configuration called "maxBooleanClauses" which > defaults to 1024. For Solr 6.4, we've managed to circumvent this for filter queries and other contexts where scoring isn't needed.

RE: Data Import

2017-03-17 Thread Liu, Daphne
I just want to share my recent project. I have successfully sent all our EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC Cassandra connector indexing our documents. Since Cassandra is so fast for writing, compression rate is around 13% and all my documents can be keep in

Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch
I feel DIH is much better for prototyping, even though people do use it in production. If you do want to use DIH, you may benefit from reviewing the DIH-DB example I am currently rewriting in https://issues.apache.org/jira/browse/SOLR-10312 (may need to change luceneMatchVersion in solrconfig.xml

Re: Get handler not working

2017-03-17 Thread Chris Ulicny
I didn't realize extra parameters were ignored on collection creation. I believe I have all of the trace log from the get request included in the attached document. The collection used was setup as CollectionOne previously. One instance in cloud mode with 2 shards with router.field=iqroutingkey.

Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 12:46 AM, Ganesh M wrote: > For how many ORs solr can give the results in less than one second.Can > I pass 100's of OR condtion in the solr query? will that affects the > performance ? This is a question that's impossible to answer. The number will vary depending on the nature of

Re: Data Import

2017-03-17 Thread Shawn Heisey
On 3/17/2017 3:04 AM, vishal jain wrote: > I am new to Solr and am trying to move data from my RDBMS to Solr. I know the > available options are: > 1) Post Tool > 2) DIH > 3) SolrJ (as ours is a J2EE application). > > I want to know what is the recommended way for Data import in production >

Unified highlighter and complexphrase

2017-03-17 Thread Bjarke Buur Mortensen
Hi list, Given the text: "Kontraktsproget vil være dansk og arbejdssproget kan være dansk, svensk, norsk og engelsk" and the query: {!complexphrase df=content_da}("sve* no*") the unified highlighter (hl.method=unified) does not return any highlights. For reference, the original highlighter returns

Re: Exact match works only for some of the strings

2017-03-17 Thread Gintautas Sulskus
Hi, Thank you for your replies. Sorry, forgot to specify, I am using Solr 4.10.3 (from Cloudera CDH 5.9.0). When I search for name:Guardian I can see both "Guardian EU-referendum" and "Guardian US" in the result set. The debugQuery results for both queries are identical

Solrcould with Haystack

2017-03-17 Thread serwah
Hi there, Is there anyone who has used Solrcloud communicating with Django Haystack? When I have to use Django Haystack plus distributed search, the question is if Solrcloud could be a solution. I ask this as I have not found good sources for that. Best, Serwah

Re: Data Import

2017-03-17 Thread Sujay Bawaskar
Hi Vishal, As per my experience DIH is the best for RDBMS to solr index. DIH with caching has best performance. DIH nested entities allow you to define simple queries. Also, solrj is good when you want your RDBMS updates make immediately available in solr. DIH full import can be used for index

Data Import

2017-03-17 Thread vishal jain
Hi, I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available options are: 1) Post Tool 2) DIH 3) SolrJ (as ours is a J2EE application). I want to know what is the recommended way for Data import in production environment. Will sending data via SolrJ in batches be

Solr data Import

2017-03-17 Thread vishal jain
Hi, I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available options are: 1) Post Tool 2) DIH 3) SolrJ (as ours is a J2EE application). I want to know what is the recommended way for Data import in production environment. Will sending data via SolrJ in batches be

Re: fq performance

2017-03-17 Thread Michael Kuhlmann
Hi Ganesh, you might want to use something like this: fq=access_control:(g1 g2 g5 g99 ...) Then it's only one fq filter per request. Internally it's like an OR condition, but in a more condensed form. I already have used this with up to 500 values without larger performance degradation (but

Re: question about function query

2017-03-17 Thread Bernd Fehling
Hi Mikhail, thanks for your help. After some more reading and testing I found the solution. Just in case someone else needs it here are the results. original query: q=collection:ftmuenster+AND+-description:*=* --> numFound="1877" frange query:

Re: fq performance

2017-03-17 Thread Ganesh M
Hi Shawn / Michael, Thanks for your replies and I guess you have got my scenarios exactly right. Initially my document contains information about who have access to the documents, like field as (U1_s:true). if 100 users can access a document, we will have 100 such fields for each user. So when