(Newbie Help!) Seeking guidance in regards to Solr's suggestor and others

2016-12-12 Thread KV Medmeme
Hi Friends, I'm new to solr, been working on it for the past 2-3 months trying to really get my feet wet with it so that I can transition the current search engine at my current job to solr. (Eww sphinx haha) anyway I need some help. I was running around the net getting my suggester working and

Re: Does sharding improve or degrade performance?

2016-12-12 Thread Shawn Heisey
On 12/12/2016 1:14 PM, Piyush Kunal wrote: > We did the following change: > > 1. Previously we had 1 shard and 32 replicas for 1.2million documents of > size 5 GB. > 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of > size 5GB How many machines and shards per machine were

Re: Does sharding improve or degrade performance?

2016-12-12 Thread Erick Erickson
Sharding adds inevitable overhead. Particularly each request, rather than being serviced on a single replica has to send out a first request to each replica, get the ID and sort criteria back, then send out a second request to get the actual docs. Especially if you're asking for a lot of rows

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Erick Erickson
bq: We are indexing with autocommit at 30 minutes OK, check the size of your tlogs. What this means is that all the updates accumulate for 30 minutes in a single tlog. That tlog will be closed when autocommit happens and a new one opened for the next 30 minutes. The first tlog won't be purged

Re: OOMs in Solr

2016-12-12 Thread Erick Erickson
bq: ...so I wonder if reducing the heap is going to help or it won’t matter that much... Well, if you're hitting OOM errors than you have no _choice_ but to reduce the heap. Or increase the memory. And you don't have much physical memory to grow into. Longer term, reducing the JVM size (assuming

Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
According to the post you linked to, it strongly advises to buy SSDs. I got in touch with the systems department in my organization and it turns out that our VM storage is SSD-backed, so I wonder if reducing the heap is going to help or it won’t matter that much. Of course, there’s nothing

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Susheel Kumar
One option: First you may purge all documents before full-reindex that you don't need to run optimize unless you need the data to serve queries same time. i think you are running into out of space because your 43 million may be consuming 30% of total disk space and when you re-index the total

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Michael Joyner
We are having an issue with running out of space when trying to do a full re-index. We are indexing with autocommit at 30 minutes. We have it set to only optimize at the end of an indexing cycle. On 12/12/2016 02:43 PM, Erick Erickson wrote: First off, optimize is actually rarely necessary.

Re: Does sharding improve or degrade performance?

2016-12-12 Thread Piyush Kunal
All our shards and replicas reside on different machines with 16GB RAM and 4 cores. On Tue, Dec 13, 2016 at 1:44 AM, Piyush Kunal wrote: > We did the following change: > > 1. Previously we had 1 shard and 32 replicas for 1.2million documents of > size 5 GB. > 2. We

Does sharding improve or degrade performance?

2016-12-12 Thread Piyush Kunal
We did the following change: 1. Previously we had 1 shard and 32 replicas for 1.2million documents of size 5 GB. 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of size 5GB We have a combined RPM of around 20k rpm for solr. But unfortunately we saw a degrade in performance

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Susheel Kumar
How much difference between below two parameters from your Solr stats screen. For e.g. in our case we have very frequent updates which results into max docs = num docs x2 over the period of time and in that case I have seen optimization helps in query performance. Unless you have huge

Re: regex-urlfilter help

2016-12-12 Thread KRIS MUSSHORN
sorry my mistake.. sent to wrong list.   - Original Message - From: "Shawn Heisey" To: solr-user@lucene.apache.org Sent: Monday, December 12, 2016 2:36:26 PM Subject: Re: regex-urlfilter help On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote: > I'm using nutch

error diagnosis help.

2016-12-12 Thread KRIS MUSSHORN
ive scoured my nutch and solr config files and I cant find any cause. suggestions? Monday, December 12, 2016 2:37:13 PMERROR nullRequestHandlerBase org.apache.solr.common.SolrException: Unexpected character '&' (code 38) in epilog; expected '<'

Setting Shard Count at Initial Startup of SolrCloud

2016-12-12 Thread Furkan KAMACI
Hi, I have an external Zookeeper. I don't wanna use SolrCloud as test. I upload confs to Zookeeper: server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir server/solr/my_collection/conf -confname my_collection Start servers: Server 1: bin/solr start -cloud -d

Re: How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Erick Erickson
First off, optimize is actually rarely necessary. I wouldn't bother unless you have measurements to prove that it's desirable. I would _certainly_ not call optimize every 10M docs. If you must call it at all call it exactly once when indexing is complete. But see above. As far as the commit, I'd

Re: OOMs in Solr

2016-12-12 Thread Erick Erickson
The biggest bang for the buck is _probably_ docValues for the fields you facet on. If that's the culprit, you can also reduce your JVM heap considerably, as Toke says, leaving this little memory for the OS is bad. Here's the writeup on why:

Re: regex-urlfilter help

2016-12-12 Thread Shawn Heisey
On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote: > I'm using nutch 1.12 and Solr 5.4.1. > > Crawling a website and indexing into nutch. > > AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. > > what if I have > https:///inside/default.cfm as my seed url... >

RE: Unicode Character Problem

2016-12-12 Thread Allison, Timothy B.
> I don't see any weird character when I manual copy it to any text editor. That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right, and Tika or PDFBox isn't getting it right. If you run tika-app on the file [0], do you get the same problem? See our stub

regex-urlfilter help

2016-12-12 Thread KRIS MUSSHORN
I'm using nutch 1.12 and Solr 5.4.1.   Crawling a website and indexing into nutch.   AFAIK the regex-urlfilter.txt file will cause content to not be crawled..   what if I have https:///inside/default.cfm  as my seed url... I want the links on this page to be crawled and indexed but I

Re: Copying Tokens

2016-12-12 Thread Alexandre Rafalovitch
Multilingual is - hard - fun. What you are trying to do is probably not super-doable as copyField copies original text representation. You don't want to copy tokens anyway, as your query-time analysis chains are different too. I would recommend looking at the books first. Mine talks about

Re: OOMs in Solr

2016-12-12 Thread Susheel Kumar
Double check if your queries are not running into deep pagination (q=*:*...=). This is something i recently experienced and was the only cause of OOM. You may have the gc logs when OOM happened and drawing it on GC Viewer may give insight how gradual your heap got filled and run into OOM.

How to check optimized or disk free status via solrj for a particular collection?

2016-12-12 Thread Michael Joyner
Halp! I need to reindex over 43 millions documents, when optimized the collection is currently < 30% of disk space, we tried it over this weekend and it ran out of space during the reindexing. I'm thinking for the best solution for what we are trying to do is to call commit/optimize every

Re: Distribution Packages

2016-12-12 Thread Pushkar Raste
We use jdeb maven plugin to build the debian packages, we use it for Solr as well On Dec 12, 2016 9:03 AM, "Adjamilton Junior" wrote: > Hi folks, > > I am new here and I wonder to know why there's no Solr 6.x packages for > ubuntu/debian? > > Thank you. > > Adjamilton Junior >

Map Highlight Field into Another Field

2016-12-12 Thread Furkan KAMACI
Hi, One can use * at highlight fields. As like: content_* So, content_de and content_en can match to it. However response will include such fields: "highlighting":{ "my query":{ "content_de": "content_en": ... Is it possible to map matched fields into a pre defined field.

Copying Tokens

2016-12-12 Thread Furkan KAMACI
Hi, I'm testing language identification. I've enabled it solrconfig.xml. Here is my dynamic fields at schema: So, after indexing, I see that fields are generated: content_en content_ru I copy my fields into a text field: Here is my text field: I want to let users only search on only

Re: Unicode Character Problem

2016-12-12 Thread Furkan KAMACI
Hi Ahmet, I don't see any weird character when I manual copy it to any text editor. On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan wrote: > Hi Furkan, > > I am pretty sure this is a pdf extraction thing. > Turkish characters caused us trouble in the past during

Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
Thanks again. I’m learning more about Solr in this thread than in my previous months reading about it! Moving to Solr Cloud is a possibility we’ve discussed and I guess it will eventually happen, as the index will grow no matter what. I’ve already lowered filterCache from 512 to 64 and I’m

Re: empty result set for a sort query

2016-12-12 Thread Yonik Seeley
Ah, 2-phase distributed search is the most likely answer (and currently classified as more of a limitation than a bug)... Phase 1 collects the top N ids from each shard (and merges them to find the global top N) Phase 2 retrieves the stored fields for the global top N If any of the ids have been

Re: Distribution Packages

2016-12-12 Thread Shawn Heisey
On 12/12/2016 7:03 AM, Adjamilton Junior wrote: > I am new here and I wonder to know why there's no Solr 6.x packages > for ubuntu/debian? There are no official Solr packages for ANY operating system. We have binary releases that include an installation script for UNIX-like operating systems

Re: OOMs in Solr

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: > I’m writing because in our web application we’re using Solr 5.1.0 and > currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are > dedicated to Solr and nothing else is running there). We have four > cores, that are this

RE: OOMs in Solr

2016-12-12 Thread Prateek Jain J
You can also try following: 1. reduced stack size of thread using -Xss flag. 2. Try to use sharding instead of single large instance (if possible). 3. reduce cache size in solrconfig.xml Regards, Prateek Jain -Original Message- From: Alfonso Muñoz-Pomer Fuentes

Distribution Packages

2016-12-12 Thread Adjamilton Junior
Hi folks, I am new here and I wonder to know why there's no Solr 6.x packages for ubuntu/debian? Thank you. Adjamilton Junior

Re: Antw: Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:56 AM, Rainer Gnan wrote: > Do the query this way: > http://hostname.de:8983/solr/live/select?indent=on=*:* > > I have no idea whether the behavior you are seeing is correct or wrong, > but if you send the traffic directly to the alias it should work correctly. > > It might turn

Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
I wasn’t aware of docValues and filterCache policies. We’ll try to fine-tune it and see if it helps. Thanks so much for the info! On 12/12/2016 12:13, Toke Eskildsen wrote: On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote: I’m writing because in our web application we’re

Re: OOMs in Solr

2016-12-12 Thread Alfonso Muñoz-Pomer Fuentes
Thanks for the reply. Here’s some more info... Disk space: 39 GB / 148 GB (used / available) Deployment model: Single instance JVM version: 1.7.0_04 Number of queries: avgRequestsPerSecond: 0.5478469104833896 GC algorithm: None specified, so I guess it defaults to the parallel GC. On

Traverse over response docs in SearchComponent impl.

2016-12-12 Thread Markus Jelsma
Hello - i need to traverse over the list of response docs in a SearchComponent, get all values for a specific field, and then conditionally add a new field. The request handler is configured as follows: dostuff I can see that Solr calls the component's process() method, but from

Re: OOMs in Solr

2016-12-12 Thread Toke Eskildsen
On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote: > I’m writing because in our web application we’re using Solr 5.1.0 > and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 > are dedicated to Solr and nothing else is running there). This leaves very little

Antw: Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Rainer Gnan
Hi Shawn, your workaround works and is exactly what I was looking for. Did you find this solution via trial and error or can you point me to the appropriate section in the APRGuide? Thanks a lot! Rainer Rainer Gnan Bayerische Staatsbibliothek

Re: Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Shawn Heisey
On 12/12/2016 3:32 AM, Rainer Gnan wrote: > Hi, > > actually I am trying to use Collection Aliasing in a SolrCloud-environment. > > My set up is as follows: > > 1. Collection_1 (alias "live") linked with config_1 > 2. Collection_2 (alias "test") linked with config_2 > 3. Collection_1 is different

Re: Data Import Handler - maximum?

2016-12-12 Thread Shawn Heisey
On 12/11/2016 8:00 PM, Brian Narsi wrote: > We are using Solr 5.1.0 and DIH to build index. > > We are using DIH with clean=true and commit=true and optimize=true. > Currently retrieving about 10.5 million records in about an hour. > > I will like to find from other member's experiences as to how

Solr 6.2.1 :: Collection Aliasing

2016-12-12 Thread Rainer Gnan
Hi, actually I am trying to use Collection Aliasing in a SolrCloud-environment. My set up is as follows: 1. Collection_1 (alias "live") linked with config_1 2. Collection_2 (alias "test") linked with config_2 3. Collection_1 is different to Collection _2 4. config_1 is different to config_2

RE: OOMs in Solr

2016-12-12 Thread Prateek Jain J
Please provide some information like, disk space available deployment model of solr like solr-cloud or single instance jvm version no. of queries and type of queries etc. GC algorithm used etc. Regards, Prateek Jain -Original Message- From: Alfonso Muñoz-Pomer Fuentes

Re: empty result set for a sort query

2016-12-12 Thread moscovig
I am not sure that it's related, but with local tests we got to a scenario where we Add doc that somehow has * empty key* and then, when querying with sort over creationTime with rows=1, we get empty result set. When specifying the recent doc shard with shards=shard2 we do have results. I

Re: empty result set for a sort query

2016-12-12 Thread moscovig
Hi Thanks for the reply. We are using select?q=*:*=creationTimestamp+desc=1 So as you said we should have got results. Another piece of information is that we commit within 300ms when inserting the "sanity" doc. And again, we delete by query. We don't have any custom plugin/query

Re: Data Import Handler - maximum?

2016-12-12 Thread Bernd Fehling
Am 12.12.2016 um 04:00 schrieb Brian Narsi: > We are using Solr 5.1.0 and DIH to build index. > > We are using DIH with clean=true and commit=true and optimize=true. > Currently retrieving about 10.5 million records in about an hour. > > I will like to find from other member's experiences as to

RE: Problem with Cross Data Center Replication

2016-12-12 Thread WILLMES Gero (SAFRAN IDENTITY AND SECURITY)
Hi Erick, thanks for the hint. Indeed, i just forgot to paste the section into the email. It was configured just the same way as you wrote. Do you have any idea what else could be the cause for the error? Best regard, Gero -Original Message- From: Erick Erickson