TimeAllowed bug
Weird fq caching bug when using timeAllowed Find a pwid (in this case YLGVQ) Run a query w/ a FQ on the pwid and timeAllowed=1. http://hgsolr2devsl.healthgrades.com:8983/solr/providersearch/select/?q=*:*wt=jsonfl=pwidfq=pwid:YLGVQtimeAllowed=1 Ensure #2 returns 0 results Rerun the query without the timeAllowed param. http://hgsolr2devsl.healthgrades.com:8983/solr/providersearch/select/?q=*:*wt=jsonfl=pwidfq=pwid:YLGVQ Note that after removing the timeAllowed parameter the query is still returning 0 results. Solr seems to be caching the FQ when the timeAllowed parameter is present. Bill Bell Sent from mobile
Re: Solr performance is slow with just 1GB of data indexed
We use 8gb to 10gb for those size indexes all the time. Bill Bell Sent from mobile On Aug 23, 2015, at 8:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/22/2015 10:28 PM, Zheng Lin Edwin Yeo wrote: Hi Shawn, Yes, I've increased the heap size to 4GB already, and I'm using a machine with 32GB RAM. Is it recommended to further increase the heap size to like 8GB or 16GB? Probably not, but I know nothing about your data. How many Solr docs were created by indexing 1GB of data? How much disk space is used by your Solr index(es)? I know very little about clustering, but it looks like you've gotten a reply from Toke, who knows a lot more about that part of the code than I do. Thanks, Shawn
Re: solr multicore vs sharding vs 1 big collection
Yeah a separate by month or year is good and can really help in this case. Bill Bell Sent from mobile On Aug 2, 2015, at 5:29 PM, Jay Potharaju jspothar...@gmail.com wrote: Shawn, Thanks for the feedback. I agree that increasing timeout might alleviate the timeout issue. The main problem with increasing timeout is the detrimental effect it will have on the user experience, therefore can't increase it. I have looked at the queries that threw errors, next time I try it everything seems to work fine. Not sure how to reproduce the error. My concern with increasing the memory to 32GB is what happens when the index size grows over the next few months. One of the other solutions I have been thinking about is to rebuild index(weekly) and create a new collection and use it. Are there any good references for doing that? Thanks Jay On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/2/2015 8:29 AM, Jay Potharaju wrote: The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about available memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that some of your client timeouts are set a lot shorter than 30 seconds. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. You're right that adding more memory to the servers is not a good scaling strategy for the general case ... but in this situation, I think it might be prudent. For your index and heap sizes, I would want the company to pay for at least 32GB of RAM. Having said that ... I've seen Solr installs work well with a LOT less memory than the ideal. I don't know that adding more memory is necessary, unless your system (CPU, storage, and memory speeds) is particularly slow. Based on your document count and index size, your documents are quite small, so I think your memory size is probably good -- if the CPU, memory bus, and storage are very fast. If one or more of those subsystems aren't fast, then make up the difference with lots of memory. Some light reading, where you will learn why I think 32GB is an ideal memory size for your system: https://wiki.apache.org/solr/SolrPerformanceProblems It is possible that your 6GB heap is not quite big enough for good performance, or that your GC is not well-tuned. These topics are also discussed on that wiki page. If you increase your heap size, then the likelihood of needing more memory in the system becomes greater, because there will be less memory available for the disk cache. Thanks, Shawn -- Thanks Jay Potharaju
Re: Division with Stats Component when Grouping in Solr
It would be cool to be able to set 2 group by with facets GROUP BY site_id, keyword Bill Bell Sent from mobile On Jun 13, 2015, at 2:28 PM, Yonik Seeley ysee...@gmail.com wrote: GROUP BY site_id, keyword
Re: Facet
Ok Clarification The limit is set to -1. But the average result is 300. The amount of strings stored in the field increased a lot. Like 250k to 350k. But the amount coming out is limited by facet.prefix. Would creating 900 fields be better ? Then I could just put the prefix in the field name. Like this: proc_ps122 Thoughts ? So far I heard solcloud, docvalues as viable solutions. Stay away from enum. Bill Bell Sent from mobile On Apr 5, 2015, at 2:56 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: William Bell billnb...@gmail.com wrote: Sent: 05 April 2015 06:20 To: solr-user@lucene.apache.org Subject: Facet We increased our number of terms (String) in a facet by 50,000. Do you mean facet.limit=5? Now we are getting an error when we facet by this field - so we switched it to facet.method=enum, and now the results come back. However, when we put it into production we literally hit a wall (CPU went to 100% for 16 cores) after about 30 minutes live. It was strange that enum worked. Internally, the difference between facet.limit=100 and facet.limit=5 is quite small. The real hits are for fine-counting within SolrCloud and serializing the result in order to deliver it to the client. I thought enum behaved the same as fc with regard to those two. We tried adding more machines to reduce the CPU, but it did not help. Sounds like SolrCloud. More machines does not help here, it might even be worse. What happens is that distributed faceting is two-phase, where the second phase is fine-counting. The fine-counting essentially makes all shards perform micro-searches for a large part of the terms returned: Your shards are bogged down by tens of thousands of small searches. If you are feeling adventurous, you can try putting http://tokee.github.io/lucene-solr/ on a test-installation (I am the author). It changes the way the fine-counting is done. Depending on your container, you might need to raise the internal limits for GET-communication. Tomcat has a default of 2MB somewhere (sorry, don't remember the details), which is not a lot for 50,000 values. What are some ideas? We are going to try docValues on the field. Does anyone know if method=fc or method=enum works for docValue? I cannot find any documentation on that. If DocValues are enabled, fc will use them. It does not change anything for enum. But I would argue against enum for anything in the thousands anyway. We are thinking of splitting the field into 2 fields (fielda, fieldb). At least the number will be less, but not sure if it will help memory? The killer is the number of terms requested/returned. The weird thing is for the first 30 minutes things are performing great. Literally at like 10% CPU across 16 cores, not much memory and normal GC. It might be because you have just been lucky. Take a look at https://twitter.com/anjacks0n/status/509284768035262464 for how different performance can be for different result set sizes. Originally the facet was a method=fc. Is there an issue with enum? We have facet.threads=20 set, and not sure this is wise for a enum ? Facet threading does not thread within each field, it just means that multiple fields are processed in parallel. - Toke Eskildsen
Re: ZFS File System for SOLR 3.6 and SOLR 4
Is the an advantage for Xfs over ext4 for Solr ? Anyone done testing? Bill Bell Sent from mobile On Mar 27, 2015, at 8:14 AM, Shawn Heisey apa...@elyograg.org wrote: On 3/27/2015 12:30 AM, abhi Abhishek wrote: i am trying to use ZFS as filesystem for my Linux Environment. are there any performance implications of using any filesystem other than ext-3/ext-4 with SOLR? That should work with no problem. The only time Solr tends to have problems is if you try to use a network filesystem. As long as it's a local filesystem and it implements everything a program can typically expect from a local filesystem, Solr should work perfectly. Because of the compatibility problems that the license for ZFS has with the GPL, ZFS on Linux is probably not as well tested as other filesystems like ext4, xfs, or btrfs, but I have not heard about any big problems, so it's probably safe. Thanks, Shawn
Re: How to boost documents at index time?
Issue a Jura ticket ? Did you try debugQuery ? Bill Bell Sent from mobile On Mar 28, 2015, at 1:49 AM, CKReddy Bhimavarapu chaitu...@gmail.com wrote: I am want to boost docs at index time, I am doing this using boost parameter in doc field doc boost=2.0. but I can't see direct impact on the doc by using debuQuery. My question is that is there any other way to boost doc at index time and can see the reflected changes i.e direct impact. -- ckreddybh. chaitu...@gmail.com
Re: Sort on multivalued attributes
Definitely needed !! Bill Bell Sent from mobile On Feb 9, 2015, at 5:51 AM, Jan Høydahl jan@cominvent.com wrote: Sure, vote for it. Number of votes do not directly make prioritized sooner. So you better also add a comment to the JIRA, it will raise committer's attention. Even better of course is if you are able to help bring the issue forward by submitting patches. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 9. feb. 2015 kl. 12.15 skrev Flavio Pompermaier pomperma...@okkam.it: Do I have to vote for it..? On Mon, Feb 9, 2015 at 11:50 AM, Jan Høydahl jan@cominvent.com wrote: See https://issues.apache.org/jira/browse/SOLR-2522 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 9. feb. 2015 kl. 10.30 skrev Flavio Pompermaier pomperma...@okkam.it: In my use case it could be very helpful because I use the SIREn plugin to index arbitrary JSON-LD and this plugin automatically index also all nested attributes as a Solr field. Thus I need for example to gather all entries with a certain value of the type attribute, ordered by name (but name could be a multivalued attribute in my use case :( ) I'd like to avoid to switch to Elasticsearch just to have this single feature. Thanks for the support, Flavio On Mon, Feb 9, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net wrote: Sure, that's correct and makes sense in some use cases. I'll need to check if Solr functions support such a thing. On Mon, Feb 9, 2015 at 12:47 AM, Flavio Pompermaier pomperma...@okkam.it wrote: I saw that this is possible in Lucene ( https://issues.apache.org/jira/browse/LUCENE-5454) and also in Elasticsearch. Or am I wrong? On Mon, Feb 9, 2015 at 9:05 AM, Anshum Gupta ans...@anshumgupta.net wrote: Unless I'm missing something here, sorting on a multi-valued field would be non-deterministic in nature. On Sun, Feb 8, 2015 at 11:59 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, Is there any possibility that in the near future Solr could support sorting on multivalued fields? Best, Flavio -- Anshum Gupta http://about.me/anshumgupta -- Anshum Gupta http://about.me/anshumgupta
Re: Collations are not working fine.
Can you order the collation a by highest to lowest hits ? Bill Bell Sent from mobile On Feb 9, 2015, at 6:47 AM, Nitin Solanki nitinml...@gmail.com wrote: I am working on spell checking in Solr. I have implemented Suggestions and collations in my spell checker component. Most of the time collations work fine but in few case it fails. *Working*: I tried query:*gone wthh thes wnd*: In this wnd doesn't give suggestion wind but collation is coming right = gone with the wind, hits = 117 *Not working:* But when I tried query: *gone wthh thes wint*: In this wint does give suggestion wind but collation is not coming right. Instead of gone with the wind it gives gone with the west, hits = 1. And I want to also know what is *hits* in collations.
Re: How large is your solr index?
For Solr 5 why don't we switch it to 64 bit ?? Bill Bell Sent from mobile On Dec 29, 2014, at 1:53 PM, Jack Krupansky jack.krupan...@gmail.com wrote: And that Lucene index document limit includes deleted and updated documents, so even if your actual document count stays under 2^31-1, deleting and updating documents can push the apparent document count over the limit unless you very aggressively merge segments to expunge deleted documents. -- Jack Krupansky -- Jack Krupansky On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson erickerick...@gmail.com wrote: When you say 2B docs on a single Solr instance, are you talking only one shard? Because if you are, you're very close to the absolute upper limit of a shard, internally the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems. But yeah, your 100B documents are going to use up a lot of servers... Best, Erick On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam bram.van...@intix.eu wrote: Hi folks, I'm trying to get a feel of how large Solr can grow without slowing down too much. We're looking into a use-case with up to 100 billion documents (SolrCloud), and we're a little afraid that we'll end up requiring 100 servers to pull it off. The largest index we currently have is ~2billion documents in a single Solr instance. Documents are smallish (5k each) and we have ~50 fields in the schema, with an index size of about 2TB. Performance is mostly OK. Cold searchers take a while, but most queries are alright after warming up. I wish I could provide more statistics, but I only have very limited access to the data (...banks...). I'd very grateful to anyone sharing statistics, especially on the larger end of the spectrum -- with or without SolrCloud. Thanks, - Bram
Re: Old facet value doesn't go away after index update
Set mincount=1 Bill Bell Sent from mobile On Dec 19, 2014, at 12:22 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: Hi there, I have an index that has a field called collection_facet. There was a value 'Ness Motley Law Firm Documents' that we wanted to update to 'Ness Motley Law Firm'. There were 36,132 records with this value. So I re-indexed just the 36,132 records. After the update, I ran a facet query (q=*:*facet=truefacet.field=collection_facet) to see if the value got updated and I saw Ness Motley Law Firm 36,132 -- as expected Ness Motley Law Firm Documents 0 — Why is this value still here even though clearly there are no records with this value anymore? I thought maybe it was cached, so I restarted solr, but I still got the same results. facet_fields: { collection_facet: [ … Ness Motley Law Firm, 36132, … Ness Motley Law Firm Documents, 0 ] Rebecca Tang Applications Developer, UCSF CKM Legacy Tobacco Document Librarylegacy.library.ucsf.edu/ E: rebecca.t...@ucsf.edu
Re: Solr Dynamic Field Performance
How about perf if you dynamically create 5000 fields ? Bill Bell Sent from mobile On Sep 14, 2014, at 10:06 AM, Erick Erickson erickerick...@gmail.com wrote: Dynamic fields, once they are actually _in_ a document, aren't any different than statically defined fields. Literally, there's no place in the search code that I know of that _ever_ has to check whether a field was dynamically or statically defined. AFAIK, the only additional cost would be figuring out which pattern matched at index time, which is such a tiny portion of the cost of indexing that I doubt you could measure it. Best, Erick On Sun, Sep 14, 2014 at 7:58 AM, Saumitra Srivastav saumitra.srivast...@gmail.com wrote: I have a collection with 200 fields and 300M docs running in cloud mode. Each doc have around 20 fields. I now have a use case where I need to replace these explicit fields with 6 dynamic fields. Each of these 200 fields will match one of the 6 dynamic field. I am evaluating performance implications of switching to dynamicFields. I have tested with a smaller dataset(5M docs) but didn't noticed any indexing or query performance degradation. Query on dynamic fields will either be faceting, range query or full text search. Are there any known performance issues with using dynamicFields instead of explicit ones? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Dynamic-Field-Performance-tp4158737.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to solve?
Yeah we already use it. I will try to create a custom functionif I get it to work I will post. The challenge for me is how to dynamically match and add them based in the faceting. Here is a better example. The doctor core has payload as name:val. The name are doctor specialties. I need to pull back by the name since the user faceted on a specialty. So far payloads work. But the user now wants to facet on another specialty. For example they are looking for a cardiologist and an internal medicine doctor and if the doctor practices at the same hospital I need to take the values and add them. Else take the max value for the 2 specialties. Make sense now ? Seems like I need to create a payload and my own custom function. Bill Bell Sent from mobile On Sep 6, 2014, at 12:57 PM, Erick Erickson erickerick...@gmail.com wrote: Here's a blog with an end-to-end example. Jack's right, it takes some configuration and having first-class support in Solr would be a good thing... http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/ Best, Erick On Sat, Sep 6, 2014 at 10:24 AM, Jack Krupansky j...@basetechnology.com wrote: Payload really don't have first class support in Solr. It's a solid feature of Lucene, but never expressed well in Solr. Any thoughts or proposals are welcome! (Hmmm... I wonder what the good folks at Heliosearch have up their sleeves in this area?!) -- Jack Krupansky -Original Message- From: William Bell Sent: Friday, September 5, 2014 10:03 PM To: solr-user@lucene.apache.org Subject: How to solve? We have a core with each document as a person. We want to boost based on the sweater color, but if the person has sweaters in their closet which are the same manufactuer we want to boost even more by adding them together. Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater: Blue=1 : Polo Tony S - Sweater: Red =2: Nike Bill O - Sweater:Red = 2: Polo, Blue=1: Polo Scores: Peter Smit - 1+2 = 3. Tony S - 2 Bill O - 2 + 1 I thought about using payloads. sweaters_payload Blue: Nike: 1 Red: Nike: 2 Blue: Polo: 1 How do I query this? http://localhost:8983/solr/persons?q=*:*sort=?? Ideas? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: embedded documents
See my Jira. It supports it via json.fsuffix=_jsonwt=json http://mail-archives.apache.org/mod_mbox/lucene-dev/201304.mbox/%3CJIRA.12641293.1365394604231.125944.1365397875874@arcas%3E Bill Bell Sent from mobile On Aug 24, 2014, at 6:43 AM, Jack Krupansky j...@basetechnology.com wrote: Indexing and query of raw JSON would be a valuable addition to Solr, so maybe you could simply explain more precisely your data model and transformation rules. For example, when multi-level nesting occurs, what does your loader do? Maybe if the fielld names were derived by concatenating the full path of JSON key names, like titles_json.FR, field_naming nesting could be handled in a fully automated manner. I had been thinking of filing a Jira proposing exactly that, so that even the most deeply nested JSON maps could be supported, although combinations of arrays and maps would be problematic. -- Jack Krupansky -Original Message- From: Michael Pitsounis Sent: Wednesday, August 20, 2014 7:14 PM To: solr-user@lucene.apache.org Subject: embedded documents Hello everybody, I had a requirement to store complicated json documents in solr. i have modified the JsonLoader to accept complicated json documents with arrays/objects as values. It stores the object/array and then flatten it and indexes the fields. e.g basic example document { titles_json:{FR:This is the FR title , EN:This is the EN title} , id: 103, guid: 3b2f2998-85ac-4a4e-8867-beb551c0b3c6 } It will store titles_json:{FR:This is the FR title , EN:This is the EN title} and then index fields titles.FR:This is the FR title titles.EN:This is the EN title Do you see any problems with this approach? Regards, Michael Pitsounis
Re: SolrCloud Scale Struggle
Seems way overkill. Are you using /get at all ? If you need the docs avail right away - why ? How about after 30 seconds ? How many docs do you get added per second during peak ? Even Google has a delay when you do Adwords. One idea is yo have an empty core that you insert into and then shard into the queries. So one fire would be called newdocs and then you would add this core into your query. There are a couple issues with this with scoring but it works nicely. I would not even use Solrcloud for that core. Try to reduce number of Java running. Reduce memory and use one java per machine. Then if you need faster avail if docs you really need to ask why. Why not later? If it got search or just showing the user the info ? If for showing maybe query a not indexes table for the few not yet indexed ?? Or just store in a db to show the user the info and index later? Bill Bell Sent from mobile On Aug 1, 2014, at 4:19 AM, anand.mahajan an...@zerebral.co.in wrote: Hello all, Struggling to get this going with SolrCloud - Requirement in brief : - Ingest about 4M Used Cars listings a day and track all unique cars for changes - 4M automated searches a day (during the ingestion phase to check if a doc exists in the index (based on values of 4-5 key fields) or it is a new one or an updated version) - Of the 4 M - About 3M Updates to existing docs (for every non-key value change) - About 1M inserts a day (I'm assuming these many new listings come in every day) - Daily Bulk CSV exports of inserts / updates in last 24 hours of various snapshots of the data to various clients My current deployment : i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines - 24 Core + 96 GB RAM each. ii)There are over 190M docs in the SolrCloud at the moment (for all replicas its consuming overall disk 2340GB which implies - each doc is at about 5-8kb in size.) iii) The docs are split into 36 Shards - and 3 replica per shard (in all 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs running on each host) iv) There are 60 fields per doc and all fields are stored at the moment :( (The backend is only Solr at the moment) v) The current shard/routing key is a combination of Car Year, Make and some other car level attributes that help classify the cars vi) We are mostly using the default Solr config as of now - no heavy caching as the search is pretty random in nature vii) Autocommit is on - with maxDocs = 1 Current throughput Issues : With the above mentioned deployment the daily throughout is only at about 1.5M on average (Inserts + Updates) - falling way short of what is required. Search is slow - Some queries take about 15 seconds to return - and since insert is dependent on at least one Search that degrades the write throughput too. (This is not a Solr issue - but the app demands it so) Questions : 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing down indexing? Its a requirement that all docs are available as soon as indexed. 2. Should I have been better served had I deployed a Single Jetty Solr instance per server with multiple cores running inside? The servers do start to swap out after a couple of days of Solr uptime - right now we reboot the entire cluster every 4 days. 3. The routing key is not able to effectively balance the docs on available shards - There are a few shards with just about 2M docs - and others over 11M docs. Shall I split the larger shards? But I do not have more nodes / hardware to allocate to this deployment. In such case would splitting up the large shards give better read-write throughput? 4. To remain with the current hardware - would it help if I remove 1 replica each from a shard? But that would mean even when just 1 node goes down for a shard there would be only 1 live node left that would not serve the write requests. 5. Also, is there a way to control where the Split Shard replicas would go? Is there a pattern / rule that Solr follows when it creates replicas for split shards? 6. I read somewhere that creating a Core would cost the OS one thread and a file handle. Since a core repsents an index in its entirty would it not be allocated the configured number of write threads? (The dafault that is 8) 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance - Would separating the ZK cluster out help? Sorry for the long thread _ I thought of asking these all at once rather than posting separate ones. Thanks, Anand -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Scale Struggle
Auto correct not good Corrected below Bill Bell Sent from mobile On Aug 2, 2014, at 11:11 AM, Bill Bell billnb...@gmail.com wrote: Seems way overkill. Are you using /get at all ? If you need the docs avail right away - why ? How about after 30 seconds ? How many docs do you get added per second during peak ? Even Google has a delay when you do Adwords. One idea is to have an empty core that you insert into and then shard into the queries. So one core would be called newdocs and then you would add this core into your query. There are a couple issues with this with scoring but it works nicely. I would not even use Solrcloud for that core. Try to reduce number of Java instances running. Reduce memory and use one java per machine. Then if you need faster avail of docs you really need to ask why. Why not later? Do you need search or just showing the user the info ? If for showing maybe query a indexed table for the few not yet indexed ?? Or just store in a db to show the user the info and index later? Bill Bell Sent from mobile On Aug 1, 2014, at 4:19 AM, anand.mahajan an...@zerebral.co.in wrote: Hello all, Struggling to get this going with SolrCloud - Requirement in brief : - Ingest about 4M Used Cars listings a day and track all unique cars for changes - 4M automated searches a day (during the ingestion phase to check if a doc exists in the index (based on values of 4-5 key fields) or it is a new one or an updated version) - Of the 4 M - About 3M Updates to existing docs (for every non-key value change) - About 1M inserts a day (I'm assuming these many new listings come in every day) - Daily Bulk CSV exports of inserts / updates in last 24 hours of various snapshots of the data to various clients My current deployment : i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines - 24 Core + 96 GB RAM each. ii)There are over 190M docs in the SolrCloud at the moment (for all replicas its consuming overall disk 2340GB which implies - each doc is at about 5-8kb in size.) iii) The docs are split into 36 Shards - and 3 replica per shard (in all 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs running on each host) iv) There are 60 fields per doc and all fields are stored at the moment :( (The backend is only Solr at the moment) v) The current shard/routing key is a combination of Car Year, Make and some other car level attributes that help classify the cars vi) We are mostly using the default Solr config as of now - no heavy caching as the search is pretty random in nature vii) Autocommit is on - with maxDocs = 1 Current throughput Issues : With the above mentioned deployment the daily throughout is only at about 1.5M on average (Inserts + Updates) - falling way short of what is required. Search is slow - Some queries take about 15 seconds to return - and since insert is dependent on at least one Search that degrades the write throughput too. (This is not a Solr issue - but the app demands it so) Questions : 1. Autocommit with maxDocs = 1 - is that a goof up and could that be slowing down indexing? Its a requirement that all docs are available as soon as indexed. 2. Should I have been better served had I deployed a Single Jetty Solr instance per server with multiple cores running inside? The servers do start to swap out after a couple of days of Solr uptime - right now we reboot the entire cluster every 4 days. 3. The routing key is not able to effectively balance the docs on available shards - There are a few shards with just about 2M docs - and others over 11M docs. Shall I split the larger shards? But I do not have more nodes / hardware to allocate to this deployment. In such case would splitting up the large shards give better read-write throughput? 4. To remain with the current hardware - would it help if I remove 1 replica each from a shard? But that would mean even when just 1 node goes down for a shard there would be only 1 live node left that would not serve the write requests. 5. Also, is there a way to control where the Split Shard replicas would go? Is there a pattern / rule that Solr follows when it creates replicas for split shards? 6. I read somewhere that creating a Core would cost the OS one thread and a file handle. Since a core repsents an index in its entirty would it not be allocated the configured number of write threads? (The dafault that is 8) 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance - Would separating the ZK cluster out help? Sorry for the long thread _ I thought of asking these all at once rather than posting separate ones. Thanks, Anand -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Scale-Struggle-tp4150592.html Sent from the Solr - User mailing list archive at Nabble.com.
Latest jetty
Since we are now on latest Java JDK can we move to Jetty 9? Thoughts ? Bill Bell Sent from mobile
Re: stucked with log4j configuration
Well I hope log4j2 is something Solr supports when GA Bill Bell Sent from mobile On Apr 12, 2014, at 7:26 AM, Aman Tandon amantandon...@gmail.com wrote: I have upgraded my solr4.2 to solr 4.7.1 but in my logs there is an error for log4j log4j: Could not find resource Please find the attachment of the screenshot of the error console https://drive.google.com/file/d/0B5GzwVkR3aDzdjE1b2tXazdxcGs/edit?usp=sharing -- With Regards Aman Tandon
Re: boost results within 250km
Just take geodist and use the map function and send to bf or boost Bill Bell Sent from mobile On Apr 9, 2014, at 8:26 AM, Erick Erickson erickerick...@gmail.com wrote: Why do you want to do this? This sounds like an XY problem, you're asking how to do something specific without explaining why you care, perhaps there are other ways to do this. Best, Erick On Tue, Apr 8, 2014 at 11:30 PM, Aman Tandon amantandon...@gmail.com wrote: How can i gave the more boost to the results within 250km than others without using result filtering.
Re: Luke 4.6.1 released
Yes it works with Solr Bill Bell Sent from mobile On Feb 16, 2014, at 3:38 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Does it work with Solr? I couldn't tell what the description was from this repo and it's Solr relevance. I am sure all the long timers know, but for more recent Solr people, the additional information would be useful. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! Luke 4.6.1 has been just released. Grab it here: https://github.com/DmitryKey/luke/releases/tag/4.6.1 fixes: loading the jar from command line is now working fine. -- Dmitry Kan Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Status if 4.6.1?
We just need the bug fix for Solr.xml https://issues.apache.org/jira/browse/SOLR-5543 Bill Bell Sent from mobile
Re: Call to Solr via TCP
Yeah open socket to port and send correct Get syntax and Solr will respond with results... Bill Bell Sent from mobile On Dec 10, 2013, at 2:50 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Zwer, is there a reason you need to do this? Its probably very hard to get solr to speak TCP. But if you're having a performance or infrastructure problem, the group might be able to help you with a far simpler solution. Sent from my Windows Phone From: Zwer Sent: 12/10/2013 12:15 PM To: solr-user@lucene.apache.org Subject: Re: Call to Solr via TCP Maybe I asked incorrectly. Solr is Web Application, hosted by some servlet container and is reachable via HTTP. HTTP is an extension of TCP and I would like to know whether exists some lower way to communicate with application (i.e. Solr) hosted by Jetty? -- View this message in context: http://lucene.472066.n3.nabble.com/Call-to-Solr-via-TCP-tp4105932p4105935.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to work with remote solr savely?
Do you have a sample jetty XML to setup basic auth for updates in Solr? Sent from my iPad On Nov 22, 2013, at 7:34 AM, michael.boom my_sky...@yahoo.com wrote: Use HTTP basic authentication, setup in your servlet container (jetty/tomcat). That should work fine if you are *not* using SolrCloud. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-work-with-remote-solr-savely-tp4102612p4102613.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: useColdSearcher in SolrCloud config
Wouldn't that be true means use cold searcher? It seems backwards to me... Sent from my iPad On Nov 22, 2013, at 2:44 AM, ade-b adrian.bro...@gmail.com wrote: Hi The definition of useColdSearcher config element in solrconfig.xml is If a search request comes in and there is no current registered searcher, then immediately register the still warming searcher and use it. If false then all requests will block until the first searcher is done warming. By the term 'block', I assume SOLR returns a non 200 response to requests. Does anybody know the exact response code returned when the server is blocking requests? If a new SOLR server is introduced into an existing array of SOLR servers (in SOLR Cloud setup), it will sync it's index from the leader. To save you having to specify warm-up queries in the solrconfig.xml file for first searchers, would/could the new server not auto warm it's caches from the caches of an existing server? Thanks Ade -- View this message in context: http://lucene.472066.n3.nabble.com/useColdSearcher-in-SolrCloud-config-tp4102569.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: NullPointerException
It seems to be a modified row and referenced in EvaluatorBag. I am not familiar with either. Sent from my iPad On Nov 22, 2013, at 3:05 AM, Adrien RUFFIE a.ruf...@e-deal.com wrote: Hello all, I have perform a full indexation with solr, but when I try to perform an incrementation indexation I get the following exception (cf attachment). Any one have a idea of the problem ? Greate thank log.txt
Re: Reverse mm(min-should-match)
This is an awesome idea! Sent from my iPad On Nov 22, 2013, at 12:54 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Instead of specifying a percentage or number of query terms must match tokens in a field, I'd like to do the opposite -- specify how much of a field must match a query. The problem I'm trying to solve is to boost document titles that closely match the query string. If a title looks something like *Title: *[solr] [the] [worlds] [greatest] [search] [engine] I want to be able to specify how much of the field must match the query string. This differs from normal mm. Normal mm specifies a how much of the query must match a field. As an example, with this title, if I use normal mm=100% and perform the following query: mm=100% q=solr This will match the title above, as 100% of [solr] matches the field What I really want to get at is a reverse mm: Rmm=100% q=solr The title above will not match in this case. Only 1/6 of the tokens in the field match the query. However an exact search would match: Rmm=100% q=solr the worlds greatest search engine Here 100% of the query matches the title, so I'm good. Is there any way to achieve this in Solr? -- Doug Turnbull Search Big Data Architect OpenSource Connections http://o19s.com
Re: Jetty 9?
So no Jetty 9 until Solr 5? Java 7 is at rel 40 Is that our commitment to not require Java 7 until Solr 5? Most people are probably already on Java 7... Bill Bell Sent from mobile On Nov 7, 2013, at 1:29 AM, Furkan KAMACI furkankam...@gmail.com wrote: Here is an issue points to that: https://issues.apache.org/jira/browse/SOLR-4839 2013/11/7 William Bell billnb...@gmail.com When are we moving Solr to Jetty 9? -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Performance of rows and start parameters
Do you want to look thru then all ? Have you considered Lucene API? Not sure if that is better but it might be. Bill Bell Sent from mobile On Nov 4, 2013, at 6:43 AM, michael.boom my_sky...@yahoo.com wrote: I saw that some time ago there was a JIRA ticket dicussing this, but still i found no relevant information on how to deal with it. When working with big nr of docs (e.g. 70M) in my case, I'm using start=0rows=30 in my requests. For the first req the query time is ok, the next one is visibily slower, the third even more slow and so on until i get some huge query times of up 140secs, after a few hundreds requests. My test were done with SolrMeter at a rate of 1000qpm. Same thing happens at 100qpm, tough. Is there a best practice on how to do in this situation, or maybe an explanation why is the query time increasing, from request to request ? Thanks! - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-of-rows-and-start-parameters-tp4099194.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Core admin: create new core
You could pre create a bunch of directories and base configs. Create as needed. Then use schema less API to set it up ... Or make changes in a script and reload the core.. Bill Bell Sent from mobile On Nov 4, 2013, at 6:06 AM, Erick Erickson erickerick...@gmail.com wrote: Right, this has been an issue for a while, there's no current way to do this. Someday, I'll be able to work on SOLR-4779 which should go some toward making this work more easily. It's still not exactly what you're looking for, but it might work. Of course with SolrCloud you can specify a configuration set that is used for multiple collections. People are using Puppet or similar to automate this over large numbers of nodes, but that's not entirely satisfactory either in our case I suspect. FWIW, Erick On Mon, Nov 4, 2013 at 4:00 AM, Bram Van Dam bram.van...@intix.eu wrote: The core admin CREATE function requires that the new instance dir and schema/config exist already. Is there a particular reason for this? It would be incredible convenient if I could create a core with a new schema and new config simply by calling CREATE (maybe providing the contents of config.xml and schema.xml as base64 encoded strings in HTTP POST or something?). I'm guessing this isn't currently possible? Ta, - bram
Re: Proposal for new feature, cold replicas, brainstorming
Yeah replicate to a DR site would be good too. Bill Bell Sent from mobile On Oct 24, 2013, at 6:27 AM, yriveiro yago.rive...@gmail.com wrote: I'm wondering some time ago if it's possible have replicas of a shard synchronized but in an state that they can't accept queries only updates. This replica in replication mode only awake to accept queries if it's the last alive replica and goes to replication mode when other replica becomes alive and synchronized. The motivation of this is simple, I want have replication but I don't want have n replicas actives with full resources allocated (cache and so on). This is usefull in enviroments where replication is needed but a high query throughput is not fundamental and the resources are limited. I know that right now is not possible, but I think that it's a feature that can be implemented in a easy way creating a new status for shards. The bottom line question is, I'm the only one with this kind of requeriments? Does it make sense one functionality like this? - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Proposal-for-new-feature-cold-replicas-brainstorming-tp4097501.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - what's the next big thing?
Full JSON support deep complex object indexing and search Game changer Bill Bell Sent from mobile On Oct 26, 2013, at 1:04 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, On Sat, Oct 26, 2013 at 5:58 AM, Saar Carmi saarca...@gmail.com wrote: LOL, Jack. I can imagine Otis saying that. Funny indeed, but not really. Otis, with these marriage, are we going to see map reduce based queries? Can you please describe what you mean by that? Maybe with an example. Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Oct 25, 2013 10:03 PM, Jack Krupansky j...@basetechnology.com wrote: But a lot of that big yellow elephant stuff is in 4.x anyway. (Otis: I was afraid that you were going to say that the next big thing in Solr is... Elasticsearch!) -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Friday, October 25, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: Re: Solr - what's the next big thing? Saar, The marriage with the big yellow elephant is a big deal. It changes the scale. Otis Solr ElasticSearch Support http://sematext.com/ On Oct 25, 2013 5:32 AM, Saar Carmi saarca...@gmail.com wrote: If I am not mistaken the most impressive improvement of Solr 4.0 compared to previous versions was the Solr Cloud architecture. What would be the next big thing in Solr 5.0 ? Saar
Re: Spatial Distance Range
Yes frange works Bill Bell Sent from mobile On Oct 22, 2013, at 8:17 AM, Eric Grobler impalah...@googlemail.com wrote: Hi Everyone, Normally one would search for documents where the location is within a specified distance, for example widthin 5 km: fq={!geofilt pt=45.15,-93.85 sfield=store d=5}http://localhost:8983/solr/select?wt=jsonindent=truefl=name,storeq=*:*fq=%7B!geofilt%20pt=45.15,-93.85%20sfield=store%20d=5%7D It there a way to specify a range between 10 and 20 km? Something like: fq={!geofilt pt=45.15,-93.85 sfield=store distancefrom=10 distanceupto=20}http://localhost:8983/solr/select?wt=jsonindent=truefl=name,storeq=*:*fq=%7B!geofilt%20pt=45.15,-93.85%20sfield=store%20d=5%7D Thanks Ericz
Re: Skipping caches on a /select
But global on a qt would be awesome !!! Bill Bell Sent from mobile On Oct 17, 2013, at 2:43 PM, Yonik Seeley ysee...@gmail.com wrote: There isn't a global cache=false... it's a local param that can be applied to any fq or q parameter independently. -Yonik On Thu, Oct 17, 2013 at 4:39 PM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Yonik, Does cache=false apply to all caches? The docs make it sound like it is for filterCache only, but I could be misunderstanding. When I force a commit and perform a /select a query many times with cache=false, I notice my query gets cached still, my guess is in the queryResultCache. At first the query takes 500ms+, then all subsequent requests take 0-1ms. I'll confirm this queryResultCache assumption today. Cheers, Tim On 16/10/13 06:33 PM, Yonik Seeley wrote: On Wed, Oct 16, 2013 at 6:18 PM, Tim Vaillancourtt...@elementspace.com wrote: I am debugging some /select queries on my Solr tier and would like to see if there is a way to tell Solr to skip the caches on a given /select query if it happens to ALREADY be in the cache. Live queries are being inserted and read from the caches, but I want my debug queries to bypass the cache entirely. I do know about the cache=false param (that causes the results of a select to not be INSERTED in to the cache), but what I am looking for instead is a way to tell Solr to not read the cache at all, even if there actually is a cached result for my query. Yeah, cache=false for q or fq should already not use the cache at all (read or write). -Yonik
DIH
We have a custom Field processor in DIH and we are not CPU bound on one core... How do we thread it ?? We need to use more cores The box has 32 cores and 1 is 100% CPU bound. Ideas ? Bill Bell Sent from mobile
Re: DIH
We are NOW CPU bound Thoughts ??? Bill Bell Sent from mobile On Oct 15, 2013, at 8:49 PM, Bill Bell billnb...@gmail.com wrote: We have a custom Field processor in DIH and we are not CPU bound on one core... How do we thread it ?? We need to use more cores The box has 32 cores and 1 is 100% CPU bound. Ideas ? Bill Bell Sent from mobile
Re: Solr 4.4.0 on Ubuntu 10.04 with Jetty 6.1 from package Repository
Does this work ? I can suggest -XX:-UseLoopPredicate to switch off predicates. ??? Which version of 7 is recommended ? Bill Bell Sent from mobile On Oct 10, 2013, at 11:29 AM, Smiley, David W. dsmi...@mitre.org wrote: *Don't* use JDK 7u40, it's been known to cause index corruption and SIGSEGV faults with Lucene: LUCENE-5212 This has not been unnoticed by Oracle. ~ David On 10/10/13 12:34 PM, Guido Medina guido.med...@temetra.com wrote: 2. Java version: There are huges performance winning between Java 5, 6 and 7; we use Oracle JDK 7u40.
Re: Field with default value and stored=false, will be reset back to the default value in case of updating other fields
You have to update the whole record including all fields... Bill Bell Sent from mobile On Oct 9, 2013, at 7:50 PM, deniz denizdurmu...@gmail.com wrote: hi all, I have encountered some problems and post it on stackoverflow here: http://stackoverflow.com/questions/19285251/solr-field-with-default-value-resets-itself-if-it-is-stored-false as you can see from the response, does it make sense to open a bug ticket for this? because, although i can workaround this by setting everything back to stored=true, it does not make sense to keep every field stored while i dont need to return them in the search result.. or will anyone can make more detailed explanations that this is expected and normal? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Field-with-default-value-and-stored-false-will-be-reset-back-to-the-default-value-in-case-of-updatins-tp4094508.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.5 spatial search - distance and score
You can apply his 4.5 patches to 4.4 or take trunk and it is there Bill Bell Sent from mobile On Sep 12, 2013, at 6:23 PM, Weber solrmaill...@fluidolabs.com wrote: I'm trying to get score by using a custom boost and also get the distance. I found David's code* to get it using Intersects, which I want to replace by {!geofilt} or geodist() *David's code: https://issues.apache.org/jira/browse/SOLR-4255 He told me geodist() will be available again for this kind of field, which is a geohash type. Then, I'd like to know how it can be done today on 4.4 with {!geofilt} and how it will be done on 4.5 using geodist() Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-5-spatial-search-distance-and-score-tp4089706.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Some highlighted snippets aren't being returned
Zip up all your configs Bill Bell Sent from mobile On Sep 8, 2013, at 3:00 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does anyone have any good suggestions for troubleshooting/fixing the problem? Thanks! - Eric
Re: Concat 2 fields in another field
If for search just copyField into a multivalued field Or do it on indexing using DIH or code. A rhino script works too. Bill Bell Sent from mobile On Aug 27, 2013, at 7:15 AM, Jack Krupansky j...@basetechnology.com wrote: I have additional examples in the two most recent early access releases of my book - variations on using the existing update processors. -- Jack Krupansky -Original Message- From: Federico Chiacchiaretta Sent: Tuesday, August 27, 2013 8:39 AM To: solr-user@lucene.apache.org Subject: Re: Concat 2 fields in another field Hi, we do the same thing using an update request processor chain, this is the snippet from solrconfig.xml updateRequestProcessorChain name=concatenation processor class=solr.CloneFieldUpdateProcessorFactory str name=source firstname/str str name=destconcatfield/str /processor processor class=solr.CloneFieldUpdateProcessorFactory str name=sourcelastname/ str str name=destconcatfield/str /processor processor class= solr.ConcatFieldUpdateProcessorFactory str name=fieldNameconcatfield /str str name=delimiter_/str /processor processor class=solr.LogUpdateProcessorFactory / processor class= solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Regards, Federico Chiacchiaretta 2013/8/27 Markus Jelsma markus.jel...@openindex.io You may be more interested in the ConcatFieldUpdateProcessorFactory: http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html -Original message- From:Alok Bhandari alokomprakashbhand...@gmail.com Sent: Tuesday 27th August 2013 14:05 To: solr-user@lucene.apache.org Subject: Re: Concat 2 fields in another field Thanks for reply. But I don't want to introduce any scripting in my code so want to know is there any Java component available for the same. -- View this message in context: http://lucene.472066.n3.nabble.com/Concat-2-fields-in-another-field-tp4086786p4086791.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2.1 update to 4.3/4.4 problem
Index and query analyzer type=index Bill Bell Sent from mobile On Aug 26, 2013, at 5:42 AM, skorrapa korrapati.sus...@gmail.com wrote: I have also re indexed the data and tried. And also tried with the belowl fieldType name=string_lower_case class=solr.TextField sortMissingLast=true omitNorms=true analyzer type = index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type = query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type = select tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType This didnt work as well... On Mon, Aug 26, 2013 at 4:03 PM, skorrapa [via Lucene] ml-node+s472066n4086601...@n3.nabble.com wrote: Hello All, I am still facing the same issue. Case insensitive search isnot working on Solr 4.3 I am using the below configurations in schema.xml fieldType name=string_lower_case class=solr.TextField sortMissingLast=true omitNorms=true analyzer type = index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type = query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type = select tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Basically I want my string which could have spaces or characters like '-' or \ to be searched upon case insensitively. Please help. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086601.html To unsubscribe from Solr 4.2.1 update to 4.3/4.4 problem, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4081896code=a29ycmFwYXRpLnN1c2htYUBnbWFpbC5jb218NDA4MTg5Nnw0MjEwNTY0Mzc= . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-1-update-to-4-3-4-4-problem-tp4081896p4086606.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How might one search for dupe IDs other than faceting on the ID field?
This seems like a fairly large issue. Can you create a Jira issue ? Bill Bell Sent from mobile On Jul 30, 2013, at 12:34 PM, Dotan Cohen dotanco...@gmail.com wrote: On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact, when adding facet.mincount=20 (I know that some dupes are in the hundreds) I got the OutOfMemoryError in seconds instead of minutes. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Performance question on Spatial Search
Can you compare with the old geo handler as a baseline. ? Bill Bell Sent from mobile On Jul 29, 2013, at 4:25 PM, Erick Erickson erickerick...@gmail.com wrote: This is very strange. I'd expect slow queries on the first few queries while these caches were warmed, but after that I'd expect things to be quite fast. For a 12G index and 256G RAM, you have on the surface a LOT of hardware to throw at this problem. You can _try_ giving the JVM, say, 18G but that really shouldn't be a big issue, your index files should be MMaped. Let's try the crude thing first and give the JVM more memory. FWIW Erick On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower smb-apa...@alcyon.net wrote: I've been doing some performance analysis of a spacial search use case I'm implementing in Solr 4.3.0. Basically I'm seeing search times alot higher than I'd like them to be and I'm hoping people may have some suggestions for how to optimize further. Here are the specs of what I'm doing now: Machine: - 16 cores @ 2.8ghz - 256gb RAM - 1TB (RAID 1+0 on 10 SSD) Content: - 45M docs (not very big only a few fields with no large textual content) - 1 geo field (using config below) - index is 12gb - 1 shard - Using MMapDirectory Field config: fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType distErrPct=0.025 maxDistErr=0.00045 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory units=degrees/ field name=geopoint indexed=true multiValued=false required=false stored=true type=geo/ What I've figured out so far: - Most of my time (98%) is being spent in java.nio.Bits.copyToByteArray(long,Object,long,long) which is being driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() which from what I gather is basically reading terms from the .tim file in blocks - I moved from Java 1.6 to 1.7 based upon what I read here: http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/ and it definitely had some positive impact (i haven't been able to measure this independantly yet) - I changed maxDistErr from 0.09 (which is 1m precision per docs) to 0.00045 (50m precision) .. - It looks to me that the .tim file are being memory mapped fully (ie they show up in pmap output) the virtual size of the jvm is ~18gb (heap is 6gb) - I've optimized the index but this doesn't have a dramatic impact on performance Changing the precision and the JVM upgrade yielded a drop from ~18s avg query time to ~9s avg query time.. This is fantastic but I want to get this down into the 1-2 second range. At this point it seems that basically i am bottle-necked on basically copying memory out of the mapped .tim file which leads me to think that the only solution to my problem would be to read less data or somehow read it more efficiently.. If anyone has any suggestions of where to go with this I'd love to know thanks, steve
Re: How to setup SimpleFSDirectoryFactory
I get a similar situation using Windows 2008 and Solr 3.6. Memory using mmap is never released. Even if I turn off traffic and commit and do a manual gc. If the size of the index is 3gb then memory used will be heap + 3gb of shared used. If I use a 6gb index I get heap + 6gb. If I turn off MMapDirectoryFactory it goes back down. When is the MMap supposed to release memory ? It only does it on JVM restart now. Bill Bell Sent from mobile On Jul 22, 2012, at 6:21 AM, geetha anjali anjaliprabh...@gmail.com wrote: It happens in 3.6, for this reasons I thought of moving to solandra. If I do a commit, the all documents are persisted with out any issues. There is no issues in terms of any functionality, but only this happens is increase in physical RAM, goes higher and higher and stop at maximum and it never comes down. Thanks On Sun, Jul 22, 2012 at 3:38 AM, Lance Norskog goks...@gmail.com wrote: Interesting. Which version of Solr is this? What happens if you do a commit? On Sat, Jul 21, 2012 at 8:01 AM, geetha anjali anjaliprabh...@gmail.com wrote: Hi uwe, Great to know. We have files indexing 1/min. After 30 mins I see all my physical memory say its 100 percentage used(windows). On deep investigation found that mmap is not releasing os files handles. Do you find this behaviour? Thanks On 20 Jul 2012 14:04, Uwe Schindler u...@thetaphi.de wrote: Hi Bill, MMapDirectory uses the file system cache of your operating system, which has following consequences: In Linux, top free should normally report only *few* free memory, because the O/S uses all memory not allocated by applications to cache disk I/O (and shows it as allocated, so having 0% free memory is just normal on Linux and also Windows). If you have other applications or Lucene/Solr itself that allocate lot's of heap space or malloc() a lot, then you are reducing free physical memory, so reducing fs cache. This depends also on your swappiness parameter (if swappiness is higher, inactive processes are swapped out easier, default is 60% on linux - freeing more space for FS cache - the backside is of course that maybe in-memory structures of Lucene and other applications get pages out). You will only see no paging at all if all memory allocated all applications + all mmapped files fit into memory. But paging in/out the mmapped Lucene index is much cheaper than using SimpleFSDirectory or NIOFSDirectory. If you use SimpleFS or NIO and your index is not in FS cache, it will also read it from physical disk again, so where is the difference. Paging is actually cheaper as no syscalls are involved. If you want as much as possible of your index in physical RAM, copy it to /dev/null regularily and buy more RUM :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi... From: Bill Bell [mailto:billnb...@gmail.com] Sent: Friday, July 20, 2012 5:17 AM Subject: Re: ... s=op using it? The least used memory will be removed from the OS automaticall=? Isee some paging. Wouldn't paging slow down the querying? My index is 10gb and every 8 hours we get most of it in shared memory. The m=mory is 99 percent used, and that does not leave any room for other apps. = Other implications? Sent from my mobile device 720-256-8076 On Jul 19, 2012, at 9:49 A... H=ap space or free system RAM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htm l Uwe ... use i= since you might run out of memory on large indexes right? Here is how I got iSimpleFSDirectoryFactory to work. Just set - Dsolr.directoryFactor... set it=all up with a helper in solrconfig.xml... if (Constants.WINDOWS) { if (MMapDirectory.UNMAP_SUPPORTED Constants.JRE_IS_64... -- Lance Norskog goks...@gmail.com
Re: How to setup SimpleFSDirectoryFactory
Thanks. Are you saying that if we run low on memory, the MMapDirectory will stop using it? The least used memory will be removed from the OS automatically? Isee some paging. Wouldn't paging slow down the querying? My index is 10gb and every 8 hours we get most of it in shared memory. The memory is 99 percent used, and that does not leave any room for other apps. Other implications? Sent from my mobile device 720-256-8076 On Jul 19, 2012, at 9:49 AM, Uwe Schindler u...@thetaphi.de wrote: Read this, then you will see that MMapDirectory will use 0% of your Java Heap space or free system RAM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: William Bell [mailto:billnb...@gmail.com] Sent: Tuesday, July 17, 2012 6:05 AM Subject: How to setup SimpleFSDirectoryFactory We all know that MMapDirectory is fastest. However we cannot always use it since you might run out of memory on large indexes right? Here is how I got iSimpleFSDirectoryFactory to work. Just set - Dsolr.directoryFactory=solr.SimpleFSDirectoryFactory. Your solrconfig.xml: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.StandardDirectoryFactory}/ You can check it with http://localhost:8983/solr/admin/stats.jsp Notice that the default for Windows 64bit is MMapDirectory. Else NIOFSDirectory except for WIndows It would be nicer if we just set it all up with a helper in solrconfig.xml... if (Constants.WINDOWS) { if (MMapDirectory.UNMAP_SUPPORTED Constants.JRE_IS_64BIT) return new MMapDirectory(path, lockFactory); else return new SimpleFSDirectory(path, lockFactory); } else { return new NIOFSDirectory(path, lockFactory); } } -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Mmap
Any thought on this? Is the default Mmap? Sent from my mobile device 720-256-8076 On Feb 14, 2012, at 7:16 AM, Bill Bell billnb...@gmail.com wrote: Does someone have an example of using unmap in 3.5 and chunksize? I am using Solr 3.5. I noticed in solrconfig.xml: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.StandardDirectoryFactory}/ I don't see this parameter taking.. When I set -Dsolr.directoryFactory=solr.MMapDirectoryFactory How do I see the setting in the log or in stats.jsp ? I cannot find a place that indicates it is set or not. I would assume StandardDirectoryFactory is being used but I do see (when I set it or NOT set it) Bill Bell Sent from mobile
Re: Problem with sorting solr docs
Would all optional fields need the sortmissinglast and sortmissingfirst set even when not sorting on that field? Seems broken to me. Sent from my Mobile device 720-256-8076 On Jul 3, 2012, at 6:45 AM, Shubham Srivastava shubham.srivast...@makemytrip.com wrote: Just adding to the below-- If there is a field(say X) which is not populated and in the query I am not sorting on this particular field but on another field (say Y) still the result ordering would depend on X . Infact in the below problem mentioned by Harsh making X as sortMissingLast=false sortMissingFirst=false solved the problem while in the query he was sorting on Y. This seems a bit illogical. Regards, Shubham From: Harshvardhan Ojha [harshvardhan.o...@makemytrip.com] Sent: Tuesday, July 03, 2012 5:58 PM To: solr-user@lucene.apache.org Subject: RE: Problem with sorting solr docs Hi, I have added field name=latlng indexed=true stored=true sortMissingLast=false sortMissingFirst=false/ to my schema.xml, although I am searching on name field. It seems to be working fine. What is its default behavior? Regards Harshvardhan Ojha -Original Message- From: Rafał Kuć [mailto:r@solr.pl] Sent: Tuesday, July 03, 2012 5:35 PM To: solr-user@lucene.apache.org Subject: Re: Problem with sorting solr docs Hello! But the latlng field is not taken into account when sorting with sort defined such as in your query. You only sort on the name field and only that field. You can also define Solr behavior when there is no value in the field, but adding sortMissingLast=true or sortMissingFirst=true to your type definition in the schema.xml file. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, Thanks for reply. I want to sort my docs on name field, it is working well only if I have all fields populated well. But my latlng field is optional, every doc will not have this value. So those docs are not getting sorted. Regards Harshvardhan Ojha -Original Message- From: Rafał Kuć [mailto:r@solr.pl] Sent: Tuesday, July 03, 2012 5:24 PM To: solr-user@lucene.apache.org Subject: Re: Problem with sorting solr docs Hello! Your query suggests that you are sorting on the 'name' field instead of the latlng field (sort=name +asc). The question is what you are trying to achieve ? Do you want to sort your documents from a given geographical point ? If that's the case you may want to look here: http://wiki.apache.org/solr/SpatialSearch/ and look at the possibility of sorting on the distance from a given point. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I have 260 docs which I want to sort on a single field latlng. doc str name=id1/str str name=nameAmphoe Khanom/str str name=latlng1.0,1.0/str /doc My query is : http://localhost:8080/solr/select?q=*:*sort=name +asc This query sorts all documents except those which doesn’t have latlng, and I can’t keep any default value for this field. My question is how can I sort all docs on latlng? Regards Harshvardhan Ojha | Software Developer - Technology Development | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India What's new?: Inspire - Discover an inspiring new way to plan and book travel online. Office Map Facebook Twitter
Re: UI
The php.net plugin is the best. SolrPHPClient is missing several features. Sent from my Mobile device 720-256-8076 On May 21, 2012, at 6:35 AM, Tolga to...@ozses.net wrote: Hi, Can you recommend a good PHP UI to search? Is SolrPHPClient good?
Re: slave index not cleaned
This is a known issue in 1.4 especially in Windows. Some of it was resolved in 3x. Bill Bell Sent from mobile On May 14, 2012, at 5:54 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, replication will require up to twice the space of the index _temporarily_, just checking if that's what you're seeing But that should go away reasonably soon. Out of curiosity, what happens if you restart your server, do the extra files go away? But it sounds like your index is growing over a longer period of time than just a single replication, is that true? Best Erick On Fri, May 11, 2012 at 6:03 AM, Jasper Floor jasper.fl...@m4n.nl wrote: Hi, On Thu, May 10, 2012 at 5:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Jasper, Sorry, I should've added more technical info wihtout being prompted. Solr does handle that for you. Some more stuff to share: * Solr version? 1.4 * JVM version? 1.7 update 2 * OS? Debian (2.6.32-5-xen-amd64) * Java replication? yes * Errors in Solr logs? no * deletion policy section in solrconfig.xml? missing I would say, but I don't see this on the replication wiki page. This is what we have configured for replication: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=masterUrl${solr.master.url}/df-stream-store/replication/str str name=pollInterval00:20:00/str str name=compressioninternal/str str name=httpConnTimeout5000/str str name=httpReadTimeout1/str /lst /requestHandler We will be updating to 3.6 fairly soon however. To be honest, from what I've read, the Solr cloud is what we really want in the future but we will have to be patient for that. thanks in advance mvg, Jasper You may also want to look at your Index report in SPM (http://sematext.com/spm) before/during/after replication and share what you see. Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm - Original Message - From: Jasper Floor jasper.fl...@m4n.nl To: solr-user@lucene.apache.org Cc: Sent: Thursday, May 10, 2012 9:08 AM Subject: slave index not cleaned Perhaps I am missing the obvious but our slaves tend to run out of disk space. The index sizes grow to multiple times the size of the master. So I just toss all the data and trigger a replication. However, can't solr handle this for me? I'm sorry if I've missed a simple setting which does this for me, but if its there then I have missed it. mvg Jasper
Re: Replication. confFiles and permissions.
Why would you replicate data import properties? The master does the importing not the slave... Sent from my Mobile device 720-256-8076 On May 9, 2012, at 7:23 AM, stockii stock.jo...@googlemail.com wrote: Hello. i running a solr replication. works well, but i need to replicate my dataimport-properties. if server1 replicate this file after he create everytime a new file, with *.timestamp, because the first replication run create this file with wrong permissions ... how can is say to solr replication chmod 755 dataimport-properties ... ? ;-) thx - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/Replication-confFiles-and-permissions-tp3973825.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it possible to limit the bandwidth of replication
+1 as well especially for larger indexes Sent from my Mobile device 720-256-8076 On May 9, 2012, at 9:46 AM, Jan Høydahl jan@cominvent.com wrote: I think we have to add this for java based rep. +1
Re: Solritas in production
I would not use Solaritas unless for very rudimentary solutions and prototypes. Sent from my Mobile device 720-256-8076 On May 6, 2012, at 6:02 AM, András Bártházi and...@barthazi.hu wrote: Hi, We're currently evaluating Solr as a Sphinx replacement. Our site has 1.000.000+ pageviews a day, it's a real estate search engine. The development is almost done, and it seems to be working fine, however some of my colleagues come with the idea that we're using it wrong. We're using it as a service from PHP/Symfony. They think we should use Solritas as a frontend, so site visitors will directly use it, so no PHP will be involved, so it will be use much less infrastructure. One of them said that even mobile.de using it that way (I have found no clue about it at all). Do you think is it a good idea? Do you know services using Solritas as a frontend on a public site? My personal opinion is that using Solritas in production is a very bad idea for us, but have not so much experience with Solr yet, and Solritas documentation is far from a detailed, up-to-date one, so don't really know what is it really usable for. Thanks, Andras
Re: change index/store at indexing time
Yes you can. Just use a script that is called for each row. Bill Bell Sent from mobile On Apr 27, 2012, at 6:38 PM, Vazquez, Maria (STM) maria.vazq...@dexone.com wrote: Hi, I'm migrating a project from Lucene 2.9 to Solr 3.4. There is a special case in the code that indexes the same field in two different ways, which is completely legal in Lucene directly but I don't know how to duplicate this same behavior in Solr: if (isFirstGeo) { document.add(new Field(geoids, geoId, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS)); isFirstGeo = false; } else { if (countProducts 100) document.add(new Field(geoids, geoId, Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS)); else document.add(new Field(geoids, geoId, Field.Store.YES, Field.Index.NO)); } Is there any way to do this in Solr in a Tranformer? I'm using the DIH to index and I can't see a way to do this other than having three fields in the schema like geoids_store_index, geoids_nostore_index, and geoids_store_noindex. Thanks a lot in advance. Maria
Re: commit stops
We also see extreme slowness using Solr 3.6 when trying to commit a delete. We also get hangs. We do 1 commit at most a week. Rebuilding from scratching using DIH works fine and has never hung. Bill Bell Sent from mobile On Apr 27, 2012, at 5:59 PM, mav.p...@holidaylettings.co.uk mav.p...@holidaylettings.co.uk wrote: Thanks for the reply The client expects a response within 2 minutes and after that will report an error. When we build fresh it seems to work and the operation takes a second or two to complete. Once it gets to a stage it hangs it simply won't accept any further commits. I did an index check and all was ok. I don¹t see any major commit happening at any time, it seems to just hang. Even starting up and shutting down takes ages. We make 3 - 4 commits a day. We use solr 3.5 No autocommit On 28/04/2012 00:56, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Apr 27, 2012 at 9:18 AM, mav.p...@holidaylettings.co.uk mav.p...@holidaylettings.co.uk wrote: We have an index of about 3.5gb which seems to work fine until it suddenly stops accepting new commits. Users can still search on the front end but nothing new can be committed and it always times out on commit. Any ideas? Perhaps the commit happens to cause a major merge which may take a long time (and solr isn't going to allow overlapping commits). How long does a commit request take to time out? What Solr version is this? Do you have any kind of auto-commit set up? How often are you manually committing? -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10
Re: Does Solr fit my needs?
You could use SQL Server and External Fields in Solr to get what you need from the database on result of the query. Bill Bell Sent from mobile On Apr 27, 2012, at 8:31 AM, G.Long jde...@gmail.com wrote: Hi there :) I'm looking for a way to save xml files into some sort of database and i'm wondering if Solr would fit my needs. The xml files I want to save have a lot of child nodes which also contain child nodes with multiple values. The depth level can be more than 10. After having indexed the files, I would like to be able to query for subparts of those xml files and be able to reconstruct them as xml files with all their children included. However, I'm wondering if it is possible with an index like solr lucene to keep or easily recover the structure of my xml data? Thanks for your help, Regards, Gary
Question concerning date fields
We are loading a long (number of seconds since 1970?) value into Solr using java and Solrj. What is the best way to convert this into the right Solr date fields? Sent from my Mobile device 720-256-8076
Re: ExtractingRequestHandler
I have had good luck with creating a separate core index for just data. This is a different core than the indexed core. Very fast. Bill Bell Sent from mobile On Apr 1, 2012, at 11:15 AM, Erick Erickson erickerick...@gmail.com wrote: Yes, you can. but Generally, storing the raw input in Solr is not the best approach. The problem here is that pretty soon you get a huge index that contains *everything*. Solr was not intended to be a data store. Besides, you then need to store the binary form of the file. Solr only deals with text, not markup. Most people index the text in Solr, and enough information so the application knows where to go to fetch the original document when the user drills down (e.g. file path, database PK, etc). Would that work for your situation? Best Erick On Sat, Mar 31, 2012 at 3:55 PM, spr...@gmx.eu wrote: Hi, I want to index various filetypes in solr, this can easily done with ExtractingRequestHandler. But I also need the extracted content back. I know ext.extract.only but then nothing gets indexed, right? Can I index the document AND get the content back as with ext.extract.only? In a single request? Thank you
Re: Empty facet counts
Send schema.xml and did you apply any patches? What version of Solr? Bill Bell Sent from mobile On Mar 29, 2012, at 5:26 AM, Youri Westerman yo...@pluxcustoms.nl wrote: Hi, I'm currently learning how to use solr and everything seems pretty straight forward. For some reason when I use faceted queries it returns only empty sets in the facet_count section. The get params I'm using are: ?q=*:*rows=0facet=truefacet.field=urn The result: facet_counts: { facet_queries: { }, facet_fields: { }, facet_dates: { }, facet_ranges: { } } The urn field is indexed and there are enough entries to be counted. When adding facet.method=Enum, nothing changes. Does anyone know why this is happening? Am I missing something? Thanks in advance! Youri
Re: DataImportHandler: backups prior to full-import
You could use the Solr Command Utility SCU that runs from Windows and can be scheduled to run. https://github.com/justengland/Solr-Command-Utility This is a windows system that will index using a core, and swap it if it succeeds. It works it's Solr. Let me know if you have any questions. On Mar 28, 2012, at 10:11 PM, Shawn Heisey s...@elyograg.org wrote: On 3/28/2012 12:46 PM, Artem Shnayder wrote: Does anyone know of any work done to automatically run a backup prior to a DataImportHandler full-import? I've asked this question on #solr and was pointed to https://wiki.apache.org/solr/SolrReplication?highlight=%28backup%29#HTTP_API which is helpful but is not an automatic backup in the context of full-import's. I'm wondering if anyone else has done this work yet. I have located a previous message from you where you mention that you are on Ubuntu. If that's true, you can use hard links to make nearly instantaneous backups with a single command: ln /path/to/index/* /path/to/backup/. One caveat to that - the backup must be on the same filesystem as the index. If keeping backups on another filesystem (or even another computer) is important, then treat the hard link backup as a temporary directory. Copy the files from that directory to your remote location, then delete them. This works because of the way that Lucene (and by extension Solr) manages files on disk - existing segment files are never modified. If they get merged, new files are created before the old ones are deleted. There is only one file in an index directory that does change without getting a new name - segments.gen. I have verified (on Solr 3.5) that even this file is properly handled so that a hard link backup keeps the correct version. For people running on Windows, this particular method won't work. Newer Windows server versions do have one feature that might actually make it possible to do something similar - shadow copies. I do not know how to leverage the feature, though. Thanks, Shawn
Re: Performance Question
The size of the index does matter practically speaking. Bill Bell Sent from mobile On Mar 19, 2012, at 11:41 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Exactly. That's what I mean. On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson jej2...@gmail.com wrote: Mikhail, Thanks for the response. Just to be clear you're saying that the size of the index does not matter, it's more the size of the results? On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Frankly speaking the computational complexity of Lucene search depends from size of search result: numFound*log(start+rows), but from size of index. Regards On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote: I'm curious if anyone tell me how Solr/Lucene performs in a situation where you have 100,000 documents each with 100 tokens vs having 1,000,000 documents each with 10 tokens. Should I expect the performance to be the same? Any information would be greatly appreciated. -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr core swap after rebuild in HA-setup / High-traffic
DIH sets the time of update to the start time not the end time, So when the index is rebuilt, if you run an delta and use the update time you should be okay. We normally go back a few minutes to make sure we have all s a fail safe as well. Sent from my Mobile device 720-256-8076 On Mar 14, 2012, at 12:58 PM, KeesSchepers k...@keesschepers.nl wrote: Hello everybody, I am designing a new Solr architecture for one of my clients. This sorl architecture is for a high-traffic website with million of visitors but I am facing some design problems were I hope you guys could help me out. In my situation there are 4 Solr servers running, 1 server is master and 3 are slave. They are running Solr version 1.4. I use two cores 'live' and 'rebuild' and I use Solr DIH to rebuild a core which goes like this: 1. I wipe the reindex core 2. I run the DIH to the complete dataset (4 million documents) in peices of 20.000 records (to prevent very long mysql locks) 3. After the DIH is finished (2 hours) we have to also have to update the rebuild core with changes from the last two hours, this is a problem 4. After updating is done and the core is not more then some seconds behind we want to SWAP the cores. Everything goes well except for step 3. The rebuild and the core swap is all okay. Because the website is undergoing changes every minute we cannot pauze the delta-import on the live and walk behind for 2 hours. The problem is that I can't figure out a closing system with not delaying the live core to long and use the DIH instead of writing a lot of code. Did anyone face this problem before or could give me some tips? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-core-swap-after-rebuild-in-HA-setup-High-traffic-tp3826461p3826461.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamically Load Query Time Synonym File
It would depend. If the synonyms are used on indexing, you need to reindex. Otherwise, you could reload and use the synonyms on query. On 2/26/12 4:05 AM, Ahmet Arslan iori...@yahoo.com wrote: Is there a way to dynamically load a synonym file without restarting solr core ? There is an open jira for this : https://issues.apache.org/jira/browse/SOLR-1307
Debugging on 3,5
I did find a solution, but the output is horrible. Why does explain look so badly? lst name=explainstr name=2H7DF 6.351252 = (MATCH) boost(*:*,query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; ,def=0.0)), product of: 1.0 = (MATCH) MatchAllDocsQuery, product of: 1.0 = queryNorm 6.351252 = query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; ,def=0.0)=6.351252 /str defType=edismaxboost=query($param)param=multi_field:87 -- We like the boost parameter in SOLR 3.5 with eDismax. The question we have is what we would like to replace bq with boost, but we get the multi-valued field issue when we try to do this. Bill Bell Sent from mobile
Mmap
Does someone have an example of using unmap in 3.5 and chunksize? I am using Solr 3.5. I noticed in solrconfig.xml: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.StandardDirectoryFactory}/ I don't see this parameter taking.. When I set -Dsolr.directoryFactory=solr.MMapDirectoryFactory How do I see the setting in the log or in stats.jsp ? I cannot find a place that indicates it is set or not. I would assume StandardDirectoryFactory is being used but I do see (when I set it or NOT set it) Bill Bell Sent from mobile
Re: Improving performance for SOLR geo queries?
Can we get this back ported to 3x? Bill Bell Sent from mobile On Feb 14, 2012, at 3:45 AM, Matthias Käppler matth...@qype.com wrote: hey thanks all for the suggestions, didn't have time to look into them yet as we're feature-sprinting for MWC, but will report back with some feedback over the next weeks (we will have a few more performance sprints in March) Best, Matthias On Mon, Feb 13, 2012 at 2:32 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Feb 9, 2012 at 1:46 PM, Yonik Seeley yo...@lucidimagination.com wrote: One way to speed up numeric range queries (at the cost of increased index size) is to lower the precisionStep. You could try changing this from 8 to 4 and then re-indexing to see how that affects your query speed. Your issue, and the fact that I had been looking at the post-filtering code again for another client, reminded me that I had been planning on implementing post-filtering for spatial. It's now checked into trunk. If you have the ability to use trunk, you can add a high cost (like cost=200) along with cache=false to trigger it. More details here: http://www.lucidimagination.com/blog/2012/02/10/advanced-filter-caching-in-solr/ -Yonik lucidimagination.com -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
Help with MMapDirectoryFactory in 3.5
I am using Solr 3.5. I noticed in solrconfig.xml: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.StandardDirectoryFactory}/ I don't see this parameter taking.. When I set -Dsolr.directoryFactory=solr.MMapDirectoryFactory How do I see the setting in the log or in stats.jsp ? I cannot find a place that indicates it is set or not. I would assume StandardDirectoryFactory is being used but I do see (when I set it or NOT set it) ame: searcher class: org.apache.solr.search.SolrIndexSearcher version: 1.0 description: index searcher stats: searcherName : Searcher@71fc3828 main caching : true numDocs : 2121163 maxDoc : 2121163 reader : SolrIndexReader{this=1867ec28,r=ReadOnlyDirectoryReader@1867ec28,refCnt=1,se gments=1} readerDir : org.apache.lucene.store.MMapDirectory@C:\solr\jetty\example\solr\providersea rch\data\index lockFactory=org.apache.lucene.store.NativeFSLockFactory@45c1cfc1 indexVersion : 1324594650551 openedAt : Sat Feb 11 09:49:31 MST 2012 registeredAt : Sat Feb 11 09:49:31 MST 2012 warmupTime : 0 Also, how do I set unman and what is the purpose of chunk size?
Re: Help with MMapDirectoryFactory in 3.5
Also, does someone have an example of using unmap in 3.5 and chunksize? From: Bill Bell billnb...@gmail.com Date: Sat, 11 Feb 2012 10:39:56 -0700 To: solr-user@lucene.apache.org Subject: Help with MMapDirectoryFactory in 3.5 I am using Solr 3.5. I noticed in solrconfig.xml: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.StandardDirectoryFactory}/ I don't see this parameter taking.. When I set -Dsolr.directoryFactory=solr.MMapDirectoryFactory How do I see the setting in the log or in stats.jsp ? I cannot find a place that indicates it is set or not. I would assume StandardDirectoryFactory is being used but I do see (when I set it or NOT set it) ame: searcher class: org.apache.solr.search.SolrIndexSearcher version: 1.0 description: index searcher stats: searcherName : Searcher@71fc3828 main caching : true numDocs : 2121163 maxDoc : 2121163 reader : SolrIndexReader{this=1867ec28,r=ReadOnlyDirectoryReader@1867ec28,refCnt=1,se gments=1} readerDir : org.apache.lucene.store.MMapDirectory@C:\solr\jetty\example\solr\providersea rch\data\index lockFactory=org.apache.lucene.store.NativeFSLockFactory@45c1cfc1 indexVersion : 1324594650551 openedAt : Sat Feb 11 09:49:31 MST 2012 registeredAt : Sat Feb 11 09:49:31 MST 2012 warmupTime : 0 Also, how do I set unman and what is the purpose of chunk size?
boost question. need boost to take a query like bq
We like the boost parameter in SOLR 3.5 with eDismax. The question we have is what we would like to replace bq with boost, but we get the multi-valued field issue when we try to do the equivalent queries HTTP ERROR 400 Problem accessing /solr/providersearch/select. Reason: can not use FieldCache on multivalued field: specialties_ids q=*:*bq=multi_field:87^2defType=dismax How do you do this using boost? q=*:*boost=multi_field:87defType=edismax We know we can use bq with edismax, but we like the multiply feature of boost. If I change it to a single valued field I get results, but they are all 1.0. str name=YFFL5 1.0 = (MATCH) MatchAllDocsQuery, product of: 1.0 = queryNorm /str q=*:*boost=single_field:87defType=edismax // this works, but I need it on multivalued
FW: boost question. need boost to take a query like bq
I did find a solution, but the output is horrible. Why does explain look so badly? lst name=explainstr name=2H7DF 6.351252 = (MATCH) boost(*:*,query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; ,def=0.0)), product of: 1.0 = (MATCH) MatchAllDocsQuery, product of: 1.0 = queryNorm 6.351252 = query(specialties_ids: #1;#0;#0;#0;#0;#0;#0;#0;#0; ,def=0.0)=6.351252 /str defType=edismaxboost=query($param)param=multi_field:87 -- We like the boost parameter in SOLR 3.5 with eDismax. The question we have is what we would like to replace bq with boost, but we get the multi-valued field issue when we try to do the equivalent queries HTTP ERROR 400 Problem accessing /solr/providersearch/select. Reason: can not use FieldCache on multivalued field: specialties_ids q=*:*bq=multi_field:87^2defType=dismax How do you do this using boost? q=*:*boost=multi_field:87defType=edismax We know we can use bq with edismax, but we like the multiply feature of boost. If I change it to a single valued field I get results, but they are all 1.0. str name=YFFL5 1.0 = (MATCH) MatchAllDocsQuery, product of: 1.0 = queryNorm /str q=*:*boost=single_field:87defType=edismax // this works, but I need it on multivalued
Re: Performance issue: Frange with geodist()
I added a Jira issue for this: https://issues.apache.org/jira/browse/SOLR-2840 On 10/13/11 8:15 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Oct 13, 2011 at 9:55 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: is it possible with geofilt and facet.query? facet.query={!geofilt pt=45.15,-93.85 sfield=store d=5} Yes, that should be both possible and faster... something along the lines of: sfield=storept=45.15,-93.85 facet.query={!geofilt d=10 key=d10} facet.query={!geofilt d=20 key=d20} facet.query={!geofilt d=50 key=d50} Eventually we should implement range faceting over functions and also add a max distance you care about to the geodist function. -Yonik http://www.lucene-eurocon.com - The Lucene/Solr User Conference On Thu, Oct 13, 2011 at 4:20 PM, roySolr royrutten1...@gmail.com wrote: I don't want to use some basic facets. When the user doesn't get any results i want to search in the radius of his search location. Example: apple store in Manchester gives no result. I want this: Click here to see 2 results in a radius of 10km. Click here to see 11 results in a radius of 50km. Click here to see 19 results in a radius of 100km. With geodist() and facet.query is this possible but the performance isn't very good.. -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-issue-Frange-with-geodist -tp3417962p3418429.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail (Mike) Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Scoring of DisMax in Solr
This seems like a bug to me. On 10/4/11 6:52 PM, David Ryan help...@gmail.com wrote: Hi, When I examine the score calculation of DisMax in Solr, it looks to me that DisMax is using tf x idf^2 instead of tf x idf. Does anyone have insight why tf x idf is not used here? Here is the score contribution from one one field: score(q,c) = queryWeight x fieldWeight = tf x idf x idf x queryNorm x fieldNorm Here is the example that I used to derive the formula above. Clearly, idf is multiplied twice in the score calculation. * http://localhost:8983/solr/select/?q=GBversion=2.2start=0rows=10indent =ondebugQuery=truefl=id,score * str name=6H500F0 0.18314168 = (MATCH) sum of: 0.18314168 = (MATCH) weight(text:gb in 1), product of: 0.35845062 = queryWeight(text:gb), product of: 2.3121865 = idf(docFreq=6, numDocs=26) 0.15502669 = queryNorm 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of: 1.4142135 = tf(termFreq(text:gb)=2) 2.3121865 = idf(docFreq=6, numDocs=26) 0.15625 = fieldNorm(field=text, doc=1) /str Thanks!
Re: Solr stopword problem in Query
This is pretty serious issue Bill Bell Sent from mobile On Sep 26, 2011, at 4:09 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi all, I have a text field named* textForQuery* . Following content has been indexed into solr in field textForQuery *Coke Studio at MTV* when i fired the query as *textForQuery:(coke studio at mtv)* the results showed 0 documents After runing the same query in debugMode i got the following results result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringtextForQuery:(coke studio at mtv)/str str name=querystringtextForQuery:(coke studio at mtv)/str str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str Why the query did not matched any document even when there is a document with value of textForQuery as *Coke Studio at MTV*? Is this because of the stopword *at* present in stopwordList? -- Thanks Regards, Isan Fulia.
Re: indexing a xml file
Send us the example solr.xml and schema.xml'. You are missing fields in the schema.xml that you are referencing. On 9/24/11 8:15 AM, ahmad ajiloo ahmad.aji...@gmail.com wrote: hello Solr Tutorial page explains about index a xml file. but when I try to index a xml file with this command: ~/Desktop/apache-solr-3.3.0/example/exampledocs$ java -jar post.jar solr.xml I get this error: SimplePostTool: FATAL: Solr returned an error #400 ERROR:unknown field 'name' can anyone help me? thanks
Best Solr escaping?
What is the best algorithm for escaping strings before sending to Solr? Does someone have some code? A few things I have witnessed in q using DIH handler * Double quotes - that are not balanced can cause several issues from an error (strip the double quote?), to no results. * Should we use + or %20 and what cases make sense: * Dr. Phil Smith or Dr.+Phil+Smith or Dr.%20Phil%20Smith - also what is the impact of double quotes? * Unmatched parenthesis I.e. Opening ( and not closing. * (Dr. Holstein * Cardiologist+(Dr. Holstein Regular encoding of strings does not always work for the whole string due to several issues like white space: * White space works better when we use back quote Bill\ Bell especially when using facets. Thoughts? Code? Ideas? Better Wikis?
Re: Search query doesn't work in solr/browse pnnel
Yes. It appears that cannot be encoded in the URL or there is really bad results. For example we get an error on first request, but if we refresh it goes away. On 9/23/11 2:57 PM, hadi md.anb...@gmail.com wrote: When I create a query like somethingfl=content in solr/browse the and = in URL converted to %26 and %3D and no result occurs. but it works in solr/admin advanced search and also in URL bar directly, How can I solve this problem? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Search-query-doesn-t-work-in-solr-brows e-pnnel-tp3363032p3363032.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Distinct elements in a field
SOLR-2242 can do it. On 9/16/11 2:15 AM, swiss knife swiss_kn...@email.com wrote: I could get this number by using group.ngroups=truegroup.limit=0 but doing grouping for this seems like an overkill Would you advise using JIRA SOLR-1814 ? - Original Message - From: swiss knife Sent: 09/15/11 12:43 PM To: solr-user@lucene.apache.org Subject: Distinct elements in a field Simple question: I want to know how many distinct elements I have in a field and these verify a query. Do you know if there's a way to do it today in 3.4. I saw SOLR-1814 and SOLR-2242. SOLR-1814 seems fairly easy to use. What do you think ? Thank you
Re: Re; DIH Scheduling
You can easily use cron with curl to do what you want to do. On 9/12/11 2:47 PM, Pulkit Singhal pulkitsing...@gmail.com wrote: I don't see anywhere in: http://issues.apache.org/jira/browse/SOLR-2305 any statement that shows the code's inclusion was decided against when did this happen and what is needed from the community before someone with the powers to do so will actually commit this? 2011/6/24 Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com On Thu, Jun 23, 2011 at 9:13 PM, simon mtnes...@gmail.com wrote: The Wiki page describes a design for a scheduler, which has not been committed to Solr yet (I checked). I did see a patch the other day (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't look well tested. I think that you're basically stuck with something like cron at this time. If your application is written in java, take a look at the Quartz scheduler - http://www.quartz-scheduler.org/ It was considered and decided against. -Simon -- - Noble Paul
Re: pagination with grouping
There are 2 use cases: 1. rows=10 means 10 groups. 2. rows=10 means to results (irregardless of groups). I thought there was a total number of groups (ngroups) or case #1. I don't believe case #2 has been coded. On 9/8/11 2:22 PM, alx...@aim.com alx...@aim.com wrote: Hello, When trying to implement pagination as in the case without grouping I see two issues. 1. with rows=10 solr feed displays 10 groups not 10 results 2. there is no total number of results with grouping to show the last page. In detail: 1. I need to display only 10 results in one page. For example if I have group.limit=5 and the first group has 5 docs, the second 3 and the third 2 then only these 3 group must be displayed in the first page. Currently specifying rows=10, shows 10 groups and if we have 5 docs in each group then in the first page we will have 50 docs. 2.I need to show the last page, for which I need total number of results with grouping. For example if I have 5 groups with number of docs 5, 4, 3,2 1 then this total number must be 15. Any ideas how to achieve this. Thanks in advance. Alex.
Re: Terms.regex performance issue
We do something like: http://localhost:8983/solr/provs/terms?terms.fl=payorterms.regex.flag=case _insensitiveterms.regex=%28.*%29WHAT USER TYPES%28.*%29terms.limit=-1 We want not just prefix but anywhere in the terms. On 8/19/11 5:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Terms.regex performance issue : : As I want to use it in an Autocomplete it has to be fast. Terms.prefix gets : results in around 100 milliseconds, while terms.regex is 10 to 20 times : slower. can you elaborate on how you are using terms.regex? what does your regex look like? .. particularly if your usecase is autocomplete terms.prefix seems like an odd choice. Possible XY Problem? https://people.apache.org/~hossman/#xyproblem Have you looked at using the Suggester plugin? https://wiki.apache.org/solr/Suggester -Hoss
Re: hierarchical faceting in Solr?
Naomi, Just create a login and update it!! On 8/22/11 12:27 PM, Erick Erickson erickerick...@gmail.com wrote: Try searching the Solr user's list for hierarchical, this topic has been discussed numerous times. It would be great if you could collate the various solutions and update the wiki, all you have to do is create a login... Best Erick On Mon, Aug 22, 2011 at 1:49 PM, Naomi Dushay ndus...@stanford.edu wrote: Chris, Is there a document somewhere on how to do this? If not, might you create one? I could even imagine such a document living on the Solr wiki ... this one has mostly ancient content: http://wiki.apache.org/solr/HierarchicalFaceting - Naomi
Re: copyField for big indexes
It depends. copyField may be good if you want to copy into a Soundex field, and then boost the sounded field lower than the tokenized field. What are you trying to do ? On 8/22/11 11:14 AM, Tom springmeth...@gmail.com wrote: Is it a good rule of thumb, that when dealing with large indexes copyField should not be used. It seems to duplicate the indexing of data. You don't need copyField to be able to search on multiple fields. Example, if I have two fields: title and post and I want to search on both, I could just query title:word OR post:word So it seems to me if you have lot's of data and a large indexes, copyField should be avoided. Any thoughts? -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p327 5712.html Sent from the Solr - User mailing list archive at Nabble.com.
Boost or BQ?
What is the different between boost= and bq= ? I cannot find any documentation
Score
How do I change the score to scale it between 0 and 100 irregardless of the score? q.alt=*:*bq=lang:SpanishdefType=dismax Bill Bell Sent from mobile
Loggly support
How do you setup log4j to work with Loggly for SOLR logs? Anyone have this set up? Bill
Re: exceeded limit of maxWarmingSearchers ERROR
OK, I'll ask the elephant in the room. What is the difference between the new UpdateHandler from Mark and the SOLR-RA? The UpdateHandler works with 4.0 does SOLR-RA work with 4.0 trunk? Pros/Cons? On 8/14/11 8:10 PM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Naveen: NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a document to become searchable. Any document that you add through update becomes immediately searchable. So no need to commit from within your update client code. Since there is no commit, the cache does not have to be cleared or the old searchers closed or new searchers opened, and warmed (error that you are facing). Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 8/14/2011 10:37 AM, Naveen Gupta wrote: Hi Mark/Erick/Nagendra, I was not very confident about NRT at that point of time, when we started project almost 1 year ago, definitely i would try NRT and see the performance. The current requirement was working fine till we were using commitWithin 10 millisecs in the XMLDocument which we were posting to SOLR. But due to which, we were getting very poor performance (almost 3 mins for 15,000 docs) per user. There are many paraller user committing to our SOLR. So we removed the commitWithin, and hence performance was much much better. But then we are getting this maxWarmingSearcher Error, because we are committing separately as a curl request after once entire doc is submitted for indexing. The question here is what is difference between commitWithin and commit (apart from the fact that commit takes memory and processes and additional hardware usage) Why we want it to be visible as soon as possible, since we are applying many business rules on top of the results (older indexes as well as new one) and apply different filters. upto 5 mins is fine for us. but more than that we need to think then other optimizations. We will definitely try NRT. But please tell me other options which we can apply in order to optimize.? Thanks Naveen On Sun, Aug 14, 2011 at 9:42 PM, Erick Ericksonerickerick...@gmail.comwrote: Ah, thanks, Mark... I must have been looking at the wrong JIRAs. Erick On Sun, Aug 14, 2011 at 10:02 AM, Mark Millermarkrmil...@gmail.com wrote: On Aug 14, 2011, at 9:03 AM, Erick Erickson wrote: You either have to go to near real time (NRT), which is under development, but not committed to trunk yet NRT support is committed to trunk. - Mark Miller lucidimagination.com
Re: Cache replication
OK. But SOLR has built-in caching. Do you not like the caching? What so you think we should change to the SOLR cache? Bill On 8/10/11 9:16 AM, didier deshommes dfdes...@gmail.com wrote: Consider putting a cache (memcached, redis, etc) *in front* of your solr slaves. Just make sure to update it when replication occurs. didier On Tue, Aug 9, 2011 at 6:07 PM, arian487 akarb...@tagged.com wrote: I'm wondering if the caches on all the slaves are replicated across (such as queryResultCache). That is to say, if I hit one of my slaves and cache a result, and I make a search later and that search happens to hit a different slave, will that first cached result be available for use? This is pretty important because I'm going to have a lot of slaves and if this isn't done, then I'd have a high chance of running a lot uncached queries. Thanks :) -- View this message in context: http://lucene.472066.n3.nabble.com/Cache-replication-tp3240708p3240708.ht ml Sent from the Solr - User mailing list archive at Nabble.com.
Re: exceeded limit of maxWarmingSearchers ERROR
I understand. Have you looked at Mark's patch? From his performance tests, it looks pretty good. When would RA work better? Bill On 8/14/11 8:40 PM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Bill: The technical details of the NRT implementation in Apache Solr with RankingAlgorithm (SOLR-RA) is available here: http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf (Some changes for Solr 3.x, but for most it is as above) Regarding support for 4.0 trunk, should happen sometime soon. Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 8/14/2011 7:11 PM, Bill Bell wrote: OK, I'll ask the elephant in the roomŠ. What is the difference between the new UpdateHandler from Mark and the SOLR-RA? The UpdateHandler works with 4.0 does SOLR-RA work with 4.0 trunk? Pros/Cons? On 8/14/11 8:10 PM, Nagendra Nagarajayyannagaraja...@transaxtions.com wrote: Naveen: NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a document to become searchable. Any document that you add through update becomes immediately searchable. So no need to commit from within your update client code. Since there is no commit, the cache does not have to be cleared or the old searchers closed or new searchers opened, and warmed (error that you are facing). Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 8/14/2011 10:37 AM, Naveen Gupta wrote: Hi Mark/Erick/Nagendra, I was not very confident about NRT at that point of time, when we started project almost 1 year ago, definitely i would try NRT and see the performance. The current requirement was working fine till we were using commitWithin 10 millisecs in the XMLDocument which we were posting to SOLR. But due to which, we were getting very poor performance (almost 3 mins for 15,000 docs) per user. There are many paraller user committing to our SOLR. So we removed the commitWithin, and hence performance was much much better. But then we are getting this maxWarmingSearcher Error, because we are committing separately as a curl request after once entire doc is submitted for indexing. The question here is what is difference between commitWithin and commit (apart from the fact that commit takes memory and processes and additional hardware usage) Why we want it to be visible as soon as possible, since we are applying many business rules on top of the results (older indexes as well as new one) and apply different filters. upto 5 mins is fine for us. but more than that we need to think then other optimizations. We will definitely try NRT. But please tell me other options which we can apply in order to optimize.? Thanks Naveen On Sun, Aug 14, 2011 at 9:42 PM, Erick Ericksonerickerick...@gmail.comwrote: Ah, thanks, Mark... I must have been looking at the wrong JIRAs. Erick On Sun, Aug 14, 2011 at 10:02 AM, Mark Millermarkrmil...@gmail.com wrote: On Aug 14, 2011, at 9:03 AM, Erick Erickson wrote: You either have to go to near real time (NRT), which is under development, but not committed to trunk yet NRT support is committed to trunk. - Mark Miller lucidimagination.com
Re: SOLR 3.3.0 multivalued field sort problem
I have a different use case. Consider a spatial multivalued field with latlong values for addresses. I would want sort by geodist() to return the closest distance in each group. For example find me the closest restaurant which each doc being a chain name like pizza hut. Or doctors with multiple offices. Bill Bell Sent from mobile On Aug 13, 2011, at 12:31 PM, Martijn v Groningen martijn.is.h...@gmail.com wrote: The first solution would make sense to me. Some kind of a strategy mechanism for this would allow anyone to define their own rules. Duplicating results would be confusing to me. On 13 August 2011 18:39, Michael Lackhoff mich...@lackhoff.de wrote: On 13.08.2011 18:03 Erick Erickson wrote: The problem I've always had is that I don't quite know what sorting on multivalued fields means. If your field had tokens a and z, would sorting on that field put the doc at the beginning or end of the list? Sure, you can define rules (first token, last token, average of all tokens (whatever that means)), but each solution would be wrong sometime, somewhere, and/or completely useless. Of course it would need rules but I think it wouldn't be too hard to find rules that are at least far better than the current situation. My wish would include an option that decides if the field can be used just once or every value on its own. If the option is set to FALSE, only the first value would be used, if it is TRUE, every value of the field would get its place in the result list. so, if we have e.g. record1: ccc and bbb record2: aaa and zzz it would be either record2 (aaa) record1 (ccc) or record2 (aaa) record1 (bbb) record1 (ccc) record2 (zzz) I find these two outcomes most plausible so I would allow them if technical possible but whatever rule looks more plausible to the experts: some solution is better than no solution. -Michael -- Met vriendelijke groet, Martijn van Groningen
Re: ideas for indexing large amount of pdf docs
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson erickerick...@gmail.com wrote: Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) r...@libnova.es wrote: Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the moment when we enter in production time. some possibilities: i. clustering. I have no experience in this, so it will be a bad idea to venture into this. ii. multicore solution. make some kind of hash to choose one core at each query (exact queries) and thus reduce the size of the individual indexes to consult or to consult all the cores at same time (complex queries). iii. do nothing more and wait for the catastrophe in the response times :P Someone with experience can help a bit to decide? Thanks a lot in advance.
Re: Problem with xinclude in solrconfig.xml
What was it? Bill Bell Sent from mobile On Aug 10, 2011, at 2:21 PM, Way Cool way1.wayc...@gmail.com wrote: Sorry for the spam. I just figured it out. Thanks. On Wed, Aug 10, 2011 at 2:17 PM, Way Cool way1.wayc...@gmail.com wrote: Hi, Guys, Based on the document below, I should be able to include a file under the same directory by specifying relative path via xinclude in solrconfig.xml: http://wiki.apache.org/solr/SolrConfigXml However I am getting the following error when I use relative path (absolute path works fine though): SEVERE: org.xml.sax.SAXParseException: Error attempting to parse XML file Any ideas? Thanks, YH
Re: getting result count only
q=*:*rows=0 On 8/6/11 8:24 PM, Jason Toy jason...@gmail.com wrote: How can I run a query to get the result count only? I only need the count and so I dont need solr to send me all the results back.
Re: Problem with making Solr query
String does no manipulation. You might need to switch the fieldtype. Also make sure your default field is title or add title:implementation in your search. Bill Bell Sent from mobile On Aug 5, 2011, at 8:43 AM, dhryvastov dhryvas...@serena.com wrote: Hi - I am new to Solr and Lucene and I have started to research its capabilities this week. My current task seems very simple (and I believe it is) but I have some issue. I have successfully done indexing of MSSQL database table. The table has probably 20 fields and I have indexed two of them: id and title. The question is: how can I get all records from this table (I mean the id's of them) were the word specifies in search appears??? I send the following get request to get result: http://localhost:8983/solr/db/select/?q=implementation. The response contains 0 results (numFound=0) but there are at least 5 records among the first 10 which contains this word in its title. My schema.xml contains: fields field name=id type=string indexed=true stored=true required=true / field name=title type=string indexed=true stored=true required=true / /fields What get request should I do to get the expected results? I feel that I have omitted something simple but it is the second day that I can't found what. Please help. Thanks for your response. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-with-making-Solr-query-tp3228877p3228877.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial Search and Highlighting
I think 4.0 supports fl=geodist() On 8/1/11 3:47 PM, Ralf Musick ra...@gmx.de wrote: Hi David, So that As a temporary workaround for older Solr versions, it's possible to obtain distances by using geodist or geofilt as the only scoring part of the main query and Highlighting do not fit together, right? Ok, than I have to calculate the distance by my own. Thank you very much for your given information!! Best regards, Ralf Am 01.08.2011 23:30, schrieb Smiley, David W.: Ralf, Highlighting (and search relevancy -- the score) is performed on the user query which must be in the q parameter. In your case, I see you placed your geospatial query there and you put your user query into a filter query fq. You have them reversed. You stated that the returning the distance information on the wiki didn't work -- that's because those instructions are for Solr 4.0 (not released yet) -- notice the warning symbol. I recommend that you calculate the distance yourself since Solr 4.0 isn't out yet. There is plenty of information on the web on how to calculate the distance between two lat-lon points using the Haversine algorithm. ~ David On Aug 1, 2011, at 5:00 PM, Ralf Musick wrote: Hi David, an example is: http://localhost:8983/solr/browse?indent=onhl=onhl.fl=name,manusort=s core+ascsfield=storejson.nl=mapwt=jsonrows=10start=0q={!func}geodi st%28%29pt=45.17614%2C-93.87341fq=%28name%20:+%28canon%29%29^8 I have to say I need the calculated distance as a return field (score) in the result list. The pseudo field solution described here http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance did not word, so I created the query above. Thanks, Ralf Am 01.08.2011 22:21, schrieb Smiley, David W.: Can you demonstrate the bug against the example data? If so, provide the URL please. ~ David On Aug 1, 2011, at 4:00 PM, Ralf Musick wrote: Hi, I combined a spatial distance search with a fulltext search as described in http://wiki.apache.org/solr/SpatialSearch#geodist_-_The_distance_funct ion . I'm using solr 3.3 and that works fine. BUT, I want to use highlighting of fulltext query words but that does not work. Before solr 3.3, I used solr 1.4 with Spatial Search plugin from Jteam and that works fine also with highlighting. After refactoring because of API change I miss the highlighting feature. Is that a known issue? Or what is my mistake/ I have to do? Example Query: INFO: [organisations] webapp=/solr path=/select params={hl.fragsize=250sort=score+ascsfield=store_lat_lonjson.nl=ma phl.fl=name,category_namewt=jsonhl=onrows=10fl=id,name,street,cit y,score,lat,lngstart=0q={!func}geodist()pt=52.5600917,13.4222482fq =((country_name:+(automatisierung))^8+OR+(category_name:+(automatisier ung))^10+OR+(sub_category_name:+(automatisierung))^10} hits=37 status=0 QTime=7 Thanks is Advance, Ralf
Re: sort by function
This seems like a bug. On 8/1/11 7:47 AM, Jamie Johnson jej2...@gmail.com wrote: I've never tried but could it be sort=sum(field1,field2,field3)%20desc On Mon, Aug 1, 2011 at 9:43 AM, Gastone Penzo gastone.pe...@gmail.com wrote: Hi, i need to order by function like: sort=sum(field1,field2,field3)+desc but solr gives me this error: Missing sort order. why is this possible? i read that is possible to order by function, from version 1.3 (http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function) i use version 1.4 nobody has an idea? thanx Gastone
Re: How can i find a document by a special id?
Why not just search the 2 fields? q=*:*fq=mediacode:AB OR id:123456 You could take the user input and replace it: q=*:*fq=mediacode:$input OR id:$input Of course you can also use dismax and wrap with an OR. Bill Bell Sent from mobile On Jul 20, 2011, at 3:38 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Am 20.07.2011 19:23, schrieb Kyle Lee: : Is the mediacode always alphabetic, and is the ID always numeric? : : No sadly not. We expose our products on too many medias :-). If i'm understanding you correctly, you're saying even the prefix AB is not special, that there could be any number of prefixes identifying differnet mediacodes ? and the product ids aren't all numeric? your question seems absurd. I can only assume that I am horribly missunderstanding your situation. (which is very easy to do when you only have a single contrieved piece of example data to go on) As a general rule, it's not a good idea to think about Solr in the same way as a relational database, but Perhaps if you imagine for a moment that your Solr index *was* a (read only) relational database, with each solr field corrisponding to a column in your DB, and then you described in psuedo-code/sql how you would go about doing the types of id lookups you want to do, it might give us a better idea of your situation so we can suggest an approach for dealing with it. -Hoss
Re: Data Import from a Queue
Yes this is a good reason for using a queue. I have used Amazon SQS this way and it was simple to set up. Bill Bell Sent from mobile On Jul 20, 2011, at 2:59 AM, Stefan Matheis matheis.ste...@googlemail.com wrote: Brandon, i don't know how they are using it in detail, but Part of Chef's Architecture is this one: Chef Server - RabbitMQ - Chef Solr Indexer - Solr http://wiki.opscode.com/download/attachments/7274878/chef-server-arch.png Perhaps not exactly, what you're looking for - but may give you an idea? Regards Stefan Am 19.07.2011 19:04, schrieb Brandon Fish: Let me provide some more details to the question: I was unable to find any example implementations where individual documents (single document per message) are read from a message queue (like ActiveMQ or RabbitMQ) and then added to Solr via SolrJ, a HTTP POST or another method. Does anyone know of any available examples for this type of import? If no examples exist, what would be a recommended commit strategy for performance? My best guess for this would be to have a queue per core and commit once the queue is empty. Thanks. On Mon, Jul 18, 2011 at 6:52 PM, Erick Ericksonerickerick...@gmail.comwrote: This is a really cryptic problem statement. you might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Fri, Jul 15, 2011 at 1:52 PM, Brandon Fishbrandon.j.f...@gmail.com wrote: Does anyone know of any existing examples of importing data from a queue into Solr? Thank you.
Re: configure dismax requesthandlar for boost a field
Add score to the fl parameter. fl=*,score On 7/4/11 11:09 PM, Romi romijain3...@gmail.com wrote: I am not returning score for the queries. as i suppose it should be reflected in search results. means doc having query string in description field come higher than the doc having query string in name field. And yes i restarted solr after making changes in configuration. - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/configure-dismax-requesthandlar-for-boo st-a-field-tp3137239p3139680.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exception when using result grouping and sorting by geodist() with Solr 3.3
Did you add: fq={!geofilt} ?? On 7/3/11 11:14 AM, Thomas Heigl tho...@umschalt.com wrote: Hello, I just tried up(down?)grading our current Solr 4.0 trunk setup to Solr 3.3.0 as result grouping was the only reason for us to stay with the trunk. Everything worked like a charm except for one of our queries, where we group results by the owning user and sort by distance. A simplified example for my query (that still fails) looks like this: q=*:*group=truegroup.field=user.uniqueId_sgroup.main=truegroup.format= groupedsfield=user.location_ppt=48.20927,16.3728sort=geodist() asc The exception thrown is: Caused by: org.apache.solr.common.SolrException: Unweighted use of sort geodist(latlon(user.location_p),48.20927,16.3728) at org.apache.solr.search.function.ValueSource$1.newComparator(ValueSource.j ava:106) at org.apache.lucene.search.SortField.getComparator(SortField.java:413) at org.apache.lucene.search.grouping.AbstractFirstPassGroupingCollector.ini t(AbstractFirstPassGroupingCollector.java:81) at org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.init(T ermFirstPassGroupingCollector.java:56) at org.apache.solr.search.Grouping$CommandField.createFirstPassCollector(Gro uping.java:587) at org.apache.solr.search.Grouping.execute(Grouping.java:256) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.j ava:237) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH andler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa se.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedded SolrServer.java:140) ... 39 more Any ideas how to fix this or work around this error for now? I'd really like to move from the trunk to the stable 3.3.0 release and this is the only problem currently keeping me from doing so. Cheers, Thomas
Re: faceting on field with two values
The easiest way is to concat() the fields in SQL, and pass it to indexing as one field already merged together. Thanks, On 7/5/11 1:12 AM, elisabeth benoit elisaelisael...@gmail.com wrote: Hello, I have two fields TOWN and POSTALCODE and I want to concat those two in one field to do faceting My two fields are declared as followed: field name=TOWN type=string indexed=true stored=true/ field name=POSTALCODE type=string indexed=true stored=true/ The concat field is declared as followed: field name=TOWN_POSTALCODE type=string indexed=true stored=true multiValued=true/ and I do the copyfield as followed: copyField source=TOWN dest=TOWN_POSTALCODE/ copyField source=POSTALCODE dest=TOWN_POSTALCODE/ When I do faceting on TOWN_POSTALCODE field, I only get answers like lst name=TOWN_POSTALCODE int name=622005/int int name=622805/int int name=boulogne sur mer5/int int name=saint martin boulogne5/int ... Which means the faceting is down on the TOWN part or the POSTALCODE part of TOWN_POSTALCODE. But I would like to have answers like lst name=TOWN_POSTALCODE int name=boulogne sur mer 622005/int int name=paris 750165/int Is this possible with Solr? Thanks, Elisabeth
Re: How many fields can SOLR handle?
This is taxonomy/index design... One way is to have a series of fields by category: TV - tv_size, resolution Computer - cpu, gpu Solr can have as many fields as you need, and if you do not store them into the index they are ignored. So if a user picks TV, you pass these to Solr: q=*:*facet=truefacet.field=tv_sizefacet.field=resolution If a user picks Computer, you pass these to Solr: q=*:*facet=truefacet.field=cpufacet.field=gpu The other option is to return ALL of the fields facet'd, but this is not recommended since you would certainly have performance issues depending on the number of fields. On 7/5/11 1:00 AM, roySolr royrutten1...@gmail.com wrote: Hi, I know i can add components to my requesthandler. In this situation facets are dependent of there category. So if a user choose for the category TV: Inch: 32 inch(5) 34 inch(3) 40 inch(1) Resolution: Full HD(5) HD ready(2) When a user search for category Computer: CPU: Intel(12) AMD(10) GPU: Ati(5) Nvidia(2) So i can't put it in my requesthandler as default. Every search there can be other facets. Do you understand what i mean? -- View this message in context: http://lucene.472066.n3.nabble.com/How-many-fields-can-SOLR-handle-tp30339 10p3139833.html Sent from the Solr - User mailing list archive at Nabble.com.