Re: Can not find solr core on admin page after setup
yes, I do. I installed the solr example instance. Engy. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236p4098380.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can not find solr core on admin page after setup
Hi Engy, Have you copy solr's war (e.g. solr-4.5.1.war, for latest Solr distribution) from Solr source distribution to Tomcat's webapps directory (rename to solr.war on webapps dir.)? After put that file and restarted the Tomcat, it will create 'solr' folder under webapps. Or, if you still found no Admin page, pls check Tomcat log (catalina.out). Thanks.- On Tue, Oct 29, 2013 at 8:54 PM, engy.morsy engy.mo...@bibalex.org wrote: Hi, I setup solr4.2 under apache tomcat on windows m/c. I created solr.xml under catalina/localhost that holds the solr/home path, I have only one core, so the solr.xml under the solr instance looks like: cores adminPath=/admin/cores defaultCoreName=core0 core name=core0 instanceDir=core0 / cores after starting the apache service, I did not find the core on the admin page. I checked the logs but no errors were found. I checked that the data folder was created successfully. I am not even able to access the core directly.Any idea !! Thanks Engy -- View this message in context: http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236.html Sent from the Solr - User mailing list archive at Nabble.com. -- wassalam, [bayu]
Re: Can not find solr core on admin page after setup
Hi Bayu , I did that but for solr 4.2, the catalaina.out has no exceptions at all. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-not-find-solr-core-on-admin-page-after-setup-tp4098236p4098385.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud full index replication on leader failure
Hi, I have a problem with SolrCloud in an specific test case and I wanted to know if it is the way it should work or if is there any way to avoid this... I have the next scenario: - Three machines - Each one with one zookeeper and one solr 4.1.0 - Each Solr stores 7 Million documents and the index is 2GB The test consist on sending queries to solr (100 concurrent queries continously) and then forcing the leader failure by shutting down both zookeeper and solr. When we shut down any solr that is not the leader there are no problems, the other two respond to the queries without problems. However if we shut down the leader the next steps occur: - Both Solrs continue responding to the queries until the leader election starts - One of them is elected as leader and the other one stops responding queries (I've read it goes to recovery mode until its index is synchronized with the leader's one) - Then, even though both indexes are the same (They were synchronized before the leader failure), the whole index is replicated. - During the time while the 2GB are replicated from leader to the remaining server, the server recovering is not responding to queries, therefore the leader must attend to the whole amount of queries and finally it crashes due to having to many queries to answer (Aside of replicating its index) My question here is... Is it normal that the whole index replicates in a leader change even though the leader and the other solr indexes should be the same? Is there any way to avoid it? Maybe I have some configuration wrong? Should changing Solr to 4.5.X avoid this operative? Aside from this problem everything seems to work fine, but that point of failure is too risky for us Thanks in advance -- Alejandro Marqués Rodríguez Paradigma Tecnológico http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42
Re: Configuration and specs to index a 1 terabyte (TB) repository
On Tue, 2013-10-29 at 14:24 +0100, eShard wrote: I have a 1 TB repository with approximately 500,000 documents (that will probably grow from there) that needs to be indexed. As Shawn point out, that isn't telling us much. If you describe the documents, how and how often you index and how you query them, it will help a lot. Let me offer some observations from a related project we are starting at Statsbiblioteket. We are planning to index 20 TB harvested web resources (*.dk from the last 8 years, or at least the resources our crawlers sunk their tentacles into). We have two text indexes generated from about 1% and 2% of that corpus, respectively. They are 200GB and 420GB in size and contains ~75 million and (whoops, offline, so rememberguessing here) ~150 million documents. For testing purposes we issued simple searches: 2-4 OR'ed terms, picked at random from a Danish dictionary. One of our test machines is an 2*8 core Xeon machine with 32GB of RAM (about ~12GB free for caching) and SSD as storage. We had room for a 2-shard cloud on the SSD's, so searches were issued to 2*200GB index of a total of 150 million documents. CentOS/Solr 4.3. Hammering that machine with 32 threads gave us a median response time of 200ms and a 99-percentile of 5-800 ms (depending on test run), single thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at 300-400% and IOWait at 30-40%, but was not closely monitored. Our current vision is to shard the projected 20TB index into ~800GB or ~1TB chunks (depending on which drives we choose) and put one chard on each physical SSD, thereby sidestepping the whole RAID TRIM-problem. We do have the great luxury of running nightly batch index updates on a single shard instead of continuous updates. We would probably go for smaller shards if they were all updated continuously. Projected price for the full setup range from $50.000-$100.000, depending on where we land on the off-the-shelf - enterprise scale. (I need to write a blog post on this) With that in mind, I urge you to do some testing on a machine with SSD and modest memory vs. a traditional spinning drives and monster-memory machine. - Toke Eskildsen, State and University Library, Denmark
Return the synonyms as part of Solr response
Hi, We have a requirement where we need to send the matched synonyms as part of Solr response. Do we need to customize the Solr response handler to do this? Regards, Siva -- View this message in context: http://lucene.472066.n3.nabble.com/Return-the-synonyms-as-part-of-Solr-response-tp4098389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud liveness problems
Hi, I experience the same problem, using version 4.4.0. In my case: 2 Solr nodes - 4 collections, each 1 shard and 2 replicas. 3 Zookeepers Replicas can get state=down when a connection to Zookeeper is lost. However, there are 2 more Zookeeper servers, so this shouldn't be a problem right? The only errors in the log are like the following: Error inspecting tlog tlog{file=/opt/solr/server/blabla/replica1/data/tlog/tlog.0001106 refcount=2} Funny thing is, the replicas with the error work just fine, the ones without errors are causing problems. Maybe because the replicas with this error go through the recovery process and the others do not? There seems absolutely no problem with the replicas that are down. The only dirty hack to fix things is by editing the clusterstate.json and change the state from down to active. Doesn't seem right, but it does work. Jeroen On 18-9-2013 5:50, Mark Miller wrote: SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in 4.5 - the first RC for 4.5 will likely come tomorrow. Thanks to yonik for sussing these out. - Mark On Sep 17, 2013, at 2:43 PM, Mark Miller markrmil...@gmail.com wrote: On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic vladimir.veljko...@boxalino.com wrote: Hello there, we have following setup: SolrCloud 4.4.0 (3 nodes, physical machines) Zookeeper 3.4.5 (3 nodes, physical machines) We have a number of rather small collections (~10K or ~100K of documents), that we would like to load to all Solr instances (numShards=1, replication_factor=3), and access them through local network interface, as the load balancing is done in layers above. We can live (and we actually do it in the test phase) with updating the entire collections whenever we need it, switching collection aliases and removing the old collections. We stumbled across following problem: as soon as all three Solr nodes become a leader to at least one collection, restarting any node makes it completely unresponsive (timeout), both though admin interface and for replication. If we restart all solr nodes the cluster end up in some kind of deadlock and only remedy we found is Solr clean installation, removing ZooKeeper data and re-posting collections. Apparently, leader is waiting for replicas to come up and they try to synchronize but timeout on http requests, so everything ends up in some kind of dead lock, maybe related to: https://issues.apache.org/jira/browse/SOLR-5240 Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that is coming in 4.5, which is a probably a week or so away. Eventually (after few minutes), leader takes over, mark collections active but remains blocked on http interface, so other nodes can not synchronize. In further tests, we loaded 4 collections with numShards=1 and replication_factor=2. By chance, one node become the leader for all 4 collections. Restarting the node which was not the leader is done without the problem, but when we restarted the leader it happened that: - leader shut down, other nodes became leaders of 2 collections each - leader starts up, 3 collections on it become active, one collection remains down and node becomes unresponsive and timeouts on http requests. Hard to say - I'll experiment with 4.5 and see if I can duplicate this. - Mark As this behavior is completely unexpected for one cluster solution, I wonder if somebody else experienced same problems or we are doing something entirely wrong. Best regards -- Vladimir Veljkovic Senior Java Entwickler Boxalino AG vladimir.veljko...@boxalino.com www.boxalino.com Tuning Kit for your Online Shop Product Search - Recommendations - Landing Pages - Data intelligence - Mobile Commerce
Making a Web Request is failing with 403 Request Forbidden
Hi All, I am making web server call to a website for Shortening the links, that is bit.ly but recieving a 403 Request Forbidden. Although if I use their webpage to short the web link its working good. Can any body tell me what might be the reason for such a vague behavior. Here is the code included. String url = https://bitly.com/shorten/;; StringBuffer response; try { URL obj = new URL(url); HttpsURLConnection con = (HttpsURLConnection) obj.openConnection(); //add reuqest header con.setRequestMethod(POST); con.setRequestProperty(User-Agent, Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22); con.setRequestProperty(Accept-Language, en-US,en;q=0.8); con.setRequestProperty(Accept, text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8); con.setRequestProperty(Accept-Charset, ISO-8859-1,utf-8;q=0.7,*;q=0.3); con.setRequestProperty(Content-Type, application/x-www-form-urlencoded); con.setRequestProperty(Host, bitly.com); String urlParameters = url= http://bit.ly/1f3aLrPie=utf-8oe=utf-8gws_rd=crei=sKlwUvPbN8j-rAf-5IDwAQbasic_style=1classic_mode=rapid_shorten_mode=_xsrf=a2b71eaf499c4690a77a21d3c87e6302 ; // Send post request con.setDoOutput(true); DataOutputStream wr = new DataOutputStream(con.getOutputStream()); wr.writeBytes(urlParameters); wr.flush(); wr.close(); int responseCode = con.getResponseCode(); System.out.println(Response Code : + responseCode); BufferedReader in = new BufferedReader( new InputStreamReader(con.getInputStream())); String inputLine; response = new StringBuffer(); while ((inputLine = in.readLine()) != null) { response.append(inputLine); } in.close(); System.out.println(response.toString()); } catch (MalformedURLException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ProtocolException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } Hoping for your response. Thanks!
Re: Making a Web Request is failing with 403 Request Forbidden
On Wed, Oct 30, 2013 at 4:50 PM, Vineet Mishra clearmido...@gmail.comwrote: I am making web server call to a website for Shortening the links, that is bit.ly but recieving a 403 Request Forbidden. Although if I use their webpage to short the web link its working good. Can any body tell me what might be the reason for such a vague behavior. This does not seem to be a Solr question. Perhaps look at more generic web request tracing tools like Wireshark, etc to compare valid and failing request. If this is Solr related, please narrow this down to Solr aspect of the problem. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Configuration and specs to index a 1 terabyte (TB) repository
On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote: If you put the index on SSD, you could get by with less RAM, but a RAID solution that works properly with SSD (TRIM support) is hard to find, so SSD failure in most situations effectively means a server failure. Solr and Lucene have a track record of shredding SSD into failure, because typically there is a LOT of writing involved. Why would TRIM have any influence on whether or not a driver failure also means server failure? If the track record you are referring to involves the problems that the Jenkins server for Lucene development had, I know of two failed drives from that setup and they were both OCZ. No surprise here, it pays to examine the reliability of the different models before buying. My current rule is to avoid OCZ like the plague and go for a Samsung 840 or an Intel drive. http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923.html - Toke Eskildsen, State and University Library, Denmark
Atomic Updates in SOLR
I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam
query with colon in bq
I have a question about query with colon in bq. Actually I use edismax and I set the q and bq just like this, .../select?defType=edismaxq=1:100^100 1 100^30qf=Title^2.0 Bodybq=Title:(1:100)^6.0 Body:(1:100)^6.0 in this query phrase, I got the error in bq, undefined field 1. How do I use query with colon in bq? -- View this message in context: http://lucene.472066.n3.nabble.com/query-with-colon-in-bq-tp4098400.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Atomic Updates in SOLR
Perhaps you are running the update request more than once accidentally? Can you try using optimistic update with _version_ while sending the update? This way, if some part of your code is making a duplicate request then Solr would throw an error. See https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.comwrote: I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam -- Regards, Shalin Shekhar Mangar.
Re: Atomic Updates in SOLR
I am not sure if optimistic concurrency would help in deduplicating but yes, as Shalin points out, you'll be able to spot issues with your client code. On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Perhaps you are running the update request more than once accidentally? Can you try using optimistic update with _version_ while sending the update? This way, if some part of your code is making a duplicate request then Solr would throw an error. See https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.com wrote: I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam -- Regards, Shalin Shekhar Mangar. -- Anshum Gupta http://www.anshumgupta.net
Re: Atomic Updates in SOLR
Ah I misread your email. You are actually sending the update twice and asking about how to dedup the multi-valued field values. No I don't think we have an update processor which can do that. On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Perhaps you are running the update request more than once accidentally? Can you try using optimistic update with _version_ while sending the update? This way, if some part of your code is making a duplicate request then Solr would throw an error. See https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.comwrote: I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Atomic Updates in SOLR
Think it'll be a good thing to have. I just created a JIRA for that. https://issues.apache.org/jira/browse/SOLR-5403 Will try and get to it soon. On Wed, Oct 30, 2013 at 4:28 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Ah I misread your email. You are actually sending the update twice and asking about how to dedup the multi-valued field values. No I don't think we have an update processor which can do that. On Wed, Oct 30, 2013 at 4:18 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Perhaps you are running the update request more than once accidentally? Can you try using optimistic update with _version_ while sending the update? This way, if some part of your code is making a duplicate request then Solr would throw an error. See https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents On Wed, Oct 30, 2013 at 3:35 PM, Anupam Bhattacharya anupam...@gmail.comwrote: I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar. -- Anshum Gupta http://www.anshumgupta.net
Re: Background merge errors with Solr 4.4.0 on Optimize call
Robert: Thanks. I'm on my way out the door, so I'll have to put up a JIRA with your patch later if it hasn't been done already Erick On Tue, Oct 29, 2013 at 10:14 PM, Robert Muir rcm...@gmail.com wrote: I think its a bug, but thats just my opinion. i sent a patch to dev@ for thoughts. On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, so you're saying that merging indexes where a field has been removed isn't handled. So you have some documents that do have a what field, but your schema doesn't have it, is that true? It _seems_ like you could get by by putting the _what_ field back into your schema, just not sending any data to it in new docs. I'll let others who understand merging better than me chime in on whether this is a case that should be handled or a bug. I pinged the dev list to see what the opinion is Best, Erick On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro m...@mshapiro.net wrote: Sorry for reposting after I just sent in a reply, but I just looked at the error trace closer and noticed 1. Caused by: java.lang.IllegalArgumentException: no such field what The 'what' field was removed by request of the customer as they wanted the logic behind what gets queried in the what field to be code side instead of solr side (for easier changing without having to re-index everything. I didn't feel strongly either way and since they are paying me, I took it out). This makes me wonder if its crashing while merging because a field that used to be there is now gone. However, this seems odd to me as Solr doesn't even let me delete the old data and instead its leaving my collection in an extremely bad state, with the only remedy I can think of is to nuke the index at the filesystem level. If this is indeed the cause of the crash, is the only way to delete a field to first completely empty your index first? On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro m...@mshapiro.net wrote: Thanks for your response. You were right, solr is logging to the catalina.out file for tomcat. When I click the optimize button in solr's admin interface the following logs are written: http://apaste.info/laup About JVM memory, solr's admin interface is listing JVM memory at 3.1% (221.7MB is dark grey, 512.56MB light grey and 6.99GB total). On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson erickerick...@gmail.com wrote: For Tomcat, the Solr is often put into catalina.out as a default, so the output might be there. You can configure Solr to send the logs most anywhere you please, but without some specific setup on your part the log output just goes to the default for the servlet. I took a quick glance at the code but since the merges are happening in the background, there's not much context for where that error is thrown. How much memory is there for the JVM? I'm grasping at straws a bit... Erick On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro m...@mshapiro.net wrote: I am working at implementing solr to work as the search backend for our web system. So far things have been going well, but today I made some schema changes and now things have broken. I updated the schema.xml file and reloaded the core (via the admin interface). No errors were reported in the logs. I then pushed 100 records to be indexed. A call to Commit afterwards seemed fine, however my next call for Optimize caused the following errors: java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] null:java.io.IOException: background merge hit exception: _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37 [maxNumSegments=1] Unfortunately, googling for background merge hit exception came up with 2 thing: a corrupt index or not enough free space. The host machine that's hosting solr has 227 out of 229GB free (according to df -h), so that's not it. I then ran CheckIndex on the index, and got the following results: http://apaste.info/gmGU As someone who is new to solr and lucene, as far as I can tell this means my index is fine. So I am coming up at a loss. I'm somewhat sure that I could probably delete my data directory and rebuild it but I am more interested in finding out why is it having issues, what is the best way to fix it, and what is the best way to prevent it from happening when this goes into production. Does anyone have any advice that may help? As an aside, i do not have a stacktrace for you because the solr admin page isn't giving me one. I tried looking in my logs file in my solr directory, but it does not contain any logs. I
Store Solr OpenBitSets In Solr Indexes
Hi All, What should be the field type if I have to save solr's open bit set value within solr document object and retrieve it later for search? OpenBitSet bits = new OpenBitSet(); bits.set(0); bits.set(1000); doc.addField(SolrBitSets, bits); What should be the field type of SolrBitSets? Thanks
Re: Data import handler with multi tables
that is what i'd call a compound key? :) using multiple attribute to generate a unique key across multiple tables .. On Wednesday, October 30, 2013 at 2:10 AM, dtphat wrote: yes, I've just used concat(id, '_', tableName) instead using compound key. I think this is an easy way. Thanks. - Phat T. Dong -- View this message in context: http://lucene.472066.n3.nabble.com/Re-Data-import-handler-with-multi-tables-tp4098048p4098328.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
Re: Atomic Updates in SOLR
Unfortunately, atomic add is add to a list (append) rather than add to a set (only unique values). But, you can use the unique fields update processor (solr.UniqFieldsUpdateProcessorFactory) to de-dupe specified multivalued fields. See: http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html My e-book has more examples as well. -- Jack Krupansky -Original Message- From: Anupam Bhattacharya Sent: Wednesday, October 30, 2013 6:05 AM To: solr-user@lucene.apache.org Subject: Atomic Updates in SOLR I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam
Re: Atomic Updates in SOLR
Oops... need to note that the parameters have changed since Solr 4.4 - I gave the link for 4.5.1, but for 4.4 and earlier, use: http://lucene.eu.apache.org/solr/4_4_0/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html (My book is for 4.4, but hasn't been updated for 4.5 yet, but the gist of the examples is the same.) -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Wednesday, October 30, 2013 9:03 AM To: solr-user@lucene.apache.org Subject: Re: Atomic Updates in SOLR Unfortunately, atomic add is add to a list (append) rather than add to a set (only unique values). But, you can use the unique fields update processor (solr.UniqFieldsUpdateProcessorFactory) to de-dupe specified multivalued fields. See: http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html My e-book has more examples as well. -- Jack Krupansky -Original Message- From: Anupam Bhattacharya Sent: Wednesday, October 30, 2013 6:05 AM To: solr-user@lucene.apache.org Subject: Atomic Updates in SOLR I am working on a offline tagging capability to tag records with a thesaurus dictionary of key concepts. I am able to use the update=add option using xml and json update calls for a field to update specific document field information. Although if I run the same atomic update query twice then the multivalued string fields start showing duplicate value in the multivalued field. e.g. for a field name as tag at the initial it was having copper, iron, steel After running the atomic update query with field name=tag update=addsteel/field I will get the tag field values as following: copper, iron, steel, steel. (Thus steel get added twice). I looked at RemoveDuplicatesTokenFilterFactory but it helps to remove token duplicate not multivalued field duplicates. Is there any updateProcessor to stop the incoming duplicate value from indexing ? Thanks in advance for any help. Regards Anupam
Unable to add mahout classifier
Hi, I made few changes to the solrconfig.xml, created a jar file,added it to the lib folder of the solr and tried to start it. THe changes in the solrconfig.xml are updateRequestProcessorChain name=mahoutclassifier default=true processor class=com.mahout.solr.classifier.CategorizeDocumentFac str name=inputFieldLEAD_NOTES/str str name=outputFieldcategory/str str name=defaultCategoryOthers/str str name=modelnaiveBayesModel/str /processor processor class=solr.RunUpdateProcessorFactory/ processor class=solr.LogUpdateProcessorFactory/ /updateRequestProcessorChain requestHandler name=/update/csv class=solr.CSVRequestHandler lst name=defaults str name=stream.contentTypeapplication/csv/str str name=update.processormahoutclassifier/str /lst /requestHandler I attahced the class file. But i get the following error. org.apache.solr.common.SolrException: Error Instantiating UpdateRequestProcessorFactory, com.mahout.solr.classifier.CategorizeDocumentFactory failed to instantiate org.apache.solr.update.processor.UpdateRequestProcessorFactory at org.apache.solr.core.SolrCore.init(SolrCore.java:834) at org.apache.solr.core.SolrCore.init(SolrCore.java:625) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:522) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:557) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.solr.common.SolrException: Error Instantiating UpdateRequestProcessorFactory, com.mahout.solr.classifier.CategorizeDocumentFactory failed to instantiate org.apache.solr.update.processor.UpdateRequestProcessorFactory at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:547) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:582) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2144) at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:119) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:584) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2128) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2122) at org.apache.solr.core.SolrCore.loadUpdateProcessorChains(SolrCore.java:906) at org.apache.solr.core.SolrCore.init(SolrCore.java:766) ... 13 more Caused by: java.lang.ClassCastException: class com.mahout.solr.classifier.CategorizeDocumentFactory at java.lang.Class.asSubclass(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:433) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:381) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:526) ... 21 more Thanks, Subbu
Re: Configuration and specs to index a 1 terabyte (TB) repository
Wow again! Thank you all very much for your insights. We will certainly take all of this under consideration. Erik: I want to upgrade but unfortunately, it's not up to me. You're right, we definitely need to do it. And SolrJ sounds interesting, thanks for the suggestions. By the way, is there a Solr upgrade guide out there anywhere? Thanks again! -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-and-specs-to-index-a-1-terabyte-TB-repository-tp4098227p4098431.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Language detection for multivalued field
Hi, First, the feature will only detect ONE language per field, even if it is a multi-valued field. In your case there is VERY little text for the detector, so do not expect great detection quality. But I believe the detector chose ES as language and mapped the whole field as tag_es. The reason you do not see tag_es in the first schema version is naturally because you have it defined as stored=false. If you want individual detection of each value, please send the values in differently named fields, of file a JIRA to add a feature request for individual detection of language for values in a multiValued field. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 22. okt. 2013 kl. 14:16 skrev vatuska vatu...@yandex.ru: *Can you elaborate on your comment There isn't tag indexed. Are you saying that your multiValued tag field is not indexed at all, gone, missing? * There aren't any tag_... field despite of indexed=true stored=true for dynamicField I found the reason, but I don't understand why If I specify str name=langid.whitelisten,es/str There aren't any tag_... field for document ... field name=tagespañol/field field name=tagfirst/field field name=tagMy tag/field ... If there are these lines in schema.xml dynamicField name=quot;*_undfndquot; type=quot;text_generalquot; indexed=lt;btrue* stored=true multiValued=true/dynamicField name=quot;*_enquot; type=quot;text_en_splittingquot; indexed=lt;btrue* stored=true multiValued=true/ dynamicField name=quot;*_esquot; type=quot;text_esquot; indexed=quot;truequot; stored=lt;bfalse* multiValued=true/ But if I specify dynamicField name=quot;*_esquot; type=quot;text_esquot; indexed=quot;truequot; stored=lt;btrue* multiValued=true/ There is a *tag_es* : español , first, My tag in the stored document Could you explain, please, how does it work? -- View this message in context: http://lucene.472066.n3.nabble.com/Language-detection-for-multivalued-field-tp4096996p4097013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Configuration and specs to index a 1 terabyte (TB) repository
On 10/30/2013 4:00 AM, Toke Eskildsen wrote: On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote: If you put the index on SSD, you could get by with less RAM, but a RAID solution that works properly with SSD (TRIM support) is hard to find, so SSD failure in most situations effectively means a server failure. Solr and Lucene have a track record of shredding SSD into failure, because typically there is a LOT of writing involved. Why would TRIM have any influence on whether or not a driver failure also means server failure? I left out a step in my description. Lack of TRIM support in RAID means that I would avoid RAID with SSD. No RAID means that when the SSD fails, that Solr is out of commission until its SSD can be replaced. If you've got multiple replicas and good error alarming, then that won't pose a major issue. I don't know how Solr would behave if you put each core on its own SSD and one of them fails. Hopefully it's smart enough to keep going with the cores that have working filesystems. Thanks, Shawn
Re: query with colon in bq
Escape any special characters with a backslash, or put the full term in quotes. -- Jack Krupansky -Original Message- From: jihyun suh Sent: Wednesday, October 30, 2013 6:28 AM To: solr-user@lucene.apache.org Subject: query with colon in bq I have a question about query with colon in bq. Actually I use edismax and I set the q and bq just like this, .../select?defType=edismaxq=1:100^100 1 100^30qf=Title^2.0 Bodybq=Title:(1:100)^6.0 Body:(1:100)^6.0 in this query phrase, I got the error in bq, undefined field 1. How do I use query with colon in bq? -- View this message in context: http://lucene.472066.n3.nabble.com/query-with-colon-in-bq-tp4098400.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing logs files of thousands of GBs
Hello, As suggested by Chris, now I am accessing the files using java program and creating SolrInputDocument, but i ran into this exception while doing server.add(document). When i tried to increase ramBufferSizeMB, it doesn't let me make it more than 2 gig. org.apache.solr.client.solrj.SolrServerException: Server at http://localhost:8983/solr/logsIndexing returned non ok status:500, message:the request was rejected because its size (2097454) exceeds the configured maximum (2097152) org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the request was rejected because its size (2097454) exceeds the configured maximum (2097152) at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902) at org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71) at org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887) at java.io.InputStream.read(Unknown Source) at org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)at org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHand at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106) at Filewalker.walk(LogsIndexer.java:48) at Filewalker.main(LogsIndexer.java:69) How do I get rid of this? Thanks, Prerna -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html Sent from the Solr - User mailing list archive at Nabble.com.
Computing Results So That They are Returned in Search Results
I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
Re: Configuration and specs to index a 1 terabyte (TB) repository
On Wed, 2013-10-30 at 14:24 +0100, Shawn Heisey wrote: On 10/30/2013 4:00 AM, Toke Eskildsen wrote: Why would TRIM have any influence on whether or not a driver failure also means server failure? I left out a step in my description. Lack of TRIM support in RAID means that I would avoid RAID with SSD. No RAID means that when the SSD fails, that Solr is out of commission until its SSD can be replaced. That makes sense, thanks. I don't know how Solr would behave if you put each core on its own SSD and one of them fails. Hopefully it's smart enough to keep going with the cores that have working filesystems. I don't know either. Seems like it would be a useful thing to test. We did some comparison on 9 shards of 420GB (against a SAN), where we tested SolrCloud with 9 independent Solr instances vs. a single instance with multiple cores. The overhead of independent instances did not seem severe for that shard size and should be resilient against single drive failure. As we're looking at a cumulative heap requirement of 100GB+ due to grouping and faceting, it might be preferable to run with independent Solrs anyway to minimize garbage collection pauses. I do not know if that logic extends in general to large Solr installations. Regards, Toke Eskildsen, State and University Library, Denmark
Evaluating a SOLR index with trec_eval
Hello! Is there a simple way to evaluate a SOLR index with TREC_EVAL? I mean: * preparing a query file in some format Solr will understand, but where each query has an ID * getting results out in trec format, with these query IDs attached Thanks Michael
[SolrCloud-Solrj] Document router problem connecting to Zookeeper ensemble
I have a zookeeper ensemble hotes in one amazon server. Using the CloudSolrServer and trying to connect , I obtain this nreally unusual error : 969 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is connected to ZooKeeper 1043 [main] INFO org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from ZooKeeper... Exception in thread main org.apache.solr.common.SolrException: Unknown document router '{name=implicit}' at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46) Although in my collection I have the compositeId strategy for routing ( from the clusterState.json ) . This is how I instantiate the server : CloudSolrServer server; server = new CloudSolrServer( ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2181, ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2182, ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2183); server.setDefaultCollection(example); SolrPingResponse ping = server.ping(); Any hint ? -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
SolrCloud batch updates
I'm currently using a SolrCoud setup and I index my data using a couple of in-house indexing clients. The clients process some files and post json messages containing added documents in batches. Initially my batch size was 100k docs and the post request took about 20-30 secs. I switched to 10k batches and now the updates are much faster but also more in number. My commit settings are : - autocommit - 45s / 100k docs, openSearcher=false - softAutoCommit - every 3 minutes I'm trying to figure out which one is preferable - bigger batches, rare or smaller batches, often? And why? Which are the background operations that take place after posting docs? At which point does the replication kick in - after commit or after update? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-batch-updates-tp4098463.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem querying with edismax and hyphens
Hi, The query z-score doesn't match a doc with zscore in the index. The analysis tool shows that this query would match this data in the index, but it's the edismax query parser step that seems to screw things up. Is there some combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory parameters, and/or something else I can change or add to generically make the query match without modifying the mm? ie. without adding a rule to specifically synonymize or split the term zscore with some dictionary of words.The query I want to match but doesn't:z-scoremm=-30%In the index:zscoreThe analyzer: fieldType autoGeneratePhraseQueries=false class=solr.TextField name=lowStopText positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.StopFilterFactory enablePositionIncrements=true ignoreCase=true words=stopwords.txt/ /analyzer /fieldTypeThe parsed edismax query with autoGeneratePhraseQueries=true:+(def_term:\(z-score z) (score zscore)\)The parsed edismax query with autoGeneratePhraseQueries=false:+(((def_term:z-score def_term:z def_term:score def_term:zscore)~3))Thanks Vardhan
Problem querying with edismax and hyphens
Hi, The query z-score doesn't match a doc with zscore in the index. The analysis tool shows that this query would match this data in the index, but it's the edismax query parser step that seems to screw things up. Is there some combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory parameters, and/or something else I can change or add to generically make the query match without modifying the mm? ie. without adding a rule to specifically synonymize or split the term zscore with some dictionary of words. The query I want to match but doesn't: z-score mm=-30% In the index: zscore The analyzer: fieldType autoGeneratePhraseQueries=false class=solr.TextField name=lowStopText positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.StopFilterFactory enablePositionIncrements=true ignoreCase=true words=stopwords.txt/ /analyzer /fieldType The parsed edismax query with autoGeneratePhraseQueries=true: +(def_term:\(z-score z) (score zscore)\) The parsed edismax query with autoGeneratePhraseQueries=false: +(((def_term:z-score def_term:z def_term:score def_term:zscore)~3)) Thanks Vardhan
Re: Computing Results So That They are Returned in Search Results
You could create a custom value source and then use it in a function query embedded in your return fields list (fl). So, the function query could use a function (value source) that takes a field, fetches its value, performs some arbitrary calculation, and then returns that value. fl=id,name,my-func(field1),my-func(field2) -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 10:10 AM To: solr-user@lucene.apache.org Subject: Computing Results So That They are Returned in Search Results I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
Re: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml
On 10/30/2013 9:24 AM, Elena Camossi wrote: Hi everyone, I'm trying to configure Solr 4.5.0 on Linux red Hat to work with CKAN and Tomcat, but Solr cannot initialize the core (I'm configuring just one core, but this is likely to change in the next future. I'm using contexts for this set up). Tomcat is working correctly, and list solr among running applications. When I open the Solr dashboard, the Solr instance is running but I see this error SolrCore Initialization Failures ckan-schema-2.0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file /usr/share/solr/ckan/conf/solrconfig.xml snip The content of my solr.xml for core settings (/usr/share/solr/solr.xml, in my installation) is solr persistent=true sharedLib=lib cores adminPath=/admin/cores defaultCoreName=ckan core name =ckan-schema-2.0 instanceDir=ckan/conf property name=dataDir value=/var/lib/solr/data/ckan //core /cores /solr Typically, instanceDir will not have the conf on it - it should just be ckan for this. Solr automatically adds the conf when it is looking for the configuration. Later you show that you have dataDir defined in solrconfig.xml -- take that out entirely. The dataDir is specified in solr.xml, putting it in solrconfig.xml also is just asking for problems -- especially if you ever end up sharing the solrconfig.xml between more than one core, which is what happens with SolrCloud. Also, evidence seems to suggest that the ${dataDir} substitution that used to work in older versions was a fluke. After a recent rigorous properties cleanup, it is no longer supported, unless you actually define that as a java system property. Finally, make sure that the permissions of all paths leading to both the symlink for your conf directory and the actual conf directory are readable to the tomcat user, not just root.
Re: Indexing logs files of thousands of GBs
I have set at multipartUploadLimitInKB parameter to 10240 (which was 2048 earlier) multipartUploadLimitInKB=10240. Now it gives following error for same files at place. http://localhost:8983/solr/logsIndexing returned non ok status:500, message:the request was rejected because its size (10486046) exceeds the configured maximum (10485760). -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098472.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml
Dear Shawn, thanks a lot for your quick answer. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: mercoledì 30 ottobre 2013 17:12 To: solr-user@lucene.apache.org Subject: Re: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcepti on: Could not load config file .../solrconfig.xml On 10/30/2013 9:24 AM, Elena Camossi wrote: Hi everyone, I'm trying to configure Solr 4.5.0 on Linux red Hat to work with CKAN and Tomcat, but Solr cannot initialize the core (I'm configuring just one core, but this is likely to change in the next future. I'm using contexts for this set up). Tomcat is working correctly, and list solr among running applications. When I open the Solr dashboard, the Solr instance is running but I see this error SolrCore Initialization Failures ckan-schema-2.0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcepti on: Could not load config file /usr/share/solr/ckan/conf/solrconfig.xml snip The content of my solr.xml for core settings (/usr/share/solr/solr.xml, in my installation) is solr persistent=true sharedLib=lib cores adminPath=/admin/cores defaultCoreName=ckan core name =ckan-schema-2.0 instanceDir=ckan/conf property name=dataDir value=/var/lib/solr/data/ckan //core /cores /solr Typically, instanceDir will not have the conf on it - it should just be ckan for this. Solr automatically adds the conf when it is looking for the configuration. Later you show that you have dataDir defined in solrconfig.xml -- take that out entirely. The dataDir is specified in solr.xml, putting it in solrconfig.xml also is just asking for problems -- especially if you ever end up sharing the solrconfig.xml between more than one core, which is what happens with SolrCloud. Also, evidence seems to suggest that the ${dataDir} substitution that used to work in older versions was a fluke. After a recent rigorous properties cleanup, it is no longer supported, unless you actually define that as a java system property. Actually, I had tried the instanceDir=ckan but it didn't work either (with the same error, just reporting a wrong path to solrconf.xml). I used this configuration taking suggestion from here http://stackoverflow.com/questions/16230493/apache-solr-unable-to-access-adm in-page). But now that I have commented the dataDir setting in solrconf.xml as you suggest, it changes behaviour and i have a different error from Solr Logging: SolrCore Initialization Failures ckan-schema-2.0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error loading class 'solr.clustering.ClusteringComponent' Please check your logs for more information Log4j (org.slf4j.impl.Log4jLoggerFactory) TimeLevel Logger Message 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/extraction/lib (resolved as: /usr/share/solr/ckan/../../../contrib/extraction/lib). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /usr/share/solr/ckan/../../../dist). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/clustering/lib/ (resolved as: /usr/share/solr/ckan/../../../contrib/clustering/lib). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /usr/share/solr/ckan/../../../dist). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/langid/lib/ (resolved as: /usr/share/solr/ckan/../../../contrib/langid/lib). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /usr/share/solr/ckan/../../../dist). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/velocity/lib (resolved as: /usr/share/solr/ckan/../../../contrib/velocity/lib). 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /usr/share/solr/ckan/../../../dist). 17:36:44WARNSolrCore[ckan-schema-2.0] Solr index directory '/var/lib/solr/data/ckan/index' doesn't exist. Creating new index... 17:36:45ERROR CoreContainer Unable to create core: ckan-schema-2.0 17:36:45ERROR CoreContainer null:org.apache.solr.common.SolrException: Unable to create core: ckan-schema-2.0 null:org.apache.solr.common.SolrException: Unable to create core: ckan-schema-2.0 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:936) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:568) at
Re: Computing Results So That They are Returned in Search Results
Sounds really close to what I'm looking for, but this sounds like it would result in a new field on a document (or a new value for a field defined to hold the result of a function). Would it be possible for a function query to produce a new document so that I can associate the computed value with it? Thanks, Alejandro On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.comwrote: You could create a custom value source and then use it in a function query embedded in your return fields list (fl). So, the function query could use a function (value source) that takes a field, fetches its value, performs some arbitrary calculation, and then returns that value. fl=id,name,my-func(field1),my-**func(field2) -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 10:10 AM To: solr-user@lucene.apache.org Subject: Computing Results So That They are Returned in Search Results I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
Re: Indexing logs files of thousands of GBs
Hi, Hm, sorry for not helping with this particular issue directly, but it looks like you are *uploading* your logs and indexing that way? Wouldn't pushing them be a better fit when it comes to log indexing? We recently contributed a Logstash output that can index logs to Solr, which may be of interest - have a look at https://twitter.com/otisg/status/395563043045638144 -- includes a little diagram that shows how this fits into the picture. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Wed, Oct 30, 2013 at 9:55 AM, keshari.prerna keshari.pre...@gmail.com wrote: Hello, As suggested by Chris, now I am accessing the files using java program and creating SolrInputDocument, but i ran into this exception while doing server.add(document). When i tried to increase ramBufferSizeMB, it doesn't let me make it more than 2 gig. org.apache.solr.client.solrj.SolrServerException: Server at http://localhost:8983/solr/logsIndexing returned non ok status:500, message:the request was rejected because its size (2097454) exceeds the configured maximum (2097152) org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the request was rejected because its size (2097454) exceeds the configured maximum (2097152) at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl$1.raiseError(FileUploadBase.java:902) at org.apache.commons.fileupload.util.LimitedInputStream.checkLimit(LimitedInputStream.java:71) at org.apache.commons.fileupload.util.LimitedInputStream.read(LimitedInputStream.java:128) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.makeAvailable(MultipartStream.java:977) at org.apache.commons.fileupload.MultipartStream$ItemInputStream.read(MultipartStream.java:887) at java.io.InputStream.read(Unknown Source) at org.apache.commons.fileupload.util.Streams.copy(Streams.java:94)at org.apache.commons.fileupload.util.Streams.copy(Streams.java:64)at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:344) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:397) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHand at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:121) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106) at Filewalker.walk(LogsIndexer.java:48) at Filewalker.main(LogsIndexer.java:69) How do I get rid of this? Thanks, Prerna -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073p4098438.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 4.5.0 configuration Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config file .../solrconfig.xml
On 10/30/2013 10:44 AM, Elena Camossi wrote: Actually, I had tried the instanceDir=ckan but it didn't work either (with the same error, just reporting a wrong path to solrconf.xml). I used this configuration taking suggestion from here http://stackoverflow.com/questions/16230493/apache-solr-unable-to-access-adm in-page). But now that I have commented the dataDir setting in solrconf.xml as you suggest, it changes behaviour and i have a different error from Solr Logging: SolrCore Initialization Failures ckan-schema-2.0: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error loading class 'solr.clustering.ClusteringComponent' Please check your logs for more information Log4j (org.slf4j.impl.Log4jLoggerFactory) TimeLevel Logger Message 17:36:43WARNSolrResourceLoader Can't find (or read) directory to add to classloader: ../../../contrib/extraction/lib (resolved as: /usr/share/solr/ckan/../../../contrib/extraction/lib). Your solrconfig.xml file includes the ClusteringComponent, but you don't have the jars required for that component available. Your solrconfig file does have a bunch of lib directives, but they don't point anywhere that's valid -- they assume that the entire Solr download is available, not just what's in the example dir. The jar for that particular component can be found in the download as dist/solr-clustering-X.X.X.jar ... but it is likely to also require additional jars, such as those found in contrib/clustering/lib. When it comes to extra jars for contrib or third-party components, the best thing to do is remove all lib directives from solrconfig.xml and put the jars in ${solr.solr.home}/lib. For you that location would be /usr/share/solr/lib. Solr automatically looks in this location without any extra configuration. Further advice - remove things you don't need from your config. If you're not planning to use the clustering component, take it out. Also remove any handlers that refer to components you won't be using -- the /browse handler is a prime example of something that most people don't need. Thanks, Shawn
Re: Computing Results So That They are Returned in Search Results
A function query is simply returning a calculated result based on existing data - no new fields required. Did you actually want to precompute a value, store it in the index, and then query on it? If so, you could do that indexing with a custom or scripted update processor. Flesh out an example of exactly what you want. -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 12:46 PM To: solr-user@lucene.apache.org Subject: Re: Computing Results So That They are Returned in Search Results Sounds really close to what I'm looking for, but this sounds like it would result in a new field on a document (or a new value for a field defined to hold the result of a function). Would it be possible for a function query to produce a new document so that I can associate the computed value with it? Thanks, Alejandro On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.comwrote: You could create a custom value source and then use it in a function query embedded in your return fields list (fl). So, the function query could use a function (value source) that takes a field, fetches its value, performs some arbitrary calculation, and then returns that value. fl=id,name,my-func(field1),my-**func(field2) -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 10:10 AM To: solr-user@lucene.apache.org Subject: Computing Results So That They are Returned in Search Results I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
Replacing Google Mini Search Appliance with Solr?
Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
RE: Replacing Google Mini Search Appliance with Solr?
Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: Replacing Google Mini Search Appliance with Solr?
Nutch is an excellent option. It should feel very comfortable for people migrating away from the Google appliances. Apache Droids is another possible way to approach, and I’ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble). I think the simples approach will be Nutch…it’s absolutely worth taking a shot at it. DO NOT write a crawler! That is a rabbit hole you do not want to peer down into :) On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: SolrCloud batch updates
Hi Michael, Here's a good post by Erick Erickson about understanding commits and transaction logs in Solr. http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ About the replication, as soon as you post an update, here's what happens: 1. The update gets routed to the correct leader 2. The leader writes it to it's transaction log 3. Leader forwards the updates to the replicas. 4. When the replicas respond in positive about the update being successful, the leader returns a success message for the update. Hope that helps. On Wed, Oct 30, 2013 at 9:06 PM, michael.boom my_sky...@yahoo.com wrote: I'm currently using a SolrCoud setup and I index my data using a couple of in-house indexing clients. The clients process some files and post json messages containing added documents in batches. Initially my batch size was 100k docs and the post request took about 20-30 secs. I switched to 10k batches and now the updates are much faster but also more in number. My commit settings are : - autocommit - 45s / 100k docs, openSearcher=false - softAutoCommit - every 3 minutes I'm trying to figure out which one is preferable - bigger batches, rare or smaller batches, often? And why? Which are the background operations that take place after posting docs? At which point does the replication kick in - after commit or after update? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-batch-updates-tp4098463.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://www.anshumgupta.net
Re: Replacing Google Mini Search Appliance with Solr?
Markus and Jason thanks for the info. I will start to research Nutch. Writing a crawler, agree it is a rabbit hole. -- Eric Palmer Web Services U of Richmond To report technical issues, obtain technical support or make requests for enhancements please visit http://web.richmond.edu/contact/technical-support.html On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Nutch is an excellent option. It should feel very comfortable for people migrating away from the Google appliances. Apache Droids is another possible way to approach, and I¹ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble). I think the simples approach will be NutchŠit¹s absolutely worth taking a shot at it. DO NOT write a crawler! That is a rabbit hole you do not want to peer down into :) On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: [SolrCloud-Solrj] Document router problem connecting to Zookeeper ensemble
Hi Alessandro, What version of Solr are you running and what's the version of SolrJ? I am guessing they are different. On Wed, Oct 30, 2013 at 8:32 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I have a zookeeper ensemble hotes in one amazon server. Using the CloudSolrServer and trying to connect , I obtain this nreally unusual error : 969 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is connected to ZooKeeper 1043 [main] INFO org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from ZooKeeper... Exception in thread main org.apache.solr.common.SolrException: Unknown document router '{name=implicit}' at org.apache.solr.common.cloud.DocRouter.getDocRouter(DocRouter.java:46) Although in my collection I have the compositeId strategy for routing ( from the clusterState.json ) . This is how I instantiate the server : CloudSolrServer server; server = new CloudSolrServer( ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2181, ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2182, ec2-xx.xx.xx.eu-west-1.compute.amazonaws.com:2183); server.setDefaultCollection(example); SolrPingResponse ping = server.ping(); Any hint ? -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- Anshum Gupta http://www.anshumgupta.net
AJAX Solr returning the default wildcard *:* and not what I query
I am currently integrating JavaScript framework AJAX Solr to my domain. I am trying to query words such as 'doctorate' or 'programs' but the console is reporting '*:*' only the default wildcard. Just curious if anyone has any helpful hints? The problem can be seen in detail on Stackoverflow, http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-default-wildcard-and-not-what-i-query Thank you, Mark IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Re: Evaluating a SOLR index with trec_eval
Hi Michael, I know you are asking about Solr, but in case you haven't seen it, Ian Soboroff has a nice little demo for Lucene: https://github.com/isoboroff/trec-demo. There is also the lucene benchmark code: http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html Otherwise, all I can think of is writing an app layer that keeps track of the id, sends the query to Solr, parses the search results and spits out results in the trec format. I'd love to find some open-source code that does what you ask. I did a quick and dirty version of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger michael.premin...@hioa.no wrote: Hello! Is there a simple way to evaluate a SOLR index with TREC_EVAL? I mean: * preparing a query file in some format Solr will understand, but where each query has an ID * getting results out in trec format, with these query IDs attached Thanks Michael
SV: Evaluating a SOLR index with trec_eval
Hi, Tom! Thanks alot. Ill check Ian's stuff and anticpate yours ... As you know, the ProveIt is now terminated as an INEX track, but we still hope to write a paper to a journal, summarizing what was done, and it would be nice to have you on. AND, youll be happy (or shocked) to know that this week I used your INEX paper from 2011 as an example of practice-near research in a seminar I was running for them, and they had an assignment to right a reflection note in advance where they associate your stuff with their own assignment. Michael Fra: Tom Burton-West [tburt...@umich.edu] Sendt: 30. oktober 2013 20:26 To: solr-user@lucene.apache.org Emne: Re: Evaluating a SOLR index with trec_eval Hi Michael, I know you are asking about Solr, but in case you haven't seen it, Ian Soboroff has a nice little demo for Lucene: https://github.com/isoboroff/trec-demo. There is also the lucene benchmark code: http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html Otherwise, all I can think of is writing an app layer that keeps track of the id, sends the query to Solr, parses the search results and spits out results in the trec format. I'd love to find some open-source code that does what you ask. I did a quick and dirty version of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger michael.premin...@hioa.no wrote: Hello! Is there a simple way to evaluate a SOLR index with TREC_EVAL? I mean: * preparing a query file in some format Solr will understand, but where each query has an ID * getting results out in trec format, with these query IDs attached Thanks Michael
SV: Evaluating a SOLR index with trec_eval
... AND appologies to everyone for erroneously posting irrelevant stuff on the list. Michael Fra: Michael Preminger [michael.premin...@hioa.no] Sendt: 30. oktober 2013 20:34 To: solr-user@lucene.apache.org Emne: SV: Evaluating a SOLR index with trec_eval Hi, Tom! Thanks alot. Ill check Ian's stuff and anticpate yours ... As you know, the ProveIt is now terminated as an INEX track, but we still hope to write a paper to a journal, summarizing what was done, and it would be nice to have you on. AND, youll be happy (or shocked) to know that this week I used your INEX paper from 2011 as an example of practice-near research in a seminar I was running for them, and they had an assignment to right a reflection note in advance where they associate your stuff with their own assignment. Michael Fra: Tom Burton-West [tburt...@umich.edu] Sendt: 30. oktober 2013 20:26 To: solr-user@lucene.apache.org Emne: Re: Evaluating a SOLR index with trec_eval Hi Michael, I know you are asking about Solr, but in case you haven't seen it, Ian Soboroff has a nice little demo for Lucene: https://github.com/isoboroff/trec-demo. There is also the lucene benchmark code: http://lucene.apache.org/core/4_5_1/benchmark/org/apache/lucene/benchmark/quality/package-summary.html Otherwise, all I can think of is writing an app layer that keeps track of the id, sends the query to Solr, parses the search results and spits out results in the trec format. I'd love to find some open-source code that does what you ask. I did a quick and dirty version of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Wed, Oct 30, 2013 at 10:52 AM, Michael Preminger michael.premin...@hioa.no wrote: Hello! Is there a simple way to evaluate a SOLR index with TREC_EVAL? I mean: * preparing a query file in some format Solr will understand, but where each query has an ID * getting results out in trec format, with these query IDs attached Thanks Michael
Re: AJAX Solr returning the default wildcard *:* and not what I query
On 10/30/2013 1:26 PM, Reyes, Mark wrote: I am currently integrating JavaScript framework AJAX Solr to my domain. I am trying to query words such as 'doctorate' or 'programs' but the console is reporting '*:*' only the default wildcard. Just curious if anyone has any helpful hints? The problem can be seen in detail on Stackoverflow, http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-default-wildcard-and-not-what-i-query We would have to know what Solr is actually receiving from your app. The Solr log should have an entry for every query you do, and it includes all of the parameters for that quey. This is *not* the Logging tab in the admin UI, but the actual logfile. On Solr 4.3 and later with the example logging setup, this is typically $CWD/logs/solr.log. Thanks, Shawn
ReplicationHandler - SnapPull failed to download a file completely.
we are continuously getting this exception during replication from master to slave. our index size is 9.27 G and we are trying to replicate a slave from scratch. Its a different file each time , sometimes we get to 60% replication before it fails and sometimes only 10%, we never managed a successful replication. 30 Oct 2013 18:38:52,884 [explicit-fetchindex-cmd] ERROR ReplicationHandler - SnapPull failed :org.apache.solr.common.SolrException: Unable to download _aa7_Lucene41_0.tim completely. Downloaded 0!=1054090 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1244) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1124) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:719) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:397) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317) at org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:218) I read in some thread that there was a related bug in solr 4.1, but we are using solr 4.3 and tried with 4.5.1 also. It seams that DirectoryFileFetcher can not download a file sometimes , the files is downloaded to the slave in size zero. we are running in a test environment where bandwidth is high. this is the master setup: |requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilesstopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml/str str name=commitReserveDuration00:00:50/str /lst /requestHandler | and the slave setup: | requestHandler name=/replication class=|||solr.ReplicationHandler| lst name=slave str name=masterUrlhttp://solr-master.saltdev.sealdoc.com:8081/solr-master/str str name=httpConnTimeout15/str str name=httpReadTimeout30/str /lst /requestHandler |
Re: AJAX Solr returning the default wildcard *:* and not what I query
solr.log file per Solr 4.5 http://pastebin.com/zSpERJZA Thanks Shawn, Mark On 10/30/13, 12:44 PM, Shawn Heisey s...@elyograg.org wrote: On 10/30/2013 1:26 PM, Reyes, Mark wrote: I am currently integrating JavaScript framework AJAX Solr to my domain. I am trying to query words such as 'doctorate' or 'programs' but the console is reporting '*:*' only the default wildcard. Just curious if anyone has any helpful hints? The problem can be seen in detail on Stackoverflow, http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-defau lt-wildcard-and-not-what-i-query We would have to know what Solr is actually receiving from your app. The Solr log should have an entry for every query you do, and it includes all of the parameters for that quey. This is *not* the Logging tab in the admin UI, but the actual logfile. On Solr 4.3 and later with the example logging setup, this is typically $CWD/logs/solr.log. Thanks, Shawn IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments.
Re: ReplicationHandler - SnapPull failed to download a file completely.
On 10/30/2013 1:49 PM, Shalom Ben-Zvi Kazaz wrote: we are continuously getting this exception during replication from master to slave. our index size is 9.27 G and we are trying to replicate a slave from scratch. Its a different file each time , sometimes we get to 60% replication before it fails and sometimes only 10%, we never managed a successful replication. snip this is the master setup: |requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilesstopwords.txt,spellings.txt,synonyms.txt,protwords.txt,elevate.xml,currency.xml/str str name=commitReserveDuration00:00:50/str /lst /requestHandler I assume that you're probably doing commits fairly often, resulting in a lot of merge activity that frequently deletes segments. That commitReserveDuration parameter needs to be made larger. I would imagine that it takes a lot more than 50 seconds to do the replication - even if you've got an extremely fast network, replicating 9.7GB probably takes several minutes. From the wiki page on replication: If your commits are very frequent and network is particularly slow, you can tweak an extra attribute str name=commitReserveDuration00:00:10/str. This is roughly the time taken to download 5MB from master to slave. Default is 10 secs. http://wiki.apache.org/solr/SolrReplication#Master You've said that your network is not slow, but with that much data, all networks are slow. Thanks, Shawn
Re: AJAX Solr returning the default wildcard *:* and not what I query
On 10/30/2013 1:55 PM, Reyes, Mark wrote: solr.log file per Solr 4.5 http://pastebin.com/zSpERJZA Your queries all look like the following, with different numbers for the parameters json.wrf and _ (underscore) that I've never seen before, and I assume Solr just ignores. {json.wrf=jQuery171015135826403275132_1383154109139q=*:*_=1383154109332wt=json} Those query parameters include q=*:*, so Solr is returning what it was asked for. You'll need to figure out why your ajax code is not sending q=doctorate or q=programs instead. Thanks, Shawn
Re: Replacing Google Mini Search Appliance with Solr?
Hi Eric, I have also developed mini-applications replacing GSA for some of our clients using Apache Nutch + Solr to crawl multi lingual sites and enable multi-lingual search. Nutch+Solr is very stable and Nutch mailing list provides a good support. Reference link to start: https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch Thanks Rajani On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric epal...@richmond.edu wrote: Markus and Jason thanks for the info. I will start to research Nutch. Writing a crawler, agree it is a rabbit hole. -- Eric Palmer Web Services U of Richmond To report technical issues, obtain technical support or make requests for enhancements please visit http://web.richmond.edu/contact/technical-support.html On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Nutch is an excellent option. It should feel very comfortable for people migrating away from the Google appliances. Apache Droids is another possible way to approach, and I¹ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble). I think the simples approach will be NutchŠit¹s absolutely worth taking a shot at it. DO NOT write a crawler! That is a rabbit hole you do not want to peer down into :) On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: AJAX Solr returning the default wildcard *:* and not what I query
As Shawn pointed out, seems like your client is actually sending out *:* queries all of the times. You perhaps have the wrong id for the search box or something that results in your ajax library to never actually receive the actual input value, but I'm just guessing. On Thu, Oct 31, 2013 at 1:25 AM, Reyes, Mark mark.re...@bpiedu.com wrote: solr.log file per Solr 4.5 http://pastebin.com/zSpERJZA Thanks Shawn, Mark On 10/30/13, 12:44 PM, Shawn Heisey s...@elyograg.org wrote: On 10/30/2013 1:26 PM, Reyes, Mark wrote: I am currently integrating JavaScript framework AJAX Solr to my domain. I am trying to query words such as 'doctorate' or 'programs' but the console is reporting '*:*' only the default wildcard. Just curious if anyone has any helpful hints? The problem can be seen in detail on Stackoverflow, http://stackoverflow.com/questions/19691535/ajax-solr-returning-the-defau lt-wildcard-and-not-what-i-query We would have to know what Solr is actually receiving from your app. The Solr log should have an entry for every query you do, and it includes all of the parameters for that quey. This is *not* the Logging tab in the admin UI, but the actual logfile. On Solr 4.3 and later with the example logging setup, this is typically $CWD/logs/solr.log. Thanks, Shawn IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages sent from Bridgepoint Education may contain information that is confidential and may be legally privileged. Please do not read, copy, forward or store this message unless you are an intended recipient of it. If you received this transmission in error, please notify the sender by reply e-mail and delete the message and any attachments. -- Anshum Gupta http://www.anshumgupta.net
Re: Problem querying with edismax and hyphens
I too have come across this same exact problem. One thing that I have found is that with autoGeneratePhraseQueries=true, you can find the case where your index has 'z score' and your query is z-score, but with false it will not find it. As to your specific problem with the single token zscore in the index and z-score as the query, I'm still stumped. Hopefully someone else can answer this question? On Wed, Oct 30, 2013 at 11:56 AM, Vardhan Dharnidharka vardhan1...@hotmail.com wrote: Hi, The query z-score doesn't match a doc with zscore in the index. The analysis tool shows that this query would match this data in the index, but it's the edismax query parser step that seems to screw things up. Is there some combination of autoGeneratePhraseQueries, WordDelimiterFilterFactory parameters, and/or something else I can change or add to generically make the query match without modifying the mm? ie. without adding a rule to specifically synonymize or split the term zscore with some dictionary of words. The query I want to match but doesn't: z-score mm=-30% In the index: zscore The analyzer: fieldType autoGeneratePhraseQueries=false class=solr.TextField name=lowStopText positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter catenateAll=1 catenateNumbers=1 catenateWords=1 class=solr.WordDelimiterFilterFactory preserveOriginal=1 splitOnCaseChange=0 splitOnNumerics=0 types=wdfftypes.txt/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.StopFilterFactory enablePositionIncrements=true ignoreCase=true words=stopwords.txt/ /analyzer /fieldType The parsed edismax query with autoGeneratePhraseQueries=true: +(def_term:\(z-score z) (score zscore)\) The parsed edismax query with autoGeneratePhraseQueries=false: +(((def_term:z-score def_term:z def_term:score def_term:zscore)~3)) Thanks Vardhan
Re: Replacing Google Mini Search Appliance with Solr?
Thanks for the link Sent from my iPhone On Oct 30, 2013, at 4:06 PM, Rajani Maski rajinima...@gmail.com wrote: Hi Eric, I have also developed mini-applications replacing GSA for some of our clients using Apache Nutch + Solr to crawl multi lingual sites and enable multi-lingual search. Nutch+Solr is very stable and Nutch mailing list provides a good support. Reference link to start: https://sites.google.com/site/profilerajanimaski/webcrawlers/apache-nutch Thanks Rajani On Thu, Oct 31, 2013 at 12:27 AM, Palmer, Eric epal...@richmond.edu wrote: Markus and Jason thanks for the info. I will start to research Nutch. Writing a crawler, agree it is a rabbit hole. -- Eric Palmer Web Services U of Richmond To report technical issues, obtain technical support or make requests for enhancements please visit http://web.richmond.edu/contact/technical-support.html On 10/30/13 2:53 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Nutch is an excellent option. It should feel very comfortable for people migrating away from the Google appliances. Apache Droids is another possible way to approach, and I¹ve found people using Heretrix or Manifold for various use cases (and usually in combination with other use cases where the extra overhead was worth the trouble). I think the simples approach will be NutchŠit¹s absolutely worth taking a shot at it. DO NOT write a crawler! That is a rabbit hole you do not want to peer down into :) On Oct 30, 2013, at 10:54 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Eric, We have also helped some government institution to replave their expensive GSA with open source software. In our case we use Apache Nutch 1.7 to crawl the websites and index to Apache Solr. It is very effective, robust and scales easily with Hadoop if you have to. Nutch may not be the easiest tool for the job but is very stable, feature rich and has an active community here at Apache. Cheers, -Original message- From:Palmer, Eric epal...@richmond.edu Sent: Wednesday 30th October 2013 18:48 To: solr-user@lucene.apache.org Subject: Replacing Google Mini Search Appliance with Solr? Hello all, Been lurking on the list for awhile. We are at the end of life for replacing two google mini search appliances used to index our public web sites. Google is no longer selling the mini appliances and buying the big appliance is not cost beneficial. http://search.richmond.edu/ We would run a solr replacement in linux (cents, redhat, similar) with open Java or Oracle Java. Background == ~130 sites only ~12,000 pages (at a depth of 3) probably ~40,000 pages if we go to a depth of 4 We use key matches a lot. In solr terms these are elevated documents (elevations) We would code a search query form in php and wrap it into our design (http://www.richmond.edu) I have played with and love lucidworks and know that their $ solution works for our use cases but the cost model is not attractive for such a small collection. So with solr what are my open source options and what are people's experiences crawling and indexing web sites with solr + crawler. I understand there is not a crawler with solr so that would have to be first up to get one working. We can code in Java, PHP, Python etc. if we have to, but we don't want to write a crawler if we can avoid it. thanks in advance for and information. -- Eric Palmer Web Services U of Richmond
Re: Computing Results So That They are Returned in Search Results
Also note that function queries only return numbers (given their origin in scoring). They cannot be used to create virtual string or text fields. Upayavira On Wed, Oct 30, 2013, at 05:19 PM, Jack Krupansky wrote: A function query is simply returning a calculated result based on existing data - no new fields required. Did you actually want to precompute a value, store it in the index, and then query on it? If so, you could do that indexing with a custom or scripted update processor. Flesh out an example of exactly what you want. -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 12:46 PM To: solr-user@lucene.apache.org Subject: Re: Computing Results So That They are Returned in Search Results Sounds really close to what I'm looking for, but this sounds like it would result in a new field on a document (or a new value for a field defined to hold the result of a function). Would it be possible for a function query to produce a new document so that I can associate the computed value with it? Thanks, Alejandro On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.comwrote: You could create a custom value source and then use it in a function query embedded in your return fields list (fl). So, the function query could use a function (value source) that takes a field, fetches its value, performs some arbitrary calculation, and then returns that value. fl=id,name,my-func(field1),my-**func(field2) -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 10:10 AM To: solr-user@lucene.apache.org Subject: Computing Results So That They are Returned in Search Results I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
Re: Unable to add mahout classifier
(13/10/30 22:09), lovely kasi wrote: Hi, I made few changes to the solrconfig.xml, created a jar file,added it to the lib folder of the solr and tried to start it. THe changes in the solrconfig.xml are updateRequestProcessorChain name=mahoutclassifier default=true processor class=com.mahout.solr.classifier.CategorizeDocumentFac str name=inputFieldLEAD_NOTES/str str name=outputFieldcategory/str str name=defaultCategoryOthers/str str name=modelnaiveBayesModel/str /processor processor class=solr.RunUpdateProcessorFactory/ processor class=solr.LogUpdateProcessorFactory/ /updateRequestProcessorChain What is com.mahout.solr.classifier.CategorizeDocumentFac ? Is it a classifier delivered by Solr community? koji -- http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html
Re: Need idea to standardize keywords - ring tone vs ringtone
I tried using synonyms but it doesn't actually change the stored text rather just the indexed value. I need a way to change the raw value stored in SOLR. May be I should use a custom update processor to standardize the data. -- View this message in context: http://lucene.472066.n3.nabble.com/Need-idea-to-standardize-keywords-ring-tone-vs-ringtone-tp4097794p4098530.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Return the synonyms as part of Solr response
Hi Siva, (13/10/30 18:12), sivaprasad wrote: Hi, We have a requirement where we need to send the matched synonyms as part of Solr response. I don't think that Solr has such function. Do we need to customize the Solr response handler to do this? So the answer is yes. koji -- http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html
Re: Configuration and specs to index a 1 terabyte (TB) repository
A flat distribution of queries is a poor test. Real queries have a zipf distribution. The flat distribution will get almost no benefit from caching, so it will give too low a number and stress disk IO too much. The 99th percentile is probably the same for both distributions, because that is dominated by rare queries. Real query loads will get a much smaller boost from SSD in the median and up to about 75th percentile. wunder Search guy for Netflix and now Chegg On Oct 30, 2013, at 1:43 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Tue, 2013-10-29 at 14:24 +0100, eShard wrote: I have a 1 TB repository with approximately 500,000 documents (that will probably grow from there) that needs to be indexed. As Shawn point out, that isn't telling us much. If you describe the documents, how and how often you index and how you query them, it will help a lot. Let me offer some observations from a related project we are starting at Statsbiblioteket. We are planning to index 20 TB harvested web resources (*.dk from the last 8 years, or at least the resources our crawlers sunk their tentacles into). We have two text indexes generated from about 1% and 2% of that corpus, respectively. They are 200GB and 420GB in size and contains ~75 million and (whoops, offline, so rememberguessing here) ~150 million documents. For testing purposes we issued simple searches: 2-4 OR'ed terms, picked at random from a Danish dictionary. One of our test machines is an 2*8 core Xeon machine with 32GB of RAM (about ~12GB free for caching) and SSD as storage. We had room for a 2-shard cloud on the SSD's, so searches were issued to 2*200GB index of a total of 150 million documents. CentOS/Solr 4.3. Hammering that machine with 32 threads gave us a median response time of 200ms and a 99-percentile of 5-800 ms (depending on test run), single thread has median 30ms and 99-percentile 70-130ms. CPU load peaked at 300-400% and IOWait at 30-40%, but was not closely monitored. Our current vision is to shard the projected 20TB index into ~800GB or ~1TB chunks (depending on which drives we choose) and put one chard on each physical SSD, thereby sidestepping the whole RAID TRIM-problem. We do have the great luxury of running nightly batch index updates on a single shard instead of continuous updates. We would probably go for smaller shards if they were all updated continuously. Projected price for the full setup range from $50.000-$100.000, depending on where we land on the off-the-shelf - enterprise scale. (I need to write a blog post on this) With that in mind, I urge you to do some testing on a machine with SSD and modest memory vs. a traditional spinning drives and monster-memory machine. - Toke Eskildsen, State and University Library, Denmark
Re: Computing Results So That They are Returned in Search Results
So here is my use case with a little more detail. I'm working with recurring events. Each event has an expression associated with it that defines its recurrence pattern. For example, monthly, daily, yearly... The event has metadata associated with it that is searchable. When a user performs a search, they can match on various metadata fields, but the query can also span a range of dates. If a match occurs, I'd like to unwind the expression into the instances specified by the pattern and return these virtual instances as results. Right now, I'm post processing data to hammer out the results that fit the window of time specified in the query, but this moves sorting and pagination out of the Solr tier. I'd like to see if I can get it to stay there :) Post processing also prohibits me from faceting which would be extremely useful. I'm trying to avoid heavy post processing if I can. Given the nature of the data, its not really feasible for me to pre-assemble instance data and index since I don't know the window of time a user will be looking at. Thanks, Alejandro On Wed, Oct 30, 2013 at 6:35 PM, Upayavira u...@odoko.co.uk wrote: Also note that function queries only return numbers (given their origin in scoring). They cannot be used to create virtual string or text fields. Upayavira On Wed, Oct 30, 2013, at 05:19 PM, Jack Krupansky wrote: A function query is simply returning a calculated result based on existing data - no new fields required. Did you actually want to precompute a value, store it in the index, and then query on it? If so, you could do that indexing with a custom or scripted update processor. Flesh out an example of exactly what you want. -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 12:46 PM To: solr-user@lucene.apache.org Subject: Re: Computing Results So That They are Returned in Search Results Sounds really close to what I'm looking for, but this sounds like it would result in a new field on a document (or a new value for a field defined to hold the result of a function). Would it be possible for a function query to produce a new document so that I can associate the computed value with it? Thanks, Alejandro On Wed, Oct 30, 2013 at 12:05 PM, Jack Krupansky j...@basetechnology.comwrote: You could create a custom value source and then use it in a function query embedded in your return fields list (fl). So, the function query could use a function (value source) that takes a field, fetches its value, performs some arbitrary calculation, and then returns that value. fl=id,name,my-func(field1),my-**func(field2) -- Jack Krupansky -Original Message- From: Alejandro Calbazana Sent: Wednesday, October 30, 2013 10:10 AM To: solr-user@lucene.apache.org Subject: Computing Results So That They are Returned in Search Results I'd like to throw out a design question and see if its possible to solve this with Solr. I have a set of data that is computed that I'd like to make searchable. Ideally, I'd like to have all documents indexed and call it the day, but the nature of the data is such that it needs to be computed given a definition. I'm interested in searching on definitions and then creating results on the fly that are calculated based on something embedded in the definition. Is it possible to embed this calculation login into Solr's result handling process? I know this sounds exotic, but the nature of the data is such that I can't index these calculated documents because I don't know what the boundary is and specifiying an arbitrary number isn't ideal. Has anyone run across something like this? Thanks, Alejandr
How to get similarity score between 0 and 1 not relative score
Hi, We have a requirement where user would like to see a score (between 0 to 1) which can tell how close the input search string is with result string. So if input was very close but not exact matach, score could be .90 etc. I do understand that we can get score from solr divide by highest score but that will always show 1 even if we match was not exact. Regards, Susheel
Re: How to get similarity score between 0 and 1 not relative score
Hi Susheel, Have a look at this: http://wiki.apache.org/lucene-java/ScoresAsPercentages You may really want to reconsider doing that. On Thu, Oct 31, 2013 at 9:41 AM, sushil sharma sushil2...@yahoo.co.inwrote: Hi, We have a requirement where user would like to see a score (between 0 to 1) which can tell how close the input search string is with result string. So if input was very close but not exact matach, score could be .90 etc. I do understand that we can get score from solr divide by highest score but that will always show 1 even if we match was not exact. Regards, Susheel -- Anshum Gupta http://www.anshumgupta.net
Solr grouping performance porblem
Hi, I've recently upgraded to SolrCloud (4.4) from Master-Slave mode. One of the changes I did the in queries is to add group functionality to remove duplicate results. The grouping is done on a specific field. But the change seemed to have a huge effect on the query performance. The group option decreased the performance by 10 times. For e.g. this query takes 1 sec to execute. The number of results is around 105387. http://localhost:8083/solr/browse?fq=language:(english)wt=xmlrows=10start=0fq=(ContentGroup-local:Learn Explore OR ADSKContentGroup-local:Getting Started)q=linesort=score descgroup=truegroup.field=dedupgroup.ngroups=true If I exclude group option, it comes down to 190ms http://localhost:8083/solr/browse?fq=language:(english)wt=xmlrows=10start=0fq=(ContentGroup-local:Learn Explore OR ADSKContentGroup-local:Getting Started)q=line I'm running this query against a 8 million doc index . I've 2 shard with 1 replica each, running on a m1x.large EC2 instance, each having 8gb allocat ed memory. Is this a known issue or am I missing something which is making this query expensive. I bumped into this JIRA -- https://issues.apache.org/jira/browse/SOLR-5027 which talks about CollapsingQParserPlugin as an alternate to grouping, but that seemed to be available in 4.6. Just wondering if it can be an alternate in my case and whether if its possible to apply as a patch in 4.4 version. Any pointer will be appreciated. - Thanks, Shamik
Re: Grouping performance problem
Bumping up this thread as I'm facing similar issue . Any solution ? -- View this message in context: http://lucene.472066.n3.nabble.com/Grouping-performance-problem-tp3995245p4098566.html Sent from the Solr - User mailing list archive at Nabble.com.