Re: Anyone else see this error when running unit tests?
Okay so I think I found a solution if you are a maven user and don't mind forcing the test codec to Lucene40 then do the following: Add this to your pom.xml under the build pluginManagement plugins section plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-surefire-plugin/artifactId version2.13/version configuration argLine-Dtests.codec=Lucene40/argLine /configuration /plugin If you are running in Eclipse, simply add this as a VM argument. The default test codec is set to random and this means that there is a possibility of picking Lucene3x if some random variable is 2 and other conditions are met. For me, my test-framework jar must not be ahead of the lucene one (b/c I don't control the classpath order and honestly this shouldn't be a requirement to run a test) so it periodically bombed. This little fix seems to have helped provided that you don't care about Lucene3x vs Lucene40 for your tests (I am on Lucene40 so it's fine for me). HTH! Amit On Mon, Feb 4, 2013 at 6:18 PM, Roman Chyla roman.ch...@gmail.com wrote: Me too, it fails randomly with test classes. We use Solr4.0 for testing, no maven, only ant. --roman On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote: Yes. Just today actually. I had some unit test based on AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail intermittently with that error message. The key to this behavior is found by looking at the code in the lucene class: TestRuleSetupAndRestoreClassEnv. I don't understand it completely but there are a number of random code paths through there. The following helped me get around the problem, at least in the short term. @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40}) public class CoreLevelTest extends AbstractSolrTestCase { I also need to call this inside my setUp() method, in 4.0 this wasn't required. initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home); -- View this message in context: http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: replication problems with solr4.1
I may be missing something but let me go back to your original statements: 1) You build the index once per week from scratch 2) You replicate this from master to slave. My understanding of the way replication works is that it's meant to only send along files that are new and if any files named the same between the master and slave have different sizes then this is a corruption of sorts and do this index.timestamp and send the full thing down. This, I think, explains your index.timestamp issue although why the old index/ directory isn't being deleted i'm not sure about. This is why I was asking about OS details, file system details etc (perhaps something else is locking that directory preventing Java from deleting it?) The second issue is the index generation which is governed by commits and is represented by looking at the last few characters in the segments_XX file. When the slave downloads the index and does the copy of the new files, it does a commit to force a new searcher hence why the slave generation will be +1 from the master. The index version is a timestamp and it may be the case that the version represents the point in time when the index was downloaded to the slave? In general, it shouldn't matter about these details because replication is only triggered if the master's version slave's version and the clocks that all servers use are synched to some common clock. Caveat however in my answer is that I have yet to try 4.1 as this is next on my TODO list so maybe I'll run into the same problem :-) but I wanted to provide some info as I just recently dug through the replication code to understand it better myself. Cheers Amit On Wed, Feb 13, 2013 at 11:57 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: OK then index generation and index version are out of count when it comes to verify that master and slave index are in sync. What else is possible? The strange thing is if master is 2 or more generations ahead of slave then it works! With your logic the slave must _always_ be one generation ahead of the master, because the slave replicates from master and then does an additional commit to recognize the changes on the slave. This implies that the slave acts as follows: - if the master is one generation ahaed then do an additional commit - if the master is 2 or more generations ahead then do _no_ commit OR - if the master is 2 or more generations ahead then do a commit but don't change generation and version of index Can this be true? I would say not really. Regards Bernd Am 13.02.2013 20:38, schrieb Amit Nithian: Okay so then that should explain the generation difference of 1 between the master and slave On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote: On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote: doesn't it do a commit to force solr to recognize the changes? yes. - Mark
RE: Why a phrase is getting searched against default fields in solr
It is returning me all the documents which contains the phrase as it is searching against Defaultfield.my default field is like below field name=SearchableField type=text_general indexed=true stored=false multiValued=true/copyField source=Product-Name-* dest=SearchableField/ copyField source=Product-Description-* dest=SearchableField/ I have defined SearchableField as default field. Thanks,Pragyanshis Date: Wed, 13 Feb 2013 23:18:06 -0800 From: iori...@yahoo.com Subject: Re: Why a phrase is getting searched against default fields in solr To: solr-user@lucene.apache.org Hi Pragyanshis, What happens when you remove bq parameter? --- On Thu, 2/14/13, Pragyanshis Pattanaik pragyans...@outlook.com wrote: From: Pragyanshis Pattanaik pragyans...@outlook.com Subject: Why a phrase is getting searched against default fields in solr To: solr Forum solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 8:24 AM Hi, This might be a very silly question but i want to know why this is happening.If i am using edismax query parser in solr and passing query something like below q=IPhone5wt=xmledismax=trueqf=Product-Name-0^100bq=(Product-Rating-0%3A7^300+OR+Product-Rating-0%3A8^400+OR+Product-Rating-0%3A9^500+OR+Product-Rating-0%3A10^600+OR+Product-Rating-0%3A*) Then why it is searching in default fields ?As i am specifying qf,it should search in the fields specified in qf parameter and boost those documents which has a higher rating. Please correct me if my understanding is wrong.Note:-I am using SOLR 4.0 Alpha Thanks,Pragyanshis
Re: Boost Specific Phrase
Hi Hemant, I think your use case would be useful for relevancy tuning. It could be implemented as either SearchComponent or QParserPlugin. Edismax query parser has pf2 pf3 parameters can remedy to some degree. Probably edismax extension will be best place to put it. Similar to https://issues.apache.org/jira/browse/SOLR-4381 --- On Thu, 2/14/13, Hemant Verma hemantverm...@gmail.com wrote: From: Hemant Verma hemantverm...@gmail.com Subject: Re: Boost Specific Phrase To: solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 7:56 AM Thanks for the response. pf parameter actually boost the documents considering all search keywords mentioned in main query but I am looking for something which boost the documents considering few search keywords from the user query. Like as per the example, user query is (project manager in India with 2 yrs experience) and my dictionary contains one entry as 'project manager' which specifies if user query has 'project manager' in his query then boost those documents which contains 'project manager' as an exact match. -- View this message in context: http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188p4040371.html Sent from the Solr - User mailing list archive at Nabble.com.
How to protect Solr 4.1 Admin page?
Hi, I'm sure it's an old question.. I just want protecting Admin page (/solr) with Basic Authentication. But I can't found fine answer yet out there. I use Solr 4.1 with Apache Tomcat/7.0.35. Could anyone give me a quick hints or links? Thanks in advance! -- wassalam, [bayu]
Re: How to protect Solr 4.1 Admin page?
On 14 February 2013 14:05, Bayu Widyasanyata bwidyasany...@gmail.com wrote: Hi, I'm sure it's an old question.. I just want protecting Admin page (/solr) with Basic Authentication. But I can't found fine answer yet out there. I use Solr 4.1 with Apache Tomcat/7.0.35. [...] The easiest way to do this with Tomcat7 is: 1. Install the manager app, and set up roles in conf/tomcat-users.xml 2. A UserDatabaseRealm should already be defined in conf/server.xml 3. Depending on how you installed Solr, there should be a folder like webapps/solr/WEB-INF/ . In that folder, edit web.xml, and add security-constraint and security-role tags. The entries for the latter should match the entries in step 1. These links should be of help: http://tomcat.apache.org/tomcat-7.0-doc/realm-howto.html http://www.tomcatexpert.com/ask-the-experts/basic-auth-configuration-tomcat-7-https Regards, Gora
Re: How to protect Solr 4.1 Admin page?
On Thu, Feb 14, 2013 at 3:53 PM, Gora Mohanty g...@mimirtech.com wrote: 3. Depending on how you installed Solr, there should be a folder like webapps/solr/WEB-INF/ . In that folder, edit web.xml, and add security-constraint and security-role tags. The entries for the latter should match the entries in step 1. One thing that I'm not found is folder webapps/solr/WEB-INF/. I install binary Solr distribution. It might be not created when deployed or first accessed..?? I'm not sure... :( since I also new on Tomcat deployment. Thanks, -- wassalam, [bayu]
RE: Why a phrase is getting searched against default fields in solr
Hi, instead of edismax=true can you try defType=edismax ahmet --- On Thu, 2/14/13, Pragyanshis Pattanaik pragyans...@outlook.com wrote: From: Pragyanshis Pattanaik pragyans...@outlook.com Subject: RE: Why a phrase is getting searched against default fields in solr To: solr Forum solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 10:21 AM It is returning me all the documents which contains the phrase as it is searching against Defaultfield.my default field is like below field name=SearchableField type=text_general indexed=true stored=false multiValued=true/ copyField source=Product-Name-* dest=SearchableField/ copyField source=Product-Description-* dest=SearchableField/ I have defined SearchableField as default field. Thanks,Pragyanshis Date: Wed, 13 Feb 2013 23:18:06 -0800 From: iori...@yahoo.com Subject: Re: Why a phrase is getting searched against default fields in solr To: solr-user@lucene.apache.org Hi Pragyanshis, What happens when you remove bq parameter? --- On Thu, 2/14/13, Pragyanshis Pattanaik pragyans...@outlook.com wrote: From: Pragyanshis Pattanaik pragyans...@outlook.com Subject: Why a phrase is getting searched against default fields in solr To: solr Forum solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 8:24 AM Hi, This might be a very silly question but i want to know why this is happening.If i am using edismax query parser in solr and passing query something like below q=IPhone5wt=xmledismax=trueqf=Product-Name-0^100bq=(Product-Rating-0%3A7^300+OR+Product-Rating-0%3A8^400+OR+Product-Rating-0%3A9^500+OR+Product-Rating-0%3A10^600+OR+Product-Rating-0%3A*) Then why it is searching in default fields ?As i am specifying qf,it should search in the fields specified in qf parameter and boost those documents which has a higher rating. Please correct me if my understanding is wrong.Note:-I am using SOLR 4.0 Alpha Thanks,Pragyanshis
Re: How to protect Solr 4.1 Admin page?
On 14 February 2013 14:42, Bayu Widyasanyata bwidyasany...@gmail.com wrote: On Thu, Feb 14, 2013 at 3:53 PM, Gora Mohanty g...@mimirtech.com wrote: 3. Depending on how you installed Solr, there should be a folder like webapps/solr/WEB-INF/ . In that folder, edit web.xml, and add security-constraint and security-role tags. The entries for the latter should match the entries in step 1. One thing that I'm not found is folder webapps/solr/WEB-INF/. I install binary Solr distribution. It might be not created when deployed or first accessed..?? I'm not sure... :( since I also new on Tomcat deployment. Presumably, you followed http://wiki.apache.org/solr/SolrTomcat Copy the .war file dist/apache-solr-*.war into $SOLR_HOME as solr.war Instead. remove solr.war, and try adding it through the browser interface of the Tomcat Web Application Manager, as described, e.g., in the section Deploying Solr with the Tomcat Manager at http://lucidworks.lucidimagination.com/display/solr/Running+Solr+on+Tomcat You might need to change the entry for solr/home in webapps/solr/WEB-INF/web.xml I imagine there is a way of adding web.xml with the other mode of installation, but I am not sure how to do that. Regards, Gora
How-to get date of indexing process
Hi everybody I am looking for the way to get date of last indexing process or commit event that it happened in my Solr server. I found a possible solution to add timestamp field , for example: |field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/| But, I would like a solution without modify the schema of Solr server. I checked statistics page but I not found a useful date. Any ideas. Thanks
RE: How-to get date of indexing process
See: admin/luke?show=index or the admin UI. -Original message- From:Miguel miguel.valen...@juntadeandalucia.es Sent: Thu 14-Feb-2013 10:45 To: solr-user@lucene.apache.org Subject: How-to get date of indexing process Hi everybody I am looking for the way to get date of last indexing process or commit event that it happened in my Solr server. I found a possible solution to add timestamp field , for example: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ But, I would like a solution without modify the schema of Solr server. I checked statistics page but I not found a useful date. Any ideas. Thanks
Re: Solr 4.1.0 not using solrcore.properties ?
James, I'm not completely sure, and i have not tested the following: entityname.last_index_time might also not be accessible... Daniel On Thu, Feb 14, 2013 at 12:47 AM, Daniel Rijkhof daniel.rijk...@gmail.comwrote: James, I debugged it until I found where things go 'wrong'. Apparently the current implementation VariableResolver does not allow the use of a period '.' in any variable/property key you want to use... It's reserved for namespaces. Personally I would really love to use a period in my variable/property key names, and see no reason why this should be an issue... So, using for example solr.dataimport.jdbcDriver=org.h2.Driver will not work using just: jdbcDriver=org.h2.Driver works fine... So i will rename all my properties... but took me hours to find out why something that used to work stopped working... I have never had problems of using periods in any properties file... apparently Solr is the only project that doesn't allow the use of periods... Even if this would be documented in a way that persons can find this documentation, i guess it would be better to just allow periods by changing the implementation of the VariableResolver just a little... 00.43 now... off to bed. Let me know what you think, Daniel On Wed, Feb 13, 2013 at 6:45 PM, Dyer, James james.d...@ingramcontent.com wrote: The code that resolves variables in DIH was refactored extensively in 4.1.0. So if you've got a case where it does not resolve the variables properly, please give the details. We can open a JIRA issue and get this fixed. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Daniel Rijkhof [mailto:daniel.rijk...@gmail.com] Sent: Wednesday, February 13, 2013 11:09 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.1.0 not using solrcore.properties ? I am looking at the source code of 4.1.0 and I cannot find any prove that solr 4.1.0's DIH would actually use any properties from the solrcore.properties file. I do however found that Solr does load my solrcore.properties file... It's strange that this would have been changed, Does anybody have prove it still can use properties defined in solrcore.properties within the DIH configuration? In that case, please reply... Daniel Daniel Rijkhof 06 12 14 12 17 On Wed, Feb 13, 2013 at 4:22 PM, Daniel Rijkhof daniel.rijk...@gmail.com wrote: I have the following problem: I'm upgrading from a nightly build 4.0.* to 4.1.0. My dataimport is configured with ${variables} which always worked fine, untill this upgrade. My solrcore.properties file seems to be ignored. Solr.xml: ?xml version=1.0 encoding=UTF-8 ? solr sharedLib=lib persistent=true cores adminPath=/admin/cores host=${host:} hostPort=${jetty.port:} core default=true name=hfselectdata instanceDir=hfselectdata/ /cores /solr and in solrhome/hfselectdata/conf/ is the file solrcore.properties. Anybody any suggestions? Greatly appreciated Daniel
Re: How-to get date of indexing process
Thanks Markus I didn't know that page. It's all I need it. Thanks again El 14/02/2013 10:47, Markus Jelsma escribió: See: admin/luke?show=index or the admin UI. -Original message- From:Miguel miguel.valen...@juntadeandalucia.es Sent: Thu 14-Feb-2013 10:45 To: solr-user@lucene.apache.org Subject: How-to get date of indexing process Hi everybody I am looking for the way to get date of last indexing process or commit event that it happened in my Solr server. I found a possible solution to add timestamp field , for example: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ But, I would like a solution without modify the schema of Solr server. I checked statistics page but I not found a useful date. Any ideas. Thanks
Re: Why SolrInputDocument use a LinkedHashMap
Almost. I did not benchmark it but tend to believe this http://docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html : iteration over the collection-views of a LinkedHashMap requires time proportional to the /size/ of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its /capacity/. André On 02/13/2013 06:58 PM, knort wrote: If the order is not important, using a HashMap offers the same fast iteration on the fields but without having an extra LinkedList. -- View this message in context: http://lucene.472066.n3.nabble.com/Why-SolrInputDocument-use-a-LinkedHashMap-tp4040195p4040260.html Sent from the Solr - User mailing list archive at Nabble.com. -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
RE: Why a phrase is getting searched against default fields in solr
Yes i did some changes with the requesthandler.I have added str name=defTypeedismax/str and removed the df field specified there and Now its working as i expected. Thanks for the help ahmet. Date: Thu, 14 Feb 2013 01:31:14 -0800 From: iori...@yahoo.com Subject: RE: Why a phrase is getting searched against default fields in solr To: solr-user@lucene.apache.org Hi, instead of edismax=true can you try defType=edismax ahmet --- On Thu, 2/14/13, Pragyanshis Pattanaik pragyans...@outlook.com wrote: From: Pragyanshis Pattanaik pragyans...@outlook.com Subject: RE: Why a phrase is getting searched against default fields in solr To: solr Forum solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 10:21 AM It is returning me all the documents which contains the phrase as it is searching against Defaultfield.my default field is like below field name=SearchableField type=text_general indexed=true stored=false multiValued=true/copyField source=Product-Name-* dest=SearchableField/copyField source=Product-Description-* dest=SearchableField/ I have defined SearchableField as default field. Thanks,Pragyanshis Date: Wed, 13 Feb 2013 23:18:06 -0800 From: iori...@yahoo.com Subject: Re: Why a phrase is getting searched against default fields in solr To: solr-user@lucene.apache.org Hi Pragyanshis, What happens when you remove bq parameter? --- On Thu, 2/14/13, Pragyanshis Pattanaik pragyans...@outlook.com wrote: From: Pragyanshis Pattanaik pragyans...@outlook.com Subject: Why a phrase is getting searched against default fields in solr To: solr Forum solr-user@lucene.apache.org Date: Thursday, February 14, 2013, 8:24 AM Hi, This might be a very silly question but i want to know why this is happening.If i am using edismax query parser in solr and passing query something like below q=IPhone5wt=xmledismax=trueqf=Product-Name-0^100bq=(Product-Rating-0%3A7^300+OR+Product-Rating-0%3A8^400+OR+Product-Rating-0%3A9^500+OR+Product-Rating-0%3A10^600+OR+Product-Rating-0%3A*) Then why it is searching in default fields ?As i am specifying qf,it should search in the fields specified in qf parameter and boost those documents which has a higher rating. Please correct me if my understanding is wrong.Note:-I am using SOLR 4.0 Alpha Thanks,Pragyanshis
Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
On all products I have I want to implement a price range filter. Since this pricerange is applied on the entire population and not on a single product, my assumption was that it would not make sense to define this within the shopitem entity, but rather under the document shopitems. So that's what I did in my data-config below. But now on these requests: http://localhost:8983/solr/tt-shop/dataimport?command=reload-config http://localhost:8983/solr/tt-shop/dataimport?command=full-import I get the error: DataImportHandler started. Not Initialized. No commands can be run dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost:1433;databaseName= user=** password=* / document name=shopitems entity name=shopitem pk=id query=select * from products field name=id column=ID / field name=prijs column=prijs / field name=createdate column=createdate / /entity entity name=quot;pricerangequot; query=quot;;With Categorized as (Select CASE When prijs amp;lt;= 1000 Then 'lt;10' When prijs amp;gt; 1000 and prijs amp;lt;= 2500 Then '[10-25]' When prijs amp;gt; 2500 and prijs amp;lt;= 5000 Then '[25-50]' Else '50' END as PriceCategory From products) Select PriceCategory, Count(*) as Cnt From Categorized Group By PriceCategory /entity /document /dataConfig -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418.html Sent from the Solr - User mailing list archive at Nabble.com.
solr 4.1 spatial with JTS - spatial query withitin a WKT polygon contained within another query ...
Hello Everyone, I've been integrating Solr 4.1 into a Web GIS solution and it's working great. I have implemented JTS within Solr 4.1 and indexed thousands of WKT polygons provided by XML document genereated by a GE's GIS Core system. Everything seems to working out great. Now I have a feature where I want to query solr with geo:intersects((POLYGON(... with a polygon too big to send via xmlhttp object. I'm getting a http 505 error. 1. Is there any other way of sending this huge string back to solr? (I've tried GET and POST) 2. This polygon was the result of a previous query so, is there a way of query inside a query? Something like ,... fq=geo:intersects(another query.spatialfield_with_the_wkt_polygon) ? Thanks Guilherme
Re: Index-time synonyms and trailing wildcard issue
Hello Jack, Thanks for your answer, it helped me gaining a deeper understandig what happens at index time, and finding a solution myself: It seems that putting the synonym filter in both filter chains (index and query), setting expand=false, and putting the desired synonym first in the row, does the trick: Synonyms line (reversed order!): orange, apfelsine All documents containing apfelsine are now mapped to orange, so there are no more documets containing apfelsine that would match a wildcard-query for apfel* (Apfelsine is a true synonym for Orange in german, meaning chinese apple. Apfel = apple, shouldnt match oranges). Problem solved, thanks again for the help! Johannes Rodenwald - Ursprüngliche Mail - Von: Jack Krupansky j...@basetechnology.com An: solr-user@lucene.apache.org Gesendet: Mittwoch, 13. Februar 2013 17:17:40 Betreff: Re: Index-time synonyms and trailing wildcard issue By doing synonyms at index time, you cause apfelsin to be added to documents that contain only orang, so of course documents that previously only contained orang will now match for apfelsin or any term query that matches apfelsin, such as a wildcard. At query time, Lucene cannot tell whether your original document contained apfelsin or if apfelsin was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky
JMX generation number is wrong
I'm trying to monitor the state of a master-slave Solr4.1 cluster. I can easily get the generation number of the slaves using JMX like this: solr/{corename}/org.apache.solr.handler.ReplicationHandler/generation That works fine. However on the master, this number is always 1. Which makes it rather hard to check if the slaves are lagging behind. Is this a defect in the JMX properties in Solr and I should file a Jira? Ari -- -- Aristedes Maniatis GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
get filterCache in Component
Hi, We need to get the filterCache in a Component but SolrIndexSearcher.getCache(String name) does not return it. It seems the filterCache is not added to cacheMap and can therefore not be returned. SolrCacheQuery,DocSet filterCache = rb.req.getSearcher().getCache(filterCache); Will always return null. Can we get the filterCache via other means or should it be added to the cacheMap so getCache can return it? Thanks, Markus
Re: Most common query
If I'm understanding your quetion correctly, you have to build that out yourself. Solr doesn't store the searches, nor the results. Hmm, though if you keep the Solr logs around you can reconstruct the queries from them although it takes a bit of work. The other place would be your servelet container logs which should be able to store all the queries. On Wed, Feb 13, 2013 at 10:27 AM, ROSENBERG, YOEL (YOEL)** CTR ** yoel.rosenb...@alcatel-lucent.com wrote: Hi, ** ** I have a question, hope you can help me. I would like to get report using the solr admin tools that return the entire search that made on the system between dates. What is the correct way to do it? ** ** BR, Yoel ** ** *Yoel Rosenberg* ALCATEL-LUCENT Support Engineer T: +972 77 9088584 M: +972 54 239 5204 *yoel.rosenb...@alcatel-lucent.com* yoel.rosenb...@alcatel-lucent.com*** * ** **
Re: What should focus be on hardware for solr servers?
One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen
Re: Multi Core / On demand loading
I updated this page: http://wiki.apache.org/solr/CoreAdmin, look for transientCacheSize and loadOnStartup. Be aware that this is somewhat in flux, but anything you find please report! Man, oh man, do I have a lot of documentation to do on all this once the dust settles Erick On Wed, Feb 13, 2013 at 5:10 PM, Vinay B, vybe3...@gmail.com wrote: Amongst the highlights for the SOLR 4.1 release, I see Multi-core: On-demand core loading and LRU-based core unloading after reaching a user-specified maximum number. How is this configured and where should I be looking for a reference on this feature? Thanks
Re: Most common query
Hi, If I am not mistaken I saw some open jira to collect queries and calculate popular searches etc. Some commercial solutions exist: http://sematext.com/search-analytics/index.html http://soleami.com/blog/soleami-start_en.html --- On Wed, 2/13/13, ROSENBERG, YOEL (YOEL)** CTR ** yoel.rosenb...@alcatel-lucent.com wrote: From: ROSENBERG, YOEL (YOEL)** CTR ** yoel.rosenb...@alcatel-lucent.com Subject: Most common query To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Wednesday, February 13, 2013, 5:27 PM Hi, I have a question, hope you can help me. I would like to get report using the solr admin tools that return the entire search that made on the system between dates. What is the correct way to do it? BR, Yoel Yoel Rosenberg ALCATEL-LUCENT Support Engineer T: +972 77 9088584 M: +972 54 239 5204 yoel.rosenb...@alcatel-lucent.com
Re: What should focus be on hardware for solr servers?
That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen
Re: Solr 4.1.0 not using solrcore.properties ?
Daniel: It would be great if you would go ahead and edit the Wiki, all you have to do is create a signon. Having just gone through the pain of figuring this out, you're best positioned to know how to warn others! Best Erick On Thu, Feb 14, 2013 at 4:56 AM, Daniel Rijkhof daniel.rijk...@gmail.comwrote: James, I'm not completely sure, and i have not tested the following: entityname.last_index_time might also not be accessible... Daniel On Thu, Feb 14, 2013 at 12:47 AM, Daniel Rijkhof daniel.rijk...@gmail.comwrote: James, I debugged it until I found where things go 'wrong'. Apparently the current implementation VariableResolver does not allow the use of a period '.' in any variable/property key you want to use... It's reserved for namespaces. Personally I would really love to use a period in my variable/property key names, and see no reason why this should be an issue... So, using for example solr.dataimport.jdbcDriver=org.h2.Driver will not work using just: jdbcDriver=org.h2.Driver works fine... So i will rename all my properties... but took me hours to find out why something that used to work stopped working... I have never had problems of using periods in any properties file... apparently Solr is the only project that doesn't allow the use of periods... Even if this would be documented in a way that persons can find this documentation, i guess it would be better to just allow periods by changing the implementation of the VariableResolver just a little... 00.43 now... off to bed. Let me know what you think, Daniel On Wed, Feb 13, 2013 at 6:45 PM, Dyer, James james.d...@ingramcontent.com wrote: The code that resolves variables in DIH was refactored extensively in 4.1.0. So if you've got a case where it does not resolve the variables properly, please give the details. We can open a JIRA issue and get this fixed. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Daniel Rijkhof [mailto:daniel.rijk...@gmail.com] Sent: Wednesday, February 13, 2013 11:09 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.1.0 not using solrcore.properties ? I am looking at the source code of 4.1.0 and I cannot find any prove that solr 4.1.0's DIH would actually use any properties from the solrcore.properties file. I do however found that Solr does load my solrcore.properties file... It's strange that this would have been changed, Does anybody have prove it still can use properties defined in solrcore.properties within the DIH configuration? In that case, please reply... Daniel Daniel Rijkhof 06 12 14 12 17 On Wed, Feb 13, 2013 at 4:22 PM, Daniel Rijkhof daniel.rijk...@gmail.com wrote: I have the following problem: I'm upgrading from a nightly build 4.0.* to 4.1.0. My dataimport is configured with ${variables} which always worked fine, untill this upgrade. My solrcore.properties file seems to be ignored. Solr.xml: ?xml version=1.0 encoding=UTF-8 ? solr sharedLib=lib persistent=true cores adminPath=/admin/cores host=${host:} hostPort=${jetty.port:} core default=true name=hfselectdata instanceDir=hfselectdata/ /cores /solr and in solrhome/hfselectdata/conf/ is the file solrcore.properties. Anybody any suggestions? Greatly appreciated Daniel
Re: Multi Core / On demand loading
Almost forgot. Do be aware of https://issues.apache.org/jira/browse/SOLR-4400. This came to light under an absurd load of opening/closing transient cores, which only means it won't show up until you go into production. The fix is on both trunk and 4x. On Thu, Feb 14, 2013 at 7:46 AM, Erick Erickson erickerick...@gmail.comwrote: I updated this page: http://wiki.apache.org/solr/CoreAdmin, look for transientCacheSize and loadOnStartup. Be aware that this is somewhat in flux, but anything you find please report! Man, oh man, do I have a lot of documentation to do on all this once the dust settles Erick On Wed, Feb 13, 2013 at 5:10 PM, Vinay B, vybe3...@gmail.com wrote: Amongst the highlights for the SOLR 4.1 release, I see Multi-core: On-demand core loading and LRU-based core unloading after reaching a user-specified maximum number. How is this configured and where should I be looking for a reference on this feature? Thanks
Re: Combining Solr score with customized user ratings for a document
Well, thinking a bit more, the second solution is not practical. If Solr retrieves, say, 1.000 documents, I would have to navigate through ALL (maybe less with some reasonable upper limit) of them to recalculate the scores and reorder them according to the new score although the Web App is going to show just the first 20. In other words, I would lose the benefits of Solr's (well, and most DB's) row/offset feature to retrieve information in batchs rather than the whole amount of results which may not be seen by the user at all. I'm now wondering if a custom implementation of a ValueSource + a FunctionQuery is a solution to my problem... Any hint? Thanks! Álvaro -- View this message in context: http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200p4040444.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Maximum Number of Records In Index
Partial updates is nothing as clever as I may have made it sound, it is just changing a record value , for example last name from Smith to Jones, that's my partial update. No errors at all in indexing, I have not yet checked the logs , but the DIH output counts show no errors, here is an example str name=Total Requests made to DataSource2/strstr name=Total Rows Fetched14823/strstr name=Total Documents Skipped0/strstr name=Full Dump Started2013-02-14 07:00:30/strstr name=Indexing completed. Added/Updated: 14823 documents. Deleted 0 documents./strstr name=Committed2013-02-14 07:19:59/strstr name=Optimized2013-02-14 07:19:59/strstr name=Total Documents Processed14823/strstr name=Time taken 0:19:58.557/str Having analysed the SOLR index this afternoon I realised that I actually add the date/time of when record indexed so did a quick SOLR admin count using record_date:[2000-02-14T00:00:00.000Z TO 2013-02-10T00:00:00.000Z] this resulted in a count of 32.723 records indexed today, and when I add up all the DIH's of Added/Updated it comes to 35,369 , weird !!! Now for the total maths , yesterday's total index count was 13593885 and today it is 13598211 a difference of 4326, but I do need to take into account records updates, so running the SQL form each of the DIH's sources in SQL Developer to purely get counts, my counts are a total of 31,789 which means only 3,000 to 4,000 updates the rest are all new. So I will definately say that records are being deleted so need to check the logs as suggested. If no mention of deletions exist my next question will be can I get a Month- breakdown on a SOLR date field so I can monitor records that drop off, because one field that will definately not change is the record creation date from the source systems which is part of the indexed record? this line ready for entering log details to see if any deletes occurred -- View this message in context: http://lucene.472066.n3.nabble.com/Maximum-Number-of-Records-In-Index-tp4038961p4040445.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MockAnalyzer in Lucene: attach stemmer or any custom filter?
MockAnalyzer is really just MocKTokenizer+MockTokenFilter+ Instead you just define your own analyzer chain using MockTokenizer. This is the way all lucene's own analysis tests work: e.g. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/test/org/apache/lucene/analysis/en/TestEnglishMinimalStemFilter.java On Thu, Feb 14, 2013 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello, Asked a question on SO: http://stackoverflow.com/questions/14873207/mockanalyzer-in-lucene-attach-stemmer-or-any-custom-filter Is there a way to configure a stemmer or a custom filter with the MockAnalyzer class? Version: LUCENE_34 Dmitry
RE: Solr 4.1.0 not using solrcore.properties ?
Daniel, This bug has already been recorded and hopefully will be fixed in time for 4.2. See https://issues.apache.org/jira/browse/SOLR-4361 . James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Daniel Rijkhof [mailto:daniel.rijk...@gmail.com] Sent: Wednesday, February 13, 2013 5:47 PM To: solr-user@lucene.apache.org Subject: Re: Solr 4.1.0 not using solrcore.properties ? James, I debugged it until I found where things go 'wrong'. Apparently the current implementation VariableResolver does not allow the use of a period '.' in any variable/property key you want to use... It's reserved for namespaces. Personally I would really love to use a period in my variable/property key names, and see no reason why this should be an issue... So, using for example solr.dataimport.jdbcDriver=org.h2.Driver will not work using just: jdbcDriver=org.h2.Driver works fine... So i will rename all my properties... but took me hours to find out why something that used to work stopped working... I have never had problems of using periods in any properties file... apparently Solr is the only project that doesn't allow the use of periods... Even if this would be documented in a way that persons can find this documentation, i guess it would be better to just allow periods by changing the implementation of the VariableResolver just a little... 00.43 now... off to bed. Let me know what you think, Daniel On Wed, Feb 13, 2013 at 6:45 PM, Dyer, James james.d...@ingramcontent.comwrote: The code that resolves variables in DIH was refactored extensively in 4.1.0. So if you've got a case where it does not resolve the variables properly, please give the details. We can open a JIRA issue and get this fixed. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Daniel Rijkhof [mailto:daniel.rijk...@gmail.com] Sent: Wednesday, February 13, 2013 11:09 AM To: solr-user@lucene.apache.org Subject: Re: Solr 4.1.0 not using solrcore.properties ? I am looking at the source code of 4.1.0 and I cannot find any prove that solr 4.1.0's DIH would actually use any properties from the solrcore.properties file. I do however found that Solr does load my solrcore.properties file... It's strange that this would have been changed, Does anybody have prove it still can use properties defined in solrcore.properties within the DIH configuration? In that case, please reply... Daniel Daniel Rijkhof 06 12 14 12 17 On Wed, Feb 13, 2013 at 4:22 PM, Daniel Rijkhof daniel.rijk...@gmail.com wrote: I have the following problem: I'm upgrading from a nightly build 4.0.* to 4.1.0. My dataimport is configured with ${variables} which always worked fine, untill this upgrade. My solrcore.properties file seems to be ignored. Solr.xml: ?xml version=1.0 encoding=UTF-8 ? solr sharedLib=lib persistent=true cores adminPath=/admin/cores host=${host:} hostPort=${jetty.port:} core default=true name=hfselectdata instanceDir=hfselectdata/ /cores /solr and in solrhome/hfselectdata/conf/ is the file solrcore.properties. Anybody any suggestions? Greatly appreciated Daniel
Re: What should focus be on hardware for solr servers?
My dual-core, HT-enabled Dell Latitude from last year has this CPU: model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz bogomips: 4988.65 An m3.xlarge reports: model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz bogomips : 4000.14 I tried running geekbench and phoronx-test-suite and failed at both... Anybody have a favorite, free, CLI benchmarking suite? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky j...@basetechnology.com wrote: That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen
Re: What should focus be on hardware for solr servers?
Or perhaps we should develop our own, Solr-based benchmark... Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: My dual-core, HT-enabled Dell Latitude from last year has this CPU: model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz bogomips: 4988.65 An m3.xlarge reports: model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz bogomips : 4000.14 I tried running geekbench and phoronx-test-suite and failed at both... Anybody have a favorite, free, CLI benchmarking suite? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky j...@basetechnology.com wrote: That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen
RE: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
This looks like https://issues.apache.org/jira/browse/SOLR-2115 , which was fixed for 4.0-Alpha . Bascially, if you do not put a data-config.xml file in the defaults section in solrconfig.xml, or if your config file has any errors, you won't be able to use DIH unless you fix the problem and restart solr. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: PeterKerk [mailto:vettepa...@hotmail.com] Sent: Thursday, February 14, 2013 5:02 AM To: solr-user@lucene.apache.org Subject: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run On all products I have I want to implement a price range filter. Since this pricerange is applied on the entire population and not on a single product, my assumption was that it would not make sense to define this within the shopitem entity, but rather under the document shopitems. So that's what I did in my data-config below. But now on these requests: http://localhost:8983/solr/tt-shop/dataimport?command=reload-config http://localhost:8983/solr/tt-shop/dataimport?command=full-import I get the error: DataImportHandler started. Not Initialized. No commands can be run dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost:1433;databaseName= user=** password=* / document name=shopitems entity name=shopitem pk=id query=select * from products field name=id column=ID / field name=prijs column=prijs / field name=createdate column=createdate / /entity entity name=quot;pricerangequot; query=quot;;With Categorized as (Select CASE When prijs amp;lt;= 1000 Then 'lt;10' When prijs amp;gt; 1000 and prijs amp;lt;= 2500 Then '[10-25]' When prijs amp;gt; 2500 and prijs amp;lt;= 5000 Then '[25-50]' Else '50' END as PriceCategory From products) Select PriceCategory, Count(*) as Cnt From Categorized Group By PriceCategory /entity /document /dataConfig -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
Ok, but I restarted solr several times and the issue still occurs. So my guess is that the entity I added contains errors: entity name=amp;quot;pricerangeamp;quot; query=amp;quot;;With Categorized as (Select CASE When prijs amp;amp;lt;= 1000 Then 'amp;lt;10' When prijs amp;amp;gt; 1000 and prijs amp;amp;lt;= 2500 Then '[10-25]' When prijs amp;amp;gt; 2500 and prijs amp;amp;lt;= 5000 Then '[25-50]' Else '50' END as PriceCategory From products) Select PriceCategory, Count(*) as Cnt From Categorized Group By PriceCategory /entity Or are you saying that this code is correct and that the 4.0-Alpha release will resolve my issue? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418p4040483.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: compare two shards.
I do a brute-force regression test where I read all the documents from shard 1 and compare them to documents in shard 2. I had to have all the fields stored to do that, but in my case that doesn't change the size of the index much. So, in other words, I do a search for a page's worth of documents sorted by the same thing and compare them, then get the next page and do the same. On Tue, Feb 12, 2013 at 4:20 AM, stockii stock.jo...@googlemail.com wrote: hello. i want to compare two shards each other, because these shards should have the same index. but this isnt so =( so i want to find these documents, there are missing in one shard of my both shards. my ideas - distrubuted shard request on my nodes and fire a facet search on my unique-field. but the result of facet component isnt reversable =( - grouping. but its not working correctly i think so. no groups of the same uniquekey in the resultset. does anyone some better ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/compare-two-shards-tp4039887.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What should focus be on hardware for solr servers?
Just using a single CPU (log processing with Python), my MacBook Pro (2GHz Intel Core i7) is twice as fast as an m2.xlarge EC2 instance. Laptop disks are slower than the EC2 disks. EC2 is for quantity, not quality. wunder On Feb 14, 2013, at 5:10 AM, Jack Krupansky wrote: That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen -- Walter Underwood wun...@wunderwood.org
RE: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
No, you still have to fix problems with data-config.xml. Just that prior to 4.0-alpha if you started solr with a problem in the config, you had no way to fix it and refreshing without restarting solr (or at least doing a core reload). With 4.0, you can fix your config file and just retry. I think the problem might be the escaped quotes and amperstands. Change it to... entity name=pricerange query=With Categorized as (Select CASE When prijs amp;amp;lt;= 1000 Then 'amp;lt;10' When prijs amp;amp;gt; 1000 and prijs amp;amp;lt;= 2500 Then '[10-25]' When prijs amp;amp;gt; 2500 and prijs amp;amp;lt;= 5000 Then '[25-50]' Else '50' END as PriceCategory From products) Select PriceCategory, Count(*) as Cnt From Categorized Group By PriceCategory /entity James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: PeterKerk [mailto:vettepa...@hotmail.com] Sent: Thursday, February 14, 2013 10:01 AM To: solr-user@lucene.apache.org Subject: RE: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run Ok, but I restarted solr several times and the issue still occurs. So my guess is that the entity I added contains errors: entity name=amp;quot;pricerangeamp;quot; query=amp;quot;;With Categorized as (Select CASE When prijs amp;amp;lt;= 1000 Then 'amp;lt;10' When prijs amp;amp;gt; 1000 and prijs amp;amp;lt;= 2500 Then '[10-25]' When prijs amp;amp;gt; 2500 and prijs amp;amp;lt;= 5000 Then '[25-50]' Else '50' END as PriceCategory From products) Select PriceCategory, Count(*) as Cnt From Categorized Group By PriceCategory /entity Or are you saying that this code is correct and that the 4.0-Alpha release will resolve my issue? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418p4040483.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What should focus be on hardware for solr servers?
Just for sake of comparison, http://www.ec2instances.info/ At the low end, EC2 CPUs come in 1, 2, 2.5, and 3.25 unit sizes. A m2.xlarge uses 3.25 unit CPUs, so one would have to step up to the high storage, high IO, or cluster compute nodes to do better than that at single threaded tasks. Good thing Solr isn't single threaded, or my company would be bankrupt! :) Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 11:24 AM, Walter Underwood wun...@wunderwood.org wrote: Just using a single CPU (log processing with Python), my MacBook Pro (2GHz Intel Core i7) is twice as fast as an m2.xlarge EC2 instance. Laptop disks are slower than the EC2 disks. EC2 is for quantity, not quality. wunder On Feb 14, 2013, at 5:10 AM, Jack Krupansky wrote: That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an inverted index is very small. We're still in any random machine you've got available-land so I second Michael's suggestion. Regards, Toke Eskildsen -- Walter Underwood wun...@wunderwood.org
Re: What should focus be on hardware for solr servers?
On Feb 14, 2013, at 11:24 AM, Walter Underwood wun...@wunderwood.org wrote: Laptop disks are slower than the EC2 disks. My laptop disk is an SSD.
Re: compare two shards.
If you can spare the load of a long request, I'd do an unsorted query for everything, non-paged. I'd dump that into a line-per-row format and use something like Apache Hive to do the analysis. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Tue, Feb 12, 2013 at 4:20 AM, stockii stock.jo...@googlemail.com wrote: hello. i want to compare two shards each other, because these shards should have the same index. but this isnt so =( so i want to find these documents, there are missing in one shard of my both shards. my ideas - distrubuted shard request on my nodes and fire a facet search on my unique-field. but the result of facet component isnt reversable =( - grouping. but its not working correctly i think so. no groups of the same uniquekey in the resultset. does anyone some better ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/compare-two-shards-tp4039887.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi Core / On demand loading
Thanks, We run SOLR 4.0 in production. Yesterday, I ported our configuration to 4.1 on my local workstation. I just looked at the SOLR-4400 fix versions and as per the info, I might wait till 4.2 before porting. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-Core-On-demand-loading-tp4040341p4040498.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
Ok, something went wrong with posting the code,since I did not escape the quotes and ampersands. I tried your code, but nu luck. Here's the original query I'm trying to execute. What characters do I need to escape? I thought only the and characters? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418p4040499.html Sent from the Solr - User mailing list archive at Nabble.com.
How to define a lowercase fieldtype without tokenizer
Hi, I don't want the field to be tokenized because Solr doesn't support sorting on a tokenized field. In order to do case insensitive sorting I need to copy a field to a lowercase but not tokenized field. How to define this? I did below but it says I need to specify a tokenizer or a class for analyzer. fieldType name=text_lowercase class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-define-a-lowercase-fieldtype-without-tokenizer-tp4040500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Implement price range filter: DataImportHandler started. Not Initialized. No commands can be run
Hi Peter, Your original query didn't make it to the mailing list. You're experiencing a long-standing nabble bug: nabble eats code. (I've told them about it a couple of times, but the problem persists, so I guess they're not interested in fixing it.) My suggestion: don't use nabble for posting to mailing lists. Or put code snippets up on a third-party text sharing facility, e.g. pastebin, github gist, etc. Steve On Feb 14, 2013, at 12:10 PM, PeterKerk vettepa...@hotmail.com wrote: Ok, something went wrong with posting the code,since I did not escape the quotes and ampersands. I tried your code, but nu luck. Here's the original query I'm trying to execute. What characters do I need to escape? I thought only the and characters? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Implement-price-range-filter-DataImportHandler-started-Not-Initialized-No-commands-can-be-run-tp4040418p4040499.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to define a lowercase fieldtype without tokenizer
You can use a KeywordTokenizerFactory, which will tokenise into a single term, and then do your lowercasing. Does that get you what you want? Upayavira On Thu, Feb 14, 2013, at 05:11 PM, Bing Hua wrote: Hi, I don't want the field to be tokenized because Solr doesn't support sorting on a tokenized field. In order to do case insensitive sorting I need to copy a field to a lowercase but not tokenized field. How to define this? I did below but it says I need to specify a tokenizer or a class for analyzer. fieldType name=text_lowercase class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-define-a-lowercase-fieldtype-without-tokenizer-tp4040500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to define a lowercase fieldtype without tokenizer
Works perfectly. Thank you. I didn't know this tokenizer does nothing before :) -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-define-a-lowercase-fieldtype-without-tokenizer-tp4040500p4040507.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Combining Solr score with customized user ratings for a document
Start by looking at Solr's external file field and http://www.linkedin.com/profile/view?id=18807864trk=tab_pro On Thu, Feb 14, 2013 at 6:24 AM, Á_o chachime...@yahoo.es wrote: Well, thinking a bit more, the second solution is not practical. If Solr retrieves, say, 1.000 documents, I would have to navigate through ALL (maybe less with some reasonable upper limit) of them to recalculate the scores and reorder them according to the new score although the Web App is going to show just the first 20. In other words, I would lose the benefits of Solr's (well, and most DB's) row/offset feature to retrieve information in batchs rather than the whole amount of results which may not be seen by the user at all. I'm now wondering if a custom implementation of a ValueSource + a FunctionQuery is a solution to my problem... Any hint? Thanks! Álvaro -- View this message in context: http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200p4040444.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Combining Solr score with customized user ratings for a document
Oops - that's definitely not the link I meant to give ;-) Here's the link from slideshare: http://www.slideshare.net/thelabdude/boosting-documents-in-solr-lucene-revolution-2011 In there we used Mahout to calculate recommendation scores and then loaded them using external file field. Cheers, Tim On Thu, Feb 14, 2013 at 11:25 AM, Timothy Potter thelabd...@gmail.com wrote: Start by looking at Solr's external file field and http://www.linkedin.com/profile/view?id=18807864trk=tab_pro On Thu, Feb 14, 2013 at 6:24 AM, Á_o chachime...@yahoo.es wrote: Well, thinking a bit more, the second solution is not practical. If Solr retrieves, say, 1.000 documents, I would have to navigate through ALL (maybe less with some reasonable upper limit) of them to recalculate the scores and reorder them according to the new score although the Web App is going to show just the first 20. In other words, I would lose the benefits of Solr's (well, and most DB's) row/offset feature to retrieve information in batchs rather than the whole amount of results which may not be seen by the user at all. I'm now wondering if a custom implementation of a ValueSource + a FunctionQuery is a solution to my problem... Any hint? Thanks! Álvaro -- View this message in context: http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200p4040444.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Can't determine Sort Order: 'prijs ASC', pos=5
: I think the order needs to be in lowercase. Try asc instead of ASC. Should be trivial to support uppercase ASC and DESC as well, not sure why no one thought of adding that before... https://issues.apache.org/jira/browse/SOLR-4458 ...patches welcome -Hoss
RE: What should focus be on hardware for solr servers?
Steve Rowe [sar...@gmail.com] wrote: On Feb 14, 2013, at 11:24 AM, Walter Underwood wun...@wunderwood.org wrote: Laptop disks are slower than the EC2 disks. My laptop disk is an SSD. So it's not a disk? ...Sorry, couldn't resist. Unfortunately Amazon only has two SSD-backed solutions and they are #3 and #2 in terms of cost/hour (http://www.ec2instances.info/). To make matters worse, one of them has only 240GB of storage, which leaves the $3.10/hour for 2TB of SSD as the only choice right now. At Berlin Buzzwords 2013 there was a very interesting talk about indexing 24 billion tweets, with the clear conclusion that it was a lot cheaper to buy your own hardware (with SSDs) instead of going Amazon. At that point in time, for that kind of corpus yadda yadda. There's a recording at http://2012.berlinbuzzwords.de/sessions/you-know-search-querying-24-billion-records-900ms Regards, Toke Eskildsen
fatest way to rebuild Solr index
I have a few Solr indexes, each with 20-200 millions documents, which were indexed by querying multiple PostgreSQL databases. If I do rebuild the index by the same way, it would take a few months, because the PostgresSQL query is slow. Now, I need to do the following changes to all indexes. 1. delete a couple fields from the Solr index 2. add a couple new fields 3. change the type of one field from string to int Luckily, all fields were indexed and stored. My plan is to query an old index, and get values for all fields, and then add them into new index. Any faster ways to build new indexes in my case? Thanks, Ming
Re: fatest way to rebuild Solr index
On 2/14/2013 12:46 PM, Mingfeng Yang wrote: I have a few Solr indexes, each with 20-200 millions documents, which were indexed by querying multiple PostgreSQL databases. If I do rebuild the index by the same way, it would take a few months, because the PostgresSQL query is slow. Now, I need to do the following changes to all indexes. 1. delete a couple fields from the Solr index 2. add a couple new fields 3. change the type of one field from string to int Luckily, all fields were indexed and stored. My plan is to query an old index, and get values for all fields, and then add them into new index. Using the DataImportHandler with SolrEntityProcessor is probably your best bet. I believe you would want to avoid updating the source index while using this. http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor Thanks, Shawn
Re: fatest way to rebuild Solr index
Shawn, Awesome. Exactly something I am looking for. Thanks! Ming On Thu, Feb 14, 2013 at 12:00 PM, Shawn Heisey s...@elyograg.org wrote: On 2/14/2013 12:46 PM, Mingfeng Yang wrote: I have a few Solr indexes, each with 20-200 millions documents, which were indexed by querying multiple PostgreSQL databases. If I do rebuild the index by the same way, it would take a few months, because the PostgresSQL query is slow. Now, I need to do the following changes to all indexes. 1. delete a couple fields from the Solr index 2. add a couple new fields 3. change the type of one field from string to int Luckily, all fields were indexed and stored. My plan is to query an old index, and get values for all fields, and then add them into new index. Using the DataImportHandler with SolrEntityProcessor is probably your best bet. I believe you would want to avoid updating the source index while using this. http://wiki.apache.org/solr/**DataImportHandler#**SolrEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor Thanks, Shawn
Re: long QTime for big index
Just to close this discussion , we solved the problem by splitting the index. It turned out that distributed search with 12 cores are faster than searching two cores. All queries ,tomcat configuration, jvm configuration remain same. Now queries are served in milliseconds. On Thu, Jan 31, 2013 at 9:34 PM, Mou [via Lucene] ml-node+s472066n4037870...@n3.nabble.com wrote: Thank you again. Unfortunately the index files will not fit in the RAM.I have to try using document cache. I am also moving my index to SSD again, we took our index off when fusion IO cards failed twice during indexing and index was corrupted.Now with the bios upgrade and new driver, it is supposed to be more reliable. Also I am going to look into the client app to verify that it is making proper query requests. Surprisingly when I used a much lower value than default for defaultconnectionperhost and maxconnectionperhost in solrmeter , it performs very well, the same queries return in less than one sec . I am not sure yet, need to run solrmeter with different heap size , with cache and without cache etc. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037870.html To unsubscribe from long QTime for big index, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4040535.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: long QTime for big index
Hi, It is curious to know how many linux boxes do you have and how many cores in each of them. It was my understanding that solr puts in the memory all documents found for a keyword, not the whole index. So, why it must be faster with more cores, when number of selected documents from many separate cores are the same as from one core? Thanks. Alex. -Original Message- From: Mou mouna...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Thu, Feb 14, 2013 2:35 pm Subject: Re: long QTime for big index Just to close this discussion , we solved the problem by splitting the index. It turned out that distributed search with 12 cores are faster than searching two cores. All queries ,tomcat configuration, jvm configuration remain same. Now queries are served in milliseconds. On Thu, Jan 31, 2013 at 9:34 PM, Mou [via Lucene] ml-node+s472066n4037870...@n3.nabble.com wrote: Thank you again. Unfortunately the index files will not fit in the RAM.I have to try using document cache. I am also moving my index to SSD again, we took our index off when fusion IO cards failed twice during indexing and index was corrupted.Now with the bios upgrade and new driver, it is supposed to be more reliable. Also I am going to look into the client app to verify that it is making proper query requests. Surprisingly when I used a much lower value than default for defaultconnectionperhost and maxconnectionperhost in solrmeter , it performs very well, the same queries return in less than one sec . I am not sure yet, need to run solrmeter with different heap size , with cache and without cache etc. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037870.html To unsubscribe from long QTime for big index, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4040535.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.3.0 - Random CPU problem
I took your advice, waited for the servers to go down then: [ec2-user@zuk-solr-slave-02 ~]$ ps -wwwf -p 10131 UIDPID PPID C STIME TTY TIME CMD tomcat 10131 1 17 23:00 ?00:03:13 /usr/sbin/sshd This doesn't say much :( What should I do know? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-3-0-Random-CPU-problem-tp4039969p4040548.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: long QTime for big index
We have two boxes, they are really nice servers, 32 core cpu, 192 G memory with both RAID arrays and fusion IOs. But each of them running two instances of Solr, one for indexing and the other for searching.Search index is on fusion IO card. Each instance has 11 cores and a small core for making indexing almost realtime. We have around 300 Million documents and 250G on disk. They are all metadata . Search queries are very diverse and they do not repeat very frequently , 40 -60 qps. Before we had two cores each 125 G on disk and solr was taking long time to get results from those two cores. CPU use was 90%. We never had problem with indexing. 50% of all our docs gets updated every day, so very high indexing rate. On Thu, Feb 14, 2013 at 4:20 PM, alxsss [via Lucene] ml-node+s472066n4040545...@n3.nabble.com wrote: Hi, It is curious to know how many linux boxes do you have and how many cores in each of them. It was my understanding that solr puts in the memory all documents found for a keyword, not the whole index. So, why it must be faster with more cores, when number of selected documents from many separate cores are the same as from one core? Thanks. Alex. -Original Message- From: Mou [hidden email] To: solr-user [hidden email] Sent: Thu, Feb 14, 2013 2:35 pm Subject: Re: long QTime for big index Just to close this discussion , we solved the problem by splitting the index. It turned out that distributed search with 12 cores are faster than searching two cores. All queries ,tomcat configuration, jvm configuration remain same. Now queries are served in milliseconds. On Thu, Jan 31, 2013 at 9:34 PM, Mou [via Lucene] [hidden email] wrote: Thank you again. Unfortunately the index files will not fit in the RAM.I have to try using document cache. I am also moving my index to SSD again, we took our index off when fusion IO cards failed twice during indexing and index was corrupted.Now with the bios upgrade and new driver, it is supposed to be more reliable. Also I am going to look into the client app to verify that it is making proper query requests. Surprisingly when I used a much lower value than default for defaultconnectionperhost and maxconnectionperhost in solrmeter , it performs very well, the same queries return in less than one sec . I am not sure yet, need to run solrmeter with different heap size , with cache and without cache etc. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4037870.html To unsubscribe from long QTime for big index, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4040535.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4040545.html To unsubscribe from long QTime for big index, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/long-QTime-for-big-index-tp4037635p4040549.html Sent from the Solr - User mailing list archive at Nabble.com.
Query question
Howdy, I have a straight-forward index that contains a name field. I am currently taking a string of text, tokenizing it into individual strings and making a query out of them all against the name field. Note that the name field is split up by a whitespace tokenizer and a lower case filter during indexing. My query is working fine but I want to boost the score when multiple terms match. So for example if I had an entry in my index that was originally Valley Fair Mall and the string I was using to search was I'm shopping at Valley Fair mall my query is currently being chopped into: name:i'm~ name:shopping~ name:at~ name:valley~ name:fair~ name:mall~ Note that I use OR by default. So as I said, the search result I want is the one with the highest score, but I was hoping to find a way to boost the score based on the number of terms it finds (or matches well) so that I can differentiate between a close match and nowhere near. Any suggestions? Regards, T -- View this message in context: http://lucene.472066.n3.nabble.com/Query-question-tp4040559.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query question
Use the edismax query parser and set the PF, PF2, and PF3 parameters so that adjacent pairs and triples of query terms will get phrase boosted. See: http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29 http://wiki.apache.org/solr/ExtendedDisMax#pf2_.28Phrase_bigram_fields.29 -- Jack Krupansky -Original Message- From: dm_tim Sent: Thursday, February 14, 2013 8:00 PM To: solr-user@lucene.apache.org Subject: Query question Howdy, I have a straight-forward index that contains a name field. I am currently taking a string of text, tokenizing it into individual strings and making a query out of them all against the name field. Note that the name field is split up by a whitespace tokenizer and a lower case filter during indexing. My query is working fine but I want to boost the score when multiple terms match. So for example if I had an entry in my index that was originally Valley Fair Mall and the string I was using to search was I'm shopping at Valley Fair mall my query is currently being chopped into: name:i'm~ name:shopping~ name:at~ name:valley~ name:fair~ name:mall~ Note that I use OR by default. So as I said, the search result I want is the one with the highest score, but I was hoping to find a way to boost the score based on the number of terms it finds (or matches well) so that I can differentiate between a close match and nowhere near. Any suggestions? Regards, T -- View this message in context: http://lucene.472066.n3.nabble.com/Query-question-tp4040559.html Sent from the Solr - User mailing list archive at Nabble.com.
Fetching the date based on lastupdate
Hi, I am having a column called 'lastUpdate' in my solr which will contain last updated date. Now i want fetch last 24 lastupdated dates from that column. How to do this??? Querying the solr server with the following URL fetches me the result , http://localhost/solr/MC_10701_catalogEntry/q=lastUpdate:{* TO NOW}sort=lastUpdate desc This URL will fetch the lastupdated date in descending order. Now I want only last 24 records to be fetched. Is there any function in solr to do this?? Plz Help me.. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Fetching-the-date-based-on-lastupdate-tp4040564.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What should focus be on hardware for solr servers?
You could run Lucene benchmark stuff and compare. Or look at ActionGenerator from Sematext on Github which you could also use for performance testing and comparing. Otis Solr ElasticSearch Support http://sematext.com/ On Feb 14, 2013 10:56 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Or perhaps we should develop our own, Solr-based benchmark... Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 10:54 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: My dual-core, HT-enabled Dell Latitude from last year has this CPU: model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz bogomips: 4988.65 An m3.xlarge reports: model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz bogomips : 4000.14 I tried running geekbench and phoronx-test-suite and failed at both... Anybody have a favorite, free, CLI benchmarking suite? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Feb 14, 2013 at 8:10 AM, Jack Krupansky j...@basetechnology.com wrote: That raises the question of how your average professional notebook computer (PC or Mac or Linux) compares to a garden-variety cloud server such as an Amazon EC2 m1.large (or m3.xlarge) in terms of performance such as document ingestion rate or how many documents you can load before load and/or query performance starts to fall off the cliff. Anybody have any numbers? I mean, is a MacBook Pro half of an EC2 m1.large? Twice? Less? More? Any rough feel? (With all the usual caveats that it all depends and your mileage will vary.) But the intent would be for a similar workload on both (like loading the wikipedia dump.) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, February 14, 2013 7:31 AM To: solr-user@lucene.apache.org Subject: Re: What should focus be on hardware for solr servers? One data point: I can comfortably index and search the Wikipedia dump (11M articles, 5M with text) on my Macbook Pro. Admittedly not heavy-duty queries, but Erick On Wed, Feb 13, 2013 at 4:01 PM, Matthew Shapiro m...@mshapiro.net wrote: Excellent, thank you very much for the reply! On Wed, Feb 13, 2013 at 2:08 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Matthew Shapiro [m...@mshapiro.net] wrote: Sorry, I should clarify our current statistics. First of all I meant 183k documents (not 183, woops). Around 100k of those are full fledged html articles (not web pages but articles in our CMS with html content inside of them), If an article is around 10-30 pages (or the equivalent), this is still a small corpus. the rest of the data are more like key/value data records with a lot of attached meta data for searching. If the amount of unique categories (model, author, playtime, lix, favorite_band, year...) in the meta data is in the lower hundreds, you should be fine. Also, what I meant by search without a search term is that probably 80% (hard to confirm due to the lack of stats given by the GSA) of our searches are done on pure metadata clauses without any searching through the content itself, That clarifies a lot, thanks. So we have roughly speaking 4000*5 queries/day ~= 14 queries/minute. Guessing wildly that your peak time traffic is about 5 times that, we end up with about 1 query/second. That is a very light load for the Solr installation we're discussing. so for example give me documents that have a content type of video, that are marked for client X, have a category of Y or Z, and was published to platform A, ordered by date published. That is a near-trivial query and you should get a reply very fast on modest hardware. The searches that use a search term are more like use the same query from the example as before, but find me all the documents that have the string My Video in it's title and description. Unless you experiment with fuzzy matches and phrase slop, this should also be fast. Ignoring analyzers, there is practically no difference between a meta data field and a larger content field in Solr. Your current search (guessing here) iterates all terms in the content fields and take a comparatively large penalty when a large document is encountered. The inversion of index in Solr means that the search terms are looked up in a dictionary and refers to the documents they belong to. The penalty for having thousands or millions of terms as compared to tens or hundreds in a field in an
suggestions w.r.t Issue with Collections API in 4.1
*1.empty Zookeeper* *2.empty index directories for solr* *3.empty solr.xml* ?xml version=1.0 encoding=UTF-8 ? solr persistent=true cores adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:} hostContext=solr /cores /solr *3.1 upload / link cfg in zookeeper for test collection* *4*.* start 4 solr servers on different machines* *5. Access server* : i see There are no SolrCores running — for the current functionality we require at least one SolrCore, sorry :) that's ok *6. CREATE collection* http://hostname:15000/solr/admin/collections?action=CREATEname=testnumShards=1replicationFactor=4 this creates one core on each server with one shard named - test_shard1_replica1 - test_shard1_replica2 - test_shard1_replica3 - test_shard1_replica4 and persists it in solr.xml on each server. *but why are these core are not started?* and even on server reboot even though solr.xml says loadOnStartup=true is still see ERROR on web admin UI There are no SolrCores running — for the current functionality we require at least one SolrCore, sorry :) I did try this once successfully and I think i am missing something now. Cannot see any errors in log that are severe -- Anirudha P. Jadhav
How to make this work with SOLR ( LUCENE-2899 : Add OpenNLP Analysis capabilities as a module)
I'm trying to explore Parts-Of-Speech tagging with SOLR. Firstly, am I right in assuming that OpenNLP integration is the right direction in which to proceed? With respect to getting OpenNLP to work with SOLR ( http://wiki.apache.org/solr/OpenNLP ) , I tried following the instructions , only to be faced with an error complaining that OpenNLPTokenizerFactory cannot.be found . Upon researching the error, I came across the issue https://issues.apache.org/jira/browse/LUCENE-2899 , that indicates that integration is not yet complete and the OpenNLP functionality is only available via a patch (I'm runnign SOLR 4.1 locally). I tried patching my SOLR 4.1 source , as well as a freshly downloaded SOLR trunk, to no avail. I guess I just need some tips on how and what to patch. I tried to patch the base directory as well as the lucene directory. If there's something I need to hack in the patch, do let me know. Thanks vinayb@blackbox ~/Downloads/solr-4.1.0/lucene $ pwd /home/vinayb/Downloads/solr-4.1.0/lucene vinayb@blackbox ~/Downloads/solr-4.1.0/lucene $ ls analysis BUILD.txtcodecsdemo highlighter JRE_VERSION_MIGRATION.txt LUCENE-2899.patch misc queries sandbox suggest tools backwards build.xmlcommon-build.xml facet ivy-settings.xml licenses memory module-build.xml queryparser site SYSTEM_REQUIREMENTS.txt benchmark CHANGES.txt core grouping join LICENSE.txtMIGRATE.txtNOTICE.txt README.txt spatial test-framework vinayb@blackbox ~/Downloads/solr-4.1.0/lucene $ patch -p0 -i LUCENE-2899.patch --dry-run can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git dev-tools/eclipse/dot.classpath dev-tools/eclipse/dot.classpath |index 1d2abc1..575b4f0 100644 |--- dev-tools/eclipse/dot.classpath |+++ dev-tools/eclipse/dot.classpath -- File to patch:
Re: suggestions w.r.t Issue with Collections API in 4.1
I don't know - by chance, I'm actually doing about the same sequence of events right now with Solr 4.1, and the cores are running fine… What do the logs say? - Mark On Feb 14, 2013, at 10:18 PM, Anirudha Jadhav aniru...@nyu.edu wrote: *1.empty Zookeeper* *2.empty index directories for solr* *3.empty solr.xml* ?xml version=1.0 encoding=UTF-8 ? solr persistent=true cores adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:} hostContext=solr /cores /solr *3.1 upload / link cfg in zookeeper for test collection* *4*.* start 4 solr servers on different machines* *5. Access server* : i see There are no SolrCores running — for the current functionality we require at least one SolrCore, sorry :) that's ok *6. CREATE collection* http://hostname:15000/solr/admin/collections?action=CREATEname=testnumShards=1replicationFactor=4 this creates one core on each server with one shard named - test_shard1_replica1 - test_shard1_replica2 - test_shard1_replica3 - test_shard1_replica4 and persists it in solr.xml on each server. *but why are these core are not started?* and even on server reboot even though solr.xml says loadOnStartup=true is still see ERROR on web admin UI There are no SolrCores running — for the current functionality we require at least one SolrCore, sorry :) I did try this once successfully and I think i am missing something now. Cannot see any errors in log that are severe -- Anirudha P. Jadhav