gui for solr index
is there a standard solution to Apache solr (from trunk) for the following: - GUI view solr-index.
Re: how to index data in solr form database automatically
How about having a delta-import and a cron to trigger the post? -- Anshum Gupta http://ai-cafe.blogspot.com On Fri, Jun 24, 2011 at 11:13 AM, Romi romijain3...@gmail.com wrote: I have MySql database for my application. i implemented solr search and used dataimporthandler(DIH)to index data from database into solr. my question is: is there any way that if database gets updated then my solr indexes automatically gets update for new data added in the database. . It means i need not to run index process manually every time data base tables changes.If yes then please tell me how can i achieve this. - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3102893.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: gui for solr index
Please use the Lucene Luke www.getopt.org/*luke*/ On 6/24/2011 1:29 PM, Алексей Цой wrote: is there a standard solution to Apache solr (from trunk) for the following: - GUI view solr-index.
Re: Re; DIH Scheduling
On Thu, Jun 23, 2011 at 9:13 PM, simon mtnes...@gmail.com wrote: The Wiki page describes a design for a scheduler, which has not been committed to Solr yet (I checked). I did see a patch the other day (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't look well tested. I think that you're basically stuck with something like cron at this time. If your application is written in java, take a look at the Quartz scheduler - http://www.quartz-scheduler.org/ It was considered and decided against. -Simon -- - Noble Paul
Re: how to index data in solr form database automatically
Yeah i am using data-import to get data from database and indexing it. but what is cron can you please provide a link for it - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating the data-config file
Ahh! Thats interesting! I understand what you mean. Since RSS and Atom feeds have the same structure parsing them would be the same but I can do the for each different URLs. These URLs can be obtained from a db, a file or through the request parameters, right? Exactly. You can register multiple dataSource with different names. And then in each each entity, you can select appropriate data source with dataSource=... tag. For a db, data-config.xml would be something like: dataSource type=HttpDataSource name=http/ dataSource type=JdbcDataSource name=db driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/mydb batchSize=-1/ entity name=urls dataSource=db query=SELECT url FROM urls entity name=slashdot dataSource=http pk=link url=${urls.url} processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=DateFormatTransformer
Re: how to index data in solr form database automatically
Cron is a time-based job scheduler in Unix-like computer operating systems. en.wikipedia.org/wiki/Cron *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Jun 24, 2011 at 12:26, Romi romijain3...@gmail.com wrote: Yeah i am using data-import to get data from database and indexing it. but what is cron can you please provide a link for it - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query time noun, verb boosting
2011/6/23 Anshum ansh...@gmail.com Pooja, You could use UIMA (or any other) Parts of Speech Tagger. You could read a little more about it here. http://uima.apache.org/downloads/sandbox/hmmTaggerUsersGuide/hmmTaggerUsersGuide.html#sandbox.tagger.annotatorDescriptor This would help you annotate and segregate nouns from verbs in the input. You could then aptly form the query. Perhaps this would take some effort but 'm assuming it'd work reasonably well. I've done this recently using UIMA POS tagger and other annotators within a TokenFilter to add TypeAttribute and PayloadAttribute to each token and eventually filter/boost when searching. Regards, Tommaso -- Anshum Gupta http://ai-cafe.blogspot.com On Thu, Jun 23, 2011 at 11:18 AM, Pooja Verlani pooja.verl...@gmail.com wrote: Hi, Say for example, a query like mammohan singh dancing, I am preferring to make a compulsory condition on nouns to be searched but any verb isnt important for me, I am preferring to extract results for manmohan singh and not for dancing. If I can extract noun verb or can get to know that in my index I have a concept of manmohan singh or an identity if not concept, I would like to define rules for doing a strict(compulsory) match of noun(concept) and loose match(non-compulsory boosting) for the verb. Basically, I want to avoid getting zero results for a compulsory match of the 3 tokens(in this case manmohan singh dancing) of the query and instead I want to do a compulsory match on manmohan singh since that exists in my index and dancing shouldn't be a compulsory match for non-zero number of results. Hope this explains. Any suggestions? Regards, Pooja On Thu, Jun 23, 2011 at 11:07 AM, Anshum ansh...@gmail.com wrote: What would you mean by 'noun or some concept'. Would be better if you could give a rather concrete example. About detecting parts of speech, you could use a lot of libraries but I didn't get about boosting terms from the Index. -- Anshum Gupta http://ai-cafe.blogspot.com On Thu, Jun 23, 2011 at 11:02 AM, Pooja Verlani pooja.verl...@gmail.com wrote: Hi, At the query time, I want to make the lucene query such that it should boost only the noun from the query or some concept existing in the index. Are there any possibilities or any possible ideas that can be worked around? Regards, Pooja
multicore and replication cause OOM
Hi, I have a Solr with 7 cores (~150MB each). All cores replicate at the same time from a Solr master instance. Every time the replication happens I get an OOM after experiencing long response times. This Solr used to have 4 cores before and I've never got an OOM with that configuration (replication occurs on daily basis). My question is: could the new 3 cores be the cause of OOM? Does Solr require considerable extra heap for performing the replication?. Should I avoid replicating all the cores at the same time? I'm using Solr 1.4 with the following mem configuration: -Xms512m -Xmx512m -XX:NewSize=128M -XX:MaxNewSize=128M Appreciate any help. Regards, Esteban
Re: Understanding query explain information
Is it possible that synonyms are being added (synonym expansion) or at least changing the field length. I've saw this before. Check what exactly what terms have been added. On 23 June 2011 22:50, Alexander Ramos Jardim alexander.ramos.jar...@gmail.com wrote: Yes, I am using synonims in index time. 2011/6/22 lee carroll lee.a.carr...@googlemail.com Hi are you using synonyms ? On 22 June 2011 10:30, Alexander Ramos Jardim alexander.ramos.jar...@gmail.com wrote: Hi guys, I am getting some doubts about how to correctly understand the debugQuery output. I have a field named itemName in my index. This is a text field, just that. When I quqery a simple ?q=itemName:iPad , I end up with the following query result. Simply trying to understand why these strings generated such scores, and as far as I can understand, the only difference between them is the field norms, as all the other results maintain themselves. Now, how do I get these field norm values? Field Norm is the result of this formula right? *1/square root of (terms)*,* where terms is the number of terms in my field after it is indexed* Well, if this is true, the field norm for my first document should be 0.5 (1/sqrt(4)) as Livro - IPAD - O Guia do Profissional ends up with the terms livro|ipad|guia|profissional as tokens. What I am forgetting to take into account? ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime3/int lst name=params str name=debugQueryon/str str name=start0/str str name=rows10/str arr name=indent stron/str stron/str /arr str name=flitemName,score/str str name=version2.2/str str name=qitemName:ipad/str /lst /lst result name=response numFound=161 start=0 maxScore=3.6808658 doc float name=score3.6808658/float str name=itemNameLivro - IPAD - O Guia do Profissional/str /doc doc float name=score3.1550279/float str name=itemNameLeitor de Cartão para Ipad - Mobimax/str /doc doc float name=score3.1550279/float str name=itemNameSleeve para iPad/str /doc doc float name=score3.1550279/float str name=itemNameSleeve de Neoprene para iPad/str /doc doc float name=score3.1550279/float str name=itemNameCarregador de parede para iPad/str /doc doc float name=score2.6291897/float str name=itemNameCase Envelope para iPad - Black - Built NY/str /doc doc float name=score2.6291897/float str name=itemNameCase Protetora p/ IPad de Silicone Duo - Browm - Iskin/str /doc doc float name=score2.6291897/float str name=itemNameCase Protetora p/ IPad de Silicone Duo - Clear - Iskin/str /doc doc float name=score2.6291897/float str name=itemNameCase p/ iPad Sleeve - Black - Built NY/str /doc doc float name=score2.6291897/float str name=itemNameBolsa de Proteção p/ iPad Preta - Geonav/str /doc /result lst name=debug str name=rawquerystringitemName:ipad/str str name=querystringitemName:ipad/str str name=parsedqueryitemName:ipad/str str name=parsedquery_toStringitemName:ipad/str lst name=explain str name=7369507 3.6808658 = (MATCH) fieldWeight(itemName:ipad in 102507), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.4375 = fieldNorm(field=itemName, doc=102507) /str str name=739 3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226401), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.375 = fieldNorm(field=itemName, doc=226401) /str str name=7356941 3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226409), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.375 = fieldNorm(field=itemName, doc=226409) /str str name=7356931 3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226447), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.375 = fieldNorm(field=itemName, doc=226447) /str str name=7360321 3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226583), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.375 = fieldNorm(field=itemName, doc=226583) /str str name=7428354 2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223178), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.3125 = fieldNorm(field=itemName, doc=223178) /str str name=7366074 2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223196), product of: 1.0 = tf(termFreq(itemName:ipad)=1) 8.413407 = idf(docFreq=165, maxDocs=275239) 0.3125 = fieldNorm(field=itemName, doc=223196) /str str name=7366068 2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223831), product of: 1.0 = tf(termFreq(itemName:ipad)=1)
Re: Garbage Collection: I have given bad advice in the past!
If possible, can you please share some details of your setup, like the amount of shards, how big are they size/doc_count wise, what is the user load / s. On Fri, Jun 24, 2011 at 1:39 AM, Shawn Heisey s...@elyograg.org wrote: In the past I have told people on this list and in the IRC channel #solr what I use for Java GC settings. A couple of days ago, I cleaned up my testing methodology to more closely mimic real production queries, and discovered that my GC settings were woefully inadequate. Here's what I was using on a virtual machine with 9GB of RAM. I've been using this for several months, and chose it because I had read several things praising it. I should have done more research. -Xms512M -Xmx2048M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode On my backup servers, I am in the process of getting 3.2.0 ready to replace our 1.4.1 index. I ran into a situation where committing a delta-import of only a few thousand records took longer than 3 minutes (Perl LWP default timeout) on every shard, where normally in production on 1.4.1 it only takes a few seconds. This was shortly after I had hit the distributed index pretty hard with my improved benchmarking. Using jstat, I found that while under benchmarking load, the system was spending 10-15% of it's time doing garbage collection, and that most of the garbage collections were from the young generation. First I tried increasing the young generation size with the -XX:NewSize=1024M parameter. This helped on the total GC count, but didn't really help with how much time was spent doing them. A good command to see these statistics on Linux, and an Oracle link explaining what it all means: jstat -gc -t `pgrep java` 5000 http://download.oracle.com/**javase/6/docs/technotes/tools/** share/jstat.htmlhttp://download.oracle.com/javase/6/docs/technotes/tools/share/jstat.html I've learned that Solr will keep most of its data in young generation (eden), unless that memory pool is too small, then it will move data to the tenured generation. The key for good performance seems to be creating a large enough young generation. You do need to have a good chunk of tenured available, unless the solr instance has no index itself and exists only to distribute queries to shards living on other solr instances. In that case, it hardly uses the tenured generation. It turns out that CMSIncrementalMode causes more young generation collections and makes them take longer, which is exactly what Solr does NOT need. After messing around with it for quite a while, I came up with the following settings, which included an increase in heap size: -Xms3072M -Xmx3072M -XX:NewSize=1536M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled With these settings, it spends very little time doing garbage collections. One of my shards has been up for nearly 24 hours, has been hit with the benchmarking script repeatedly, and it has only done 62 young generation collections, and zero full collections, with 6.8 seconds total GC time. I am thinking of increasing the NewSize yet again, because the tenured generation (1.5GB in size) is only one third utilized after nearly 24 hours. My settings will probably not work for everyone, but I hope this post will make it easier for others to find the right solution for themselves. Thanks, Shawn -- Regards, Dmitry Kan
Re: testing subscription.
passed On Thu, Jun 23, 2011 at 10:38 PM, Esteban Donato esteban.don...@gmail.comwrote: -- Regards, Dmitry Kan
Question about optimization
Hi, I saw this in the Solr wiki : An un-optimized index is going to be *at least* 10% slower for un-cached queries. Is this still true? I read somewhere that recent versions of Lucene where less sensitive to un-ptimized indexed than is the past... Having 50 000 new (or updated) documents coming to my index every day, would a once-a-day optimization be sufficient? Thanks in advance, Marc.
Re: multicore and replication cause OOM
On Fri, Jun 24, 2011 at 1:41 PM, Esteban Donato esteban.don...@gmail.com wrote: I have a Solr with 7 cores (~150MB each). All cores replicate at the same time from a Solr master instance. Every time the replication happens I get an OOM after experiencing long response times. This Solr used to have 4 cores before and I've never got an OOM with that configuration (replication occurs on daily basis). My question is: could the new 3 cores be the cause of OOM? Does Solr require considerable extra heap for performing the replication?. Yes and no. Replication itself does not consume a lot of heap (I guess about a couple of MBs per ongoing replication). However, when the searchers are re-opened on the newly installed index, auto warming can cause memory usage to double for a core. Should I avoid replicating all the cores at the same time? You should try that especially if you are so constrained for heap space. I'm using Solr 1.4 with the following mem configuration: -Xms512m -Xmx512m -XX:NewSize=128M -XX:MaxNewSize=128M That seems to be a small amount of RAM for indexing/querying seven 150MB indexes in parallel. -- Regards, Shalin Shekhar Mangar.
Query may only contain [a-z][0-9]
Hello, Is it possible to configure into SOLR that only numbers and letters are accepted([a-z][0-9])?? When a user gives a term like + or - i get some SOLR errors. How can i exclude this characters? -- View this message in context: http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Weird issue with solr and jconsole/jmx
I just encountered the same bug - JMX registered beans don't survive Solr core reloads. I believe the reason is that when you do core reload * when the new core is created - it overwrites/over-register beans in registry (in mbeanserver) * when the new core is ready in the core register phase CoreContainer closes old core that results to unregistering jmx beans As a result there's only one bean in registry id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@33099cc main left after Core reload. It is because this in the only new (dynamically named bean) that is created by new core and not un-registered in oldCore.close. I'll try to reproduce that in test and file bug in Jira. On Tue, Mar 16, 2010 at 4:25 AM, Andrew Greenburg agreenb...@gmail.com wrote: On Tue, Mar 9, 2010 at 7:44 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I connected to one of my solr instances with Jconsole today and : noticed that most of the mbeans under the solr hierarchy are missing. : The only thing there was a Searcher, which I had no trouble seeing : attributes for, but the rest of the statistics beans were missing. : They all show up just fine on the stats.jsp page. : : In the past this always worked fine. I did have the core reload due to : config file changes this morning. Could that have caused this? possibly... reloading the core actually causes a whole new SolrCore object (with it's own registry of SOlrInfoMBeans) to be created and then swapped in place of hte previous core ... so perhaps you are still looking at the stats of the old core which is no longer in use (and hasn't been garbage collected because the JMX Manager still had a refrence to it for you? ... i'm guessing at this point) did disconnecting from jconsole and reconnecting show you the correct stats? Disconnecting and reconnecting didn't help. The queryCache and documentCache and some others started showing up after I did a commit and opened a new searcher, but the whole tree never did fill in. I'm guessing that the request handler stats stayed associated with the old, no longer visible core in JMX since new instances weren't created when the core reloaded. Does that make sense? The stats on the web stats page continued to be fresh.
Re: how to index data in solr form database automatically
would you please tell me how can i use Cron for auto index my database tables in solr - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103768.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query may only contain [a-z][0-9]
Probably the best place to do this is on the application layer. Also, if the problem is with the parsing erros, have you tried with dIsmax or edismax query parsers? On Fri, Jun 24, 2011 at 7:15 AM, roySolr royrutten1...@gmail.com wrote: Hello, Is it possible to configure into SOLR that only numbers and letters are accepted([a-z][0-9])?? When a user gives a term like + or - i get some SOLR errors. How can i exclude this characters? -- View this message in context: http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query may only contain [a-z][0-9]
Yes i use the dismax handler, but i will fix this in my application layer. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103945.html Sent from the Solr - User mailing list archive at Nabble.com.
Advice wanted on approach/architecture
Hi List, I'm looking into some options on what technology to adopt building a specific logfile search solution. At first glance it looks like Solr is the tool I'm looking for. I intend to write a web-based front end for end users What would be a possible approach to tackle following requirements? In other words how could these requirements be translated into Solr on a high level. I'm not asking for solutions, just pointers, approaches, tips, Solr features to look at, possible pitfalls, ... - A query results into a set of results. - Individual records from this query should have the ability to be marked so (although they match the query) those specific records don't show anymore when the same query is rerun. - I don't want to delete data from the db/index - I want to avoid that my application has to take care of excluding parts of the returned data by keeping track which record id's to exclude. - A query should exclude the records which have a match in a possibly large growing list of regexes. Thanks! Jelle
intersecting map extent with solr spatial documents
The following describes a Solr query filter that determines if axis aligned geographic bounding boxes intersect. It is used to determine which documents in a Solr repository containing spatial data are relevant to a given rectangular geographic region that corresponds to a displayed map. I haven’t seen this described before. I thought it might be useful to others and I might get some pointers on how to improve it. OpenGeoPortal (http://geoportal-demo.atech.tufts.edu/) is a web application supporting the rapid discovery of GIS layers. It uses Solr to combine spatial, keyword, date and GIS datatype based searching. As a user manipulates its map OpenGeoPortal automatically computes and displays relevant search results. This spatial searching requires the application to determine which GIS layers are relevant given the current extent of the map. Each Solr document includes spatial information about a single GIS layer. Specifically, it contains the center of the layer (in degrees latitude and longitude stored as tdoubles) as well as the height and width of the layer (in degrees stored as a tdouble). These values are precomputed from the bounding boxes of the layers during ingest. To identify relevant layers our search algorithm looks for a separating axis (http://en.wikipedia.org/wiki/Separating_axis_theorem) between the current bounds of the map and the bounds of each layer. If a horizontal or vertical separating axis exists then the layer does not contain any information in the geographic area defined by the map. If neither separating axis exists then the layer intersects the map, and is included in the result set. Identifying whether separating axes exists is relatively straightforward given two axis-aligned bounding boxes. In our case, one bounding box is defined by the map’s current extent and the other bounding box by a GIS layer. To determine if a vertical separating exists one must determine if the difference between the center longitude of the map and the center longitude of the layer is greater then the sum of the width of the map with the width of the layer. If so, a vertical separating axis exists. If not, a vertical separating axis does not exist. (See http://www.gamasutra.com/view/feature/3383/simple_intersection_tests_for_games.php?page=3 for a diagram.) Similarly, the presence of a horizontal separating can be computed using center latitudes and heights. It is possible to generate a Solr filter query that filters layers that contain neither a horizontal or vertical separating axis given a specific map. Naturally, this query is somewhat complicated. The query essentially counts the number of separating axes and, using !frange, eliminates layers that have a separating axis. In the following example, the map was centered on latitude 42.3, longitude -71.0 and had a width and height of 0.3 degrees. The schema defines the fields CenterX, CenterY, HalfWidth and HalfHeight. fq={!frange l=1 u=2} map(sum( map(sub(abs(sub(-71.0,CenterX)),sum(0.3,HalfWidth)),0,360,1,0), map(sub(abs(sub(42.3,CenterY)),sum(0.3,HalfHeight)),0,90,1,0)), 0,0,1,0) The clauses that check for bounding box (e.g., sub(abs(sub(-71.0,CenterX)),sum(0.3,HalfWidth)) and sub(abs(sub(42.3,CenterY)),sum(0.3,HalfHeight))) return a positive number if a separating axis exists. Using a map function, this value is mapped to 1 if the separating axis exists and 0 if it does not. The Solr query checks for two separating axes and computes the number of such axes using sum. The total number of separating axes (which is 0, 1 or 2) is then mapped to the values 0 and 1. This final map returns 1 if there are no separating axes (that is, the bounding boxes intersect) or 0 if there is at least one separating axis (that is, the bounding boxes do not intersect). The outermost clause applies a frange function to eliminate those layers that do not intersect the current map. Ranking the layers that intersect the map is a separate issue. This is done with several query clauses. One clause determines how the area of the map compares area of the layer. The other determines how the center of the map compares to the center of the layer. These clauses are used in conjunction keyword-based queries and date-based filters to create search results based on spatial, keyword and temporal constraints. -- View this message in context: http://lucene.472066.n3.nabble.com/intersecting-map-extent-with-solr-spatial-documents-tp3104098p3104098.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to index data in solr form database automatically
First write a Script in Python ( or JAVA or PHP or anyLanguage) which reads the data from database and index into Solr. Now setup this script as cron-job to run automatically at certain interval. On 24 June 2011 17:23, Romi romijain3...@gmail.com wrote: would you please tell me how can i use Cron for auto index my database tables in solr - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103768.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks and Regards Mohammad Shariq
Re: Query may only contain [a-z][0-9]
I think another alternative is to use phrase query and then a PatternReplaceFilterFactory at query time to remove the unwanted characters. Don't know if phrase query behavior meets your requirements thought. On Fri, Jun 24, 2011 at 9:39 AM, roySolr royrutten1...@gmail.com wrote: Yes i use the dismax handler, but i will fix this in my application layer. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103945.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to index data in solr form database automatically
Why don't you use DataImportHandler? We use DIH, we have a wget based bash script that is runned by cron every about 2 minutes. DIH is called in delta-query mode. The bash script waork in this way: 1) first call a wget on DIH status 2) analyze wget DIH status /dataimport?command=status 2.1) if status is **busy** do nothing and exit (beacuse DIH is already running) 2.2) if status is **idle** do /dataimport?command=delta-importclean=false 3) exit On 24/06/11 15:20, Mohammad Shariq wrote: First write a Script in Python ( or JAVA or PHP or anyLanguage) which reads the data from database and index into Solr. Now setup this script as cron-job to run automatically at certain interval. -- Renato Eschini Inera srl Via Mazzini 138 56100 Pisa (PI) Tel:(+39) (0)50 9911800 Fax:(+39) (0)50 9911830 Int:(+39) (0)50 9911819 Email: r.esch...@inera.it Msn:r_esch...@hotmail.com Skype: renato.eschini WWW:http://www.inera.it Rispetta l'ambiente - è veramente necessario stampare questa e-mail? Please consider the environment - do you really need to print this e-mail?
Re: Query may only contain [a-z][0-9]
You should escape those characters http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters On 6/24/11 3:15 AM, roySolr wrote: Hello, Is it possible to configure into SOLR that only numbers and letters are accepted([a-z][0-9])?? When a user gives a term like + or - i get some SOLR errors. How can i exclude this characters? -- View this message in context: http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html Sent from the Solr - User mailing list archive at Nabble.com.
Query Results Differ
Hi, I am trying to understand why the two queries return different results. To me they look similar, can some one help me understand the difference in the results. Query1 : facet=trueq=timefq=supplierid:1001start=0rows=10sort=published_on desc Query2: facet=trueq=timefq=supplierid:1001+published_on:[* TO NOW]start=0rows=10sort=published_on desc The first query returns only 44 rows while the second one returns 200,000 rows. When I dont have the filter for published_on, I am assuming that SOLR should return all the results with supplier id 1001, so Query 1 should have returned more number of results(or atleast same number of results ) than the second query. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104412.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Results Differ
+ is an urlencoded whitespace .. so your filter-query says either supplerid or published_on. what you could do is: 1) use a second fq= param 2) combine them both into one like this: fq=foo+%2Bbar %2B is an urlencoded + character HTH, Regards Stefan On Fri, Jun 24, 2011 at 4:27 PM, jyn7 jyotsna.namb...@gmail.com wrote: Hi, I am trying to understand why the two queries return different results. To me they look similar, can some one help me understand the difference in the results. Query1 : facet=trueq=timefq=supplierid:1001start=0rows=10sort=published_on desc Query2: facet=trueq=timefq=supplierid:1001+published_on:[* TO NOW]start=0rows=10sort=published_on desc The first query returns only 44 rows while the second one returns 200,000 rows. When I dont have the filter for published_on, I am assuming that SOLR should return all the results with supplier id 1001, so Query 1 should have returned more number of results(or atleast same number of results ) than the second query. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104412.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Garbage Collection: I have given bad advice in the past!
On 6/24/2011 2:19 AM, Dmitry Kan wrote: If possible, can you please share some details of your setup, like the amount of shards, how big are they size/doc_count wise, what is the user load / s. Each full chain (there are two) consists of two servers with 2 quad-core processors and 32GB of RAM. There are 9 VMs contained on those two servers. Six of them house large shards with 9 GB of RAM about 9.5 million rows each, taking up about 17.5GB of disk space. One of them houses a small shard (3GB RAM) that contains the newest data, usually about 1GB and 400,000 rows. There is a VM (512MB) for running haproxy and a VM (3GB) with a Solr instance that serves as a broker - no index, one core has the shards parameter in solrconfig.xml. The small shard is updated every two minutes. Every ten minutes, deletes are run against all shards. Once an hour, the small shard is optimized. Once a night, data older than 7 days is distributed among the large shards, deleted from the small shard, and one large shard is optimized. Normally data is replicated between the two chains, but right now the primary chain is running 1.4.1 and the backup chain is running 3.2.0. According to Solr stats, the average queries per second in production is well below 1. I don't know what it is during day when it peaks ... but it's certainly not very large. We do maintain statistics on every search in a database, I just haven't worked out yet how to turn that into usable numbers. The usual statistical functions don't seem to be enough, I'll probably have to write something myself. If anyone knows an easy way to turn a series of timestamps and QTimes into per-second statistics on arbitrary timeframes (hourly, daily, a 10 second span, etc), I'm all ears. On my newly tuned 3.2.0 index, I can get near 100 queries per second if I run the benchmarking script a few times in a row. It uses 8 threads each pounding out 1024 queries as fast as they can. Running it against the old index with the old GC settings, I can only get about 25 queries per second. Both of these numbers are well above what I really need. If I ever need more performance, I can increase the system memory so more of the index fits into RAM, which would also let me increase the java heap size. I actually hope one day to add servers, decrease the number of large shards, and run without virtualization ... but the funding just isn't there. Shawn
Re: Updating the data-config file
Thanks. I will look into this and see how it goes. -- View this message in context: http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3104470.html Sent from the Solr - User mailing list archive at Nabble.com.
Do unused indexes after performance?
Hi, As a proof of concept I have imported around ~11 million document in a solr index. my schema file has multiple fields defined dynamicField name=*_idtype=text indexed=true stored=true/ dynamicField name=*_start type=tdate indexed=true stored=true/ dynamicField name=*_end type=tdate indexed=true stored=true/ dynamicField name=* type=string indexed=true stored=true/ Above being the most important for my question. The average document has around 40 attributes. Each document has: * a minimum of 2 tdate fileds ( max of 10) * a minimum of 2 *_id fields each contain a space delimited list of ids (i.e. 4de5656 q23ew9h) The finial dynamicField causes all fields within a document to be indexed. This was done to firstly show the flexibility of solr and also due to me not knowing what fields we would use to query / filter on. The total size of my index is ~18GB However... we now know the fields we will be querying on. I have 3 questions 1) Do unused indexes on the same dynamicField affect solr's performance? Our query will always be (type:book book_id:*). Will the presents of 4 million documents (type:location store_id:*) affect solr's performance? Sounds obviously yes but may not be the case. 2) Do unused dynamicField indexes affect solr's performance? All documents have a attribute version which is indexed as text yet this is never used in any queries. Does their existence ( in 11 million documents ) effect performance? 3) How does one improve query times against an index Once an index is built is there a method to optimise the query analyzers or a method of removing unused indexes without rebuilding the entire index? The latter is a very important one. We want to replace the current schema with a more restrictive version. Most importantly dynamicField name=* type=string indexed=true stored=true / becomes dynamicField name=* type=string indexed=*false* stored=true / But this change alone does not cause the index to shrink. It would be lovely if there was a method to re-analyze an index post import. More than happy to be referred to related documentation. I have read and considered http://wiki.apache.org/solr/SolrPerformanceFactors http://wiki.apache.org/lucene-java/ImproveSearchingSpeed But there may be some fluid knowledge held here which is undocumented. Thank you in advance for any answers.
Re: Query Results Differ
So if I use a second fq parameter, will SOLR apply an AND on both the fq parameters? I have multiple indexed values, so when I search for q=time, does SOLR return results with Time in any of the indexed values ? Sorry for the silly questions -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104611.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Results Differ
On Fri, Jun 24, 2011 at 5:11 PM, jyn7 jyotsna.namb...@gmail.com wrote: So if I use a second fq parameter, will SOLR apply an AND on both the fq parameters? Yes :) On Fri, Jun 24, 2011 at 5:11 PM, jyn7 jyotsna.namb...@gmail.com wrote: I have multiple indexed values, so when I search for q=time, does SOLR return results with Time in any of the indexed values ? Sorry for the silly questions No. Read here http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field and afterwards here http://wiki.apache.org/solr/SchemaXml#Copy_Fields Regards Stefan
Call indexer after action on website
People can add advertisements on my website. What I do now is run a scheduled task on my Windows server every night at 3AM. But I want to do a delta import as soon as the user saves a new advertisement on my website. Now, from the server doing the delta import is as easy as calling: http://localhost:8983/solr/dataimport?command=delta-import But, as you can see, that is from localhost, which I cant call from my frontend website. How can I do a delta import after a visitor action on the front? -- View this message in context: http://lucene.472066.n3.nabble.com/Call-indexer-after-action-on-website-tp3105153p3105153.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Call indexer after action on website
On 6/24/2011 10:55 AM, PeterKerk wrote: Now, from the server doing the delta import is as easy as calling: http://localhost:8983/solr/dataimport?command=delta-import But, as you can see, that is from localhost, which I cant call from my frontend website. How can I do a delta import after a visitor action on the front? The port number 8983 suggests that you are using the included Jetty. Unless you have taken steps in the container configuration to lock it down so only localhost has access, it should be accessible from anywhere that can reach it, so your application code running on the webserver can just request the following URL, which it could even do with the IP address instead of my example hostname: http://host.example.com:8983/solr/dataimport?command=delta-import It can even check for an error or success status using a similar URL: http://host.example.com:8983/solr/dataimport Thanks, Shawn
Re: multiple spatial values
Yonik Seeley-2-2 wrote: On Tue, Sep 21, 2010 at 12:12 PM, dan sutton lt;danbsut...@gmail.comgt; wrote: I was looking at the LatLonType and how it might represent multiple lon/lat values ... it looks to me like the lat would go in {latlongfield}_0_LatLon and the long in {latlongfield}_1_LatLon ... how then if we have multiple lat/long points for a doc when filtering for example we choose the correct points. e.g. if thinking in cartisean coords and we have P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ... then how does it ensure we're not erroneously picking (3,7) or (6,4) whilst filtering with the spatial query? That's why it's a single-valued field only for now... don't we have to store both values together ? what am i missing here? The problem is that we don't have a way to query both values together, so we must index them separately. The basic LatLonType uses numeric queries on the lat and lon fields separately. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 I have in my index two diferents fields like you say Yonik (location_1, location_2) but the problem is when i want to filter results that have d= 50 for location_1 and d=50 for location_2 .I really dont know to build the query ... For example it works perfectly : q={!geofilt}sfield=location_1pt=36.62288966,-6.23211272d=25 but how i add the sfield location_2 ? I try nested queries but doesnt work. Is it possible to do from the url? -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3105521.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiple spatial values
On Fri, Jun 24, 2011 at 2:11 PM, marthinal jm.rodriguez.ve...@gmail.com wrote: Yonik Seeley-2-2 wrote: On Tue, Sep 21, 2010 at 12:12 PM, dan sutton lt;danbsut...@gmail.comgt; wrote: I was looking at the LatLonType and how it might represent multiple lon/lat values ... it looks to me like the lat would go in {latlongfield}_0_LatLon and the long in {latlongfield}_1_LatLon ... how then if we have multiple lat/long points for a doc when filtering for example we choose the correct points. e.g. if thinking in cartisean coords and we have P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ... then how does it ensure we're not erroneously picking (3,7) or (6,4) whilst filtering with the spatial query? That's why it's a single-valued field only for now... don't we have to store both values together ? what am i missing here? The problem is that we don't have a way to query both values together, so we must index them separately. The basic LatLonType uses numeric queries on the lat and lon fields separately. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 I have in my index two diferents fields like you say Yonik (location_1, location_2) but the problem is when i want to filter results that have d= 50 for location_1 and d=50 for location_2 .I really dont know to build the query ... For example it works perfectly : q={!geofilt}sfield=location_1pt=36.62288966,-6.23211272d=25 but how i add the sfield location_2 ? sfield, pt and d can all be specified directly in the spatial functions/filters too, and that will override the global params. Unfortunately one must currently use lucene query syntax to do an OR. It just makes it look a bit messier. q=_query_:{!geofilt} _query:{!geofilt sfield=location_2} -Yonik http://www.lucidimagination.com
RE: Garbage Collection: I have given bad advice in the past!
Hi Shawn, Thanks for sharing this information. I also found that in our use case, for some reason the default settings for the concurrent garbage collector seem to size the young generation way too small (At least for heap sizes of 1GB or larger.) Can you also let us know what version of the JVM you are using? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search
Re: Garbage Collection: I have given bad advice in the past!
On 6/24/2011 12:53 PM, Burton-West, Tom wrote: Thanks for sharing this information. I also found that in our use case, for some reason the default settings for the concurrent garbage collector seem to size the young generation way too small (At least for heap sizes of 1GB or larger.) Can you also let us know what version of the JVM you are using? Sure. This is running under CentOS 5.6, with epel, rpmforge, and jpackage repositories added. [root@idxst0-b ~]# java -version java version 1.6.0_25 Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode) I used java-1.6.0-sun-1.6.0.25-1.0.cf.nosrc.rpm to make the java RPMs, as explained here. Looks like I can go to 1.6.0.26 now: http://www.city-fan.org/tips/SunJava6OnFedora* * Thanks, Shawn P.S. Tom, thanks for all the good info on the Hathi Trust blog.
Re: Query Results Differ
Thanks Stefan. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3105914.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: intersecting map extent with solr spatial documents
Very cool! What you've essentially described is a way of indexing searching lat-lon box shapes, and the cool thing is that you were able to do this without custom coding / hacking of Solr. Sweet! I do have some observations about this approach: 1. Doesn't support variable number of shapes per document. (LatLonType doesn't either, by the way) 2. The use of function queries on CenterX, CenterY, HalfWidth, and HalfHeight means that all these values (just the distinct ones) will be put into RAM in Lucene's FieldCache. Not a big deal but something to be noted. 3. The function query is going to be evaluated on every document matching the keyword search. That will probably perform okay; not so sure for large indexes with a *:* query. Again, nice job. Could you please share an example of your ranking query? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/intersecting-map-extent-with-solr-spatial-documents-tp3104098p3106333.html Sent from the Solr - User mailing list archive at Nabble.com.
Reject URL requests unless from localhost for dataimport
Hi all, My solr server is currently set up at www.mysite.com:8983/solr. I would like to keep this for the time being but I would like to restrict users from going to www.mysite.com:8983/solr/dataimport. In that case, I would only want to be able to do localhost:8983/solr/dataimport. Is this possible? If so, where should I look for a guide? Thanks, Brian Lamb
Re: Reject URL requests unless from localhost for dataimport
Firewall? It's easy to set up and the most low level. You can also use a proxy or perhaps manage it in your servlet container. Hi all, My solr server is currently set up at www.mysite.com:8983/solr. I would like to keep this for the time being but I would like to restrict users from going to www.mysite.com:8983/solr/dataimport. In that case, I would only want to be able to do localhost:8983/solr/dataimport. Is this possible? If so, where should I look for a guide? Thanks, Brian Lamb
Solr integration with Oracle Coherence caching
is it possible? if so then how? any steps would be good! By the way I have Java version of both available for integration, just need to push the plug in!