Re: Faceting Question
Seems like pivot faceting is what you looking for ( http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting ) Note: it currently does not work in distributed mode - see https://issues.apache.org/jira/browse/SOLR-2894 On Thu, Nov 15, 2012 at 7:46 AM, Jamie Johnson jej2...@gmail.com wrote: Sorry some more info. I have a field to store source and another for date. I currently use faceting to get a temporal distribution across all sources. What is the best way to get a temporal distribution per source? Is the only thing I can do to execute 1 query for the list of sources and then another query for each source? On Wednesday, November 14, 2012, Jamie Johnson jej2...@gmail.com wrote: I've recently been asked to be able to display a temporal facet broken down by source, so source1 has the following temporal distribution, source 2 has the following temporal distribution etc. I was wondering what the best way to accomplish this is? My current thoughts were that I'd need to execute a completely separate query for each, is this right? Could field aliasing some how be used to execute this in a single request to solr? Any thoughts would really be appreciated.
Re: Faceting Facets
http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting On Mon, Sep 3, 2012 at 6:38 PM, Dotan Cohen dotanco...@gmail.com wrote: Is there any way to nest facet searches in Solr? Specifically, I have a User field and a DateTime field. I need to know how many Documents match each User for each one-hour period in the past 24 hours. That is, 16 Users * 24 time periods = 384 values to return. I could run 16 queries and facet on DateTime, or 24 queries and facet on User. However, if there is a way to facet the facets, then I would love to know. Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Query Time problem on Big Index Solr 3.5
1. Use filter queries Here a example of query, there are any incorrect o anything that can I change? http://xxx:8893/solr/candidate/select/?q=+(IdCandidateStatus:2)+(IdCobranded:3)+(IdLocation1:12))+(LastLoginDate:[2011-08-26T00:00:00Z TO 2012-08-28T00:00:00Z]) What is the logic here? Are you AND-ing these boolean clauses? If yes, then I would change queries to http://xxx:8893/solr/candidate/select/?q=*:*fq=IdCandidateStatus:2fq=IdCobranded:3fq=IdLocation1:12fq=LastLoginDate:[2011-08-26T00:00:00Z TO 2012-08-28T00:00:00Z] I.e. move queries into fq (filter query) parameter. * it should be faster as it seems you don't need score here. Sort by id/date instead. * fq-s will be cached separately thus increasing cache hit rate. 2. Do not optimize your index I have a master, and 6 slaves, they are been syncronized every 10 minutes. And the index always is optimized. DO NOT optimize your index! (unless you re-create the whole index completely every 10 mins). It basically kills the idea of replication (after every optimize command slaves download the whole index).
Re: Java class [B has no public instance field or method named split.
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5 On Sat, Sep 1, 2012 at 2:17 AM, Cirelli, Stephen J. stephen.j.cire...@saic.com wrote: Anyone know why I'm getting this exception? I'm following the example here http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer but I get the below error. The field type in my schema.xml is string, text doesn't work either. Why would I get an error that there's no split method on a string? Caused by: sun.org.mozilla.javascript.internal.EvaluatorException: Java class [B has no public instance field or method named split. (Unknown source#52) Here's the JS function parseAttachments(row){ var mainDelim = '(|)', subDelim = '-|-', attRow = [//This must be in the order that it was concatinated in the query. { index:0, field:'attachmentFileName', arr: new java.util.ArrayList()}, { index:1, field:'attachmentSize', arr: new java.util.ArrayList()}, { index:2, field:'attachmentMIMEType', arr: new java.util.ArrayList()}, { index:3, field:'attachmentExtractedText', arr: new java.util.ArrayList()}, { index:4, field:'attachmentLink', arr: new java.util.ArrayList()} ] var allAttachments = row.get('attachments').split(mainDelim); for(var i=0,l=allAttachments.length; il; i++) { var attachment = allAttachments[i].split(subDelim); for(var j=0,jl=attRow.length; jjl; j++){ var itm = attachment[j], arr = attRow[j].arr; arr.add(itm); } } for(var j=0,jl=attRow.length; jjl; j++){ var itm = attRow[j]; row.put(itm.field, itm.arr); } row.remove('attachments'); return row; }
Re: Sharing and performance testing question.
Any tips on load testing solr? Ideally we would like caching to not effect the result as much as possible. 1. Siege tool This is probably the simplest option. You can generate urls.txt file and pass it to the tool. You should also capture server performance (CPU, memory, qps, etc) using tools like newrelic, zabbix, etc. 2. SolrMeter http://code.google.com/p/solrmeter/ 3. Solr benchmark module (not committed yet) You to run complex benchmarks using different algorithms * https://issues.apache.org/jira/browse/SOLR-2646 * http://searchhub.org/dev/2011/07/11/benchmarking-the-new-solr-near-realtime-improvements/
Re: Injest pauses
Hey Brad, This leads me to believe that a single merge thread is blocking indexing from occuring. When this happens our producers, which distribute their updates amongst all the shards, pile up on this shard and wait. Which version of Solr you are using? Have you tried 4.0 beta? * http://searchhub.org/dev/2011/04/09/solr-dev-diary-solr-and-near-real-time-search/ * https://issues.apache.org/jira/browse/SOLR-2565 Alexey
Re: LateBinding
http://searchhub.org/dev/2012/02/22/custom-security-filtering-in-solr/ See section about PostFilter. On Wed, Aug 29, 2012 at 4:43 PM, johannes.schwendin...@blum.com wrote: Hello, Has anyone ever implementet the security feature called late-binding? I am trying this but I am very new to solr and I would be very glad if I would get some hints to this. Regards, Johannes
Re: Injest pauses
Could you take jstack dump when it's happening and post it here? Interestingly it is not pausing during every commit so at least a portion of the time the async commit code is working. Trying to track down the case where a wait would still be issued. -Original Message- From: Voth, Brad (GE Corporate) Sent: Wednesday, August 29, 2012 12:32 PM To: solr-user@lucene.apache.org Subject: RE: Injest pauses Thanks, I'll continue with my testing and tracking down the block. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, August 29, 2012 12:28 PM To: solr-user@lucene.apache.org Subject: Re: Injest pauses On Wed, Aug 29, 2012 at 11:58 AM, Voth, Brad (GE Corporate) brad.v...@ge.com wrote: Anyone know the actual status of SOLR-2565, it looks to be marked as resolved in 4.* but I am still seeing long pauses during commits using 4.* SOLR-2565 is definitely committed - adds are no longer blocked by commits (at least at the Solr level). -Yonik http://lucidworks.com
Re: Indexing and querying BLOBS stored in Mysql
I would recommend to create a simple data import handler to test tika parsing for large BLOBs, i.e. remove not related entities, remove all the configuration for delta imports and keep just entity that retrieves blobs and entity that parses binary content (fieldReader/TikaEntityProcessor). Some comments: 1. Maybe you are running delta import and there are not new records in database? 2. deltaQuery should only return id-s and not other columns/data, because you don't use them in deltaQueryImport (see dataimporter.delta.id ) 3. not all entities have HTMLStripTransformer in a transformers list, but use them in fields. TemplateTransformer is not used at all. entity name=aitiologikes_ektheseis dataSource=db transformer=HTMLStripTransformer query=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text' deltaImportQuery=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text' and id='${dataimporter.delta.id}' deltaQuery=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text' and last_modified '${dataimporter.last_index_time}' field column=id name=ida / field column=solr_id name=solr_id / field column=title name=title stripHTML=true / field column=grid_title name=grid_title stripHTML=true / field column=model name=model stripHTML=true / field column=type name=type stripHTML=true / field column=url name=url stripHTML=true / field column=last_modified name=last_modified stripHTML=true / field column=search_tag name=search_tag stripHTML=true / field column=content name=content stripHTML=true / /entity entity name=aitiologikes_ektheseis_bin query=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin' deltaImportQuery=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and id='${dataimporter.delta.id}' deltaQuery=select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and last_modified '${dataimporter.last_index_time}' transformer=TemplateTransformer dataSource=db field column=id name=ida / field column=solr_id name=solr_id / field column=title name=title stripHTML=true / field column=grid_title name=grid_title stripHTML=true / field column=model name=model stripHTML=true / field column=type name=type stripHTML=true / field column=url name=url stripHTML=true / field column=last_modified name=last_modified stripHTML=true / field column=search_tag name=search_tag stripHTML=true / entity dataSource=fieldReader processor=TikaEntityProcessor dataField=aitiologikes_ektheseis_bin.text format=text field column=text name=contentbin stripHTML=true / /entity /entity ... ... /document /dataConfig *A portion from schema.xml (the fieldTypes and filed definition):* fieldType name=text_ktimatologio class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_el.txt enablePositionIncrements=true/ filter class=solr.GreekLowerCaseFilterFactory/ filter class=solr.GreekStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter
Re: MySQL Exception: Communications link failure WITH DataImportHandler
My memory is vague, but I think I've seen something similar with older versions of Solr. Is it possible that you have significant database import and there's a big segments merge happening in the middle causing blocking in dih indexing process (and reading records from database as well), since long inactivity in communication with db server and timeout as a result. If this is the case then you can either increase timeout limit on db server (don't remember the actual parameter) or upgrade Solr to newer version that doesn't have such long pauses (4.0 beta?). On Thu, Aug 16, 2012 at 12:37 PM, Jienan Duan jnd...@gmail.com wrote: Hi all: I have resolved this problem by configuring a jndi datasource in tomcat. But I still want to find out why it throw an exception in DIH when I configure datasource in data-configure.xml but a jndi resource. Regards. 2012/8/16 Jienan Duan jnd...@gmail.com Hi all: I'm using DataImportHandler load data from MySQL. It works fine on my develop machine and online environment. But I got an exception on test environment: Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at com.mysql.jdbc.Util.handleNewInstance(Util.java:406) at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074) at com.mysql.jdbc.MysqlIO.init(MysqlIO.java:343) at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2132) ... 26 more Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at java.net.Socket.init(Socket.java:375) at java.net.Socket.init(Socket.java:218) at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:253) at com.mysql.jdbc.MysqlIO.init(MysqlIO.java:292) ... 27 more This make me confused,because the test env and online env almost same:Tomcat runs on a Linux Server with JDK6,MySql5 runs on another. Even I wrote a simple JDBC test class it works,a jsp file with JDBC code also works.Only DataImportHandler failed. I'm trying to read Solr source code and found that it seems Solr has it's own ClassLoader.I'm not sure if it goes wrong with Tomcat on some specific configuration. Dose anyone know how to fix this problem? Thank you very much. Best Regards. Jienan Duan -- -- 不走弯路,就是捷径。 http://www.jnan.org/ -- -- 不走弯路,就是捷径。 http://www.jnan.org/
Re: Custom Geocoder with Solr and Autosuggest
My first decision was to divide SOLR into two cores, since I am already using SOLR as my search server. One core would be for the main search of the site and one for the geocoding. Correct. And you can even use that location index/collection for locations extraction for a non structural documents - i.e. if you don't have separate field with geographical names in your corpus (or location data is just not good enough compared to what can be mined from documents) My second decision is to store the name data in a normalised state, some examples are shown below: London, England England Swindon, Wiltshire, England Yes, you can add postcode/outcodes there also. And I would add additional field type region/county/town/postcode/outcode. The third decision was to return “autosuggest” results, for example when the user types “Lond” I would like to suggest “London, England”. For this to work I think it makes sense to return up to 5 results via JSON based on relevancy and have these displayed under the search box. Yeah, you might want to boost cities more than towns (I'm sure there are plenty ambiguous terms), use some kind of geoip service, additional scoring factors. My fourth decision is that when the user actually hits the “search” button on the location field, SOLR is again queries and returns the most relevant result, including the co-ordinates which are stored. You can also have special logic to decide if you want to use spatial search or just simple textual match would be better. I.e. you have England in your example. It doesn't sound practical to return coordinates and use spatial search for this use case, right? HTH, Alexey
Re: Solr Index linear growth - Performance degradation.
10K queries How do you generate these queries? I.e. is this a single or multi threaded application? Can you provide full queries you send to Solr servers and solrconfig request handler configuration? Do you use function queries, grouping, faceting, etc? On Tue, Aug 14, 2012 at 10:31 AM, feroz_kh feroz.kh2...@gmail.com wrote: Its 7,200,000 hits == number of documents found by all 10K queries. We have RHEL tikanga version. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Index-linear-growth-Performance-degradation-tp4000934p4001069.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Running out of memory
It would be vastly preferable if Solr could just exit when it gets a memory error, because we have it running under daemontools, and that would cause an automatic restart. -XX:OnOutOfMemoryError=cmd args; cmd args Run user-defined commands when an OutOfMemoryError is first thrown. Does Solr require the entire index to fit in memory at all times? No. But it's hard to say about your particular problem without additional information. How often do you commit? Do you use faceting? Do you sort by Solr fields and if yes what are those fields? And you should also check caches.
Re: Is this too much time for full Data Import?
9m*15 - that's a lot of queries (400 QPS). I would try reduce the number of queries: 1. Rewrite your main (root) query to select all possible data * use SQL joins instead of DIH nested entities * select data from 1-N related tables (tags, authors, etc) in the main query using GROUP_CONCAT (that's MySQL specific function, but there are similar functions for other RDBMS-es) aggregate function and then split concatenated data in a DIH transformer. 2. Identify small tables in nested entities and cache them completely in CachedSqlEntityProcessor. On Wed, Aug 8, 2012 at 10:35 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello, Does your indexer utilize CPU/IO? - check it by iostat/vmstat. If it doesn't, take several thread dumps by jvisualvm sampler or jstack, try to understand what blocks your threads from progress. It might happen you need to speedup your SQL data consumption, to do this, you can enable threads in DIH (only in 3.6.1), move from N+1 SQL queries to select all/cache approach http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor and https://issues.apache.org/jira/browse/SOLR-2382 Good luck On Wed, Aug 8, 2012 at 9:16 AM, Pranav Prakash pra...@gmail.com wrote: Folks, My full data import takes ~80hrs. It has around ~9m documents and ~15 SQL queries for each document. The database servers are different from Solr Servers. Each document has an update processor chain which (a) calculates signature of the document using SignatureUpdateProcessorFactory and (b) Finds out terms which have term frequency 2; using a custom processor. The index size is ~ 480GiB I want to know if the amount of time taken is too large compared to the document count? How do I benchmark the stats and what are some of the ways I can improve this? I believe there are some optimizations that I could do at Update Processor Factory level as well. What would be a good way to get dirty on this? *Pranav Prakash* temet nosce -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Large RDBMS dataset
The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. You can also try to use GROUP_CONCAT (it's MySQL function, but maybe there's something similar in MS SQL) to select all the nested 1-N entities in a single result set as strings joined using some separator and then split them into multivalued fields in post processing phase (using regex template transformer or similar)
Re: a question on jmx solr exposure
Which Solr version do you use? Maybe it has something to do with default collection? I do see separate jmx domain for every collection, i.e. solr/collection1 solr/collection2 solr/collection3 ... On Wed, Dec 21, 2011 at 1:56 PM, Dmitry Kan dmitry@gmail.com wrote: Hello list, This might be not the right place to ask the jmx specific questions, but I decided to try, as we are polling SOLR statistics through jmx. We currently have two solr cores with different schemas A and B being run under the same tomcat instance. Question is: which stat is jconsole going to see under solr/ ? From the numbers (e.g. numDocs of searcher), jconsole see the stats of A. Where do stats of B go? Or is firstly activated core will capture the jmx pipe and won't let B's stats to go through? -- Regards, Dmitry Kan
Re: Decimal Mapping problem
Try to cast MySQL decimal data type to string, i.e. CAST( IF(drt.discount IS NULL,'0',(drt.discount/100)) AS CHAR) as discount (or CAST AS TEXT) On Mon, Dec 19, 2011 at 1:24 PM, Niels Stevens ni...@kabisa.nl wrote: Hey everybody, I'm having an issue importing Decimal numbers from my Mysql DB to Solr. Is there anybody with some advise, I will start and try to explain my problem. According to my findings, I think the lack of a explicit mapping of a Decimal value in the schema.xml is causing some issues I'm experiencing. The decimal numbers I'm trying to import look like this : 0.075000 7.50 2.25 but after the import statement the results for the equivalent Solr field are returned as this: [B@1413d20 [B@11c86ff [B@1e2fd0d The import statement for this particular field looks like: IF(drt.discount IS NULL,'0',(drt.discount/100)) ... Now I thought that using the Round functions from mysql to 3 numbers after the dot. In conjunction with a explicite mapping field in the schema.xml could solve this issue. Is there someone with some similar problems with decimal fields or anybody with an expert view on this? Thanks a lot in advance. Regards, Niels Stevens
Re: Solr 3.3: DIH configuration for Oracle
Why do you need to collect both primary keys T1_ID_RECORD and T2_ID_RECORD in your delta query. Isn't T2_ID_RECORD primary key value enough to get all data from both tables? (you have table1-table2 relation as 1-N, right?) On Thu, Aug 11, 2011 at 12:52 AM, Eugeny Balakhonov c0f...@gmail.com wrote: Hello, all! I want to create a good DIH configuration for my Oracle database with deltas support. Unfortunately I am not able to do it well as DIH has the strange restrictions. I want to explain a problem on a simple example. In a reality my database has very difficult structure. Initial conditions: Two tables with following easy structure: Table1 - ID_RECORD (Primary key) - DATA_FIELD1 - .. - DATA_FIELD2 - LAST_CHANGE_TIME Table2 - ID_RECORD (Primary key) - PARENT_ID_RECORD (Foreign key to Table1.ID_RECORD) - DATA_FIELD1 - .. - DATA_FIELD2 - LAST_CHANGE_TIME In performance reasons it is necessary to do selection of the given tables by means of one request (via inner join). My db-data-config.xml file: ?xml version=1.0 encoding=UTF-8? dataConfig dataSource jndiName=jdbc/DB1 type=JdbcDataSource user= password=/ document entity name=ent pk=T1_ID_RECORD, T2_ID_RECORD query=select * from TABLE1 t1 inner join TABLE2 t2 on t1.ID_RECORD = t2.PARENT_ID_RECORD deltaQuery=select t1.ID_RECORD T1_ID_RECORD, t1.ID_RECORD T2_ID_RECORD from TABLE1 t1 inner join TABLE2 t2 on t1.ID_RECORD = t2.PARENT_ID_RECORD where TABLE1.LAST_CHANGE_TIME to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS') or TABLE2.LAST_CHANGE_TIME to_date('${dataimporter.last_index_time}', '-MM-DD HH24:MI:SS') deltaImportQuery=select * from TABLE1 t1 inner join TABLE2 t2 on t1.ID_RECORD = t2.PARENT_ID_RECORD where t1.ID_RECORD = ${dataimporter.delta.T1_ID_RECORD} and t2.ID_RECORD = ${dataimporter.delta.T2_ID_RECORD} / /document /dataConfig In result I have following error: java.lang.IllegalArgumentException: deltaQuery has no column to resolve to declared primary key pk='T1_ID_RECORD, T2_ID_RECORD' I have analyzed the source code of DIH. I found that in the DocBuilder class collectDelta() method works with value of entity attribute pk as with simple string. But in my case this is array with two values: T1_ID_RECORD, T2_ID_RECORD What do I do wrong? Thanks, Eugeny
Re: Weird issue with solr and jconsole/jmx
I just encountered the same bug - JMX registered beans don't survive Solr core reloads. I believe the reason is that when you do core reload * when the new core is created - it overwrites/over-register beans in registry (in mbeanserver) * when the new core is ready in the core register phase CoreContainer closes old core that results to unregistering jmx beans As a result there's only one bean in registry id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@33099cc main left after Core reload. It is because this in the only new (dynamically named bean) that is created by new core and not un-registered in oldCore.close. I'll try to reproduce that in test and file bug in Jira. On Tue, Mar 16, 2010 at 4:25 AM, Andrew Greenburg agreenb...@gmail.com wrote: On Tue, Mar 9, 2010 at 7:44 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I connected to one of my solr instances with Jconsole today and : noticed that most of the mbeans under the solr hierarchy are missing. : The only thing there was a Searcher, which I had no trouble seeing : attributes for, but the rest of the statistics beans were missing. : They all show up just fine on the stats.jsp page. : : In the past this always worked fine. I did have the core reload due to : config file changes this morning. Could that have caused this? possibly... reloading the core actually causes a whole new SolrCore object (with it's own registry of SOlrInfoMBeans) to be created and then swapped in place of hte previous core ... so perhaps you are still looking at the stats of the old core which is no longer in use (and hasn't been garbage collected because the JMX Manager still had a refrence to it for you? ... i'm guessing at this point) did disconnecting from jconsole and reconnecting show you the correct stats? Disconnecting and reconnecting didn't help. The queryCache and documentCache and some others started showing up after I did a commit and opened a new searcher, but the whole tree never did fill in. I'm guessing that the request handler stats stayed associated with the old, no longer visible core in JMX since new instances weren't created when the core reloaded. Does that make sense? The stats on the web stats page continued to be fresh.
Re: Solr and Tag Cloud
Consider you have multivalued field _tag_ related to every document in your corpus. Then you can build tag cloud relevant for all data set or specific query by retrieving facets for field _tag_ for *:* or any other query. You'll get a list of popular _tag_ values relevant to this query with occurrence counts. If you want to build tag cloud for general analyzed text fields you still can do that the same way, but you should note that you can hit some performance/memory problems if you have significant data set and huge text fields. You should probably use stop words to filter popular general terms. On Sat, Jun 18, 2011 at 8:12 AM, Jamie Johnson jej2...@gmail.com wrote: Does anyone have details of how to generate a tag cloud of popular terms across an entire data set and then also across a query?
Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)
Do you mean that we have current Index as it is and have a separate core which has only the user-id ,product-id relation and at while querying ,do a join between the two cores based on the user-id. Exactly. You can index user-id, product-id relation either to the same core or to different core on the same Solr instance. This would involve us to Index/delete the product as and when the user subscription for a product changes ,This would involve some amount of latency if the Indexing (we have a queue system for Indexing across the various instances) or deletion is delayed Right, but I'm not sure if it's possible to achieve good performance requiring zero latency. IF we want to go ahead with this solution ,We currently are using solr 1.3 , so is this functionality available as a patch for solr 1.3? No. AFAIK it's in trunk only. Would it be possible to do with a separate Index instead of a core ,then I can create only one Index common for all our instances and then use this instance to do the join. No, I don't think that's possible with join feature. I guess that would require network request per search req and number of mapped ids could be huge, so it could affect performance significantly. You'll need to be a bit careful using joins, as the performance hit can be significant if you have lots of cross-referencing to do, which I believe you would given your scenario. As far as I understand join query would build bitset filter which can be cached in filterCache, etc. The only performance impact I can think of is that user-product relations table could be too big to fit into single instance.
Re: Complex situation
Am I right that you are only interested in results / facets for current season? If it's so then you can index start/end dates as a separate number fields and build your search filters like this fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17] +end_date_month:[* TO 6] +end_date_day:[16 TO *] where 6/16 is current month/day. On Thu, Jun 16, 2011 at 5:20 PM, roySolr royrutten1...@gmail.com wrote: Hello, First i will try to explain the situation: I have some companies with openinghours. Some companies has multiple seasons with different openinghours. I wil show some example data : Companyid Startdate(d-m) Enddate(d-m) Openinghours_end 1 01-01 01-04 17:00 1 01-04 01-08 18:00 1 01-08 31-12 17:30 2 01-01 31-12 20:00 3 01-01 01-06 17:00 3 01-06 31-12 18:00 What i want is some facets on the left site of my page. They have to look like this: Closing today on: 17:00(23) 18:00(2) 20:00(1) So i need to get the NOW to know which openinghours(seasons) i need in my facet results. How should my index look like? Can anybody helps me how i can save this data in the solr index? -- View this message in context: http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)
So a search for a product once the user logs in and searches for only the products that he has access to Will translate to something like this . ,the product ids are obtained form the db for a particular user and can run into n number. search term fq=product_id(100 10001 ..n number) but we are currently running into too many Boolean expansion error .We are not able to tie the user also into roles as each user is mainly any one who comes to site and purchases a product . I'm wondering if new trunk Solr join functionality can help here. * http://wiki.apache.org/solr/Join In theory you can index your products (product_id, ...) and user_id-product many-to-many relation (user_product_id, user_id) into signle/different cores and then do join, like f=search termsfq={!join from=product_id to=user_product_id}user_id:10101 But I haven't tried that, so I'm just speculating.
Re: Strange behavior
Have you stopped Solr before manually copying the data? This way you can be sure that index is the same and you didn't have any new docs on the fly. 2011/6/14 Denis Kuzmenok forward...@ukr.net: What should i provide, OS is the same, environment is the same, solr is completely copied, searches work, except that one, and that is strange.. I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Updating only one indexed field for all documents quickly.
with the integer field. If you just want to influence the score, then just plain external field fields should work for you. Is this an appropriate solution, give our use case? Yes, check out ExternalFileField * http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4 * http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html * http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28
Re: URGENT HELP: Improving Solr indexing time
str name=Total Requests made to DataSource16276/str ... so I am doing a delta import of around 500,000 rows at a time. http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
Re: Need query help
See Tagging and excluding Filters section * http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters 2011/6/6 Denis Kuzmenok forward...@ukr.net: For now i have a collection with: id (int) price (double) multivalue brand_id (int) filters (string) multivalue I need to get available brand_id, filters, price values and list of id's for current query. For example now i'm doing queries with facet.field=brand_id/filters/price: 1) to get current id's list: (brand_id:100 OR brand_id:150) AND (filters:p1s100 OR filters:p4s20) 2) to get available filters on selected properties (same properties but another values): (brand_id:100 OR brand_id:150) AND (filters:p1s* OR filters:p4s*) 3) to get available brand_id (if any are selected, if none - take from 1st query results): (filters:p1s100 OR filters:p4s20) 4) another request to get available prices if any are selected Is there any way to simplify this task? Data needed: 1) Id's for selected filters, price, brand_id 2) Available filters, price, brand_id from selected values 3) Another values for selected properties (is any chosen) 4) Another brand_id for selected brand_id 5) Another price for selected price Will appreciate any help or thoughts! Cheers, Denis Kuzmenok
Re: Solr memory consumption
Commits are divided into 2 groups: - often but small (last changed info) 1) Make sure that it's not too often and you don't have commit overlapping problem. http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F 2) You may also try to limit cache sizes and check if it helps. 3) If it doesn't help then try to monitor your app using jconsole * try to hit garbage collector and see if it frees some memory * browse solr jmx attributes and see if there'r any hints re solr caches usage, etc 4) Try to run jmap -heap -histo and see if there's any hints there 5) If none of above helps then you probably need to examine your memory usage using some kind of java profiler tool (like yourkit profiler) Size: 4 databases about 1G (sum), 1 database (with n-gram) for 21G.. I don't know any other way to search for product names except n-gram =\ Standard text field with solr.WordDelimiterFilterFactory and generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 during indexing isn't good enough? You might want to limit min and max ngram size, just to reduce your index size.
Re: Solr memory consumption
Hey Denis, * How big is your index in terms of number of documents and index size? * Is it production system where you have many search requests? * Is there any pattern for OOM errors? I.e. right after you start your Solr app, after some search activity or specific Solr queries, etc? * What are 1) cache settings 2) facets and sort-by fields 3) commit frequency and warmup queries? etc Generally you might want to connect to your jvm using jconsole tool and monitor your heap usage (and other JVM/Solr numbers) * http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html * http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX HTH, Alexey 2011/6/1 Denis Kuzmenok forward...@ukr.net: There were no parameters at all, and java hitted out of memory almost every day, then i tried to add parameters but nothing changed. Xms/Xmx - did not solve the problem too. Now i try the MaxPermSize, because it's the last thing i didn't try yet :( Wednesday, June 1, 2011, 9:00:56 PM, you wrote: Could be related to your crazy high MaxPermSize like Marcus said. I'm no JVM tuning expert either. Few people are, it's confusing. So if you don't understand it either, why are you trying to throw in very non-standard parameters you don't understand? Just start with whatever the Solr example jetty has, and only change things if you have a reason to (that you understand). On 6/1/2011 1:19 PM, Denis Kuzmenok wrote: Overall memory on server is 24G, and 24G of swap, mostly all the time swap is free and is not used at all, that's why no free swap sound strange to me..
Re: DIH render html entities
Maybe HTMLStripTransformer is what you are looking for. * http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer On Tue, May 31, 2011 at 5:35 PM, Erick Erickson erickerick...@gmail.com wrote: Convert them to what? Individual fields in your docs? Text? If the former, you might get some joy from the XpathEntityProcessor. If you want to just strip the markup and index all the content you might get some joy from the various *html* analyzers listed here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Best Erick On Fri, May 27, 2011 at 5:19 AM, anass talby anass.ta...@gmail.com wrote: Sorry my question was not clear. when I get data from database, some field contains some html special chars, and what i want to do is just convert them automatically. On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, May 27, 2011 at 3:50 PM, anass talby anass.ta...@gmail.com wrote: Is there any way to render html entities in DIH for a specific field? [...] This does not make too much sense: What do you mean by rendering HTML entities. DIH just indexes, so where would it render HTML to, even if it could? Please take a look at http://wiki.apache.org/solr/UsingMailingLists Regards, Gora -- Anass
Re: Better Spellcheck
I've tried to use a spellcheck dictionary built from my own content, but my content ends up having a lot of misspelled words so the spellcheck ends up being less than effective. You can try to use sp.dictionary.threshold parameter to solve this problem * http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold It also misses phrases. When someone searches for Untied States I would hope the spellcheck would suggest United States but it just recognizes that untied is a valid word and doesn't suggest any thing. So you are saying about auto suggest component and not spellcheck right? These are two different use cases. If you want auto suggest and you have some search logs for your system then you can probably use the following solution: * http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ If you don't have significant search logs history and want to populate your auto suggest dictionary from index or some text file you should check * http://wiki.apache.org/solr/Suggester
Re: Documents update
Will it be slow if there are 3-5 million key/value rows? AFAIK it shouldn't affect search time significantly as Solr caches it in memory after you reloading Solr core / issuing commit. But obviously you need more memory and commit/reload will take more time.
Re: Indexing 20M documents from MySQL with DIH
{quote} ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... {quote} It could probably be because of autocommit / segment merging. You could try to disable autocommit / increase mergeFactor {quote} I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. {quote} I was thinking about some hackish solution to paginate results entity name =pages query=SELECT id FROM generate_series( (SELECT count(*) from source_table) / 1000 ) ... entity name=records query=SELECT * from source_table LIMIT 1000 OFFSET ${pages.id}*1000 /entity /entity Or something along those lines ( you'd need to to calculate offset in pages query ) But unfortunately MySQL does not provide generate_series function (it's postgres function and there'r similar solutions for oracle and mssql). On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow eph...@gmail.com wrote: Thank you everyone for your help. I ended up getting the index to work using the exact same config file on a (substantially) larger instance. On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson erickerick...@gmail.com wrote: {{{A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a nightly thing, but something to do in case of a disaster? Is there a difference in the performance of an index that was built all at once vs. one that has had delta inserts and updates applied over a period of months?}}} Is it a common practice? Like all of this, it depends. It's certainly easier to let DIH do the work. Sometimes DIH doesn't have all the capabilities necessary. Or as Chris said, in the case where you already have a system built up and it's easier to just grab the output from that and send it to Solr, perhaps with SolrJ and not use DIH. Some people are just more comfortable with their own code... Do you try not to fully rebuild. It depends on how painful a full rebuild is. Some people just like the simplicity of starting over every day/week/month. But you *have* to be able to rebuild your index in case of disaster, and a periodic full rebuild certainly keeps that process up to date. Is there a difference...delta inserts...updates...applied over months. Not if you do an optimize. When a document is deleted (or updated), it's only marked as deleted. The associated data is still in the index. Optimize will reclaim that space and compact the segments, perhaps down to one. But there's no real operational difference between a newly-rebuilt index and one that's been optimized. If you don't delete/update, there's not much reason to optimize either I'll leave the DIH to others.. Best Erick On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow eph...@gmail.com wrote: Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this specific problem: ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a
Re: Solr performance issue
Btw, I am monitoring output via jconsole with 8gb of ram and it still goes to 8gb every 20 seconds or so, gc runs, falls down to 1gb. Hmm, jvm is eating 8Gb for 20 seconds - sounds a lot. Do you return all results (ids) for your queries? Any tricky faceting/sorting/function queries?
Re: Custom scoring for searhing geographic objects
Hi Pavel, I had the similar problem several years ago - I had to find geographical locations in textual descriptions, geocode these objects to lat/long during indexing process and allow users to filter/sort search results to specific geographical areas. The important issue was that there were several types of geographical objects - street town region country. The idea was to geocode to most narrow geographical area as possible. Relevance logic in this case could be specified as find the most narrow result that is unique identified by your text or search query. So I came up with custom algorithm that was quite good in terms of performance and precision/recall. Here's the simple description: * You can intersect all text/searchquery terms with locations dictionary to find only geo terms * Search in your locations Lucene index and filter only street objects (the most narrow areas). Due to tf*idf formula you'll get the most relevant results. Then you need to post process N (3/5/10) results and verify that they are matches indeed. I did intersect search terms with result's terms and make another lucene search to verify if these terms are unique identifying the match. If it's then return matching street. If there's no any match proceed using the same algorithm with towns, regions, countries. HTH, Alexey On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov char...@gmail.com wrote: Hi, Please give me advise how to create custom scoring. I need to result that documents were in order, depending on how popular each term in the document (popular = how many times it appears in the index) and length of the document (less terms - higher in search results). For example, index contains following data: ID | SEARCH_FIELD -- 1 | Russia 2 | Russia, Moscow 3 | Russia, Volgograd 4 | Russia, Ivanovo 5 | Russia, Ivanovo, Altayskaya street 45 6 | Russia, Moscow, Kremlin 7 | Russia, Moscow, Altayskaya street 8 | Russia, Moscow, Altayskaya street 15 9 | Russia, Moscow, Altayskaya street 15/26 And I should get next results: Query | Document result set -- Russia | 1,2,4,3,6,7,8,9,5 Moscow | 2,6,7,8,9 Ivanovo | 4,5 Altayskaya | 7,8,9,5 In fact --- it is a search for geographic objects (cities, streets, houses). At the same time can be given only part of the address, and the results should appear the most relevant results. Thanks. -- Pavel Minchenkov
Re: Dataimport performance
With subquery and with left join: 320k in 6 Min 30 It's 820 records per second. It's _really_ impressive considering the fact that DIH performs separate sql query for every record in your case. So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. Sub entities slows down data import indeed. You can try to avoid separate query for every row by using CachedSqlEntityProcessor. There are couple of options - 1) you can load all sub-entity data in memory or 2) you can reduce the number of sql queries by caching sub entity data per id. There's no silver bullet and each option has its own pros and cons. Also Ephraim proposed a really neat solution with GROUP_CONCAT, but I'm not sure that all RDBMS-es support that. 2010/12/15 Robert Gründler rob...@dubture.com: i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: 2010/12/15 Robert Gründler rob...@dubture.com: The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. For my own purposes, I wrote a custom data import using SolrJ to improve the performance (from 3 hours to 10 minutes). Just as a test, how long does it take if you comment out the artists entity?
Re: my index has 500 million docs ,how to improve so lr search performance?
How much memory do you allocate for JVMs? Considering you have 10 JVMs per server (10*N) you might have not enough memory for OS file system cache ( you need to keep some memory free for that ) all indexs size is about 100G is this per server or whole size? On Mon, Nov 15, 2010 at 8:35 AM, lu.rongbin lu.rong...@goodhope.net wrote: In addition,my index has only two store fields, id and price, and other fields are index. I increase the document and query cache. the ec2 m2.4xLarge instance is 8 cores, 68G memery. all indexs size is about 100G. -- View this message in context: http://lucene.472066.n3.nabble.com/my-index-has-500-million-docs-how-to-improve-solr-search-performance-tp1902595p1902869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie: Indexing unrelated MySQL tables
I figured I would create three entities and relevant schema.xml entries in this way: dataimport.xml: entity name=Users query=select id,firstname,lastname from user/entity entity name=Artwork query=select id,user,name,description from artwork/entity entity name=Jobs query=select id,company,position,location,description from jobs/entity That's correct. You can list several entities under document element. You can index them separately using entity parameter (i.e. add entity=Users to you full import HTTP request). Do not forget to add clean=false so you won't delete previously indexed documents. Or you can index all entities in one request (by default). schema.xml: field name=id type=int indexed=true stored=true required=true/ field name=firstname type=string indexed=true stored=true/ field name=lastname type=string indexed=true stored=true/ field name=user type=int indexed=true stored=true/ field name=name type=string indexed=true stored=true/ field name=description type=text indexed=true stored=false/ field name=company type=string indexed=true stored=true/ field name=position type=string indexed=true stored=true/ field name=location type=string indexed=true stored=false/ Why do you use string type for textual fields (description, company, name, firstname, lastname, etc)? Is it intentional to use these fields in filtering/faceting? You can also add default searchable multivalued field (type=text) and copy field instructions to copy all textual content into this field ( http://wiki.apache.org/solr/SchemaXml#Copy_Fields ). Thus you will be able to search in default field for terms from all fields (firstname, lastname, name, description, company, position, location, etc). You would probably want to add field type=user/artwork/job. You will be able to facet/filter on that fields and provide better user search experience. This obviously does not work as I want. I only get results from the users table, and I cannot get results from neither artwork nor jobs. Are you sure that this is because the indexing isn't working? How do you search for your data? What query parser (standard/dismax)/etc? I have found out that the possible solution is in putting field tags in the entity tag and somehow aliasing column names for Solr, but the logic behind this is completely alien to me and the blind tests I tried did not yield anything. You don't need to list your fields explicitly in fields declaration. BTW, what database do you use? Oracle has some issue with upper casing column names that could be a problem. My logic says that the id field is getting replaced by the id field of other entities and indexes are being overwritten. Are your ids unique across different objects? I.e. is there any job with the same id as user? If so then you would probably want to prefix your ids like: entity name=Users query=select ('user_' || id) as id,firstname,lastname from user/entity entity name=Artwork query=select ('artwork_' || id) as id,user,name,description from artwork/entity But if I aliased all id fields in all entities into something else, such as user_id and job_id, I couldn't figure what to put in the primaryKey configuration in schema.xml because I have three different id fields from three different tables that are all primary keyed in the database! You can still create separate id fields if you need to search for different objects by id and don't mess with prefixed ids. But it's not required. HTH, Alexey
Re: Query performance very slow even after autowarming
* Do you use EdgeNGramFilter in index analyzer only? Or you also use it on query side as well? * What if you create additional field first_letter (string) and put first character/characters (multivalued?) there in your external processing code. And then during search you can filter all documents that start with letter a using fq=a filter query. Would that solve your performance problems? * It makes sense to specify what are you trying to achieve and probably more people can help you with that. On Fri, Dec 3, 2010 at 10:47 AM, johnnyisrael johnnyi.john...@gmail.com wrote: Hi, I am using edgeNgramFilterfactory on SOLR 1.4.1 [filter class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 /] for my indexing. Each document will have about 5 fields in it and only one field is indexed with EdgeNGramFilterFactory. I have about 1.4 million documents in my index now and my index size is approx 296MB. I made the field that is indexed with EdgeNGramFilterFactory as default search field. All my query responses are very slow, some of them taking more than 10seconds to respond. All my query responses are very slow, Queries with single letters are still very slow. /select/?q=m So I tried query warming as follows. listener event=newSearcher class=solr.QuerySenderListener arr name=queries lststr name=qa/str/lst lststr name=qb/str/lst lststr name=qc/str/lst lststr name=qd/str/lst lststr name=qe/str/lst lststr name=qf/str/lst lststr name=qg/str/lst lststr name=qh/str/lst lststr name=qi/str/lst lststr name=qj/str/lst lststr name=qk/str/lst lststr name=ql/str/lst lststr name=qm/str/lst lststr name=qn/str/lst lststr name=qo/str/lst lststr name=qp/str/lst lststr name=qq/str/lst lststr name=qr/str/lst lststr name=qs/str/lst lststr name=qt/str/lst lststr name=qu/str/lst lststr name=qv/str/lst lststr name=qw/str/lst lststr name=qx/str/lst lststr name=qy/str/lst lststr name=qz/str/lst /arr /listener The same above is done for firstSearcher as well. My cache settings are as follows. filterCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=4096/ queryResultCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=1024/ documentCache class=solr.LRUCache size=16384 initialSize=16384 / Still after query warming, few single character search is taking up to 3 seconds to respond. Am i doing anything wrong in my cache setting or autowarm setting or am i missing anything here? Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dataimports response returns before done?
After issueing a dataimport, I've noticed solr returns a response prior to finishing the import. Is this correct? Is there anyway i can make solr not return until it finishes? Yes, you can add synchronous=true to your request. But be aware that it could take a long time and you can see http timeout exception. If not, how do I ping for the status whether it finished or not? See command=status On Fri, Dec 3, 2010 at 8:55 PM, Tri Nguyen tringuye...@yahoo.com wrote: Hi, After issueing a dataimport, I've noticed solr returns a response prior to finishing the import. Is this correct? Is there anyway i can make solr not return until it finishes? If not, how do I ping for the status whether it finished or not? thanks, tri
Re: Syncing 'delta-import' with 'select' query
Hey Juan, It seems that DataImportHandler is not a right tool for your scenario and you'd better use Solr XML update protocol. * http://wiki.apache.org/solr/UpdateXmlMessages You still can work around your outdated GUI view problem with calling DIH synchronously, by adding synchronous=true to your request. But it won't solve the problem with two parallel requests from two users to single DIH request handler, because DIH doesn't support that, and if previous request is still running it bounces the second request. HTH, Alex On Fri, Dec 3, 2010 at 10:33 PM, Juan Manuel Alvarez naici...@gmail.com wrote: Hello everyone! I would like to ask you a question about DIH. I am using a database and DIH to sync against Solr, and a GUI to display and operate on the items retrieved from Solr. When I change the state of an item through the GUI, the following happens: a. The item is updated in the DB. b. A delta-import command is fired to sync the DB with Solr. c. The GUI is refreshed by making a query to Solr. My problem comes between (b) and (c). The delta-import operation is executed in a new thread, so my call returns immediately, refreshing the GUI before the Solr index is updated causing the item state in the GUI to be outdated. I had two ideas so far: 1. Querying the status of the DIH after the delta-import operation and do not return until it is idle. The problem I see with this is that if other users execute delta-imports, the status will be busy until all operations are finished. 2. Use Zoie. The first problem is that configuring it is not as straightforward as it seems, so I don't want to spend more time trying it until I am sure that this will solve my issue. On the other hand, I think that I may suffer the same problem since the delta-import is still firing in another thread, so I can't be sure it will be called fast enough. Am I pointing on the right direction or is there another way to achieve my goal? Thanks in advance! Juan M.
Re: DIH - rdbms to index confusion
I have a table that contains the data values I'm wanting to return when someone makes a search. This table has, in addition to the data values, 3 id's (FKs) pointing to the data/info that I'm wanting the users to be able to search on (while also returning the data values). The general rdbms query would be something like: select f.value, g.gar_name, c.cat_name from foo f, gar g, cat c, dub d where g.id=f.gar_id and c.id=f.cat_id and d.id=f.dub_id You can put this general rdbms query as is into single DIH entity - no need to split it. You would probably want to split it if your main table has one to many relation with other tables, so you can't retrieve all the data and have single result set row per Solr document.
Re: Syncing 'delta-import' with 'select' query
When you say two parallel requests from two users to single DIH request handler, what do you mean by request handler? I mean DIH. Are you refering to the HTTP request? Would that mean that if I make the request from different HTTP sessions it would work? No. It means that when you have two users that simultaneously changed two objects in the UI then you have two HTTP requests to DIH to pull changes from the db into Solr index. If the second request comes when the first is not fully processed then the second request will be rejected. As a result your index would be outdated (w/o the latest update) until the next update.
Re: using DIH with mets/alto file sets
The idea is to create a full text index of the alto content, accompanied by the author/title info from the mets file for purposes of results display. - Then you need to list only alto files in your landscapes entity (fileName=^ID.{3}-ALTO\d{3}.xml$ or something like that), because you don't want to index every mets file as a separate solr document, right? - Also it seems you might want to try to add regex transformer that extract ID from avto file name field column=metsId regex=ID(.{3})-ALTO\d{3}.xml sourceColName=${landscapes.fileAbsolutePath} or fileAbsolutePath/ - And finally add nested entity to process mets file for every alto record entity name=landscapes ... entity name=sample entity name=metsProcessor url=${landscapes.fileAbsolutePath}../ID${sample.metsId}-mets.xml processor=XPathEntityProcessor forEach=/mets transformer=TemplateTransformer,RegexTransformer,LogTransformer and extract mets elements/attributes and index them as a separate fields. P.S. I haven't tried similar scenario, so just speculating On Fri, Nov 19, 2010 at 12:09 AM, Fred Gilmore fgilm...@mail.utexas.edu wrote: mets/alto is an xml standard for describing physical objects. In this case, we're describing books. The mets file holds the metadata (author, title, etc.), the alto file is the physical description (words on the page, formatting of the page). So it's a one (mets) to many (alto) relationship. the directory structure: /our/collection/IDxxx/: IDxxx-mets.xml ALTO/ /our/collection/IDxxx/ALTO/: IDxxx-ALTO001.xml IDxxx-ALTO002.xml ie. an xml file per scanned book page. Beyond the ID number as part of the file names, the mets file contains no reference to the alto children. The alto children do contain a reference to the jpg page scan, which is labelled with the ID number as part of the name. The idea is to create a full text index of the alto content, accompanied by the author/title info from the mets file for purposes of results display. The first try with this is attempting a recursive FileDataSource approach. It was relatively easy to create a content field which holds the text of the page (each word is actually an attribute of a separate tag), but I'm having difficulty determining how I'm going to conditionally add the author and title data from the METS file to the rows created with the ALTO content field. It'll involve regex'ing out the ID number associated with both the mets and alto filenames for starters, but even at that, I don't see how to keep it straight since it's not one mets=one alto and it's also not a static string for the entire index. thanks for any hints you can provide. Fred University of Texas at Austin == data-config.xml thus far: dataConfig dataSource type=FileDataSource / document entity name=landscapes rootEntity=false processor=FileListEntityProcessor fileName=.xml$ recursive=true baseDir=/home/utlol/htdocs/lib-landscapes-new/publications/ entity name=sample rootEntity=true stream=true pk=filename url=${landscapes.fileAbsolutePath} processor=XPathEntityProcessor forEach=/mets | /alto transformer=TemplateTransformer,RegexTransformer,LogTransformer logTemplate= processing ${landscapes.fileAbsolutePath} logLevel=info !-- use system filename for getting OCLC number -- !-- we need it both for linking to results and for referencing the METS file -- field column=fileAbsPath template=${landscapes.fileAbsolutePath} / field column=title xpath=/mets/dmdSec/mdWrap/xmlData/mods/titleInfo/title / !-- field column=author xpath=/mets/dmdSec/mdWrap/xmlData/mods/na...@id='MODSMD_PRINT_N1']/namepa...@type='given'] / -- field column=filename xpath=/alto/Description/sourceImageInformation/fileName / field column=content xpath=/alto/Layout/Page/PrintSpace/TextBlock/TextLine/String/@CONTENT / /entity /entity /document /dataConfig == METS example: ?xml version=1.0 encoding=UTF-8? mets xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns=http://www.loc.gov/METS/; xsi:schemaLocation=http://www.loc.gov/METS/ http://schema.ccs-gmbh.com/docworks/version20/mets-docworks.xsd; xmlns:MODS=http://www.loc.gov/mods/v3; xmlns:mix=http://www.loc.gov/mix/; xmlns:xlink=http://www.w3.org/1999/xlink; TYPE=METAe_Monograph LABEL=ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE- Kingsville Area metsHdr CREATEDATE=2010-05-06T11:21:18 LASTMODDATE=2010-05-06T11:21:18 agent ROLE=CREATOR TYPE=OTHER OTHERTYPE=SOFTWARE nameCCS docWORKS/METAe Version 6.3-0/name notedocWORKS-ID: 1677/note /agent /metsHdr dmdSec ID=MODSMD_PRINT mdWrap MIMETYPE=text/xml MDTYPE=MODS LABEL=Bibliographic meta-data of the printed version xmlData MODS:mods MODS:titleInfo ID=MODSMD_PRINT_TI1 xml:lang=en MODS:titleENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE- Kingsville Area/MODS:title /MODS:titleInfo MODS:name ID=MODSMD_PRINT_N1
Re: Basic Solr Configurations and best practice
1- How to combine data from DIH and content extracted from file system document into one document in the index? http://wiki.apache.org/solr/TikaEntityProcessor You can have one sql entity that retrieves metadata from database and another nested entity that parses binary file into additional fields in the document. 2- Should I move the per-user permissions into a separate index? What technique to implement? I would start with keeping permissions in the same index as the actual content. On Tue, Nov 23, 2010 at 11:35 AM, Darx Oman darxo...@gmail.com wrote: Hi guys I'm kind of new to solr and I'm wondering how to configure solr to best fulfills my requirements. Requirements are as follow: I have 2 data sources: database and file system documents. Every document in the file system has related information stored in the database. Both the file content and the related database fields must be indexed. Along with the DB data is per-user permissions for every document. I'm using DIH for the DB and Tika for the file System. The documents contents nearly never change, while the DB data especially the permissions changes very frequently. Total number of documents roughly around 2M and each document is about 500KB. 1- How to combine data from DIH and content extracted from file system document into one document in the index? 2- Should I move the per-user permissions into a separate index? What technique to implement?
Re: DIH delta, deltaQuery
Are you sure that it's deltaQuery that's taking a minute? It only retrieves ids of updated records and then deltaImportQuery is executed N times for each id record. You might want to try the following technique - http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport On Wed, Nov 24, 2010 at 3:06 PM, stockii st...@shopgate.com wrote: Hello. i wonder why this deltaQuery takes over a minute: deltaQuery=SELECT id FROM sessions WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 1 HOUR ) AND NOW() OR modified BETWEEN '${dataimporter.sessions .last_index_time}' AND DATE_ADD( NOW(), INTERVAL - 1 HOUR ) the database have only 700 Entries and the compare with modified takes so long !!? when i remove the modified compare its fast. when i put this query in my mysql database the query need 0.0014 seconds ... wha is it so slow? -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-delta-deltaQuery-tp1960246p1960246.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searching with wrong keyboard layout or using translit
Another approach for this problem is to use another Solr core for storing users queries for auto complete functionality ( see http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ ) and index not only user_query field, but also transliterated and diff_layout versions and use dismax query parser to search suggestions in all fields. This solution is only viable if you have huge log of user queries ( which I believe google does ). HTH, Alex 2010/10/29 Alexander Kanarsky kanarsky2...@gmail.com: Pavel, it depends on size of your documents corpus, complexity and types of the queries you plan to use etc. I would recommend you to search for the discussions on synonyms expansion in Lucene (index time vs. query time tradeoffs etc.) since your problem is quite similar to that (think Moskva vs. Moskwa). Unless you have a small corpus, I would go with the second approach and expand the terms during the query time. However, the first approach might be useful, too: say, you may want to boost the score for the documents that naturally contain the word 'Moskva', so such a documents will be at the top of the result list. Having both forms indexed will allow you to achieve this easily by utilizing Solr's dismax query (to boost the results from the field with the original terms): http://localhost:8983/solr/select/?q=MoskvadefType=dismaxqf=text^10.0+text_translit^0.1 ('text' field has the original Cyrillic tokens, 'text_translit' is for transliterated ones) -Alexander 2010/10/28 Pavel Minchenkov char...@gmail.com: Alexander, Thanks, What variat has better performance? 2010/10/28 Alexander Kanarsky kanarsky2...@gmail.com Pavel, I think there is no single way to implement this. Some ideas that might be helpful: 1. Consider adding additional terms while indexing. This assumes conversion of Russian text to both translit and wrong keyboard forms and index converted terms along with original terms (i.e. your Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You may re-use the same field (if you plan for a simple term queries) or create a separate fields for the generated terms (better for phrase, proximity queries etc. since it keeps the original text positional info). Then the query could use any of these forms to fetch the document. If you use separate fields, you'll need to expand/create your query to search for them, of course. 2. If you have to index just an original Russian text, you might generate all term forms while analyzing the query, then you could treat the converted terms as a synonyms and use the combination of TermQuery for all term forms or the MultiPhraseQuery for the phrases. For Solr in this case you probably will need to add a custom filter similar to SynonymFilter. Hope this helps, -Alexander On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov char...@gmail.com wrote: Hi, When I'm trying to search Google with wrong keyboard layout -- it corrects my query, example: http://www.google.ru/search?q=vjcrdf (I typed word Moscow in Russian but in English keyboard layout). http://www.google.ru/search?q=vjcrdfAlso, when I'm searching using translit, It does the same: http://www.google.ru/search?q=moskva What is the right way to implement this feature in Solr? -- Pavel Minchenkov -- Pavel Minchenkov
Re: problem on running fullimport
Caused by: java.sql.SQLException: Illegal value for setFetchSize(). Try to add batchSize=-1 to your data source declaration http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F On Fri, Oct 15, 2010 at 3:42 PM, swapnil dubey swapnil.du...@gmail.com wrote: Hi, I am using the full import option with the data-config file as mentioned below dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql:///xxx user=xxx password=xx / document entity name=yyy query=select studentName from test1 field column=studentName name=studentName / /entity /document /dataConfig on running the full-import option I am getting the error mentioned below.I had already included the dataimport.properties file in my conf file.help me to get the issue resolved response - lst name=responseHeader int name=status0/int int name=QTime334/int /lst - lst name=initArgs - lst name=defaults str name=configdata-config.xml/str /lst /lst str name=commandfull-import/str str name=modedebug/str null name=documents/ - lst name=verbose-output - lst name=entity:test1 - lst name=document#1 str name=queryselect studentName from test1/str - str name=EXCEPTION org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select studentName from test1 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:184) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:203) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: java.sql.SQLException: Illegal value for setFetchSize(). at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1075) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:984) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:929) at
Re: DataImportHandler dynamic fields clarification
Harry, could you please file a jira for this and I'll address this in a patch. I fixed related issue (SOLR-2102) and I think it's pretty similar. Interesting, I was under the impression that case does not matter. From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config : It is possible to totally avoid the field entries in entities if the names of the fields are same (case does not matter) as those in Solr schema Yeah, case does not matter only for explicit mapping of sql columns to Solr fields. The reason is that DIH populates hash map for case insensitive match only for explicit mappings. You can also workaround this upper case column names in Oracle using the following SQL clause: = data-config.xml entity name=item query=select column_1 as quote;column_1quote;, column_100 as quote;column_100quote; from wide_table /entity schema.xml dynamicField name=column_* type=string indexed=true stored=true multiValued=true / = HTH, Alexey On Thu, Sep 30, 2010 at 9:10 PM, harrysmith harrysmith...@gmail.com wrote: Two things, one are your DB column uppercase as this would effect the out. Interesting, I was under the impression that case does not matter. From http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config : It is possible to totally avoid the field entries in entities if the names of the fields are same (case does not matter) as those in Solr schema I confirmed that matching the schema.xml field case to the database table is needed for dynamic fields, and the wiki statement above is incorrect, or at the very least confusing, possibly a bug. My database is Oracle 10g and the column names have been created in all uppercase in the database. In Oracle: Table name: wide_table Column names: COLUMN_1 ... COLUMN_100 (yes, uppercase) Please see following scenarios and results I found: data-config.xml entity name=item query=select column_1,column_100 from wide_table field column=column_100 name=id/ /entity schema.xml dynamicField name=column_* type=string indexed=true stored=true multiValued=true / Result: Nothing Imported = data-config.xml entity name=item query=select COLUMN_1,COLUMN_100 from wide_table field column=column_100 name=id/ /entity schema.xml dynamicField name=column_* type=string indexed=true stored=true multiValued=true / Result: Note query column names changed to uppercase. Nothing Imported = data-config.xml entity name=item query=select column_1,column_100 from wide_table field column=COLUMN_100 name=id/ /entity schema.xml dynamicField name=column_* type=string indexed=true stored=true multiValued=true / Result: Note ONLY the field entry was changed to caps All records imported, with only COLUMN_100 id field. data-config.xml entity name=item query=select column_1,column_100 from wide_table field column=COLUMN_100 name=id/ /entity schema.xml dynamicField name=COLUMN_* type=string indexed=true stored=true multiValued=true / Result: Note BOTH the field entry was changed to caps in data-config.xml, and the dynamicField wildcard in schema.xml All records imported, with all fields specified. This is the behavior desired. = Second what does your db-data-config.xml look like The relevant data-config.xml is as follows: document name= entity name=item query=select COLUMN_1,COLUMN_100 from wide_table field column=COLUMN_100 name=id/ /entity /document Ideally, I would rather have the query be 'select * from wide_table with the fields being dynamically matched by the column name from the dynamicField wildcard from the schema.xml. dynamicField name=COLUMN_* type=string indexed=true stored=true/ -- View this message in context: http://lucene.472066.n3.nabble.com/DataImportHandler-dynamic-fields-clarification-tp1606159p1609578.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delta Import with something other than Date
Can you provide a sample of passing the parameter via URL? And how using it would look in the data-config.xml http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters
Re: Solr is indexing jdbc properties
http://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5 Try to add convertType attribute to dataSource declaration, i.e. dataSource type=JdbcDataSource name=mssqlDatasource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://{db.host}:1433/{db};instance=SQLEXPRESS user={username} password={password} convertType=true / HTH, Alex On Mon, Sep 6, 2010 at 5:49 PM, savvas.andreas savvas.andreas.moysi...@googlemail.com wrote: Hello, I am trying to index some data stored in an SQL Server database through DIH. My setup in data-config.xml is the following: dataConfig dataSource type=JdbcDataSource name=mssqlDatasource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://{db.host}:1433/{db};instance=SQLEXPRESS user={username} password={password}/ document entity name=id dataSource=mssqlDatasource query=select id, title from WORK field column=id name=id / field column=title name=title / /entity /document /dataConfig However, when I run the indexer (invoking http://127.0.0.1:8983/solr/admin/dataimport.jsp?handler=/dataimport) I get all the rows in my index but with incorrect data indexed. More specifically, by examining the top 10 terms for the title field I get: term frequency impl 1241371 jdbc 1241371 net 1241371 sourceforg 1241371 jtds 1241371 clob 1241371 netsourceforgejtdsjdbcclobimpl 1186981 c 185070 a 179901 e 160759 which is clearly wrong..Does anybody know why Solr is indexing the jdbc properties instead of the actual data? Any pointers would be much appreciated. Thank you very much. -- Savvas -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-is-indexing-jdbc-properties-tp1426473p1426473.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Rows fetch OK, Total Documents Failed??
Do you have any required fields or uniqueKey in your schema.xml? Do you provide values for all these fields? AFAIU you don't need commonField attribute for id and title fields. I don't think that's your problem but anyway... On Sat, Jul 31, 2010 at 11:29 AM, scr...@asia.com wrote: Hi, I'm a bit lost with this, i'm trying to import a new XML via DIH, all row are fetched but no ducument are indexed? I don't find any log or error? Any ideas? Here is the STATUS: str name=commandstatus/str str name=statusidle/str str name=importResponse/ lst name=statusMessages str name=Total Requests made to DataSource1/str str name=Total Rows Fetched7554/str str name=Total Documents Skipped0/str str name=Full Dump Started2010-07-31 10:14:33/str str name=Total Documents Processed0/str str name=Total Documents Failed7554/str str name=Time taken 0:0:4.720/str /lst My xml file looks like this: ?xml version=1.0 encoding=UTF-8? products product titleMoniteur VG1930wm 19 LCD Viewsonic/title urlhttp://x.com/abc?a(12073231)p(2822679)prod(89042332277)ttid(5)url(http%3A%2F%2Fwww.ffdsssd.com%2Fproductinformation%2F%7E66297%7E%2Fproduct.htm%26sender%3D2003)/url contentMoniteur VG1930wm 19 LCD Viewsonic VG1930WM/content price247.57/price categoryEcrans/category /product etc... and my dataconfig: dataConfig dataSource type=URLDataSource / document entity name=products url=file:///home/john/Desktop/src.xml processor=XPathEntityProcessor forEach=/products/product transformer=DateFormatTransformer field column=id xpath=/products/product/url commonField=true / field column=title xpath=/products/product/title commonField=true / field column=category xpath=/products/product/category / field column=content xpath=/products/product/content / field column=price xpath=/products/product/price / /entity /document /dataConfig
Re: Implementing lookups while importing data
We are currently doing this via a JOIN on the numeric field, between the main data table and the lookup table, but this dramatically slows down indexing. I believe SQL JOIN is the fastest and easiest way in your case (in comparison with nested entity even using CachedSqlEntity). You probably don't have proper indexes in your database - check SQL query plan.
Re: DIH and multivariable fields problems
Have others successfully imported dynamic multivalued fields in a child entity using the DataImportHandler via the child entity returning multiple records through a RDBMS? Yes, it's working ok with static fields. I didn't even know that it's possible to use variables in field names ( dynamic names ) in DIH configuration. This use case is quite unusual. This is increasingly more looking like a bug. To recap, I am trying to use the DIH to import multivalued dynamic fields and using a variable to name that field. I'm not an expert in DIH source code but it seems there's special processing of dynamic fields that prevents handling field type (and multivalued attribute). Specifically there's conditional jump (continue) over field type detection code in case of dynamic field name ( see DataImporter:initEntity ). I guess the reason of such behavior is that you can't determine field type based on dynamic field name (${variable}_s) at that time (configuration parsing). I'm wondering if it's possible to determine field types at runtime (when actual field title_s name is resolved). I encountered similar problem with implicit sql_column - solr_field mapping using SqlEntityProcessor, i.e. when you select some columns and do not explicitly list all these columns as fields entries in your configuration. In this case field type detection doesn't work either. I think that moving type detection process into runtime would solve that problem also. Am i missing something obvious that prevents us from doing field type detection at runtime? Alex On Tue, Aug 10, 2010 at 4:20 AM, harrysmith harrysmith...@gmail.com wrote: This is increasingly more looking like a bug. To recap, I am trying to use the DIH to import multivalued dynamic fields and using a variable to name that field. Upon further testing, the multivalued import works fine with a static/constant name, but only keeps the first record when naming the field dynamically. See below for relevant snips. From schema.xml : dynamicField name=*_s type=string indexed=true stored=true multiValued=true / From data-config.xml : entity name=terms query=select distinct CORE_DESC_TERM from metadata where item_id=${item.DIVID_PK} entity name=metadata query=select * from metadata where item_id=${item.DIVID_PK} AND core_desc_term='${terms.CORE_DESC_TERM}' field name=metadata_record_s column=TEXT_VALUE / /entity /entity Produces the following, note that there are 3 records that should be returned and are correctly done, with the field name being a constant. - result name=response numFound=1 start=0 - doc str name=id9892962/str - arr name=metadata_record_s strrecord 1/str strrecord 2/str strrecord 3/str strPolygraph Newsletter Title/str /arr - arr name=title strPolygraph Newsletter Title/str /arr /doc /result === Now, changing the field name to a variable..., note only the first record is retained for the 'Relation_s' field -- there should be 3 records. field name=metadata_record_s column=TEXT_VALUE / becomes field name=${terms.CORE_DESC_TERM}_s column=TEXT_VALUE / produces the following: - result name=response numFound=1 start=0 - doc - arr name=Relation_s strrecord 1/str /arr - arr name=Title_s strPolygraph Newsletter Title/str /arr str name=id9892962/str - arr name=title strPolygraph Newsletter Title/str /arr /doc /result Only the first record is retained. There was also another post (which recieved no replies) in the archive that reported the same issue. The DIH debug logs do show 3 records correctly being returned, so somehow these are not getting added. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1065244.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: commit is taking very very long time
I am not sure why some commits take very long time. Hmm... Because it merges index segments... How large is your index? Also is there a way to reduce the time it takes? You can disable commit in DIH call and use autoCommit instead. It's kind of hack because you postpone commit operation and make it async. Another option is to set optimize=false in DIH call ( it's true by default ). Also you can try to increase mergeFactor parameter but it would affect search performance.
Re: 2 solr dataImport requests on a single core at the same time
having multiple Request Handlers will not degrade the performance IMO you shouldn't worry unless you have hundreds of them
Re: Performance issues when querying on large documents
Do you use highlighting? ( http://wiki.apache.org/solr/HighlightingParameters ) Try to disable it and compare performance. On Fri, Jul 23, 2010 at 10:52 PM, ahammad ahmed.ham...@gmail.com wrote: Hello, I have an index with lots of different types of documents. One of those types basically contains extracts of PDF docs. Some of those PDFs can have 1000+ pages, so there would be a lot of stuff to search through. I am experiencing really terrible performance when querying. My whole index has about 270k documents, but less than 1000 of those are the PDF extracts. The slow querying occurs when I search only on those PDF extracts (by specifying filters), and return 100 results. The 100 results definitely adds to the issue, but even cutting that down can be slow. Is there a way to improve querying with such large results? To give an idea, querying for a single word can take a little over a minute, which isn't really viable for an application that revolves around searching. For now, I have limited the results to 20, which makes the query execute in roughly 10-15 seconds. However, I would like to have the option of returning 100 results. Thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: 2 solr dataImport requests on a single core at the same time
DataImportHandler does not support parallel execution of several requests. You should either send your requests sequentially or register several DIH handlers in solrconfig and use them in parallel. On Thu, Jul 22, 2010 at 11:20 AM, kishan mklpra...@gmail.com wrote: please help me -- View this message in context: http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p986351.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Adding new elements to index
1) Shouldn't you put your entity elements under document tag, i.e. dataConfig dataSource ... / dataSource ... / document name=docs entity ../entity entity ../entity /document /dataConfig 2) What happens if you try to run full-import with explicitly specified entity GET parameter? command=full-importentity=carrers command=full-importentity=hidrants On Wed, Jul 7, 2010 at 11:15 AM, Xavier Rodriguez xee...@gmail.com wrote: Thanks for the quick reply! In fact it was a typo, the 200 rows I got were from postgres. I tried to say that the full-import was omitting the 100 oracle rows. When I run the full import, I run it as a single job, using the url command=full-import. I've tried to clear the index both using the clean command and manually deleting it, but when I run the full-import, the number of indexed documents are the documents coming from postgres. To be sure that the id field is unique, i get the id by assigning a letter before the id value. When indexed, the id looks like s_123, and that's the id 123 for an entity identified as s. Other entities use different prefixes, but never s. I used DIH to index the data. My configuration is the folllowing: File db-data-config.xml dataSource type=JdbcDataSource name=ds_ora driver=oracle.jdbc.OracleDriver url=jdbc:oracle:thin:@xxx.xxx.xxx.xxx:1521:SID user=user password=password / dataSource type=JdbcDataSource name=ds_pg driver=org.postgresql.Driver url=jdbc:postgresql://xxx.xxx.xxx.yyy:5432/sid user=user password=password / entity name=carrers dataSource=ds_ora query=select 's_'||id as id_carrer,'a' as tooltip from imi_carrers field column=id_carrer name=identificador / field column=tooltip name=Nom / /entity entity name=hidrants dataSource=ds_pg query=select 'h_'||id as id_hidrant, parc as tooltip from hidrants field column=id_hidrant name=identificador / field column=tooltip name=Nom / /entity -- In that configuration, all the fields coming from ds_pg are indexed, and the fields coming from ds_ora are not indexed. As I've said, the strange behaviour for me is that no error is logged in tomcat, the number of documents created is the number of rows returned by hidrants, while the number of rows returned is the sum of the rows from hidrants and carrers. Thanks in advance. Xavi. On 7 July 2010 02:46, Erick Erickson erickerick...@gmail.com wrote: first do you have a unique key defined in your schema.xml? If you do, some of those 300 rows could be replacing earlier rows. You say: if I have 200 rows indexed from postgres and 100 rows from Oracle, the full-import process only indexes 200 documents from oracle, although it shows clearly that the query retruned 300 rows. Which really looks like a typo, if you have 100 rows from Oracle how did you get 200 rows from Oracle? Are you perhaps doing this in two different jobs and deleting the first import before running the second? And if this is irrelevant, could you provide more details like how you're indexing things (I'm assuming DIH, but you don't state that anywhere). If it *is* DIH, providing that configuration would help. Best Erick On Tue, Jul 6, 2010 at 11:19 AM, Xavier Rodriguez xee...@gmail.com wrote: Hi, I have a SOLR installed on a Tomcat application server. This solr instance has some data indexed from a postgres database. Now I need to add some entities from an Oracle database. When I run the full-import command, the documents indexed are only documents from postgres. In fact, if I have 200 rows indexed from postgres and 100 rows from Oracle, the full-import process only indexes 200 documents from oracle, although it shows clearly that the query retruned 300 rows. I'm not doing a delta-import, simply a full import. I've tried to clean the index, reload the configuration, and manually remove dataimport.properties because it's the only metadata i found. Is there any other file to check or modify just to get all 300 rows indexed? Of course, I tried to find one of that oracle fields, with no results. Thanks a lot, Xavier Rodriguez.
Re: Data Import Handler Rich Format Documents
Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error: java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1 release. You should use trunk / nightly builds. https://issues.apache.org/jira/browse/SOLR-1583 My data-config.xml looks like this: dataConfig dataSource type=JdbcDataSource driver=oracle.jdbc.driver.OracleDriver url=jdbc:oracle:thin:@whatever:12345:whatever user=me name=ds-db password=secret/ dataSource type=BinURLDataSource name=ds-url/ document entity name=my_database dataSource=ds-db query=select * from my_database where rownum lt;=2 field column=CONTENT_ID name=content_id/ field column=CMS_TITLE name=cms_title/ field column=FORM_TITLE name=form_title/ field column=FILE_SIZE name=file_size/ field column=KEYWORDS name=keywords/ field column=DESCRIPTION name=description/ field column=CONTENT_URL name=content_url/ /entity entity name=my_database_url dataSource=ds-url query=select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}' entity processor=TikaEntityProcessor dataSource=ds-url format=text url=http://www.mysite.com/${my_database.content_url}; field column=text/ /entity /entity /document /dataConfig I added the entity name=my_database_url section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url. Is there anything obviously wrong with what I've tried so far? I think you should move Tika entity into my_database entity and simplify the whole configuration entity name=my_database dataSource=ds-db query=select * from my_database where rownum lt;=2 ... field column=CONTENT_URL name=content_url/ entity processor=TikaEntityProcessor dataSource=ds-url format=text url=http://www.mysite.com/${my_database.content_url}; field column=text/ /entity /entity
Re: solr data config questions
Hi, You can add additional commentreplyjoin entity to story entity, i.e. entity name=story ... ... entity name=commenttable ... ... entity name=replytable ... ... /entity /entity entity name=commentreplyjoin query=select concat(comment_id, ',', replier_id) as commentreply from commenttable left join replytable on replytable.comment_id=commenttable.comment_id where commenttable.story_id=${story.story_id}' field name=commentreply column=commentreply / /entity /entity Thus, you will have multivalued field commentreply that contains list of related comment_id, reply_id (comment_id, if you don't have any related replies for this entry) pairs. You can retrieve all values of that field and process on a client and build complex data structure. HTH, Alex On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei wei.p...@xerox.com wrote: Hi All, I am a new user of Solr. We are now trying to enable searching on Digg dataset. It has story_id as the primary key and comment_id are the comment id which commented story_id, so story_id and comment_id is one-to-many relationship. These comment_ids can be replied by some repliers, so comment_id and repliers are one-to-many relationship. The problem is that within a single returned document the search results shows an array of comment_ids and an array of repliers without knowing which repliers replied which comment. For example: now we got comment_id:[c1,c,2...,cn], repliers:[r1,r2,r3rm]. Can we get something like comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that {r1,r2} is corresponding to c1? Our current data-config is attached: dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver autoreconnect=true netTimeoutForStreamingResults=1200 url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root password= / document entity name=story pk=story_id query=select * from story deltaImportQuery=select * from story where ID=='${dataimporter.delta.story_id}' deltaQuery=select story_id from story where last_modified '${dataimporter.last_index_time}' field column=link name=link / field column=title name=title / field column=description name=story_content / field column=digg name=positiveness / field column=comment name=spreading_number / field column=user_id name=author / field column=profile_view name=user_popularity / field column=topic name=topic / field column=timestamp name=timestamp / entity name=dugg_list pk=story_id query=select * from dugg_list where story_id='${story.story_id}' deltaQuery=select SID from dugg_list where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${dugg_list.story_id} field name=viewer column=dugger / /entity entity name=commenttable pk=comment_id query=select * from commenttable where story_id='${story.story_id}' deltaQuery=select SID from commenttable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select story_id from story where story_id=${commenttable.story_id} field name=comment_id column=comment_id / field name=spreading_user column=replier / field name=comment_positiveness column=up / field name=comment_negativeness column=down / field name=user_comment column=content / field name=user_comment_timestamp column=timestamp / entity name=replytable query=select * from replytable where comment_id='${commenttable.comment_id}' deltaQuery=select SID from replytable where last_modified '${dataimporter.last_index_time}' parentDeltaQuery=select comment_id from commenttable where comment_id=${replytable.comment_id} field name=replier_id column=replier_id / field name=reply_content column=content / field name=reply_positiveness column=up / field name=reply_negativeness column=down / field name=reply_timestamp column=timestamp / /entity /entity /entity /document /dataConfig Please help me on this. Many thanks Vivian
Re: DIH and denormalizing
It seems that ${ncdat.feature} is not being set. Try ${dataTable.feature} instead. On Tue, Jun 29, 2010 at 1:22 AM, Shawn Heisey s...@elyograg.org wrote: I am trying to do some denormalizing with DIH from a MySQL source. Here's part of my data-config.xml: entity name=dataTable pk=did query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did gt; ${dataimporter.request.minDid} AND did lt;= ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) entity name=ncdat_wt query=SELECT webtable as wt FROM ncdat_wt WHERE featurecode='${ncdat.feature}' /entity /entity The relationship between features in ncdat and webtable in ncdat_wt (via featurecode) will be many-many. The wt field in schema.xml is set up as multivalued. It seems that ${ncdat.feature} is not being set. I saw a query happening on the server and it was SELECT webtable as wt FROM ncdat_wt WHERE featurecode='' - that last part is an empty string with single quotes around it. From what I can tell, there are no entries in ncdat where feature is blank. I've tried this with both a 1.5-dev checked out months ago (which we are using in production) and a 3.1-dev checked out today. Am I doing something wrong? Thanks, Shawn
Re: dataimport.properties is not updated on delta-import
Please note that Oracle ( or Oracle jdbc driver ) converts column names to upper case eventhough you state them in lower case. If this is the case then try to rewrite your query in the following form select id as id, name as name from table On Thursday, June 24, 2010, warb w...@mail.com wrote: Hello again! Upon further investigation it seems that something is amiss with delta-import after all, the delta-import does not actually import anything (I thought it did when I ran it previously but I am not sure that was the case any longer.) It does complete successfully as seen from the front-end (dataimport?command=delta-import). Also in the logs it is stated the the import was successful (INFO: Delta Import completed successfully), but there are exception pertaining to some documents. The exception message is that the id field is missing (org.apache.solr.common.SolrException: Document [null] missing required field: id). Now, I have checked the column names in the table, the data-config.xml file and the schema.xml file and they all have the column/field names written in lowercase and are even named exactly the same. Do Solr rollback delta-imports if one or more of the documents failed? -- View this message in context: http://lucene.472066.n3.nabble.com/dataimport-properties-is-not-updated-on-delta-import-tp916753p919609.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Handler Rich Format Documents
You are right. It seems TikaEntityProcessor is exactly the tool you need in this case. Alex On Sat, Jun 19, 2010 at 2:59 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I think you can use existing ExtractingRequestHandler to do the job, : i.e. add child entity to your DIH metadata why would you do this instead of using the TikaEntityProcessor as i already suggested in my earlier mail? -Hoss
Re: Data Import Handler Rich Format Documents
I think you can use existing ExtractingRequestHandler to do the job, i.e. add child entity to your DIH metadata dataSource type=JdbcDataSource name=db ... / dataSource type=URLDataSource name=solr / entity name=metadata query=select id, title, url from metadata dataSource=db entity processor=PlainTextEntityProcessor name=content url=http://localhost:8983/solr/update/extract?extractOnly=truewt=xmlindent=onstream.url=${metadata.url}; dataSource=solr field column=plainText name=content/ /entity /entity That's not working example, just basic idea, you still need to uri_escape ${metadata.url} reference probably using some transformer (regexp, javascript?) and extract file content from ERH xml response using xpath and probably do some html stripping. HTH, Alex On Fri, Jun 18, 2010 at 4:51 PM, Tod listac...@gmail.com wrote: I have a database containing Metadata from a content management system. Part of that data includes a URL pointing to the actual published document which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc. I'm already indexing the Metadata and that provides a lot of value. The customer however would like that the content pointed to by the URL also be indexed for more discrete searching. This article at Lucid: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Searching-rich-format-documents-stored-DBMS describes the process of coding a custom transformer. A separate article I've read implies Nutch could be used to provide this functionality too. What would be the best and most efficient way to accomplish what I'm trying to do? I have a feeling the Lucid article might be dated and there might ways to do this now without any coding and maybe without even needing to use Nutch. I'm using the current release version of Solr. Thanks in advance. - Tod
Re: Solr DataConfig / DIH Question
There is a 1-[0,1] relationship between Person and Address with address_id being the nullable foreign key. I think you should be good with single query/entity then (no need for nested entities) entity name=person query=select person.id, person.name, person.address_id, address.zipcode from person left join address on address.id=person.address_id On Sunday, June 13, 2010, Holmes, Charles V. chol...@mitre.org wrote: I'm putting together an entity. A simplified version of the database schema is below. There is a 1-[0,1] relationship between Person and Address with address_id being the nullable foreign key. If it makes any difference, I'm using SQL Server 2005 on the backend. Person [id (pk), name, address_id (fk)] Address [id (pk), zipcode] My data config looks like the one below. This naturally fails when the address_id is null since the query ends up being select * from user.address where id = . entity name=person Query=select * from user.person entity name=address Query=select * from user.address where id = ${person.address_id} /entity /entity I've worked around it by using a config like this one. However, this makes the queries quite complex for some of my larger joins. entity name=person Query=select * from user.person entity name=address Query=select * from user.address where id = (select address_id from user.person where id = ${person.id}) /entity /entity Is there a cleaner / better way of handling these type of relationships? I've also tried to specify a default in the Solr schema, but that seems to only work after all the data is indexed which makes sense but surprised me initially. BTW, thanks for the great DIH tutorial on the wiki! Thanks! Charles
Re: multiValued using
Hi Alberto, You can add child entity which returns multiple records, i.e. entity name=root query=select id, title from titles entity name=child query=select value from multivalued where title_id='${root.id}' /entity /entity HTH, Alex 2010/6/7 Alberto García Sola alberto...@gmail.com: Hello, this is my first message to this list. I was wondering if it is possible to use multiValued when using MySQL (or any SQL-database engine) through DataImportHandler. I've tried using a query which return something like this: 1 - title1 - multivalue1-1 1 - title1 - multivalue1-2 1 - title1 - multivalue1-3 2 - title2 - multivalue2-1 2 - title2 - multivalue2-2 And using the first row as ID. But that only returns me the first occurrence rather than transforming them into multiValued fields. Is there a way to deal with multiValued in databases? NOTE: The way of working with multivalues I use is using foreign keys and relate them into the query so that the query gives me the results the way I have shown. Regards, Alberto.
Re: Importing large datasets
What's the relation between items and item_descriptions table? I.e. is there only one item_descriptions record for every id? If 1-1 then you can merge all your data into single database and use the following query entity name=item dataSource=single_datasource query=select * from items inner join item_descriptions on item_descriptions.id=items.id /entity HTH, Alex On Thu, Jun 3, 2010 at 6:34 AM, Blargy zman...@hotmail.com wrote: Erik Hatcher-4 wrote: One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. Erik On Jun 2, 2010, at 12:21 PM, Blargy wrote: As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). Also wanted to add that our main entity (item) consists of 5 sub- entities (ie, joins). 2 of those 5 are fairly small so I am using CachedSqlEntityProcessor for them but the other 3 (which includes item_description) are normal. All the entites minus the item_description connect to datasource1. They currently point to one physical machine although we do have a pool of 3 DB's that could be used if it helps. The other entity, item_description uses a datasource2 which has a pool of 2 DB's that could potentially be used. Not sure if that would help or not. I might as well that the item description will have indexed, stored and term vectors set to true. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html Sent from the Solr - User mailing list archive at Nabble.com. I can't find any example of creating a massive sql query. Any out there? Will batching still work with this massive query? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: indexer threading?
Hi Brian, I was testing indexing performance on a high cpu box recently and came to the same issue. I tried different indexing methods ( xml, CSVRequestHandler and Solrj + BinaryRequestWriter with multiple threads ). The last method is the fastest indeed. I believe that multiple threads approach gives you better performance if you have complex text analysis. I had very simple analysis - WhitespaceTokenizer only and performance boost with increasing threads was not very impressive ( but still ). I guess that in case of simple text analysis overall performance comes to synchronization issues. I tried to profile application during indexing phase for CPU times and monitors and it seems that most of blocking is on the following methods: - DocumentsWriter.doBalanceRAM - DocumentsWriter.getThreadState - SolrIndexWriter.ensureOpen I don't know the guts of Solr/Lucene in such details so can't make any conclusions. Are there any configuration techniques to improve indexing performance in multiple threads scenario? Alex On Mon, Apr 26, 2010 at 6:52 PM, Wawok, Brian brian.wa...@cmegroup.com wrote: Hi, I was wondering about how the multi-threading of the indexer works? I am using SolrJ to stream documents to a server. As I add more threads on the client side, I slowly see both speed and CPU usage go up on the indexer side. Once I hit about 4 threads, my indexer is at 100% cpu usage (of 1 CPU on a 4-way box), and will not do any more work. It is pretty fast, doing something like 75k lines of text per second.. but I would really like to use all 4 CPUs on the indexer. Is the just a limitation of Solr, or is this a limitation of using SolrJ and document streaming? Thanks, Brian
Re: Short Question: Fills this entity multiValued Fields (DIH)?
Have a look at these two lines: entity name=feature query=select description from feature where item_id='${item.ID}' field name=features column=description / If there is more than one description per item_ID, does the features-field gets multiple values if it is defined as multiValued=true? Correct.
Re: SOLR-1316 How To Implement this autosuggest component ???
You should add this component (suggest or spellcheck, depends how do you name it) to request handler, i.e. add requestHandler name=/suggest class=org.apache.solr.handler.component.SearchHandler lst name=defaults /lst arr name=components strsuggest/str /arr /requestHandler And then you can hit the following url and get your suggestions http://localhost:8983/solr/suggest/?spellcheck=truespellcheck.dictionary=suggestspellcheck.build=truespellcheck.extendedResults=truespellcheck.count=10q=prefix On Wed, Mar 24, 2010 at 8:09 PM, stocki st...@shopgate.com wrote: hey. i got it =) i checked out with lucene and the build from solr. with ant -verbose example. now, when i put this line into solrconfig: str name=classnameorg.apache.solr.spelling.suggest.Suggester/str no exception occurs =) juhu but how wokrs this component ?? sorry for a new stupid question ^^ stocki wrote: okay, thx so i checked out but i cannot build an build. i got 100 errors ... D:\cygwin\home\stock\trunk_\solr\common-build.xml:424: The following error occur red while executing this line: D:\cygwin\home\stock\trunk_\solr\common-build.xml:281: The following error occur red while executing this line: D:\cygwin\home\stock\trunk_\solr\contrib\clustering\build.xml:69: The following error occurred while executing this line: D:\cygwin\home\stock\trunk_\solr\build.xml:155: The following error occurred whi le executing this line: D:\cygwin\home\stock\trunk_\solr\common-build.xml:221: Compile failed; see the c ompiler error output for details. Lance Norskog-2 wrote: You need 'ant' to do builds. At the top level, do: ant clean ant example These will build everything and set up the example/ directory. After that, run: ant test-core to run all of the unit tests and make sure that the build works. If the autosuggest patch has a test, this will check that the patch went in correctly. Lance On Tue, Mar 23, 2010 at 7:42 AM, stocki st...@shopgate.com wrote: okay, i do this.. but one file are not right updatet Index: trunk/src/java/org/apache/solr/util/HighFrequencyDictionary.java (from the suggest.patch) i checkout it from eclipse, apply patch, make an new solr.war ... its the right way ?? i thought that is making a war i didnt need to make an build. how do i make an build ? Alexey-34 wrote: Error loading class 'org.apache.solr.spelling.suggest.Suggester' Are you sure you applied the patch correctly? See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches Checkout Solr trunk source code ( http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch, verify that everything went smoothly, build solr and use built version for your tests. On Mon, Mar 22, 2010 at 9:42 PM, stocki st...@shopgate.com wrote: i patch an nightly build from solr. patch runs, classes are in the correct folder, but when i replace spellcheck with this spellchecl like in the comments, solr cannot find the classes =( searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str str name=fieldtext/str str name=sourceLocationamerican-english/str /lst /searchComponent -- SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading class 'org.ap ache.solr.spelling.suggest.Suggester' why is it so ?? i think no one has so many trouble to run a patch like me =( :D Andrzej Bialecki wrote: On 2010-03-19 13:03, stocki wrote: hello.. i try to implement autosuggest component from these link: http://issues.apache.org/jira/browse/SOLR-1316 but i have no idea how to do this !?? can anyone get me some tipps ? Please follow the instructions outlined in the JIRA issue, in the comment that shows fragments of XML config files. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/SOLR-1316-How-To-Implement-this-patch-autoComplete-tp27950949p28001938.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com -- View this message in context: http://old.nabble.com/SOLR-1316-How-To-Implement-this-patch-autoComplete-tp27950949p28018196.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR-1316 How To Implement this autosuggest component ???
Error loading class 'org.apache.solr.spelling.suggest.Suggester' Are you sure you applied the patch correctly? See http://wiki.apache.org/solr/HowToContribute#Working_With_Patches Checkout Solr trunk source code ( http://svn.apache.org/repos/asf/lucene/solr/trunk ), apply patch, verify that everything went smoothly, build solr and use built version for your tests. On Mon, Mar 22, 2010 at 9:42 PM, stocki st...@shopgate.com wrote: i patch an nightly build from solr. patch runs, classes are in the correct folder, but when i replace spellcheck with this spellchecl like in the comments, solr cannot find the classes =( searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str str name=fieldtext/str str name=sourceLocationamerican-english/str /lst /searchComponent -- SCHWERWIEGEND: org.apache.solr.common.SolrException: Error loading class 'org.ap ache.solr.spelling.suggest.Suggester' why is it so ?? i think no one has so many trouble to run a patch like me =( :D Andrzej Bialecki wrote: On 2010-03-19 13:03, stocki wrote: hello.. i try to implement autosuggest component from these link: http://issues.apache.org/jira/browse/SOLR-1316 but i have no idea how to do this !?? can anyone get me some tipps ? Please follow the instructions outlined in the JIRA issue, in the comment that shows fragments of XML config files. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://old.nabble.com/SOLR-1316-How-To-Implement-this-autosuggest-component-tp27950949p27990809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Term Highlighting without store text in index
Hey Dominique, See http://www.lucidimagination.com/search/document/5ea8054ed8348e6f/highlight_arbitrary_text#3799814845ebf002 Although it might be not good solution for huge texts, wildcard/phrase queries. http://issues.apache.org/jira/browse/SOLR-1397 On Mon, Mar 15, 2010 at 4:09 PM, dbejean dominique.bej...@eolya.fr wrote: Hello, Just in order to be able to show term highlighting in my results list, I store all the indexed data in the Lucene index and so, it is very huge (108Gb). Is there any possibilities to do it in an other way ? Now or in the future, is it possible that Solr use a 3nd-party tool such as ehcache in order to store the content of the indexed documents outside of the Lucene index ? Thank you Dominique -- View this message in context: http://old.nabble.com/Term-Highlighting-without-store-text-in-index-tp27904022p27904022.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: implementing profanity detector
- A TokenFilter would allow me to tap into the existing analysis pipeline so I get the tokens for free but I can't access the document. https://issues.apache.org/jira/browse/SOLR-1536 On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham mper...@onespot.com wrote: We'd like to implement a profanity detector for documents during indexing. That is, given a file of profane words, we'd like to be able to mark a document as safe or not safe if it contains any of those words so that we can have something similar to google's safe search. I'm trying to figure out how best to implement this with Solr 1.4: - An UpdateRequestProcessor would allow me to dynamically populate a safe boolean field but requires me to pull out the content, tokenize it and run each token through my set of profanities, essentially running the analysis pipeline again. That's a lot of overheard AFAIK. - A TokenFilter would allow me to tap into the existing analysis pipeline so I get the tokens for free but I can't access the document. Any suggestions on how to best implement this? Thanks in advance, mike
DataImportHandler - case sensitivity of column names
I encountered the problem with Oracle converting column names to upper case. As a result SolrInputDocument is created with field names in upper case and Document [null] missing required field: id exception is thrown ( although ID field is defined ). I do not specify field elements explicitly. I know that I can rewrite all my queries to select id as id, body as body from document format, but is there any other workaround for this? case insensitive option or something? Here's my data-config: dataConfig dataSource convertType=true driver=oracle.jdbc.driver.OracleDriver password=oracle url=jdbc:oracle:thin:@localhost:1521:xe user=SYSTEM/ document name=items entity name=root pk=id preImportDeleteQuery=db:db1 query=select id, body from document transformer=TemplateTransformer entity name=nested1 query=select category from document_category where doc_id='${root.id}'/ entity name=nested2 query=select tag from document_tag where doc_id='${root.id}'/ field column=db template=db1/ /entity /document /dataConfig Alexey
Re: Indexing an oracle warehouse table
What would be the right way to point out which field contains the term searched for. I would use highlighting for all of these fields and then post process Solr response in order to check highlighting tags. But I don't have so many fields usually and don't know if it's possible to configure Solr to highlight fields using '*' as dynamic fields. On Wed, Feb 3, 2010 at 2:43 AM, caman aboxfortheotherst...@gmail.com wrote: Thanks all. I am on track. Another question: What would be the right way to point out which field contains the term searched for. e.g. If I search for SOLR and if the term exist in field788 for a document, how do I pinpoint that which field has the term. I copied all the fields in field called 'body' which makes searching easier but would be nice to show the field which has that exact term. thanks caman wrote: Hello all, hope someone can point me to right direction. I am trying to index an oracle warehouse table(TableA) with 850 columns. Out of the structure about 800 fields are CLOBs and are good candidate to enable full-text searching. Also have few columns which has relational link to other tables. I am clean on how to create a root entity and then pull data from other relational link as child entities. Most columns in TableA are named as field1,field2...field800. Now my question is how to organize the schema efficiently: First option: if my query is 'select * from TableA', Do I define field name=attr1 column=FIELD1 / for each of those 800 columns? Seems cumbersome. May be can write a script to generate XML instead of handwriting both in data-config.xml and schema.xml. OR Dont define any field name=attr1 column=FIELD1 / so that column in SOLR will be same as in the database table. But questions are 1)How do I define unique field in this scenario? 2) How to copy all the text fields to a common field for easy searching? Any helpful is appreciated. Please feel free to suggest any alternative way. Thanks -- View this message in context: http://old.nabble.com/Indexing-an-oracle-warehouse-table-tp27414263p27429352.html Sent from the Solr - User mailing list archive at Nabble.com.
DataImportHandler - convertType attribute
Hello, I encountered blob indexing problem and found convertType solution in FAQhttp://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5 I was wondering why it is not enabled by default and found the following comment http://www.lucidimagination.com/search/document/169e6cc87dad5e67/dataimporthandler_and_blobs#169e6cc87dad5e67in mailing list: We used to attempt type conversion from the SQL type to the field's given type. We found that it was error prone and switched to using the ResultSet#getObject for all columns (making the old behavior a configurable option – convertType in JdbcDataSource). Why it is error prone? Is it safe enough to enable convertType for all jdbc data sources by default? What are the side effects? Thanks in advance, Alex
Re: Indexing a oracle warehouse table
Dont define any field name=attr1 column=FIELD1 / so that column in SOLR will be same as in the database table. Correct You can define dynamic field dynamicField name=field* type=text indexed=true stored=true/ ( see http://wiki.apache.org/solr/SchemaXml#Dynamic_fields ) 1)How do I define unique field in this scenario? You can create primary key into database or generate it directly in Solr ( see UUID techniques http://wiki.apache.org/solr/UniqueKey ) 2) How to copy all the text fields to a common field for easy searching? copyField source=field* dest=field/ ( see http://wiki.apache.org/solr/SchemaXml#Copy_Fields ) On Tue, Feb 2, 2010 at 4:22 AM, caman aboxfortheotherst...@gmail.com wrote: Hello all, hope someone can point me to right direction. I am trying to index an oracle warehouse table(TableA) with 850 columns. Out of the structure about 800 fields are CLOBs and are good candidate to enable full-text searching. Also have few columns which has relational link to other tables. I am clean on how to create a root entity and then pull data from other relational link as child entities. Most columns in TableA are named as field1,field2...field800. Now my question is how to organize the schema efficiently: First option: if my query is 'select * from TableA', Do I define field name=attr1 column=FIELD1 / for each of those 800 columns? Seems cumbersome. May be can write a script to generate XML instead of handwriting both in data-config.xml and schema.xml. OR Dont define any field name=attr1 column=FIELD1 / so that column in SOLR will be same as in the database table. But questions are 1)How do I define unique field in this scenario? 2) How to copy all the text fields to a common field for easy searching? Any helpful is appreciated. Please feel free to suggest any alternative way. Thanks -- View this message in context: http://old.nabble.com/Indexing-a-oracle-warehouse-table-tp27414263p27414263.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImportHandler - synchronous execution
Hi, I created Jira issue SOLR-1721 and attached simple patch ( no documentation ) for this. HIH, Alex 2010/1/13 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com: it can be added On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba ase...@gmail.com wrote: Hi, I found that there's no explicit option to run DataImportHandler in a synchronous mode. I need that option to run DIH from SolrJ ( EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream to DIH as a workaround for this, but I think it makes sense to add specific option for that. Any objections? Alex -- - Noble Paul | Systems Architect| AOL | http://aol.com
DataImportHandler - synchronous execution
Hi, I found that there's no explicit option to run DataImportHandler in a synchronous mode. I need that option to run DIH from SolrJ ( EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream to DIH as a workaround for this, but I think it makes sense to add specific option for that. Any objections? Alex
Re: Adaptive search?
You can add click counts to your index as additional field and boost results based on that value. http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29 You can keep some kind of buffer for clicks and update click count field for documents in the index periodically. If you don't want to update whole documents in the index then you probably should look at ExternalFileField or Lucene ParallelReader as a custom Solr IndexReader, but this is complex low level Lucene stuff and requires some hacking. Alex On Thu, Dec 17, 2009 at 6:46 PM, Siddhant Goel siddhantg...@gmail.com wrote: Let say we have a search engine (a simple front end - web app kind of a thing - responsible for querying Solr and then displaying the results in a human readable form) based on Solr. If a user searches for something, gets quite a few search results, and then clicks on one such result - is there any mechanism by which we can notify Solr to boost the score/relevance of that particular result in future searches? If not, then any pointers on how to go about doing that would be very helpful. Thanks, On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrecht p...@activemath.org wrote: What can it mean to adapt to user clicks ? Quite many things in my head. Do you have maybe a citation that inspires you here? paul Le 17-déc.-09 à 13:52, Siddhant Goel a écrit : Does Solr provide adaptive searching? Can it adapt to user clicks within the search results it provides? Or that has to be done externally? -- - Siddhant
Re: preserve relational strucutre in solr?
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example See full import example, it has 1-n and n-n relationships On Mon, Dec 14, 2009 at 4:34 PM, Faire Mii faire@gmail.com wrote: was able to import data through solr DIH. in my db i have 3 tables: threads: id tags: id thread_tag_map: thread_id, tag_id i want to import the many2many relationship (which thread has which tags) to my solr index. how should the query look like. i have tried with following code without result: entity name=thread_tags query=select * from threads, tags, thread_tag_map where thread_tag_map.thread_id = threads.id AND thread_tag_map.tag_id = tags.id /entity s this the right way to go? i thought that with this query each document will consist of tread and all the tags related to it. and i could do a query to get the specific thread by tagname. thanks!
Re: Similar documents from multiple cores with different schemas
Or maybe it's possible to tweak MoreLikeThis just to return the fields and terms that could be used for a search on the other core? Exactly See parameter mlt.interestingTerms in MoreLikeThisHandler http://wiki.apache.org/solr/MoreLikeThisHandler You can get interesting terms and build query (with N optional clauses + boosts) to second core yourself HIH, Alex On Mon, Nov 9, 2009 at 6:25 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi all, my search for any postings answering the following question haven't produced any helpful hints so far. Maybe someone can point me into the right direction? Situation: I have two cores with slightly different schemas. Slightly means that some fields appear on both cores but there are some that are required in one core but optional in the other. Then there are fields that appear only in one core. (I don't want to put them in one index, right now, because of the fields that might be required for only one type but not the other. But it's certainly an option.) Question: Is there a way to get similar contents from core B when the input (seed) to the comparison is a document from core A? MoreLikeThis: I was searching for MoreLikeThis, multiple schemas etc. As these are cores with different schemas, the posts on distributed search/sharding in combination with MoreLikeThis are not helpful. But maybe there is some other functionality that I am not aware of? Some similarity search? Or maybe it's possible to tweak MoreLikeThis just to return the fields and terms that could be used for a search on the other core? Thanks for any input! Chantal
Re: sanizing/filtering query string for security
I added some kind of pre and post processing of Solr results for this, i.e. If I find fieldname specified in query string in form of fieldname:term then I pass this query string to standard request handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler doesn't break the query, at least I haven't seen yet ). If standard request handler throws error ( invalid field, too many clauses, etc ) then I pass original query to DisMax request handler. Alex On Mon, Nov 9, 2009 at 10:05 PM, michael8 mich...@saracatech.com wrote: Hi Julian, Saw you post on exactly the question I have. I'm curious if you got any response directly, or figured out a way to do this by now that you could share? I'm in the same situation trying to 'sanitize' the query string coming in before handing it to solr. I do see that characters like : could break the query, but am curious if anyone has come up with a general solution as I think this must be a fairly common problem for any solr deployment to tackle. Thanks, Michael Julian Davchev wrote: Hi, Is there anything special that can be done for sanitizing user input before passed as query to solr. Not allowing * and ? as first char is only thing I can thing of right now. Anything else it should somehow handle. I am not able to find any relevant document. -- View this message in context: http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sanizing/filtering query string for security
BTW, I have not used DisMax handler yet, but does it handle *:* properly? See q.alt DisMax parameter http://wiki.apache.org/solr/DisMaxRequestHandler#q.alt You can specify q.alt=*:* and q as empty string to get all results. do you care if users issue this query I allow users to issue an empty search and get all results with all facets / etc. It's a nice navigation UI btw. Basically given my UI, I'm trying to *hide* the total count from users searching for *everything* If you don't specify q.alt parameter then Solr returns zero results for empty search. *:* won't work either. though this syntax has helped me debug/monitor the state of my search doc pool size. see q.alt Alex On Tue, Nov 10, 2009 at 12:59 AM, michael8 mich...@saracatech.com wrote: Sounds like a nice approach you have done. BTW, I have not used DisMax handler yet, but does it handle *:* properly? IOW, do you care if users issue this query, or does DisMax treat this query string differently than standard request handler? Basically given my UI, I'm trying to *hide* the total count from users searching for *everything*, though this syntax has helped me debug/monitor the state of my search doc pool size. Thanks, Michael Alexey-34 wrote: I added some kind of pre and post processing of Solr results for this, i.e. If I find fieldname specified in query string in form of fieldname:term then I pass this query string to standard request handler, otherwise use DisMaxRequestHandler ( DisMaxRequestHandler doesn't break the query, at least I haven't seen yet ). If standard request handler throws error ( invalid field, too many clauses, etc ) then I pass original query to DisMax request handler. Alex On Mon, Nov 9, 2009 at 10:05 PM, michael8 mich...@saracatech.com wrote: Hi Julian, Saw you post on exactly the question I have. I'm curious if you got any response directly, or figured out a way to do this by now that you could share? I'm in the same situation trying to 'sanitize' the query string coming in before handing it to solr. I do see that characters like : could break the query, but am curious if anyone has come up with a general solution as I think this must be a fairly common problem for any solr deployment to tackle. Thanks, Michael Julian Davchev wrote: Hi, Is there anything special that can be done for sanitizing user input before passed as query to solr. Not allowing * and ? as first char is only thing I can thing of right now. Anything else it should somehow handle. I am not able to find any relevant document. -- View this message in context: http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26271891.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/sanizing-filtering-query-string-for-security-tp21516844p26274459.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MoreLikeThis and filtering/restricting on target fields
Hi Cody, I have tried using MLT as a search component so that it has access to filter queries (via fq) but I cannot seem to get it to give me any data other than more of the same, that is, I can get a ton of Articles back but not other content types. Filter query ( fq ) should work, for example add fq=type_s:BlogPost OR type_s:Community http://localhost:9007/solr/mlt?q=id:WikiArticle:948mlt.fl=body_tmlt.qf=body_t^1.0fq=type_s:BlogPost OR type_s:Community Alex On Fri, Nov 6, 2009 at 1:44 AM, Cody Caughlan tool...@gmail.com wrote: I am trying to use MoreLikeThis (both the component and handler, trying combinations) and I would like to give it an input document reference which has a source field to analyze and then get back other documents which have a given field that is used by MLT. My dataset is composed of documents like: # Doc 1 id:Article:99 type_s:Article body_t: the body of the article... # Doc 2 id:Article:646 types_s:Article body_t: another article... # Doc 3 id:Community:44 type_s:Community description_t: description of this community... # Doc 4 id:Community:34874 type_s:Community description_t: another description # Doc 5 id:BlogPost:2384 type_s:BlogPost body_t: contents of some blog post So I would like to say, given an article (e.g. id:Article:99 which has a field body_t that should be analyze), give more related Communities, and you will want to search on description_t for your analysis.' When I run a basic query like: (using raw URL values for clarity, but they are encoded in reality) http://localhost:9007/solr/mlt?q=id:WikiArticle:948mlt.fl=body_t then I get back a ton of other articles. Which is fine if my target type was Article. So how I can I say search on field A for your analysis of the input document, but for related terms use field B, filtered by type_s It seems that I can really only specify one field via mlt.fl I have tried using MLT as a search component so that it has access to filter queries (via fq) but I cannot seem to get it to give me any data other than more of the same, that is, I can get a ton of Articles back but not other content types. Am I just trying to do too much? Thanks /Cody
Re: Dismax and Standard Queries together
Hi Ram, You can add another field total ( catchall field ) and copy all other fields into this field ( using copyField directive ) http://wiki.apache.org/solr/SchemaXml#Copy_Fields and use this field in DisMax qf parameter, for example qf=business_name^2.0 category_name^1.0 sub_category_name^1.0 total^0.0 and mm=100% Thus, it requires occurrence of all search keywords in any field of your document, but you can control relevance of returned results via boosting in qf parameter. HIH, Alex On Tue, Nov 3, 2009 at 12:02 AM, ram_sj rpachaiyap...@gmail.com wrote: Hi, I have three fields, business_name, category_name, sub_category_name in my solrconfig file. my query = pet clinic example sub_category_names: Veterinarians, Kennels, Veterinary Clinics Hospitals, Pet Grooming, Pet Stores, Clinics my ideal requirement is dismax searching on a. dismax over three or two fields b. followed by a Boolean match over any one of the field is acceptable. I played around with minimum match attributes, but doesn't seems to be helpful, I guess the dismax requires at-least two fields. The nest queries takes only one qf filed, so it doesn't help much either. Any suggestions will be helpful. Thanks Ram -- View this message in context: http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: adding and updating a lot of document to Solr, metadata extraction etc
Hi Eugene, - ability to iterate over all documents, returned in search, as Lucene does provide within a HitCollector instance. We would need to extract and aggregate various fields, stored in index, to group results and aggregate them in some way. Also I did not find any way in the tutorial to access the search results with all fields to be processed by our application. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr Check out Faceted Search, probably you can achieve your goal by using Facet Component There's also Field Collapsing patch http://wiki.apache.org/solr/FieldCollapsing Alex
Re: Solr Cell on web-based files?
e.g (doesn't work) curl http://localhost:8983/solr/update/extract?extractOnly=true --data-binary @http://myweb.com/mylocalfile.htm -H Content-type:text/html You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml). Yes, curl example curl 'http://localhost:8080/solr/main_index/extract/?extractOnly=trueindent=onresource.name=lecture12stream.url=http%3A//myweb.com/lecture12.ppt' It works great for me. Alex
Re: yellow pages navigation kind menu. howto take every 100th row from resultset
It seems that you need Faceted Searchhttp://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr On Fri, Oct 2, 2009 at 3:35 PM, Julian Davchev j...@drun.net wrote: Hi, Long story short: how can I take every 100th row from solr resultset. What would syntax for this be. Long story: Currently I have lots of say documents(articles) indexed. They all have field title with corresponding value. atitle btitle . *title How do I build menu so I can search of those? I cannot just hardcode ABC Dmeaning all starting with A all starting with B etc...cause there are unicode characters and english alphabet will just not cut it... So my idea is to make ranges like [atitle - mtitle][mtitle - ltitle] ...etc etc (based on actual title names I got) Questions is how do I figure out what those atitle-mtitle is (like get from solr query every 100th record) Two solutions I found: 1. get all stuff and do it server side (huge load as it's thousands record we talk about) 2. use solr sort and start and make N calls until resulted rows 100.But this will mean quite a load as well as there lots of records. Any pointers? Thanks
Re: Keepwords Schema
Probably you want to use - multivalued field 'authors' add doc field name=filenamelogin.php/field field name=authorsalex/field field name=authorsbrian/field ... /doc /add - return facets for this field - you can filter unwanted authors whether during indexing process or post process returned search results On Fri, Oct 2, 2009 at 4:35 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Oct 1, 2009 at 7:37 PM, matrix_psj matrix_...@hotmail.com wrote: An example: My schema is about web files. Part of the syntax is a text field of authors that have worked on each file, e.g. file filenamelogin.php/filename lastModDate2009-01-01/lastModDate authorsalex, brian, carl carlington, dave alpha, eddie, dave beta/authors /file When I perform a search and get 20 web files back, I would like a facet of the individual authors, but only if there name appears in a public_authors.txt file. So if the public_authors.txt file contained: Anna, Bob, Carl Carlington, Dave Alpha, Elvis, Eddie, The facet returned would be: Carl Carlington Dave Alpha Eddie Not sure if that makes sense? If it does, could someone explain to me the schema fieldtype declarations that would bring back this sort of results. If I'm understanding you correctly - You want to facet on a field (with facet=truefacet.field=authors) but you want to show only certain whitelisted facet values in the response. If that is correct then, you can remove the authors which are not in the whitelist during indexing time. You can do this by adding KeepWordFilterFactory to your field type: filter class=solr.KeepWordFilterFactory words=author_whitelist.txt ignoreCase=true / -- Regards, Shalin Shekhar Mangar.
Re: Disabling tf (term frequency) during indexing and/or scoring
Hi Aaron, You can overwrite default Lucene Similarity and disable tf and lengthNorm factors in scoring formula ( see http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html and http://lucene.apache.org/java/2_4_1/api/index.html ) You need to 1) compile the following class and put it into Solr WEB-INF/classes --- package my.package; import org.apache.lucene.search.DefaultSimilarity; public class NoLengthNormAndTfSimilarity extends DefaultSimilarity { public float lengthNorm(String fieldName, int numTerms) { return numTerms 0 ? 1.0f : 0.0f; } public float tf(float freq) { return freq 0 ? 1.0f : 0.0f; } } --- 2. Add similarity class=my.package.NoLengthNormAndTfSimilarity/ into your schema.xml http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca HIH, Alex On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee ucbmc...@gmail.com wrote: Hello, Let me preface this by admitting that I'm still fairly new to Lucene and Solr, so I apologize if any of this sounds naive and I'm open to thinking about my problem differently. I'm currently responsible for a rather large dataset of business records that I'm trying to build a Lucene/Solr infrastructure around, to replace an in-house solution that we've been using for a few years. These records are sourced from multiple providers and there's often a fair bit of overlap in the business coverage. I have a set of fuzzy correlation libraries that I use to identify these documents and I ultimately create a super-record that includes metadata from each of the providers. Given the nature of things, these providers often have slight variations in wording or spelling in the overlapping fields (it's amazing how many ways people find to refer to the same business or address). I'd like to capture these variations, as they facilitate searching, but TF considerations are currently borking field scoring here. For example, taking business names into consideration, I have a Solr schema similar to: field name=name_provider1 type=string indexed=false stored=false multiValued=true ... field name=name_providerN type=string indexed=false stored=false multiValued=true field name=nameNorm type=text indexed=true stored=false multiValued=true omitNorms=true copyField source=name_provider1 dest=nameNorm ... copyField source=name_providerN dest=nameNorm For any given business record, there may be 1..N business names present in the nameNorm field (some with naming variations, some identical). With TF enabled, however, I'm getting different match scores on this field simply based on how many providers contributed to the record, which is not meaningful to me. For example, a record containing nameNormfoo barpositionIncrementGapfoo bar/nameNorm is necessarily scoring higher than a record just containing nameNormfoo bar/nameNorm. Although I wouldn't mind TF data being considered within each discrete field value, I need to find a way to prevent score inflation based simply on the number of contributing providers. Looking at the mailing list archive and searching around, it sounds like the omitTf boolean in Lucene used to function somewhat in this manner, but has since taken on a broader interpretation (and name) that now also disables positional and payload data. Unfortunately, phrase support for fields like this is absolutely essential. So what's the best way to address a need like this? I guess I don't mind whether this is handled at index time or search time, but I'm not sure what I may need to override or if there's some existing provision I should take advantage of. Thank you for any help you may have. Best regards, Aaron
Re: do NOT want to stem plurals for a particular field, or words
You can enable/disable stemming per field type in the schema.xml, by removing the stemming filters from the type definition. Basically, copy your prefered type, rename it to something like 'text_nostem', remove the stemming filter from the type and use your 'text_nostem' type for your field 'type' . + you can search in both fields text_stemmed and text_exact using DisMax handler and boost text_exact matching. Thus if you search for 'articles' you'll get all results with 'articles' and 'article', but exact match will be on top.
Re: query too long / has-many relation
Is there a way to configure Solr to accept POST queries (instead of GET only?). Or: is there some other way to make Solr accept queries longer than 2,000 characters? (Up to 10,000 would be nice) Solr accepts POST queries by default. I switched to POST for exactly the same reason. I use Solr 1.4 ( trunk version ) though. I have a Solr 1.3 index (served by Tomcat) of People, containing id, name, address, description etc. This works fine. Now I want to store and retrieve Events (time location, person), so each person has 0 or more events. As I understood it, there is no way to model a has-many relation in Solr (at least not between two structures with more than 1 properties), so I decided to store the Events in a separate mysql table. An example of a query I would like to do is: give me all people that will have an Event on location x coming month, that have in their description. I do this in two steps now: first I query the mysql table, then I build a solr query, with a big OR of all the ids. The problem is that this can generate long (too long) querystrings. Another option would be to put all your event objects (time, location, person_id, description) into Solr index ( normalization ) Then you can generate Solr query give me all events on location x coming month that have smth in their description and asks Solr to return facets values for field person_id. Solr will return all distinct values of field person_id that matches the query with count values. Then you can take list of related person_ids and load all persons from MySQL database using SQL in IN () clause.
Re: query too long / has-many relation
Is there a way to configure Solr to accept POST queries (instead of GET only?). Or: is there some other way to make Solr accept queries longer than 2,000 characters? (Up to 10,000 would be nice) Solr accepts POST queries by default. I switched to POST for exactly the same reason. I use Solr 1.4 ( trunk version ) though. Don't forget to increase maxBooleanClauses in solrconfig.xml http://wiki.apache.org/solr/SolrConfigXml#head-69ecb985108d73a2f659f2387d916064a2cf63d1
Re: query too long / has-many relation
But apart from that everything works fine now (10,000 OR clauses takes 10 seconds). Not fast. I would recommend to denormalize your data, put everything into Solr index and use Solr faceting http://wiki.apache.org/solr/SolrFacetingOverview to get relevant persons ( see my previous message )
Re: DisMax - fetching dynamic fields
My bad! Please disregard this post. Alex On Tue, Aug 4, 2009 at 9:21 PM, Alexey Serbaase...@gmail.com wrote: Solr 1.4 built from trunk revision 790594 ( 02 Jul 2009 ) On Tue, Aug 4, 2009 at 9:19 PM, Alexey Serbaase...@gmail.com wrote: Hi everybody, I have a couple of dynamic fields in my schema, e.g. rating_* popularity_* The problem I have is that if I try to specify existing fields rating_1 popularity_1 in fl parameter - DisMax handler just ignores them whereas StandardRequestHandler works fine. Any clues what's wrong? Thanks in advance, Alex
DisMax - fetching dynamic fields
Hi everybody, I have a couple of dynamic fields in my schema, e.g. rating_* popularity_* The problem I have is that if I try to specify existing fields rating_1 popularity_1 in fl parameter - DisMax handler just ignores them whereas StandardRequestHandler works fine. Any clues what's wrong? Thanks in advance, Alex
Re: DisMax - fetching dynamic fields
Solr 1.4 built from trunk revision 790594 ( 02 Jul 2009 ) On Tue, Aug 4, 2009 at 9:19 PM, Alexey Serbaase...@gmail.com wrote: Hi everybody, I have a couple of dynamic fields in my schema, e.g. rating_* popularity_* The problem I have is that if I try to specify existing fields rating_1 popularity_1 in fl parameter - DisMax handler just ignores them whereas StandardRequestHandler works fine. Any clues what's wrong? Thanks in advance, Alex