Re: number of replicas in Cloud
Hi Anshum, Im using solr 4.4. Is there a problem with using replicationFactor of 2 On Thu, Sep 12, 2013 at 11:20 AM, Anshum Gupta ans...@anshumgupta.netwrote: Prasi, a replicationFactor of 2 is what you want. However, as of the current releases, this is not persisted. On Thu, Sep 12, 2013 at 11:17 AM, Prasi S prasi1...@gmail.com wrote: Hi, I want to setup solrcloud with 2 shards and 1 replica for each shard. MyCollection shard1 , shard2 shard1-replica , shard2-replica In this case, i would numShards=2. For replicationFactor , should give replicationFactor=1 or replicationFActor=2 ? Pls suggest me. thanks, Prasi -- Anshum Gupta http://www.anshumgupta.net
Re: No or limited use of FieldCache
Thanks, guys. Now I know a little more about DocValues and realize that they will do the job wrt FieldCache. Regards, Per Steffensen On 9/12/13 3:11 AM, Otis Gospodnetic wrote: Per, check zee Wiki, there is a page describing docvalues. We used them successfully in a solr for analytics scenario. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 9:15 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 09/11/2013 08:40 AM, Per Steffensen wrote: The reason I mention sort is that we in my project, half a year ago, have dealt with the FieldCache-OOM-problem when doing sort-requests. We basically just reject sort-requests unless they hit below X documents - in case they do we just find them without sorting and sort them ourselves afterwards. Currently our problem is, that we have to do a group/distinct (in SQL-language) query and we have found that we can do what we want to do using group (http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing) or facet - either will work for us. Problem is that they both use FieldCache and we know that using FieldCache will lead to OOM-execptions with the amount of data each of our Solr-nodes administrate. This time we have really no option of just limit usage as we did with sort. Therefore we need a group/distinct-functionality that works even on huge data-amounts (and a algorithm using FieldCache will not) I believe setting facet.method=enum will actually make facet not use the FieldCache. Is that true? Is it a bad idea? I do not know much about DocValues, but I do not believe that you will avoid FieldCache by using DocValues? Please elaborate, or point to documentation where I will be able to read that I am wrong. Thanks! There is Simon Willnauer's presentation http://www.slideshare.net/** lucenerevolution/willnauer-**simon-doc-values-column-** stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene and this blog post http://blog.trifork.com/2011/** 10/27/introducing-lucene-**index-doc-values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ and this one that shows some performance comparisons: http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
Re: SolrCloud 4.x hangs under high update volume
Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourtt...@elementspace.comwrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in SolrCloud a bit. Thoughts? Thanks all! Cheers, Tim On 6 September 2013 14:47, Tim Vaillancourtt...@elementspace.com wrote: Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com wrote: Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourtt...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at
DataImportHandler oddity
I'm trying to index a view in an Oracle database, and have come across some strange behaviour: all the VARCHAR2 fields are being returned as empty strings; this also applies to a datetime field converted to a string via TO_CHAR, and the url field built by concatenating two constant strings and a numeric filed converted via TO_CHAR. If I cast the fields columns to CHAR(N), I get values back, but this is not an acceptable workaround (the maximum length of CHAR(N) is less than VARCHAR2(N), and the result is padded to the specified length). Note that this query works as it should in sqldeveloper, and also in some code that uses the .NET sqlclient api. The query I'm using is select 'APPLICATION' as sourceid, 'http://app.company.com' || '/app/report.aspx?trsid=' || to_char(incident_no) as URL, incident_no, trans_date, location, responsible_unit, process_eng, product_eng, case_title, case_description, index_lob, investigated, investigated_eng, to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date from synx.dw_fast where (investigated 3) while the view is INCIDENT_NONUMBER(38) TRANS_DATEVARCHAR2(8) LOCATIONVARCHAR2(4000) RESPONSIBLE_UNITVARCHAR2(4000) PROCESS_ENGVARCHAR2(4000) PROCESS_NOVARCHAR2(4000) PRODUCT_ENGVARCHAR2(4000) PRODUCT_NOVARCHAR2(4000) CASE_TITLEVARCHAR2(4000) CASE_DESCRIPTIONVARCHAR2(4000) INDEX_LOBCLOB INVESTIGATEDNUMBER(38) INVESTIGATED_ENGVARCHAR2(254) INVESTIGATED_NOVARCHAR2(254) MODIFIED_DATEDATE
Storing/indexing speed drops quickly
Hi SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on each, one collection across the 6 nodes, 4 shards per node Storing/indexing from 100 threads on external machines, each thread one doc at the time, full speed (they always have a new doc to store/index) See attached images * iowait.png: Measured I/O wait on the Solr machines * doccount.png: Measured number of doc in Solr collection Starting from an empty collection. Things are fine wrt storing/indexing speed for the first two-three hours (100M docs per hour), then speed goes down dramatically, to an, for us, unacceptable level (max 10M per hour). At the same time as speed goes down, we see that I/O wait increases dramatically. I am not 100% sure, but quick investigation has shown that this is due to almost constant merging. What to do about this problem? Know that you can play around with mergeFactor and commit-rate, but earlier tests shows that this really do not seem to do the job - it might postpone the time where the problem occurs, but basically it is just a matter of time before merging exhaust the system. Is there a way to totally avoid merging, and keep indexing speed at a high level, while still making sure that searches will perform fairly well when data-amounts become big? (guess without merging you will end up with lots and lots of small files, and I guess this is not good for search response-time) Regards, Per Steffensen
create a core with explicit node_name
Hi solr users I want to create a core with node_name through the api CloudSolrServer.query(SolrParams params ). For example: ModifiableSolrParams params = new ModifiableSolrParams(); params.set(qt, /admin/cores); params.set(action, CREATE); params.set(name, newcore.getName()); params.set(shard, newcore.getShardname()); params.set(collection.configName, newcore.getCollectionconfigname()); params.set(schema, newcore.getSchemaXMLFilename()); params.set(config, newcore.getSolrConfigFilename()); params.set(coreNodeName, newcore.getCorenodename()); params.set(node_name, 10.7.23.124:8080_solr); params.set(collection, newcore.getCollectionname()); The newcore encapsulats the create properties about the created core. It seems not to work. As a result,the core was created on the other node. Do I need to send the params directly to the explicit web server -here is the 10.7.23.124 - instead of using the CloudSolrServer.query(SolrParams params )? regards
Re: charset encoding
no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: ReplicationFactor for solrcloud
Hi Aditya, You need to start another 6 instances (9 instances in total) to achieve this. The first 3 instances, as you mention, are already assigned to the 3 shards. The next 3 will be become their replicas, followed by the next 3 as the next replicas. You could create two copies each of the example folder and start each one on a different jetty port. See: http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster Regards, Aloke On 9/12/13, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hi - I am trying to set the 3 shards and 3 replicas for my solrcloud deployment with 3 servers, specifying the replicationFactor=3 and numShards=3 when starting the first node. I see each of the servers allocated to 1 shard each.however, do not see 3 replicas allocated on each node. I specifically need to have 3 replicas across 3 servers with 3 shards. Do we think of any reason to not have this configuration ? -- Regards, -Aditya Sakhuja
Re: number of replicas in Cloud
Can you specify what do you mean by 'problem'? I don't think there should be any issues with that. Hope this is what you followed in your attempt so far: http://wiki.apache.org/solr/SolrCloud#Example_B:_Simple_two_shard_cluster_with_shard_replicas On Thu, Sep 12, 2013 at 11:31 AM, Prasi S prasi1...@gmail.com wrote: Hi Anshum, Im using solr 4.4. Is there a problem with using replicationFactor of 2 On Thu, Sep 12, 2013 at 11:20 AM, Anshum Gupta ans...@anshumgupta.net wrote: Prasi, a replicationFactor of 2 is what you want. However, as of the current releases, this is not persisted. On Thu, Sep 12, 2013 at 11:17 AM, Prasi S prasi1...@gmail.com wrote: Hi, I want to setup solrcloud with 2 shards and 1 replica for each shard. MyCollection shard1 , shard2 shard1-replica , shard2-replica In this case, i would numShards=2. For replicationFactor , should give replicationFactor=1 or replicationFActor=2 ? Pls suggest me. thanks, Prasi -- Anshum Gupta http://www.anshumgupta.net -- Anshum Gupta http://www.anshumgupta.net
Re: DataImportHandler oddity
Followup: I just tried modifying the select with select CAST('APPLICATION' as varchar2(100)) as sourceid, ... and that caused the sourceid field to be empty. CASTing to char(100) gave me the expected value ('APPLICATION', right-padded to 100 characters). Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via http://forum.caucho.com/showthread.php?t=27574). On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote: I'm trying to index a view in an Oracle database, and have come across some strange behaviour: all the VARCHAR2 fields are being returned as empty strings; this also applies to a datetime field converted to a string via TO_CHAR, and the url field built by concatenating two constant strings and a numeric filed converted via TO_CHAR. If I cast the fields columns to CHAR(N), I get values back, but this is not an acceptable workaround (the maximum length of CHAR(N) is less than VARCHAR2(N), and the result is padded to the specified length). Note that this query works as it should in sqldeveloper, and also in some code that uses the .NET sqlclient api. The query I'm using is select 'APPLICATION' as sourceid, 'http://app.company.com' || '/app/report.aspx?trsid=' || to_char(incident_no) as URL, incident_no, trans_date, location, responsible_unit, process_eng, product_eng, case_title, case_description, index_lob, investigated, investigated_eng, to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date from synx.dw_fast where (investigated 3) while the view is INCIDENT_NONUMBER(38) TRANS_DATEVARCHAR2(8) LOCATIONVARCHAR2(4000) RESPONSIBLE_UNITVARCHAR2(4000) PROCESS_ENGVARCHAR2(4000) PROCESS_NOVARCHAR2(4000) PRODUCT_ENGVARCHAR2(4000) PRODUCT_NOVARCHAR2(4000) CASE_TITLEVARCHAR2(4000) CASE_DESCRIPTIONVARCHAR2(4000) INDEX_LOBCLOB INVESTIGATEDNUMBER(38) INVESTIGATED_ENGVARCHAR2(254) INVESTIGATED_NOVARCHAR2(254) MODIFIED_DATEDATE
Re: charset encoding
could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8? On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: DataImportHandler oddity
This is probably a bug with Oracle thin JDBC driver. Google found a similar issue: http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string I don't think this is specific to DataImportHandler. On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote: Followup: I just tried modifying the select with select CAST('APPLICATION' as varchar2(100)) as sourceid, ... and that caused the sourceid field to be empty. CASTing to char(100) gave me the expected value ('APPLICATION', right-padded to 100 characters). Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via http://forum.caucho.com/showthread.php?t=27574). On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote: I'm trying to index a view in an Oracle database, and have come across some strange behaviour: all the VARCHAR2 fields are being returned as empty strings; this also applies to a datetime field converted to a string via TO_CHAR, and the url field built by concatenating two constant strings and a numeric filed converted via TO_CHAR. If I cast the fields columns to CHAR(N), I get values back, but this is not an acceptable workaround (the maximum length of CHAR(N) is less than VARCHAR2(N), and the result is padded to the specified length). Note that this query works as it should in sqldeveloper, and also in some code that uses the .NET sqlclient api. The query I'm using is select 'APPLICATION' as sourceid, 'http://app.company.com' || '/app/report.aspx?trsid=' || to_char(incident_no) as URL, incident_no, trans_date, location, responsible_unit, process_eng, product_eng, case_title, case_description, index_lob, investigated, investigated_eng, to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date from synx.dw_fast where (investigated 3) while the view is INCIDENT_NONUMBER(38) TRANS_DATEVARCHAR2(8) LOCATIONVARCHAR2(4000) RESPONSIBLE_UNITVARCHAR2(4000) PROCESS_ENGVARCHAR2(4000) PROCESS_NOVARCHAR2(4000) PRODUCT_ENGVARCHAR2(4000) PRODUCT_NOVARCHAR2(4000) CASE_TITLEVARCHAR2(4000) CASE_DESCRIPTIONVARCHAR2(4000) INDEX_LOBCLOB INVESTIGATEDNUMBER(38) INVESTIGATED_ENGVARCHAR2(254) INVESTIGATED_NOVARCHAR2(254) MODIFIED_DATEDATE -- Regards, Shalin Shekhar Mangar.
Re: Storing/indexing speed drops quickly
Maybe the fact that we are never ever going to delete or update documents, can be used for something. If we delete we will delete entire collections. Regards, Per Steffensen On 9/12/13 8:25 AM, Per Steffensen wrote: Hi SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on each, one collection across the 6 nodes, 4 shards per node Storing/indexing from 100 threads on external machines, each thread one doc at the time, full speed (they always have a new doc to store/index) See attached images * iowait.png: Measured I/O wait on the Solr machines * doccount.png: Measured number of doc in Solr collection Starting from an empty collection. Things are fine wrt storing/indexing speed for the first two-three hours (100M docs per hour), then speed goes down dramatically, to an, for us, unacceptable level (max 10M per hour). At the same time as speed goes down, we see that I/O wait increases dramatically. I am not 100% sure, but quick investigation has shown that this is due to almost constant merging. What to do about this problem? Know that you can play around with mergeFactor and commit-rate, but earlier tests shows that this really do not seem to do the job - it might postpone the time where the problem occurs, but basically it is just a matter of time before merging exhaust the system. Is there a way to totally avoid merging, and keep indexing speed at a high level, while still making sure that searches will perform fairly well when data-amounts become big? (guess without merging you will end up with lots and lots of small files, and I guess this is not good for search response-time) Regards, Per Steffensen
Re: Storing/indexing speed drops quickly
Seems like the attachments didnt make it through to this mailing list https://dl.dropboxusercontent.com/u/25718039/doccount.png https://dl.dropboxusercontent.com/u/25718039/iowait.png On 9/12/13 8:25 AM, Per Steffensen wrote: Hi SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on each, one collection across the 6 nodes, 4 shards per node Storing/indexing from 100 threads on external machines, each thread one doc at the time, full speed (they always have a new doc to store/index) See attached images * iowait.png: Measured I/O wait on the Solr machines * doccount.png: Measured number of doc in Solr collection Starting from an empty collection. Things are fine wrt storing/indexing speed for the first two-three hours (100M docs per hour), then speed goes down dramatically, to an, for us, unacceptable level (max 10M per hour). At the same time as speed goes down, we see that I/O wait increases dramatically. I am not 100% sure, but quick investigation has shown that this is due to almost constant merging. What to do about this problem? Know that you can play around with mergeFactor and commit-rate, but earlier tests shows that this really do not seem to do the job - it might postpone the time where the problem occurs, but basically it is just a matter of time before merging exhaust the system. Is there a way to totally avoid merging, and keep indexing speed at a high level, while still making sure that searches will perform fairly well when data-amounts become big? (guess without merging you will end up with lots and lots of small files, and I guess this is not good for search response-time) Regards, Per Steffensen
Re: Regarding improving performance of the solr
Hi I tried to reindex the solr. I get the regular expression problem. The steps I followed are I started the java -jar start.jar http://localhost:8983/solr/update?stream.body= deletequery*:*querydelete http://localhost:8983/solr/update?stream.body=commit/ I stopped the solr server I changed indexed and stored tags as false for some of the fields in schema.xml fields field name=idtype=string indexed=true stored=true required=true/ field name=title type=string indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=revision type=sintindexed=false stored=false/ field name=user type=string indexed=false stored=false/ field name=userIdtype=int indexed=false stored=false/ field name=text type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=pagerank type=text_generalindexed=true stored=false/ field name=anchor_text type=text_general indexed=true stored=false multiValued=true compressed=true termVectors=true termPositions=true termOffsets=true/ field name=freebase type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=timestamp type=dateindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=titleText type=text_generalindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=category type=string indexed=true stored=true/ /fields uniqueKeyid/uniqueKey copyField source=title dest=titleText/ My data-config.xml dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name=page processor=XPathEntityProcessor stream=true forEach=/mediawiki/page/ url=/home/prabu/wikipedia_full_indexed_dump.xml transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer field column=idxpath=/mediawiki/page/id stripHTML=true/ field column=title xpath=/mediawiki/page/title stripHTML=true/ field column=category xpath=/mediawiki/page/category stripHTML=true/ field column=revision xpath=/mediawiki/page/revision/id stripHTML=true/ field column=user xpath=/mediawiki/page/revision/contributor/username stripHTML=true/ field column=userId xpath=/mediawiki/page/revision/contributor/id stripHTML=true/ field column=text xpath=/mediawiki/page/revision/text stripHTML=true/ field column=freebase xpath=/mediawiki/page/freebase stripHTML=true/ field column=pagerank xpath=/mediawiki/page/pagerank stripHTML=true/ field column=anchor_text xpath=/mediawiki/page/anchor_text/ stripHTML=true/ field column=timestamp xpath=/mediawiki/page/revision/timestamp dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=$skipDoc regex=^#REDIRECT .* replaceWith=true sourceColName=text/ field column=category regex=((\[\[.*Category:.*\]\]\W?)+) sourceColName=text stripHTML=true/ field column=$skipDoc regex=^Template:.* replaceWith=true sourceColName=title/ /entity /document /dataConfig I tried the http://localhost:8983/solr/dataimport?command=full-import. At 50,000 document, I get some error related to regular expression. at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) I do not how to proceed. Please help me out. Thanks and Regards Prabu On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little impact on memory required for searching. I've seen indexes where 75% of the data is stored and indexes where 5% of the data is stored. Summary of File Extensions here: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html Best, Erick On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy
Not able to deploy SOLR after applying OpenNLP patch
Hi, My Question is related to OpenNLP Integration with SOLR. I have successfully applied OpenNLP LUCENE-2899-x.patch to latest solr branch checkout from here: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x And also iam able to compile source code, generated all realted binaries and able to create war file. But facing issues while deployment of SOLR. Here is the error Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text_opennlp: Plugin init failure for [schema.xml] a nalyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:467) ... 15 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFa ctory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 16 more Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.OpenNLPTokenizerFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:449) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:543) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 20 more Caused by: java.lang.ClassNotFoundException: solr.OpenNLPTokenizerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:433) ... 24 more 4446 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer û null:org.apache.solr.common.SolrException: Unable to create core: colle ction1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:931) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:563) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:244) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:236) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Please help me on this. Waiting for your reply. Thanks in advance.
SolrCloud behave differently on server and local
hi all. I am trying solr cloud on my server. The server is a virtual machine. I have followed solr cloude wiki http://wiki.apache.org/solr/SolrCloud . When I run solr Cloud, It si failed. But If I try on my local ,it runs successfully. Why does solr behave differently on server and local? My solr.log as follows: INFO - 2013-09-12 14:50:13.389; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer; CoreContainer was not shutdown prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! instance=1423856966 INFO - 2013-09-12 14:50:13.483; org.eclipse.jetty.server.AbstractConnector; Started SocketConnector@0.0.0.0:8983 INFO - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2013-09-12 14:57:01.838; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor /opt/Applications/solr-4.4.0/example/contexts at interval 0 INFO - 2013-09-12 14:57:01.846; org.eclipse.jetty.deploy.DeploymentManager; Deployable added: /opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml INFO - 2013-09-12 14:57:02.549; org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet INFO - 2013-09-12 14:57:02.656; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() INFO - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader; JNDI not configured for solr (NoInitialContextEx) INFO - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader; solr home defaulted to 'solr/' (could not find system property or JNDI) INFO - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr/' INFO - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading container configuration from /opt/Applications/solr-4.4.0/example/solr/solr.xml ERROR - 2013-09-12 14:57:03.072; org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check solr/home property and the logs ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Could not load SOLR configuration at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65) at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129) at org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58) at
Re: SolrCloud 4.x hangs under high update volume
Fewer client threads updating makes sense, and going to 1 core also seems like it might help. But it's all a crap-shoot unless the underlying cause gets fixed up. Both would improve things, but you'll still hit the problem sometime, probably when doing a demo for your boss ;). Adrien has branched the code for SOLR 4.5 in preparation for a release candidate tentatively scheduled for next week. You might just start working with that branch if you can rather than apply individual patches... I suspect there'll be a couple more changes to this code (looks like Shikhar already raised an issue for instance) before 4.5 is finally cut... FWIW, Erick On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.comwrote: Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace.**comt...@elementspace.com wrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in SolrCloud a bit. Thoughts? Thanks all! Cheers, Tim On 6 September 2013 14:47, Tim Vaillancourttim@elementspace.**comt...@elementspace.com wrote: Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com wrote: Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourttim@elementspace.**comt...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to
Re: No or limited use of FieldCache
Per: One thing I'll be curious about. From my reading of DocValues, it uses little or no heap. But it _will_ use memory from the OS if I followed Simon's slides correctly. So I wonder if you'll hit swapping issues... Which are better than OOMs, certainly... Thanks, Erick On Thu, Sep 12, 2013 at 2:07 AM, Per Steffensen st...@designware.dk wrote: Thanks, guys. Now I know a little more about DocValues and realize that they will do the job wrt FieldCache. Regards, Per Steffensen On 9/12/13 3:11 AM, Otis Gospodnetic wrote: Per, check zee Wiki, there is a page describing docvalues. We used them successfully in a solr for analytics scenario. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 9:15 AM, Michael Sokolov msokolov@safaribooksonline.** com msoko...@safaribooksonline.com wrote: On 09/11/2013 08:40 AM, Per Steffensen wrote: The reason I mention sort is that we in my project, half a year ago, have dealt with the FieldCache-OOM-problem when doing sort-requests. We basically just reject sort-requests unless they hit below X documents - in case they do we just find them without sorting and sort them ourselves afterwards. Currently our problem is, that we have to do a group/distinct (in SQL-language) query and we have found that we can do what we want to do using group (http://wiki.apache.org/solr/FieldCollapsinghttp://wiki.apache.org/solr/**FieldCollapsing http://wiki.**apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ) or facet - either will work for us. Problem is that they both use FieldCache and we know that using FieldCache will lead to OOM-execptions with the amount of data each of our Solr-nodes administrate. This time we have really no option of just limit usage as we did with sort. Therefore we need a group/distinct-functionality that works even on huge data-amounts (and a algorithm using FieldCache will not) I believe setting facet.method=enum will actually make facet not use the FieldCache. Is that true? Is it a bad idea? I do not know much about DocValues, but I do not believe that you will avoid FieldCache by using DocValues? Please elaborate, or point to documentation where I will be able to read that I am wrong. Thanks! There is Simon Willnauer's presentation http://www.slideshare.net/** lucenerevolution/willnauer-simon-doc-values-column-** stride-fields-in-lucenehttp:/**/www.slideshare.net/** lucenerevolution/willnauer-**simon-doc-values-column-** stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene and this blog post http://blog.trifork.com/2011/http://blog.trifork.com/2011/** 10/27/introducing-lucene-index-doc-values/http://blog.** trifork.com/2011/10/27/**introducing-lucene-index-doc-**values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ and this one that shows some performance comparisons: http://searchhub.org/2013/04/02/fun-with-docvalues-in-**solr-**4-2/http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/ http://searchhub.**org/2013/04/02/fun-with-**docvalues-in-solr-4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
Re: Help in resolving the below retrieval issue
Hi, I am also seeing this issue when the search query is something like how are you? (Quotes for clarity). The query parser splits it to the below tokens: +text:whats +text:your +text:raashee? However when I remove the ? from the search query how are you I get the results. Is ? a special character? Should it be escaped as well? On Wed, Sep 11, 2013 at 1:50 AM, Jack Krupansky j...@basetechnology.comwrote: Removing stray hyphens (embedded hyphens, like CD-ROM, are okay) or escaping them with backslash looks like your best bests. There's no query parser option to disable the hyphen as an exlusion operator, although an upgrade to a modern Solr should fix the problem. -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Tuesday, September 10, 2013 4:13 PM To: solr-user@lucene.apache.org Subject: Re: Help in resolving the below retrieval issue I'm using Solr 3.4. This bug is causing the 2nd term i.e. kumar to be treated as an exclusion operator? Is it possible to configure the query parser to not treat the '-' as exclusion operator ? If not the only way is to remove the '-' from the query string? Thanks, Prathik On Tue, Sep 10, 2013 at 10:36 PM, Jack Krupansky j...@basetechnology.com **wrote: What release of Solr are you using? It appears that the hyphen is being treated as an exclusion operator even though it is followed by a space. Solr 4.4 doesn't appear to do that, but maybe earlier releases had a problem. In any case, be careful with leading hyphen in queries since it does mean exclude documents that contain the following term. Or, just escape any leading hyphen with a backslash. -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Tuesday, September 10, 2013 11:47 AM To: d...@lucene.apache.org ; solr-user@lucene.apache.org Subject: Re: Help in resolving the below retrieval issue Thanks Erick for the response. I tried to debug the query. Below is the response in the debug node str name=rawquerystringRahul - kumar/strstr name=querystringRahul - kumar/strstr name=parsedquery+text:Rahul -text:kumar/strstr name=parsedquery_toString+text:Rahul -text:kumar/strlst name=explain/str name=QParserLuceneQParser/strarr name=filter_queriesstrRahul - kumar/str/arrarr name=parsed_filter_queriesstr+text:rahul -text:kumar/str/arr Does it mean the query parser has parsed it to tokens Rahul - and kumar? Even if this was the case solr should be able to retrieve the documents because I have indexed all the documents based on n-grams as well. Thanks, Prathik On Tue, Sep 10, 2013 at 7:09 PM, Erick Erickson erickerick...@gmail.com * *wrote: Try adding debug=query to the url. What I think you'll find is that you're running into a common issue, the difference between query parsing and analysis. when you submit anything with whitespace in it, the query parser will break it up _before_ it gets to the analysis part, you should see something in the debug portion of the query like field:rahul field:kumar and possibly even field:- These are searched as separate tokens. By specifying KeywordTokenizer, at index time you'll have exactly one token, rahul-kumar in the index which will not match any of the separated tokens Try escaping the spaces with backslash. You could also try quoting the input although that has some phrase implications. Do you really want this search to fail on just searching rahul though? Perhaps keywordTokenizer isn't best here, it depends upon your use-case... Best, Erick On Tue, Sep 10, 2013 at 8:10 AM, Prathik Puthran prathik.puthra...@gmail.com wrote: Hi, I am facing the below issue where in Solr is not retrieving the indexed word for some cases. This happens whenever the indexed word has string - (quotes for clarity) as substring i.e word prefix followed by a space which is followed by '-' again followed by a space and followed by the rest of the word suffix. When I search with search query being the exact string Solr returns no results. Example: Indexed word -- Rahul - kumar (quotes for clarity) If I search with the search query as below Solr gives no results Search query -- Rahul - kumar (quotes for clarity) However the below search query returns the results Search query -- Rahul kumar Can you please let me know what I am doing wrong here and what should I do to ensure the first query i.e. Rahul - kumar returns the documents indexed using it. Below are the analyzers I am using: Index time analyzer components: 1) charFilter class=solr.PatternReplaceCharFilterFactory pattern=([^A-Za-z0-9 ]) replacement=/ 2) tokenizer class=solr.KeywordTokenizerFactory/ 3) filter class=solr.LowerCaseFilterFactory/ 4) filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1/ 5) filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=50 side=front/ 6)
Re: ReplicationFactor for solrcloud
You must specify maxShardsPerNode=3 for this to happen. By default maxShardsPerNode defaults to 1 so only one shard is created per node. On Thu, Sep 12, 2013 at 3:19 AM, Aditya Sakhuja aditya.sakh...@gmail.com wrote: Hi - I am trying to set the 3 shards and 3 replicas for my solrcloud deployment with 3 servers, specifying the replicationFactor=3 and numShards=3 when starting the first node. I see each of the servers allocated to 1 shard each.however, do not see 3 replicas allocated on each node. I specifically need to have 3 replicas across 3 servers with 3 shards. Do we think of any reason to not have this configuration ? -- Regards, -Aditya Sakhuja -- Regards, Shalin Shekhar Mangar.
Re: DataImportHandler oddity
That sounds reasonable. I've done some more digging, and found that the database instance in this case is an _OLD_ version of Oracle: 9.2.0.8.0. I also tried using the OCI driver (version 12), which refuses to even talk to this database. I have three other databases running on more recent versions of Oracle, and all three have worked fine with DataImportHandler. On Thu, Sep 12, 2013 at 9:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is probably a bug with Oracle thin JDBC driver. Google found a similar issue: http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string I don't think this is specific to DataImportHandler. On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote: Followup: I just tried modifying the select with select CAST('APPLICATION' as varchar2(100)) as sourceid, ... and that caused the sourceid field to be empty. CASTing to char(100) gave me the expected value ('APPLICATION', right-padded to 100 characters). Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via http://forum.caucho.com/showthread.php?t=27574). On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote: I'm trying to index a view in an Oracle database, and have come across some strange behaviour: all the VARCHAR2 fields are being returned as empty strings; this also applies to a datetime field converted to a string via TO_CHAR, and the url field built by concatenating two constant strings and a numeric filed converted via TO_CHAR. If I cast the fields columns to CHAR(N), I get values back, but this is not an acceptable workaround (the maximum length of CHAR(N) is less than VARCHAR2(N), and the result is padded to the specified length). Note that this query works as it should in sqldeveloper, and also in some code that uses the .NET sqlclient api. The query I'm using is select 'APPLICATION' as sourceid, 'http://app.company.com' || '/app/report.aspx?trsid=' || to_char(incident_no) as URL, incident_no, trans_date, location, responsible_unit, process_eng, product_eng, case_title, case_description, index_lob, investigated, investigated_eng, to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date from synx.dw_fast where (investigated 3) while the view is INCIDENT_NONUMBER(38) TRANS_DATEVARCHAR2(8) LOCATIONVARCHAR2(4000) RESPONSIBLE_UNITVARCHAR2(4000) PROCESS_ENGVARCHAR2(4000) PROCESS_NOVARCHAR2(4000) PRODUCT_ENGVARCHAR2(4000) PRODUCT_NOVARCHAR2(4000) CASE_TITLEVARCHAR2(4000) CASE_DESCRIPTIONVARCHAR2(4000) INDEX_LOBCLOB INVESTIGATEDNUMBER(38) INVESTIGATED_ENGVARCHAR2(254) INVESTIGATED_NOVARCHAR2(254) MODIFIED_DATEDATE -- Regards, Shalin Shekhar Mangar.
Re: DataImportHandler oddity
Thanks. It'd be great if you can update this thread if you ever find a workaround. We will document it on the DataImportHandlerFaq wiki page. http://wiki.apache.org/solr/DataImportHandlerFaq On Thu, Sep 12, 2013 at 4:56 PM, Raymond Wiker rwi...@gmail.com wrote: That sounds reasonable. I've done some more digging, and found that the database instance in this case is an _OLD_ version of Oracle: 9.2.0.8.0. I also tried using the OCI driver (version 12), which refuses to even talk to this database. I have three other databases running on more recent versions of Oracle, and all three have worked fine with DataImportHandler. On Thu, Sep 12, 2013 at 9:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: This is probably a bug with Oracle thin JDBC driver. Google found a similar issue: http://stackoverflow.com/questions/4168494/resultset-getstring-on-varchar2-column-returns-empty-string I don't think this is specific to DataImportHandler. On Thu, Sep 12, 2013 at 12:43 PM, Raymond Wiker rwi...@gmail.com wrote: Followup: I just tried modifying the select with select CAST('APPLICATION' as varchar2(100)) as sourceid, ... and that caused the sourceid field to be empty. CASTing to char(100) gave me the expected value ('APPLICATION', right-padded to 100 characters). Meanwhile, google gave me this: http://bugs.caucho.com/view.php?id=4224(via http://forum.caucho.com/showthread.php?t=27574). On Thu, Sep 12, 2013 at 8:25 AM, Raymond Wiker rwi...@gmail.com wrote: I'm trying to index a view in an Oracle database, and have come across some strange behaviour: all the VARCHAR2 fields are being returned as empty strings; this also applies to a datetime field converted to a string via TO_CHAR, and the url field built by concatenating two constant strings and a numeric filed converted via TO_CHAR. If I cast the fields columns to CHAR(N), I get values back, but this is not an acceptable workaround (the maximum length of CHAR(N) is less than VARCHAR2(N), and the result is padded to the specified length). Note that this query works as it should in sqldeveloper, and also in some code that uses the .NET sqlclient api. The query I'm using is select 'APPLICATION' as sourceid, 'http://app.company.com' || '/app/report.aspx?trsid=' || to_char(incident_no) as URL, incident_no, trans_date, location, responsible_unit, process_eng, product_eng, case_title, case_description, index_lob, investigated, investigated_eng, to_char(modified_date, '-MM-DDTHH24:MI:SSZ') as modified_date from synx.dw_fast where (investigated 3) while the view is INCIDENT_NONUMBER(38) TRANS_DATEVARCHAR2(8) LOCATIONVARCHAR2(4000) RESPONSIBLE_UNITVARCHAR2(4000) PROCESS_ENGVARCHAR2(4000) PROCESS_NOVARCHAR2(4000) PRODUCT_ENGVARCHAR2(4000) PRODUCT_NOVARCHAR2(4000) CASE_TITLEVARCHAR2(4000) CASE_DESCRIPTIONVARCHAR2(4000) INDEX_LOBCLOB INVESTIGATEDNUMBER(38) INVESTIGATED_ENGVARCHAR2(254) INVESTIGATED_NOVARCHAR2(254) MODIFIED_DATEDATE -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: No or limited use of FieldCache
Yes, thanks. Actually some months back I made PoC of a FieldCache that could expand beyond the heap. Basically imagine a FieldCache with room for unlimited data-arrays, that just behind the scenes goes to memory-mapped files when there is no more room on heap. Never finished it, and it might be kinda stupid because you actually just go read the data from lucene indices and write them to memory-mapped files in order to use them. It is better to just use the data in the Lucene indices instead. But it had some nice features. But that solution will also have the running out of swap space-problems. Regards, Per Steffensen On 9/12/13 12:48 PM, Erick Erickson wrote: Per: One thing I'll be curious about. From my reading of DocValues, it uses little or no heap. But it _will_ use memory from the OS if I followed Simon's slides correctly. So I wonder if you'll hit swapping issues... Which are better than OOMs, certainly... Thanks, Erick
Re: Help in resolving the below retrieval issue
Question mark and asterisk are wildcard characters, so if you want them to be treated as punctuation, either enclose the terms in quotes or escape the characters. Wildcard characters suppress the execution of some token filters if they are not able to cope with wildcards. -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Thursday, September 12, 2013 7:01 AM To: solr-user@lucene.apache.org Subject: Re: Help in resolving the below retrieval issue Hi, I am also seeing this issue when the search query is something like how are you? (Quotes for clarity). The query parser splits it to the below tokens: +text:whats +text:your +text:raashee? However when I remove the ? from the search query how are you I get the results. Is ? a special character? Should it be escaped as well? On Wed, Sep 11, 2013 at 1:50 AM, Jack Krupansky j...@basetechnology.comwrote: Removing stray hyphens (embedded hyphens, like CD-ROM, are okay) or escaping them with backslash looks like your best bests. There's no query parser option to disable the hyphen as an exlusion operator, although an upgrade to a modern Solr should fix the problem. -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Tuesday, September 10, 2013 4:13 PM To: solr-user@lucene.apache.org Subject: Re: Help in resolving the below retrieval issue I'm using Solr 3.4. This bug is causing the 2nd term i.e. kumar to be treated as an exclusion operator? Is it possible to configure the query parser to not treat the '-' as exclusion operator ? If not the only way is to remove the '-' from the query string? Thanks, Prathik On Tue, Sep 10, 2013 at 10:36 PM, Jack Krupansky j...@basetechnology.com **wrote: What release of Solr are you using? It appears that the hyphen is being treated as an exclusion operator even though it is followed by a space. Solr 4.4 doesn't appear to do that, but maybe earlier releases had a problem. In any case, be careful with leading hyphen in queries since it does mean exclude documents that contain the following term. Or, just escape any leading hyphen with a backslash. -- Jack Krupansky -Original Message- From: Prathik Puthran Sent: Tuesday, September 10, 2013 11:47 AM To: d...@lucene.apache.org ; solr-user@lucene.apache.org Subject: Re: Help in resolving the below retrieval issue Thanks Erick for the response. I tried to debug the query. Below is the response in the debug node str name=rawquerystringRahul - kumar/strstr name=querystringRahul - kumar/strstr name=parsedquery+text:Rahul -text:kumar/strstr name=parsedquery_toString+text:Rahul -text:kumar/strlst name=explain/str name=QParserLuceneQParser/strarr name=filter_queriesstrRahul - kumar/str/arrarr name=parsed_filter_queriesstr+text:rahul -text:kumar/str/arr Does it mean the query parser has parsed it to tokens Rahul - and kumar? Even if this was the case solr should be able to retrieve the documents because I have indexed all the documents based on n-grams as well. Thanks, Prathik On Tue, Sep 10, 2013 at 7:09 PM, Erick Erickson erickerick...@gmail.com * *wrote: Try adding debug=query to the url. What I think you'll find is that you're running into a common issue, the difference between query parsing and analysis. when you submit anything with whitespace in it, the query parser will break it up _before_ it gets to the analysis part, you should see something in the debug portion of the query like field:rahul field:kumar and possibly even field:- These are searched as separate tokens. By specifying KeywordTokenizer, at index time you'll have exactly one token, rahul-kumar in the index which will not match any of the separated tokens Try escaping the spaces with backslash. You could also try quoting the input although that has some phrase implications. Do you really want this search to fail on just searching rahul though? Perhaps keywordTokenizer isn't best here, it depends upon your use-case... Best, Erick On Tue, Sep 10, 2013 at 8:10 AM, Prathik Puthran prathik.puthra...@gmail.com wrote: Hi, I am facing the below issue where in Solr is not retrieving the indexed word for some cases. This happens whenever the indexed word has string - (quotes for clarity) as substring i.e word prefix followed by a space which is followed by '-' again followed by a space and followed by the rest of the word suffix. When I search with search query being the exact string Solr returns no results. Example: Indexed word -- Rahul - kumar (quotes for clarity) If I search with the search query as below Solr gives no results Search query -- Rahul - kumar (quotes for clarity) However the below search query returns the results Search query -- Rahul kumar Can you please let me know what I am doing wrong here and what should I do to ensure the first query i.e. Rahul - kumar returns the documents indexed using it. Below are the analyzers I am using: Index time analyzer components: 1)
Re: No or limited use of FieldCache
On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote: Actually some months back I made PoC of a FieldCache that could expand beyond the heap. Basically imagine a FieldCache with room for unlimited data-arrays, that just behind the scenes goes to memory-mapped files when there is no more room on heap. That sounds a lot like disk-based DocValues. [...] But that solution will also have the running out of swap space-problems. Not really. Memory mapping works like the disk cache: There is no requirement that a certain amount of physical memory needs to be available, it just takes what it can get. If there are not a lot of physical memory, it will require a lot of storage access, but it will not over-allocate swap space. It seems that different setups vary quite a lot in this area and some systems are prone to aggressive use of the swap file, which can severely harm responsiveness of applications with out-swapped data. However, this should still not result in any OOM's, as the system can always discard some of the memory mapped data if it needs more physical memory. - Toke Eskildsen, State and University Library, Denmark
Re: No or limited use of FieldCache
On 9/12/13 3:28 PM, Toke Eskildsen wrote: On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote: Actually some months back I made PoC of a FieldCache that could expand beyond the heap. Basically imagine a FieldCache with room for unlimited data-arrays, that just behind the scenes goes to memory-mapped files when there is no more room on heap. That sounds a lot like disk-based DocValues. He he But that solution will also have the running out of swap space-problems. Not really. Memory mapping works like the disk cache: There is no requirement that a certain amount of physical memory needs to be available, it just takes what it can get. If there are not a lot of physical memory, it will require a lot of storage access, but it will not over-allocate swap space. That was also my impression, but during the work, I experienced some problems around swap space, but I do not remember exactly what I saw, and therefore how I concluded that everything in mm-files actually have to fit in physical mem + swap. I might very well have been wrong in that conclusion It seems that different setups vary quite a lot in this area and some systems are prone to aggressive use of the swap file, which can severely harm responsiveness of applications with out-swapped data. However, this should still not result in any OOM's, as the system can always discard some of the memory mapped data if it needs more physical memory. I saw no OOMs - Toke Eskildsen, State and University Library, Denmark
Facet counting empty as well.. how to prevent this?
Hi, I got a small issue here, my facet settings are returning counts for empty . I.e. when no the actual field was empty. Here are the facet settings: str name=facet.sortcount/str str name=facet.limit6/str str name=facet.mincount1/str str name=facet.missingfalse/str and this is the part of the result I dont want: int name=4/int (that is coming because the query results had 4 rows with no value in that field whole facet counts are being called). Rest all is working just fine -- Regards, Raheel Hasan
Re: SolrCloud behave differently on server and local
My problem is solved. My server default java version is 1.5 . I upgrade java version. 2013/9/12 cihat güzel c.guzel@gmail.com hi all. I am trying solr cloud on my server. The server is a virtual machine. I have followed solr cloude wiki http://wiki.apache.org/solr/SolrCloud . When I run solr Cloud, It si failed. But If I try on my local ,it runs successfully. Why does solr behave differently on server and local? My solr.log as follows: INFO - 2013-09-12 14:50:13.389; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer; CoreContainer was not shutdown prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! instance=1423856966 INFO - 2013-09-12 14:50:13.483; org.eclipse.jetty.server.AbstractConnector; Started SocketConnector@0.0.0.0:8983 INFO - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2013-09-12 14:57:01.838; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor /opt/Applications/solr-4.4.0/example/contexts at interval 0 INFO - 2013-09-12 14:57:01.846; org.eclipse.jetty.deploy.DeploymentManager; Deployable added: /opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml INFO - 2013-09-12 14:57:02.549; org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet INFO - 2013-09-12 14:57:02.656; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() INFO - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader; JNDI not configured for solr (NoInitialContextEx) INFO - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader; solr home defaulted to 'solr/' (could not find system property or JNDI) INFO - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr/' INFO - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading container configuration from /opt/Applications/solr-4.4.0/example/solr/solr.xml ERROR - 2013-09-12 14:57:03.072; org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check solr/home property and the logs ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Could not load SOLR configuration at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65) at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129) at org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at
Re: SolrCloud behave differently on server and local
My problem is solved. My server default java version is 1.5 . I upgrade java version. 2013/9/12 cihat güzel c.guzel@gmail.com hi all. I am trying solr cloud on my server. The server is a virtual machine. I have followed solr cloude wiki http://wiki.apache.org/solr/SolrCloud . When I run solr Cloud, It si failed. But If I try on my local ,it runs successfully. Why does solr behave differently on server and local? My solr.log as follows: INFO - 2013-09-12 14:50:13.389; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() done ERROR - 2013-09-12 14:50:13.433; org.apache.solr.core.CoreContainer; CoreContainer was not shutdown prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! instance=1423856966 INFO - 2013-09-12 14:50:13.483; org.eclipse.jetty.server.AbstractConnector; Started SocketConnector@0.0.0.0:8983 INFO - 2013-09-12 14:57:01.776; org.eclipse.jetty.server.Server; jetty-8.1.10.v20130312 INFO - 2013-09-12 14:57:01.838; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor /opt/Applications/solr-4.4.0/example/contexts at interval 0 INFO - 2013-09-12 14:57:01.846; org.eclipse.jetty.deploy.DeploymentManager; Deployable added: /opt/Applications/solr-4.4.0/example/contexts/solr-jetty-context.xml INFO - 2013-09-12 14:57:02.549; org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet INFO - 2013-09-12 14:57:02.656; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() INFO - 2013-09-12 14:57:02.797; org.apache.solr.core.SolrResourceLoader; JNDI not configured for solr (NoInitialContextEx) INFO - 2013-09-12 14:57:02.799; org.apache.solr.core.SolrResourceLoader; solr home defaulted to 'solr/' (could not find system property or JNDI) INFO - 2013-09-12 14:57:02.801; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr/' INFO - 2013-09-12 14:57:02.917; org.apache.solr.core.ConfigSolr; Loading container configuration from /opt/Applications/solr-4.4.0/example/solr/solr.xml ERROR - 2013-09-12 14:57:03.072; org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check solr/home property and the logs ERROR - 2013-09-12 14:57:03.098; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Could not load SOLR configuration at org.apache.solr.core.ConfigSolr.fromFile(ConfigSolr.java:65) at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:89) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:139) at org.apache.solr.core.CoreContainer.init(CoreContainer.java:129) at org.apache.solr.servlet.SolrDispatchFilter.createCoreContainer(SolrDispatchFilter.java:139) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:122) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:119) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:719) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1252) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:710) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:494) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:39) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:186) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:494) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:141) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:145) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:56) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:609) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:540) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:403) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:337) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:121) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:555) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:230) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at
Re: Storing/indexing speed drops quickly
On 9/12/2013 2:14 AM, Per Steffensen wrote: Starting from an empty collection. Things are fine wrt storing/indexing speed for the first two-three hours (100M docs per hour), then speed goes down dramatically, to an, for us, unacceptable level (max 10M per hour). At the same time as speed goes down, we see that I/O wait increases dramatically. I am not 100% sure, but quick investigation has shown that this is due to almost constant merging. While constant merging is contributing to the slowdown, I would guess that your index is simply too big for the amount of RAM that you have. Let's ignore for a minute that you're distributed and just concentrate on one machine. After three hours of indexing, you have nearly 300 million documents. If you have a replicationFactor of 1, that's still 50 million documents per machine. If your replicationFactor is 2, you've got 100 million documents per machine. Let's focus on the smaller number for a minute. 50 million documents in an index, even if they are small documents, is probably going to result in an index size of at least 20GB, and quite possibly larger. In order to make Solr function with that many documents, I would guess that you have a heap that's at least 4GB in size. With only 8GB on the machine, this doesn't leave much RAM for the OS disk cache. If we assume that you have 4GB left for caching, then I would expect to see problems about the time your per-machine indexes hit 15GB in size. If you are making it beyond that with a total of 300 million documents, then I am impressed. Two things are going to happen when you have enough documents: 1) You are going to fill up your Java heap and Java will need to do frequent collections to free up enough RAM for normal operation. When this problem gets bad enough, the frequent collections will be *full* GCs, which are REALLY slow. 2) The index will be so big that the OS disk cache cannot effectively cache it. I suspect that the latter is more of the problem, but both might be happening at nearly the same time. When dealing with an index of this size, you want as much RAM as you can possibly afford. I don't think I would try what you are doing without at least 64GB per machine, and I would probably use at least an 8GB heap on each one, quite possibly larger. With a heap that large, extreme GC tuning becomes a necessity. To cut down on the amount of merging, I go with a fairly large mergeFactor, but mergeFactor is basically deprecated for TieredMergePolicy, there's a new way to configure it now. Here's the indexConfig settings that I use on my dev server: indexConfig mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce35/int int name=segmentsPerTier35/int int name=maxMergeAtOnceExplicit105/int /mergePolicy mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxThreadCount1/int int name=maxMergeCount6/int /mergeScheduler ramBufferSizeMB48/ramBufferSizeMB infoStream file=INFOSTREAM-${solr.core.name}.txtfalse/infoStream /indexConfig Thanks, Shawn
RE: Solr cloud shard goes down after SocketException in another shard
Neoman, Make sure that solr08-prod (or the elected leader at any time) isn't doing a stop-the-world garbage collection that takes long enough that the zookeeper connection times out. I've seen that in my cluster when I didn't have parallel GC enabled and my zkClientTimeout in solr.xml was too low. Thanks, Greg -Original Message- From: neoman [mailto:harira...@gmail.com] Sent: Thursday, September 12, 2013 9:19 AM To: solr-user@lucene.apache.org Subject: Solr cloud shard goes down after SocketException in another shard Exception in shard1 (solr01-prod) primary 09/12/13 13:56:46:635|http-bio-8080-exec-66|ERROR|apache.solr.servlet.SolrDispatchFilter|null:ClientAbortException: java.net.SocketException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:406) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:342) at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:431) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:419) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:91) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:214) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:95) at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:470) at org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:545) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:232) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149) at org.apache.solr.common.util.JavaBinCodec.writeSolrDocument(JavaBinCodec.java:320) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:257) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149) at org.apache.solr.common.util.JavaBinCodec.writeArray(JavaBinCodec.java:427) at org.apache.solr.common.util.JavaBinCodec.writeSolrDocumentList(JavaBinCodec.java:356) Exception in shard1 (solr08-prod) secondary 09/12/13 13:56:46:729|http-bio-8080-exec-50|ERROR|apache.solr.core.SolrCore|org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr08-prod:8080/solr/aq-core), but locally we don't think so. Request came from http://solr03-prod.phneaz:8080/solr/aq-core/ at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) Out configuration Solr 4.4, Tomcat 7, 3 shards Thanks for your help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr cloud shard goes down after SocketException in another shard
Thanks greg. Currently we have 60 seconds (we reduced it recently). I may have to reduce it again. can you please share your timeout value. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576p4089582.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr cloud shard goes down after SocketException in another shard
Neoman, I've got ours set at 45 seconds: int name=zkClientTimeout${zkClientTimeout:45000}/int -Original Message- From: neoman [mailto:harira...@gmail.com] Sent: Thursday, September 12, 2013 9:33 AM To: solr-user@lucene.apache.org Subject: Re: Solr cloud shard goes down after SocketException in another shard Thanks greg. Currently we have 60 seconds (we reduced it recently). I may have to reduce it again. can you please share your timeout value. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576p4089582.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr cloud shard goes down after SocketException in another shard
Exception in shard1 (solr01-prod) primary 09/12/13 13:56:46:635|http-bio-8080-exec-66|ERROR|apache.solr.servlet.SolrDispatchFilter|null:ClientAbortException: java.net.SocketException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:406) at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:342) at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:431) at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:419) at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:91) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:214) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:95) at org.apache.solr.common.util.JavaBinCodec.writeStr(JavaBinCodec.java:470) at org.apache.solr.common.util.JavaBinCodec.writePrimitive(JavaBinCodec.java:545) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:232) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149) at org.apache.solr.common.util.JavaBinCodec.writeSolrDocument(JavaBinCodec.java:320) at org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:257) at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:149) at org.apache.solr.common.util.JavaBinCodec.writeArray(JavaBinCodec.java:427) at org.apache.solr.common.util.JavaBinCodec.writeSolrDocumentList(JavaBinCodec.java:356) Exception in shard1 (solr08-prod) secondary 09/12/13 13:56:46:729|http-bio-8080-exec-50|ERROR|apache.solr.core.SolrCore|org.apache.solr.common.SolrException: ClusterState says we are the leader (http://solr08-prod:8080/solr/aq-core), but locally we don't think so. Request came from http://solr03-prod.phneaz:8080/solr/aq-core/ at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) Out configuration Solr 4.4, Tomcat 7, 3 shards Thanks for your help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-shard-goes-down-after-SocketException-in-another-shard-tp4089576.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Grouping by field substring?
Hi Jack, On Sep 11, 2013, at 5:34pm, Jack Krupansky wrote: Do a copyField to another field, with a limit of 8 characters, and then use that other field. Thanks - I should have included a few more details in my original question. The issue is that I've got an index with 200M records, of which about 50M have a unique value for this prefix (which is 32 characters long) So adding another indexed field would be significant, which is why I was hoping there was a way to do it via grouping/collapsing at query time. Or is that just not possible? Thanks, -- Ken -Original Message- From: Ken Krugler Sent: Wednesday, September 11, 2013 8:24 PM To: solr-user@lucene.apache.org Subject: Grouping by field substring? Hi all, Assuming I want to use the first N characters of a specific field for grouping results, is such a thing possible out-of-the-box? If not, then what would the next best option be? E.g. a custom function query? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: Facet counting empty as well.. how to prevent this?
On 9/12/2013 7:54 AM, Raheel Hasan wrote: I got a small issue here, my facet settings are returning counts for empty . I.e. when no the actual field was empty. Here are the facet settings: str name=facet.sortcount/str str name=facet.limit6/str str name=facet.mincount1/str str name=facet.missingfalse/str and this is the part of the result I dont want: int name=4/int The facet.missing parameter has to do with whether or not to display counts for documents that have no value at all for that field. Even though it might seem wrong, the empty string is a valid value, so you can't fix this with faceting parameters. If you don't want that to be in your index, then you can add the LengthFilterFactory to your analyzer to remove terms with a length less than 1. You might also check to see whether the field definition in your schema has a default value set to the empty string. If you are using DocValues (Solr 4.2 and later), then the indexed terms aren't used for facets, and it won't matter what you do to your analysis chain. With DocValues, Solr basically uses a value equivalent to the stored value. To get rid of the empty string with DocValues, you'll need to either change your indexing process so it doesn't send empty strings, or use a custom UpdateProcessor to change the data before it gets indexed. Thanks, Shawn
Re: Facet counting empty as well.. how to prevent this?
ok, so I got the idea... I will pull 7 fields instead and remove the empty one... But there must be some setting that can be done in Facet configuration to ignore certain value if we want to On Thu, Sep 12, 2013 at 7:44 PM, Shawn Heisey s...@elyograg.org wrote: On 9/12/2013 7:54 AM, Raheel Hasan wrote: I got a small issue here, my facet settings are returning counts for empty . I.e. when no the actual field was empty. Here are the facet settings: str name=facet.sortcount/str str name=facet.limit6/str str name=facet.mincount1/str str name=facet.missingfalse/str and this is the part of the result I dont want: int name=4/int The facet.missing parameter has to do with whether or not to display counts for documents that have no value at all for that field. Even though it might seem wrong, the empty string is a valid value, so you can't fix this with faceting parameters. If you don't want that to be in your index, then you can add the LengthFilterFactory to your analyzer to remove terms with a length less than 1. You might also check to see whether the field definition in your schema has a default value set to the empty string. If you are using DocValues (Solr 4.2 and later), then the indexed terms aren't used for facets, and it won't matter what you do to your analysis chain. With DocValues, Solr basically uses a value equivalent to the stored value. To get rid of the empty string with DocValues, you'll need to either change your indexing process so it doesn't send empty strings, or use a custom UpdateProcessor to change the data before it gets indexed. Thanks, Shawn -- Regards, Raheel Hasan
Re: Get the commit time of a document in Solr
Slow down, back up, and now tell us what problem (if any!) you are really trying to solve. Don't leap to a proposed solution before you clearly state the problem to be solved. First, why do you think there is any problem at all? Or, what are you really trying to achieve? -- Jack Krupansky -Original Message- From: phanichaitanya Sent: Thursday, September 12, 2013 1:04 PM To: solr-user@lucene.apache.org Subject: Re: Get the commit time of a document in Solr So, now I want to know when that document becomes searchable or when it is committed. I've the following scenario: 1) Indexing starts at say 9:00 AM - with the above additions to the schema.xml I'll know the indexed time of each document I send to Solr via the update handler. Say 9:01, 9:02 and so on ... lets say I send a document for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs 2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search these 1800 documents which is fine. 3) Now I want to know that I can search these 1800 documents only at =9:30 AM but not 9:30 AM as I did not do a hard commit before 9:30 AM. In order to know that, is there a way in Solr rather than some application keeping track of the documents it sends to Solr between any two commits. The reason I'm asking is, if there are say two parallel processes indexing to the same index and one process issues a commit - then whatever documents process two indexed until that point of time would also be committed right ? Now if I keep track of commit times in each process it doesn't reflect the true commit times as they are inter-twined. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089638.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: charset encoding
it was the http-header, as soon as i force a iso-8859-1 header it worked On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote: could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8? On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: Get the commit time of a document in Solr
Hi Jack, Sorry, I was not clear earlier. What I'm trying to achieve is : I want to know when a document is committed (hard commit). There can be a lot of time lapse (1 hour or more) between the time you indexed that document vs you issue a commit in my case. Now, I exactly want to know when a document is committed. In my previous example all 1800 docs are committed at 9:30 AM and I want to know that time for those 1800 docs. In other batch it'll be some other time. The use-case is I've have more than 1 process sending the update requests to Solr and each of those process has a separate commit step and I want to know the commit time of the documents that were committed when I gave a commit request. I hope I'm clear now - please let me know if I'm not. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089662.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud 4.x hangs under high update volume
Lol, at breaking during a demo - always the way it is! :) I agree, we are just tip-toeing around the issue, but waiting for 4.5 is definitely an option if we get-by for now in testing; patched Solr versions seem to make people uneasy sometimes :). Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up worse due to less limitations on thread), I'm guessing only SOLR-5232 and SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a world of difference! Thanks so much again guys! Tim On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com wrote: Fewer client threads updating makes sense, and going to 1 core also seems like it might help. But it's all a crap-shoot unless the underlying cause gets fixed up. Both would improve things, but you'll still hit the problem sometime, probably when doing a demo for your boss ;). Adrien has branched the code for SOLR 4.5 in preparation for a release candidate tentatively scheduled for next week. You might just start working with that branch if you can rather than apply individual patches... I suspect there'll be a couple more changes to this code (looks like Shikhar already raised an issue for instance) before 4.5 is finally cut... FWIW, Erick On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in SolrCloud a bit. Thoughts? Thanks all! Cheers, Tim On 6 September 2013 14:47, Tim Vaillancourttim@elementspace.**com t...@elementspace.com wrote: Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Millermarkrmil...@gmail.com wrote: Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourttim@elementspace.**com t...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for
Re: SolrCloud 4.x hangs under high update volume
Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 is not quite ready - we need some testing. - Mark On Sep 12, 2013, at 2:12 PM, Erick Erickson erickerick...@gmail.com wrote: My take on it is this, assuming I'm reading this right: 1 SOLR-5216 - probably not going anywhere, 5232 will take care of it. 2 SOLR-5232 - expected to fix the underlying issue no matter whether you're using CloudSolrServer from SolrJ or sending lots of updates from lots of clients. 3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the meantime. I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0 is looking like it'll be ready to cut next week so it might not be included. Best, Erick On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt t...@elementspace.comwrote: Lol, at breaking during a demo - always the way it is! :) I agree, we are just tip-toeing around the issue, but waiting for 4.5 is definitely an option if we get-by for now in testing; patched Solr versions seem to make people uneasy sometimes :). Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up worse due to less limitations on thread), I'm guessing only SOLR-5232 and SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a world of difference! Thanks so much again guys! Tim On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com wrote: Fewer client threads updating makes sense, and going to 1 core also seems like it might help. But it's all a crap-shoot unless the underlying cause gets fixed up. Both would improve things, but you'll still hit the problem sometime, probably when doing a demo for your boss ;). Adrien has branched the code for SOLR 4.5 in preparation for a release candidate tentatively scheduled for next week. You might just start working with that branch if you can rather than apply individual patches... I suspect there'll be a couple more changes to this code (looks like Shikhar already raised an issue for instance) before 4.5 is finally cut... FWIW, Erick On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in
Re: Get the commit time of a document in Solr
Sorry, but all you've done is reshuffle your previous statements but without telling us about the actual problem that you are trying to solve! Repeating myself: You, the application developer can send a hard commit any time you want to assure that documents are searchable. Maybe not every millisecond, but, say, once a second with a soft commit and once a minute for a hard commit, using commit within to minimize commits when multiple processes are indexing data. AFAICT, no application should ever have to care when a document is actually committed - and you have control with commit, anyway. You the application developer can tune the commit interval to balance searchability and overall efficiency. There shouldn't be any problem there, given the variety of commit methods that Solr supports, but you have to make the choices. So, what's the problem you are trying to solve? You still haven't articulated it. It sounds as if you are trying to solve a non-problem. But, we can't be sure since you haven't articulated what the actual problem (if any) really is. -- Jack Krupansky -Original Message- From: phanichaitanya Sent: Thursday, September 12, 2013 1:42 PM To: solr-user@lucene.apache.org Subject: Re: Get the commit time of a document in Solr Hi Jack, Sorry, I was not clear earlier. What I'm trying to achieve is : I want to know when a document is committed (hard commit). There can be a lot of time lapse (1 hour or more) between the time you indexed that document vs you issue a commit in my case. Now, I exactly want to know when a document is committed. In my previous example all 1800 docs are committed at 9:30 AM and I want to know that time for those 1800 docs. In other batch it'll be some other time. The use-case is I've have more than 1 process sending the update requests to Solr and each of those process has a separate commit step and I want to know the commit time of the documents that were committed when I gave a commit request. I hope I'm clear now - please let me know if I'm not. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089662.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Get the commit time of a document in Solr
On Sep 12, 2013, at 20:55 , phanichaitanya pvempaty@gmail.com wrote: Apologies again. But here is another try : I want to make sure that documents that are indexed are committed in say an hour. I agree that if you pass commitWithIn params and the like will make sure of that based on the time configurations we set. But, I want to make sure that the document is really committed within whatever time we set using commitWithIn. It's a question asking for proof that Solr commits within that time if we add commitWithIn parameter to the configuration. That is about commitWithIn parameter option that you suggested. Now is there a way to explicitly get all the documents that are committed when a hard commit request is issued ? This might not make sense but we are pondered with that question. If you have a timestamp field that defaults to NOW, you could do queries for a single document (q=*), ranked by descending timestamp. If you're feeding constantly, and run these queries regularly, you should be able to get some sort of feel for the latency in the system.
Re: Get the commit time of a document in Solr
On 9/12/2013 12:55 PM, phanichaitanya wrote: I want to make sure that documents that are indexed are committed in say an hour. I agree that if you pass commitWithIn params and the like will make sure of that based on the time configurations we set. But, I want to make sure that the document is really committed within whatever time we set using commitWithIn. It's a question asking for proof that Solr commits within that time if we add commitWithIn parameter to the configuration. That is about commitWithIn parameter option that you suggested. Now is there a way to explicitly get all the documents that are committed when a hard commit request is issued ? This might not make sense but we are pondered with that question. If these are ongoing requirements that you need to with every commit or with a large subset of commits, then I don't think there is any way to do it without writing custom plugins for Solr. If you are just trying to prove to someone that Solr is doing what you say it is, then you can do some simple testing: Send an update request with as many documents as you want to test, and include commit=true on the request. If you are planning to use commitWithin, also include SoftCommit=true, because commitWithin is a soft commit. Time how long it takes for the update request to complete. That's approximately how long it will take for a real update/commit to happen. There will be some extra time for the indexing itself, but unless the document count is absolutely enormous, it shouldn't matter too much. If you want to test just the commit time, then (after making sure nothing else is sending updates or commits) send the update without any commit parameters, then send a commit request by itself and time how long the commit request takes. With enough RAM for proper OS disk caching, commits should be very fast even on an index with 10 million documents. Here is a wiki page that has a small amount of discussion about slow commits: http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits Thanks, Shawn
Re: Get the commit time of a document in Solr
So, now I want to know when that document becomes searchable or when it is committed. I've the following scenario: 1) Indexing starts at say 9:00 AM - with the above additions to the schema.xml I'll know the indexed time of each document I send to Solr via the update handler. Say 9:01, 9:02 and so on ... lets say I send a document for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs 2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search these 1800 documents which is fine. 3) Now I want to know that I can search these 1800 documents only at =9:30 AM but not 9:30 AM as I did not do a hard commit before 9:30 AM. In order to know that, is there a way in Solr rather than some application keeping track of the documents it sends to Solr between any two commits. The reason I'm asking is, if there are say two parallel processes indexing to the same index and one process issues a commit - then whatever documents process two indexed until that point of time would also be committed right ? Now if I keep track of commit times in each process it doesn't reflect the true commit times as they are inter-twined. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089638.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud 4.x hangs under high update volume
My take on it is this, assuming I'm reading this right: 1 SOLR-5216 - probably not going anywhere, 5232 will take care of it. 2 SOLR-5232 - expected to fix the underlying issue no matter whether you're using CloudSolrServer from SolrJ or sending lots of updates from lots of clients. 3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the meantime. I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0 is looking like it'll be ready to cut next week so it might not be included. Best, Erick On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt t...@elementspace.comwrote: Lol, at breaking during a demo - always the way it is! :) I agree, we are just tip-toeing around the issue, but waiting for 4.5 is definitely an option if we get-by for now in testing; patched Solr versions seem to make people uneasy sometimes :). Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up worse due to less limitations on thread), I'm guessing only SOLR-5232 and SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a world of difference! Thanks so much again guys! Tim On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com wrote: Fewer client threads updating makes sense, and going to 1 core also seems like it might help. But it's all a crap-shoot unless the underlying cause gets fixed up. Both would improve things, but you'll still hit the problem sometime, probably when doing a demo for your boss ;). Adrien has branched the code for SOLR 4.5 in preparation for a release candidate tentatively scheduled for next week. You might just start working with that branch if you can rather than apply individual patches... I suspect there'll be a couple more changes to this code (looks like Shikhar already raised an issue for instance) before 4.5 is finally cut... FWIW, Erick On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in SolrCloud a bit. Thoughts? Thanks all! Cheers, Tim On 6 September 2013 14:47, Tim Vaillancourttim@elementspace.**com t...@elementspace.com
Re: charset encoding
On 9/12/2013 11:17 AM, Andreas Owen wrote: it was the http-header, as soon as i force a iso-8859-1 header it worked Glad you found a workaround! If you are in a situation where you cannot control the header of the request or modify the content itself to include charset information, or there's some reason you would rather not take that route, there will be another way with the next Solr release. https://issues.apache.org/jira/browse/SOLR-5082 Solr 4.5 will support an ie (input encoding) parameter for the update request so you can inform Solr what charset encoding to expect. The release process for Solr 4.5 has been started, it usually takes 2-3 weeks to complete. Thanks, Shawn
Re: SolrCloud 4.x hangs under high update volume
That makes sense, thanks Erick and Mark for you help! :) I'll see if I can find a place to assist with the testing of SOLR-5232. Cheers, Tim On 12 September 2013 11:16, Mark Miller markrmil...@gmail.com wrote: Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 is not quite ready - we need some testing. - Mark On Sep 12, 2013, at 2:12 PM, Erick Erickson erickerick...@gmail.com wrote: My take on it is this, assuming I'm reading this right: 1 SOLR-5216 - probably not going anywhere, 5232 will take care of it. 2 SOLR-5232 - expected to fix the underlying issue no matter whether you're using CloudSolrServer from SolrJ or sending lots of updates from lots of clients. 3 SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the meantime. I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0 is looking like it'll be ready to cut next week so it might not be included. Best, Erick On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt t...@elementspace.com wrote: Lol, at breaking during a demo - always the way it is! :) I agree, we are just tip-toeing around the issue, but waiting for 4.5 is definitely an option if we get-by for now in testing; patched Solr versions seem to make people uneasy sometimes :). Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up worse due to less limitations on thread), I'm guessing only SOLR-5232 and SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a world of difference! Thanks so much again guys! Tim On 12 September 2013 03:43, Erick Erickson erickerick...@gmail.com wrote: Fewer client threads updating makes sense, and going to 1 core also seems like it might help. But it's all a crap-shoot unless the underlying cause gets fixed up. Both would improve things, but you'll still hit the problem sometime, probably when doing a demo for your boss ;). Adrien has branched the code for SOLR 4.5 in preparation for a release candidate tentatively scheduled for next week. You might just start working with that branch if you can rather than apply individual patches... I suspect there'll be a couple more changes to this code (looks like Shikhar already raised an issue for instance) before 4.5 is finally cut... FWIW, Erick On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt t...@elementspace.com wrote: Thanks Erick! Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 patch. I think that is a very, very useful patch by the way. SOLR-5232 seems promising as well. I see your point on the more-shards idea, this is obviously a global/instance-level lock. If I really had to, I suppose I could run more Solr instances to reduce locking then? Currently I have 2 cores per instance and I could go 1-to-1 to simplify things. The good news is we seem to be more stable since changing to a bigger client-solr batch-size and fewer client threads updating. Cheers, Tim On 11/09/13 04:19 AM, Erick Erickson wrote: If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourttim@elementspace. **comt...@elementspace.com wrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the
Re: Some highlighted snippets aren't being returned
maxAnalyzedChars did it! I wasn't setting that param, and I'm working with some very long documents. I also made the hl.fl param formatting change that you suggested, Aloke. Thanks again! - Eric On Sep 11, 2013, at 3:10 AM, Eric O'Hanlon elo2...@columbia.edu wrote: Thank you, Aloke and Bryan! I'll give this a try and I'll report back on what happens! - Eric On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi Eric, As Bryan suggests, you should look at appropriately setting up the fragSize maxAnalyzedChars for long documents. One issue I find with your search request is that in trying to highlight across three separate fields, you have added each of them as a separate request param: hl.fl=contentshl.fl=titlehl.fl=original_url The way to do it would be (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass them as values to one comma (or space) separated field: hl.fl=contents,title,original_url Regards, Aloke On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: Eric, Your example document is quite long. Are you setting hl.maxAnalyzedChars? If you don't, the highlighter you appear to be using will not look past the first 51,200 characters of the document for snippet candidates. http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars -- Bryan -Original Message- From: Eric O'Hanlon [mailto:elo2...@columbia.edu] Sent: Sunday, September 08, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Re: Some highlighted snippets aren't being returned Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code facet.field=domainfacet.field=date_of_capture_facet.field=mimetype _codefacet.field=geographic_focus__facetfacet.field=organization_based_i n__facetfacet.field=organization_type__facetfacet.field=language__facet facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun% 202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2- uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no- 39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U timut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one
Re: Get the commit time of a document in Solr
On 9/12/2013 11:04 AM, phanichaitanya wrote: So, now I want to know when that document becomes searchable or when it is committed. I've the following scenario: 1) Indexing starts at say 9:00 AM - with the above additions to the schema.xml I'll know the indexed time of each document I send to Solr via the update handler. Say 9:01, 9:02 and so on ... lets say I send a document for every second between 9 - 9:30 AM and it makes it 30*60 = 1800 docs 2) Now at 9:30 AM, I issue a hard commit and now I'll be able to search these 1800 documents which is fine. 3) Now I want to know that I can search these 1800 documents only at =9:30 AM but not 9:30 AM as I did not do a hard commit before 9:30 AM. In order to know that, is there a way in Solr rather than some application keeping track of the documents it sends to Solr between any two commits. The reason I'm asking is, if there are say two parallel processes indexing to the same index and one process issues a commit - then whatever documents process two indexed until that point of time would also be committed right ? Now if I keep track of commit times in each process it doesn't reflect the true commit times as they are inter-twined. From what I understand, if you use the default of NOW for a field in your schema, then all documents indexed in that request will have the timestamp of the time that indexing started. Assuming what I understand is the way it actually works, if you want the time to reflect anything even close to commit time, then you will need to send very small batches and you will need to commit after every batch. If you are indexing very quickly, you'll probably want those commits to be soft commits. You'll also want to have an autoCommit set up to do hard commits less frequently with openSearcher=false, or you'll run into the problem described at the link below. There is a good autoCommit example there: http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup I've heard (but have not tested) that with the NOW default, large imports with the dataimporthandler will all have the timestamp of when the DIH request started, no matter what you do with autoCommit or autoSoftCommit. Thanks, Shawn
Re: Get the commit time of a document in Solr
Yes, the document will be searchable after it is committed. Although you can also do auto commits and commitWithin which do not guarantee immediate visibility of index changes, you can do a hard commit any time you want to make a document searchable. -- Jack Krupansky -Original Message- From: phanichaitanya Sent: Thursday, September 12, 2013 12:07 PM To: solr-user@lucene.apache.org Subject: Get the commit time of a document in Solr I'd like to know when a document is committed in Solr vs. the indexed time. For indexed time, I can add a field as : field name=indexed_time type=date default=NOW indexed=true stored=true /. If I have say, 10 million docs indexed and I want to know the actual commit time of the document which makes it searchable. The problem is to just find the time when a document can be searchable which will be after it is committed ? (I don't want to do any soft commits). If there is a way to know this, please let me know so that I'd like to know more details based on it. -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624.html Sent from the Solr - User mailing list archive at Nabble.com.
Get the commit time of a document in Solr
I'd like to know when a document is committed in Solr vs. the indexed time. For indexed time, I can add a field as : field name=indexed_time type=date default=NOW indexed=true stored=true /. If I have say, 10 million docs indexed and I want to know the actual commit time of the document which makes it searchable. The problem is to just find the time when a document can be searchable which will be after it is committed ? (I don't want to do any soft commits). If there is a way to know this, please let me know so that I'd like to know more details based on it. -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regarding improving performance of the solr
Hi Prabu, It's difficult to tell what's going wrong without the full exception stack trace, including what the exception is. If you can provide the specific input that triggers the exception, that might also help. Steve On Sep 12, 2013, at 4:14 AM, prabu palanisamy pr...@serendio.com wrote: Hi I tried to reindex the solr. I get the regular expression problem. The steps I followed are I started the java -jar start.jar http://localhost:8983/solr/update?stream.body= deletequery*:*querydelete http://localhost:8983/solr/update?stream.body=commit/ I stopped the solr server I changed indexed and stored tags as false for some of the fields in schema.xml fields field name=idtype=string indexed=true stored=true required=true/ field name=title type=string indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=revision type=sintindexed=false stored=false/ field name=user type=string indexed=false stored=false/ field name=userIdtype=int indexed=false stored=false/ field name=text type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=pagerank type=text_generalindexed=true stored=false/ field name=anchor_text type=text_general indexed=true stored=false multiValued=true compressed=true termVectors=true termPositions=true termOffsets=true/ field name=freebase type=text_general indexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=timestamp type=dateindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=titleText type=text_generalindexed=true stored=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=category type=string indexed=true stored=true/ /fields uniqueKeyid/uniqueKey copyField source=title dest=titleText/ My data-config.xml dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name=page processor=XPathEntityProcessor stream=true forEach=/mediawiki/page/ url=/home/prabu/wikipedia_full_indexed_dump.xml transformer=RegexTransformer,DateFormatTransformer,HTMLStripTransformer field column=idxpath=/mediawiki/page/id stripHTML=true/ field column=title xpath=/mediawiki/page/title stripHTML=true/ field column=category xpath=/mediawiki/page/category stripHTML=true/ field column=revision xpath=/mediawiki/page/revision/id stripHTML=true/ field column=user xpath=/mediawiki/page/revision/contributor/username stripHTML=true/ field column=userId xpath=/mediawiki/page/revision/contributor/id stripHTML=true/ field column=text xpath=/mediawiki/page/revision/text stripHTML=true/ field column=freebase xpath=/mediawiki/page/freebase stripHTML=true/ field column=pagerank xpath=/mediawiki/page/pagerank stripHTML=true/ field column=anchor_text xpath=/mediawiki/page/anchor_text/ stripHTML=true/ field column=timestamp xpath=/mediawiki/page/revision/timestamp dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / field column=$skipDoc regex=^#REDIRECT .* replaceWith=true sourceColName=text/ field column=category regex=((\[\[.*Category:.*\]\]\W?)+) sourceColName=text stripHTML=true/ field column=$skipDoc regex=^Template:.* replaceWith=true sourceColName=title/ /entity /document /dataConfig I tried the http://localhost:8983/solr/dataimport?command=full-import. At 50,000 document, I get some error related to regular expression. at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$Loop.match(Pattern.java:4295) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) I do not how to proceed. Please help me out. Thanks and Regards Prabu On Wed, Sep 11, 2013 at 11:31 AM, Erick Erickson erickerick...@gmail.comwrote: Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little
Unable to connect to http://localhost:8983/solr/
Hi, I just have this issue came out of no where Everything was fine until all of a sudden the browser cant connect to this solr. Here is the solr log: INFO - 2013-09-12 20:07:58.142; org.eclipse.jetty.server.Server; jetty-8.1.8.v20121106 INFO - 2013-09-12 20:07:58.179; org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor E:\Projects\G1\A1\trunk\solr_root\solrization\contexts at interval 0 INFO - 2013-09-12 20:07:58.191; org.eclipse.jetty.deploy.DeploymentManager; Deployable added: E:\Projects\G1\A1\trunk\solr_root\solrization\contexts\solr-jetty-context.xml INFO - 2013-09-12 20:07:59.159; org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet INFO - 2013-09-12 20:07:59.189; org.eclipse.jetty.server.handler.ContextHandler; started o.e.j.w.WebAppContext{/solr,file:/E:/Projects/G1/A1/trunk/solr_root/solrization/solr-webapp/webapp/},E:\Projects\G1\A1\trunk\solr_root\solrization/webapps/solr.war INFO - 2013-09-12 20:07:59.190; org.eclipse.jetty.server.handler.ContextHandler; started o.e.j.w.WebAppContext{/solr,file:/E:/Projects/G1/A1/trunk/solr_root/solrization/solr-webapp/webapp/},E:\Projects\G1\A1\trunk\solr_root\solrization/webapps/solr.war INFO - 2013-09-12 20:07:59.206; org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init() INFO - 2013-09-12 20:07:59.231; org.apache.solr.core.SolrResourceLoader; JNDI not configured for solr (NoInitialContextEx) INFO - 2013-09-12 20:07:59.231; org.apache.solr.core.SolrResourceLoader; solr home defaulted to 'solr/' (could not find system property or JNDI) INFO - 2013-09-12 20:07:59.241; org.apache.solr.core.CoreContainer$Initializer; looking for solr config file: E:\Projects\G1\A1\trunk\solr_root\solrization\solr\solr.xml INFO - 2013-09-12 20:07:59.244; org.apache.solr.core.CoreContainer; New CoreContainer 24012447 INFO - 2013-09-12 20:07:59.244; org.apache.solr.core.CoreContainer; Loading CoreContainer using Solr Home: 'solr/' INFO - 2013-09-12 20:07:59.245; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr/' INFO - 2013-09-12 20:07:59.483; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting socketTimeout to: 0 INFO - 2013-09-12 20:07:59.484; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting urlScheme to: http:// INFO - 2013-09-12 20:07:59.485; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting connTimeout to: 0 INFO - 2013-09-12 20:07:59.486; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maxConnectionsPerHost to: 20 INFO - 2013-09-12 20:07:59.487; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting corePoolSize to: 0 INFO - 2013-09-12 20:07:59.488; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maximumPoolSize to: 2147483647 INFO - 2013-09-12 20:07:59.489; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting maxThreadIdleTime to: 5 INFO - 2013-09-12 20:07:59.490; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting sizeOfQueue to: -1 INFO - 2013-09-12 20:07:59.490; org.apache.solr.handler.component.HttpShardHandlerFactory; Setting fairnessPolicy to: false INFO - 2013-09-12 20:07:59.498; org.apache.solr.client.solrj.impl.HttpClientUtil; Creating new http client, config:maxConnectionsPerHost=20maxConnections=1socketTimeout=0connTimeout=0retry=false INFO - 2013-09-12 20:07:59.671; org.apache.solr.core.CoreContainer; Registering Log Listener INFO - 2013-09-12 20:07:59.689; org.apache.solr.core.CoreContainer; Creating SolrCore 'A1' using instanceDir: solr\A1 INFO - 2013-09-12 20:07:59.690; org.apache.solr.core.SolrResourceLoader; new SolrResourceLoader for directory: 'solr\A1\' INFO - 2013-09-12 20:07:59.724; org.apache.solr.core.SolrConfig; Adding specified lib dirs to ClassLoader INFO - 2013-09-12 20:07:59.726; org.apache.solr.core.SolrResourceLoader; Adding 'file:/E:/Projects/G1/A1/trunk/solr_root/solrization/lib/mysql-connector-java-5.1.25-bin.jar' to classloader INFO - 2013-09-12 20:07:59.727; org.apache.solr.core.SolrResourceLoader; Adding 'file:/E:/Projects/G1/A1/trunk/solr_root/contrib/dataimporthandler/lib/activation-1.1.jar' to classloader INFO - 2013-09-12 20:07:59.727; org.apache.solr.core.SolrResourceLoader; Adding 'file:/E:/Projects/G1/A1/trunk/solr_root/contrib/dataimporthandler/lib/mail-1.4.1.jar' to classloader INFO - 2013-09-12 20:07:59.728; org.apache.solr.core.SolrResourceLoader; Adding 'file:/E:/Projects/G1/A1/trunk/solr_root/dist/solr-dataimporthandler-4.3.0.jar' to classloader INFO - 2013-09-12 20:07:59.729; org.apache.solr.core.SolrResourceLoader; Adding 'file:/E:/Projects/G1/A1/trunk/solr_root/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.3.0.jar' to classloader INFO - 2013-09-12 20:07:59.729; org.apache.solr.core.SolrResourceLoader; Adding
Stop filter changes in Solr = 4.4
While attempting to upgrade from Solr 4.3.0 to Solr 4.4.0 I ran into this exception: java.lang.IllegalArgumentException: enablePositionIncrements=false is not supported anymore as of Lucene 4.4 as it can create broken token streams which led me to https://issues.apache.org/jira/browse/LUCENE-4963. I need to be able to match queries irrespective of intervening stopwords (which used to work with enablePositionIncrements=true). For instance: foo of the bar would find documents matching foo bar, foo of bar, and foo of the bar. With this option deprecated in 4.4.0 I'm not clear on how to maintain the same functionality. The package javadoc adds: If the selected analyzer filters the stop words is and the, then for a document containing the string blue is the sky, only the tokens blue, sky are indexed, with position(sky) = 3 + position(blue). Now, a phrase query blue is the sky would find that document, because the same analyzer filters the same stop words from that query. But the phrase query blue sky would not find that document because the position increment between blue and sky is only 1. If this behavior does not fit the application needs, the query parser needs to be configured to not take position increments into account when generating phrase queries. But there's no mention of how to actually configure the query parser to do this. Does anyone know how to deal with this issue as Solr moves toward 5.0? Crossposted from stackoverflow: http://stackoverflow.com/questions/18668376/solr-4-4-stopfilterfactory-and-enablepositionincrements
Re: Get the commit time of a document in Solr
Solr admin exposes time of last commit. You can use that. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 12, 2013 3:22 PM, phanichaitanya pvempaty@gmail.com wrote: Apologies again. But here is another try : I want to make sure that documents that are indexed are committed in say an hour. I agree that if you pass commitWithIn params and the like will make sure of that based on the time configurations we set. But, I want to make sure that the document is really committed within whatever time we set using commitWithIn. It's a question asking for proof that Solr commits within that time if we add commitWithIn parameter to the configuration. That is about commitWithIn parameter option that you suggested. Now is there a way to explicitly get all the documents that are committed when a hard commit request is issued ? This might not make sense but we are pondered with that question. - Phani Chaitanya -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-commit-time-of-a-document-in-Solr-tp4089624p4089687.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 4.5 spatial search - distance and score
I'm trying to get score by using a custom boost and also get the distance. I found David's code* to get it using Intersects, which I want to replace by {!geofilt} or geodist() *David's code: https://issues.apache.org/jira/browse/SOLR-4255 He told me geodist() will be available again for this kind of field, which is a geohash type. Then, I'd like to know how it can be done today on 4.4 with {!geofilt} and how it will be done on 4.5 using geodist() Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-5-spatial-search-distance-and-score-tp4089706.html Sent from the Solr - User mailing list archive at Nabble.com.