Re: How to re-index Solr get term frequency within documents
I use Nutch as input datasource for my Solr. So I cannot re-run all the Nutch jobs to generate data again for Solr as it will take very long to generate that much data. I was hoping there would be an easier way inside Solr to just re-index all the existing data. Thanks, Tony On Tue, Jul 2, 2013 at 1:37 AM, Jack Krupansky j...@basetechnology.comwrote: Or, go with a commercial product that has a single-click Solr re-index capability, such as: 1. DataStax Enterprise - data is stored in Cassandra and reindexed into Solr from there. 2. LucidWorks Search - data sources are declared so that the package can automatically re-crawl the data sources. But, yeah, as Otis says, re-index is really just a euphemism for deleting your Solr data directory and indexing from scratch from the original data sources. -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Monday, July 01, 2013 2:26 PM To: solr-user@lucene.apache.org Subject: Re: How to re-index Solr get term frequency within documents If all your fields are stored, you can do it with http://search-lucene.com/?q=**solrentityprocessorhttp://search-lucene.com/?q=solrentityprocessor Otherwise, just reindex the same way you indexed in the first place. *Always* be ready to reindex from scratch. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com wrote: Thanks Jack , it worked. Could you please provide some info on how to re-index existing data in Solr, after changing the schema.xml ? Thanks, Tony On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.com* *wrote: You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/select/?q=iphonefl=AuthorX%** 2CTitleX%2CCommentXdf=CommentXwt=xmlindent=true** qt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv. offsets=truehttp://localhost:**8080/solr/select/?q=iphonefl=** AuthorX%2CTitleX%2CCommentX**df=CommentXwt=xmlindent=** trueqt=tvrhtv=truetv.tf=**truetv.df=truetv.positions** tv.offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Re: Unique key error while indexing pdf files
Can you please suggest a way (with example) of assigning this unique key to a pdf file? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074588.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key error while indexing pdf files
Okay. Can you please suggest a way (with an example) of assigning this unique key to a pdf file. Say, a unique number to each pdf file. How do i achieve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexer and Hadoop
Michael, I understand from your post that I can use the current storage without in Hadoop. I already have the storage mounted via NFS. Does your map function read from the mounted storage directly? If possible can you please illustrate more on that. Thanks Engy -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074604.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr - Delta Query Via Full Import
I am using DIH to fetch rows from db to solr. I have many 1:n relations and I can do it only if I use caching (super fast) Therefor I am adding the following attributes to my inner entity processor=CachedSqlEntityProcessor cacheKey= cacheLookup= Everything works great and fast. (First the n tables are queried than the main entity.) Now I want configured the delta import. And it is not actually working. I know that by standardhttp://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example I need to define the following attributes: 1. query - Initial Query 2. DeltaQuery - The rows that were changed 3. DeltaImportQuery - Fetch the data that was changed 4. parentDeltaQuery - The Keys of the parent entity that has changed rows in the current entity (2-4 only used in delta queries) And I have seen in a hack in the documentshttp://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example that you can do delta query via full import. So instead of adding the following attribute - Query,deltaImportQuery,deltaQuery -I can just add query and call full instead of delta. Problem - Only the first query (main entity) is executed when I run the full import without clean. Here is a part of my configuration in data-config.xml (I have left deltaImportQuery though I call only full import) entity name=PackageVersion pk=PackageVersionId query= select from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId Where '${dataimporter.request.clean}' != 'false' OR Package.LastModificationTime '${dataimporter.last_index_time}' OR PackageVersion.Timestamp '${dataimporter.last_index_time}' deltaImportQuery=select ... from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId Where '${dataimporter.request.clean}' != 'false' OR Package.LastModificationTime '${dataimporter.last_index_time}' OR PackageVersion.Timestamp '${dataimporter.last_index_time}' and ID=='${dih.delta.id}' entity name=PackageTag pk=ResourceId processor=CachedSqlEntityProcessor cacheKey=ResourceId cacheLookup=PackageVersion.PackageId query= SELECT ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where '${dataimporter.request.clean}' = 'true' OR Tag.TimeStamp '${dataimporter.last_index_time}' parentDeltaQuery=select PackageVersion.PackageVersionId from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion ON Package.Id = PackageVersion.PackageId where Package.Id=${PackageTag.ResourceId} /entity /entity
Re: Unique key error while indexing pdf files
We can't tell you what the id of your own document should be. Isn't there anything which is unique about your pdf files? How about the file name or the absolute path? On Tue, Jul 2, 2013 at 11:33 AM, archit2112 archit2...@gmail.com wrote: Okay. Can you please suggest a way (with an example) of assigning this unique key to a pdf file. Say, a unique number to each pdf file. How do i achieve this? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074592.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Unique key error while indexing pdf files
Yes. The absolute path is unique. -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074620.html Sent from the Solr - User mailing list archive at Nabble.com.
Removal of unique key - Query Elevation Component
I want to index pdf files in solr 4.3.0 using the data import handler. I have done the following: My request handler - requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler My data-config.xml dataConfig dataSource type=BinFileDataSource / document entity name=f dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf recursive=true entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ /entity /entity /document /dataConfig Now when i tried to index the documents i got the following error org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id Because i dont want any uniquekey in my case i disabled it as follows : In solrconfig.xml i commented out - searchComponent name=elevator class=solr.QueryElevationComponent pick a fieldType to analyze queries str name=queryFieldTypestring/str str name=config-fileelevate.xml/str /searchComponent In schema.xml i commented out uniquekeyid/uniquekey and added fieldType name=uuid class=solr.UUIDField indexed=true / field name=id type=uuid indexed=true stored=true default=NEW / and in elevate.xml i made the following changes elevate query text=foo bar doc id=4602376f-9741-407b-896e-645ec3ead457 / /query /elevate When i do this the indexing takes place but the indexed docs contain an author,s_author and id field. The document should contain author,text,title and id field (as defined in my data-config.xml). Please help me out. Am i doing anything wrong? and from where did this s_author field come? doc str name=authorarora arc/str str name=author_sarora arc/str str name=id4f65332d-49d9-497a-b88b-881da618f571/str/doc -- View this message in context: http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removal of unique key - Query Elevation Component
My guess is that you have a copyField element which copies the author into an author_s field. On Tue, Jul 2, 2013 at 2:14 PM, archit2112 archit2...@gmail.com wrote: I want to index pdf files in solr 4.3.0 using the data import handler. I have done the following: My request handler - requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler My data-config.xml dataConfig dataSource type=BinFileDataSource / document entity name=f dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf recursive=true entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=text/ /entity /entity /document /dataConfig Now when i tried to index the documents i got the following error org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id Because i dont want any uniquekey in my case i disabled it as follows : In solrconfig.xml i commented out - searchComponent name=elevator class=solr.QueryElevationComponent pick a fieldType to analyze queries str name=queryFieldTypestring/str str name=config-fileelevate.xml/str /searchComponent In schema.xml i commented out uniquekeyid/uniquekey and added fieldType name=uuid class=solr.UUIDField indexed=true / field name=id type=uuid indexed=true stored=true default=NEW / and in elevate.xml i made the following changes elevate query text=foo bar doc id=4602376f-9741-407b-896e-645ec3ead457 / /query /elevate When i do this the indexing takes place but the indexed docs contain an author,s_author and id field. The document should contain author,text,title and id field (as defined in my data-config.xml). Please help me out. Am i doing anything wrong? and from where did this s_author field come? doc str name=authorarora arc/str str name=author_sarora arc/str str name=id4f65332d-49d9-497a-b88b-881da618f571/str/doc -- View this message in context: http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Solr indexer and Hadoop
If you can upload your data to hdfs you can use this patch to build the solr indexes: https://issues.apache.org/jira/browse/SOLR-1301 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074635.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removal of unique key - Query Elevation Component
Thanks! The author_s issue has been resolved. Why are the other fields not getting indexed ? -- View this message in context: http://lucene.472066.n3.nabble.com/Removal-of-unique-key-Query-Elevation-Component-tp4074624p4074636.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unique key error while indexing pdf files
Yes. The absolute path is unique. How do i implement it? can you please explain? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html Sent from the Solr - User mailing list archive at Nabble.com.
need distance in miles not in kilometers
Hi, I am suing solr 4.2 and my results are coming proper. but now i want to distance in miles and i am getting the distance in kilometre. can anyone tell me how to get the distance in miles. example query q=*:*fq={!geofilt}sfield=latlngpt=18.9322453,72.8264378001d=60fl=_dist_:geodist()sort=geodist() desc url http://wiki.apache.org/solr/SpatialSearch Thanks in advance. Regards, Irshad
Re: OOM killer script woes
On looking at the code in SolrDispatchFilter, is this intentional or not? I think I remember Mark Miller mentioning that in an OOM case, the best course of action is basically to kill the process, there is very little Solr can do once it has run out of memory. Yet it seems that Solr catches the OOM itself and just logs it as an error, rather than letting it go back up the to the JVM. We have also seem OOMs in IndexWriter and that has specific code to handle OOM cases, and seems to fall-back to the transaction log (but fail committing anything). I understand the logic of that, but in reality, I've seen the tlog can get corrupted in this case, so we still need to be monitoring the system and forcibly kill the process. On 27 June 2013 00:03, Timothy Potter thelabd...@gmail.com wrote: Thanks for the feedback Daniel ... For now, I've opted to just kill the JVM with System.exit(1) in the SolrDispatchFilter code and will restart it with a Linux supervisor. Not elegant but the alternative of having a zombie Solr instance walking around my cluster is much worse ;-) Will try to dig into the code that is trapping this error but for now I've lost too many hours on this problem. Cheers, Tim On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com wrote: Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and throwing it/packaging it as a java.lang.RuntimeException. The -XX option assumes that the application doesn't handle the Errors and so they would reach the JVM and thus invoke the handler. Since Jetty has an exception handler that is dealing with anything (included Errors), they never reach the JVM, hence no handler. Not much we can do short of not using Jetty? That's a pain, I'd just written a nice OOM handler too! On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote: A little more to this ... Just on chance this was a weird Jetty issue or something, I tried with the latest 9 and the problem still occurs :-( This is on Java 7 on debian: java version 1.7.0_21 Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here is an example stack trace from the log 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR solr.servlet.SolrDispatchFilter Q:22 - null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.OutOfMemoryError: Java heap space On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com wrote: Recently upgraded to 4.3.1 but this problem has persisted for a while now ... I'm using the following configuration when starting Jetty: -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p If an OOM is triggered during Solr web app initialization (such as by me lowering -Xmx to a value that is too low to initialize Solr with), then the script gets called and does what I expect! However, once the Solr webapp initializes
Aggregate TermFrequency on Result Grouping / Field Collapsing
Hi, Is it possible to perform aggregated termfreq(field,term) on Result Grouping ? I am trying to get total count of term's appearance in a document and then want to aggregate that count by grouping the document on one of my field. Like this http://localhost:8080/solr/collection1/select?q=iphonewt=jsonindent=truegroup=truegroup.field=urlfl=freq%3Atermfreq%28CommentX%2C%27iphone%27%29 Problem is it returning only top level result (doc) in each group and thus the term frequency of that result (doc). How can I make it to sum the termfred() of all the documents per group ? Thanks, Tony
undefined field http:// while searchi query
Hi, I am using solr 3.3 version. After indexing I am querying below command. http://localhost:8080/solr/select/?q=*(http://www.google.co.in)* getting below error. SEVERE: org.apache.solr.common.SolrException: *undefined field http://* at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) at org.apache.solr.search.QParser.getQuery(QParser.java:142) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) can you please assist me in this. thanks in advance. Aniljayanti. -- View this message in context: http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 4.3 Pivot Performance Issue
Hi There, I notice with the upgrade from solr 4.0 to solr 4.3 that we had a degradation of queries that are using pivot fields. Have someone else notice it too? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-tp4074617.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No date.gap on pivoted facets
On Sun, Jun 30, 2013 at 5:33 PM, Jack Krupansky j...@basetechnology.com wrote: Sorry, but Solr pivot faceting is based solely on field facets, not range (or date) facets. Thank you. I tried adding that information to the SimpleFacetParameters wiki page, but that page seems to be defined as Immutable Page. You can approximate date gaps by making a copy of your raw date field and then manually gap (truncate) the date values so that the their discrete values correspond to your date gap. Thank you, this is what I have done. In the next release of my book I have a script for a StatelessScriptUpdateProccessor (with examples) that supports truncation of dates to a desired resolution, copying or modifying the input date as desired. Terrific, I anticipate the release. Next release? Did I miss the release? http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957/ -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Spell check in SOLR
Hi, How can i configure SOLR to provide corrections for misspelled words. If the query string is in dictionary SOLR should not return any suggestions. But if the query string is not in dictionary SOLR should return all possible corrected words in the dictionary which most likely could be the query string? Thanks, Prathik
RE: undefined field http:// while searchi query
colons need to be escaped cheers -Original message- From:aniljayanti aniljaya...@yahoo.co.in Sent: Tuesday 2nd July 2013 12:35 To: solr-user@lucene.apache.org Subject: undefined field http:// while searchi query Hi, I am using solr 3.3 version. After indexing I am querying below command. http://localhost:8080/solr/select/?q=*(http://www.google.co.in)* getting below error. SEVERE: org.apache.solr.common.SolrException: *undefined field http://* at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) at org.apache.solr.search.QParser.getQuery(QParser.java:142) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) can you please assist me in this. thanks in advance. Aniljayanti. -- View this message in context: http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html Sent from the Solr - User mailing list archive at Nabble.com.
parent Import Query doent run
I have 1:n relation between my main entity(PackageVersion) and its tag in my DB. I add a new tag with this date to the db at the timestamp and I run delta import command. the select retrieves the line but i dont see any other sql. Here are my data-config.xml configurations: entity name=PackageVersion pk=PackageVersionId query= select ... from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId deltaQuery = select PackageVersion.Id PackageVersionId from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId where Package.LastModificationTime '${dataimporter.last_index_time}' OR PackageVersion.Timestamp '${dataimporter.last_index_time}' deltaImportQuery=select ... from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId Where PackageVersionId=='${dih.delta.id}' entity name=PackageTag pk=ResourceId processor=CachedSqlEntityProcessor cacheKey=ResourceId cacheLookup=PackageVersion.PackageId query= SELECT ResourceId,[Text] PackageTag from [dbo].[Tag] Tag deltaQuery=SELECT ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where Tag.TimeStamp '${dataimporter.last_index_time}' parentDeltaQuery=select PackageVersion.PackageVersionId from [dbo].[Package] where Package.Id=${PackageVersion.PackageVersionId} /entity /entity
Re: undefined field http:// while searchi query
Presuming that uses the standard lucene query parser syntax then you have asked to query for the field called http, searching for the value // www.google.co.in See http://wiki.apache.org/solr/SolrQuerySyntax for more details, but you probably want to escape the : at least, http\://www.google.co.in On 2 July 2013 07:34, aniljayanti aniljaya...@yahoo.co.in wrote: Hi, I am using solr 3.3 version. After indexing I am querying below command. http://localhost:8080/solr/select/?q=*(http://www.google.co.in)* getting below error. SEVERE: org.apache.solr.common.SolrException: *undefined field http://* at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1254) at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getAnalyzer(IndexSchema.java:410) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:385) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:574) at org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:158) at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237) at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) at org.apache.solr.search.QParser.getQuery(QParser.java:142) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:257) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1764) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) can you please assist me in this. thanks in advance. Aniljayanti. -- View this message in context: http://lucene.472066.n3.nabble.com/undefined-field-http-while-searchi-query-tp4074601.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stemming query in Solr
Somehow we're mis-communicating here. Forget expansion, it's all about base forms. G. bq: What I cannot figure out is how is this going to help me in instructing Solr to execute the query for the different grammatical variations of the input search term stem You don't. You search the stemmed input against the stemmed field (happens automatically by field). So, getting hits on burn, burns, burned, burning when searching for burning, because both the query and index process are working with burn. Note that the _stored_ values that get returned with the fields are all the originals, so you see burns, burning, etc. Your query searches against one or the other field depending on whether you have the exact match checkbox checked or not. You can even do a variant of searching on _both_ with a high boos on the exact_match field which would _tend_ to sort the documents with exact match to the top of the list. Best Erick On Mon, Jul 1, 2013 at 9:12 AM, snkar soumya@zoho.com wrote: I was just wondering if another solution might work. If we are able to extract the stem of the input search term(maybe using a C# based stemmer, some open source implementation of the Porter algorithm) for cases where the stemming option is selected, and submit the query to solr as a multiple character wild card query with respect to the stem, it should return me all the different variations of the stemmed word. Example: Search Term: burning Stem: burn Modified Query: burn* Results: burn, burning, burns, burnt, etc. I am sure this is not the proper way of executing a stemming by expansion, but this might just get the job done. What do you think? Trying to think of test case where this will fail. On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here lt;Ggt;.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: gt; Hi Erick, gt; gt; Thanks for the reply. gt; gt; Here is what the situation is: gt; gt; Relevant portion of Solr Schema: gt; amp;lt;field name=Content type=text_general indexed=false stored=true gt; required=true/amp;gt; gt; amp;lt;field name=ContentSearch type=text_general indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; gt; gt; amp;lt;fieldType name=text_general class=solr.TextField gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; amp;lt;tokenizer gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt gt; enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; amp;lt;analyzer gt; type=queryamp;gt; amp;lt;tokenizer class=solr.StandardTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer class=solr.WhitespaceTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; When I am indexing a document, the content gets stored as is in the gt; Content field and gets copied over to ContentSearch and gt; ContentSearchStemming for text based search and stemming search gt; respectively. So, the ContentSearchStemming field does store the gt; stem/reduced form of the terms. I have checked this with the Luke as well gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin gt;
Re: documentCache not used in 4.3.1?
This takes some significant custom code, but... One strategy is to keep your commits relatively lengthy (depends on the ingest rate) and keep a side car index either a small core or a RAMDirectory. Then at search time you somehow combine the two results. The somehow is a bit tricky since the scores may not be comparable. If you're sorting it's trivial, but what you describe doesn't sound like it's sorted as opposed to score. Or more accurately, it sounds like you're sorting by score. But none of that is worthwhile if you're getting good enough results as it stands. Best Erick On Mon, Jul 1, 2013 at 12:28 PM, Daniel Collins danwcoll...@gmail.comwrote: Regrettably, visibility is key for us :( Documents must be searchable as soon as they have been indexed (or as near as we can make it). Our old search system didn't do relevance sort, it was time-ordered (so it had a much simpler job) but it did have sub-second latency, and that is what is expected for its replacement (I know Solr doesn't like 1s currently, but we live in hope!). Tried explaining that by doing relevance sort we are searching 100% of the collection, instead of the ~10%-20% a time-ordered sort did (it effectively sharded by date and only searched as far back as it needed to fill a page of results), but that tends to get blank looks from business. :) One of life's little challenges. On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote: Daniel: Soft commits invalidate the top level caches, which include things like filterCache, queryResultCache etc. Various segment-level caches are NOT invalidated, but you really don't have a lot of control from the Solr level over those anyway. But yeah, the tension between caching a bunch of stuff for query speedups and NRT is still with us. Soft commits are much less expensive than hard commits, but not being able to use the caches as much is the price. You're right that with such frequent autocommits, autowarming probably is not worth the effort. The question I always ask is whether 1 second is really necessary. Or, more accurately, worth the price. Often it's not and lengthening it out significantly may be an option, but that's a discussion for you to have with your product manager G I have seen configurations that have a more frequent hard commit (openSearcher=false) than soft commit. The mantra is soft commits are about visibility, hard commits are about durability. FWIW, Erick On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com wrote: We see similar results, again we softCommit every 1s (trying to get as NRT as we can), and we very rarely get any hits in our caches. As an unscheduled test last week, we did shutdown indexing and noticed about 80% hit rate in caches (and average query time dropped from ~1s to 100ms!) so I think we are in the same position as you. I appreciate with such a frequent soft commit that the caches get invalidated, but I was expecting cache warming to help though it doesn't appear to be. We *don't* currently run a warming query, my impression of NRT was that it was better to not do that as otherwise you spend more time warming the searcher and caches, and by the time you've done all that, the searcher is invalidated anyway! On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote: That's a good idea, I'll try that next week. Thanks! Tim On 29/06/13 12:39 PM, Erick Erickson wrote: Tim: Yeah, this doesn't make much sense to me either since, as you say, you should be seeing some metrics upon occasion. But do note that the underlying cache only gets filled when getting documents to return in query results, since there's no autowarming going on it may come and go. But you can test this pretty quickly by lengthening your autocommit interval or just not indexing anything for a while, then run a bunch of queries and look at your cache stats. That'll at least tell you whether it works at all. You'll have to have hard commits turned off (or openSearcher set to 'false') for that check too. Best Erick On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim tvaillanco...@ea.com * *wrote: Yes, we are softCommit'ing every 1000ms, but that should be enough time to see metrics though, right? For example, I still get non-cumulative metrics from the other caches (which are also throw away). I've also curl/sampled enough that I probably should have seen a value by now. If anyone else can reproduce this on 4.3.1 I will feel less crazy :). Cheers, Tim -Original Message- From: Erick Erickson [mailto:erickerickson@gmail.**com erickerick...@gmail.com ] Sent: Saturday, June 29, 2013 10:13 AM To:
Re: Converting nested data model to solr schema
As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.3 Pivot Performance Issue
What is the nature of your degradation? -- Jack Krupansky -Original Message- From: solrUserJM Sent: Tuesday, July 02, 2013 4:22 AM To: solr-user@lucene.apache.org Subject: Solr 4.3 Pivot Performance Issue Hi There, I notice with the upgrade from solr 4.0 to solr 4.3 that we had a degradation of queries that are using pivot fields. Have someone else notice it too? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-Pivot-Performance-Issue-tp4074617.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: need distance in miles not in kilometers
Simply multiply by the number of miles per kilometer, 0.621371: fl=_dist_:mul(geodist(),0.621371) -- Jack Krupansky -Original Message- From: irshad siddiqui Sent: Tuesday, July 02, 2013 5:19 AM To: solr-user@lucene.apache.org Subject: need distance in miles not in kilometers Hi, I am suing solr 4.2 and my results are coming proper. but now i want to distance in miles and i am getting the distance in kilometre. can anyone tell me how to get the distance in miles. example query q=*:*fq={!geofilt}sfield=latlngpt=18.9322453,72.8264378001d=60fl=_dist_:geodist()sort=geodist() desc url http://wiki.apache.org/solr/SpatialSearch Thanks in advance. Regards, Irshad
Re: need distance in miles not in kilometers
Jack , Thanks for your response. In case of frange we donot want to separately multiple for conversion so in that case is there any way to convert it into miles. my Query: http://localhost:8983/solr/select?q=name:shopfl=name,shopLocation,shopMaxDeliveryDistance,geodist%28shopLocation,0.0,0.0%29sort=geodist%28shopLocation,0.0,0.0%29%20ascfq={! frange %20u=0}sub%28geodist%28shopLocation,0.0,0.0%29,shopMaxDeliveryDistance%29http://www.google.com/url?q=http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fselect%3Fq%3Dname%3Ashop%26fl%3Dname%2CshopLocation%2CshopMaxDeliveryDistance%2Cgeodist%2528shopLocation%2C0.0%2C0.0%2529%26sort%3Dgeodist%2528shopLocation%2C0.0%2C0.0%2529%2520asc%26fq%3D%7B%21frange%2520u%3D0%7Dsub%2528geodist%2528shopLocation%2C0.0%2C0.0%2529%2CshopMaxDeliveryDistance%2529sa=Dsntz=1usg=AFQjCNGIj4_FQ0XKwC8RPLb_iZrzOY-SZg I wants result in miles On Tue, Jul 2, 2013 at 6:11 PM, Jack Krupansky j...@basetechnology.comwrote: Simply multiply by the number of miles per kilometer, 0.621371: fl=_dist_:mul(geodist(),0.**621371) -- Jack Krupansky -Original Message- From: irshad siddiqui Sent: Tuesday, July 02, 2013 5:19 AM To: solr-user@lucene.apache.org Subject: need distance in miles not in kilometers Hi, I am suing solr 4.2 and my results are coming proper. but now i want to distance in miles and i am getting the distance in kilometre. can anyone tell me how to get the distance in miles. example query q=*:*fq={!geofilt}sfield=**latlngpt=18.9322453,72.** 8264378001d=60fl=_dist_:**geodist()sort=geodist() desc url http://wiki.apache.org/solr/**SpatialSearchhttp://wiki.apache.org/solr/SpatialSearch Thanks in advance. Regards, Irshad
Re: documentCache not used in 4.3.1?
Cheers, its certainly something we might end up exploring. On 2 July 2013 12:41, Erick Erickson erickerick...@gmail.com wrote: This takes some significant custom code, but... One strategy is to keep your commits relatively lengthy (depends on the ingest rate) and keep a side car index either a small core or a RAMDirectory. Then at search time you somehow combine the two results. The somehow is a bit tricky since the scores may not be comparable. If you're sorting it's trivial, but what you describe doesn't sound like it's sorted as opposed to score. Or more accurately, it sounds like you're sorting by score. But none of that is worthwhile if you're getting good enough results as it stands. Best Erick On Mon, Jul 1, 2013 at 12:28 PM, Daniel Collins danwcoll...@gmail.com wrote: Regrettably, visibility is key for us :( Documents must be searchable as soon as they have been indexed (or as near as we can make it). Our old search system didn't do relevance sort, it was time-ordered (so it had a much simpler job) but it did have sub-second latency, and that is what is expected for its replacement (I know Solr doesn't like 1s currently, but we live in hope!). Tried explaining that by doing relevance sort we are searching 100% of the collection, instead of the ~10%-20% a time-ordered sort did (it effectively sharded by date and only searched as far back as it needed to fill a page of results), but that tends to get blank looks from business. :) One of life's little challenges. On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote: Daniel: Soft commits invalidate the top level caches, which include things like filterCache, queryResultCache etc. Various segment-level caches are NOT invalidated, but you really don't have a lot of control from the Solr level over those anyway. But yeah, the tension between caching a bunch of stuff for query speedups and NRT is still with us. Soft commits are much less expensive than hard commits, but not being able to use the caches as much is the price. You're right that with such frequent autocommits, autowarming probably is not worth the effort. The question I always ask is whether 1 second is really necessary. Or, more accurately, worth the price. Often it's not and lengthening it out significantly may be an option, but that's a discussion for you to have with your product manager G I have seen configurations that have a more frequent hard commit (openSearcher=false) than soft commit. The mantra is soft commits are about visibility, hard commits are about durability. FWIW, Erick On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com wrote: We see similar results, again we softCommit every 1s (trying to get as NRT as we can), and we very rarely get any hits in our caches. As an unscheduled test last week, we did shutdown indexing and noticed about 80% hit rate in caches (and average query time dropped from ~1s to 100ms!) so I think we are in the same position as you. I appreciate with such a frequent soft commit that the caches get invalidated, but I was expecting cache warming to help though it doesn't appear to be. We *don't* currently run a warming query, my impression of NRT was that it was better to not do that as otherwise you spend more time warming the searcher and caches, and by the time you've done all that, the searcher is invalidated anyway! On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote: That's a good idea, I'll try that next week. Thanks! Tim On 29/06/13 12:39 PM, Erick Erickson wrote: Tim: Yeah, this doesn't make much sense to me either since, as you say, you should be seeing some metrics upon occasion. But do note that the underlying cache only gets filled when getting documents to return in query results, since there's no autowarming going on it may come and go. But you can test this pretty quickly by lengthening your autocommit interval or just not indexing anything for a while, then run a bunch of queries and look at your cache stats. That'll at least tell you whether it works at all. You'll have to have hard commits turned off (or openSearcher set to 'false') for that check too. Best Erick On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim tvaillanco...@ea.com * *wrote: Yes, we are softCommit'ing every 1000ms, but that should be enough time to see metrics though, right? For example, I still get non-cumulative metrics from the other caches (which are also throw away). I've also curl/sampled enough that I probably should have seen a value by now. If
Re: Converting nested data model to solr schema
It sounds like 4.4 will have an RC next week, so the prospects for block join in 4.4 are kind of dim. I mean, such a significant feature should have more than a few days to bake before getting released. But... who knows what Yonik has planned! -- Jack Krupansky -Original Message- From: adfel70 Sent: Tuesday, July 02, 2013 7:41 AM To: solr-user@lucene.apache.org Subject: Re: Converting nested data model to solr schema As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Converting nested data model to solr schema
I'm not familiar with block join in lucene. I've read a bit, and I just want to make sure - do you think that when this ticket is released, it will solve the current problem of solr cloud joins? Also, can you elaborate a bit about your solution? Jack Krupansky-2 wrote It sounds like 4.4 will have an RC next week, so the prospects for block join in 4.4 are kind of dim. I mean, such a significant feature should have more than a few days to bake before getting released. But... who knows what Yonik has planned! -- Jack Krupansky -Original Message- From: adfel70 Sent: Tuesday, July 02, 2013 7:41 AM To: solr-user@.apache Subject: Re: Converting nested data model to solr schema As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spell check in SOLR
See http://wiki.apache.org/solr/SpellCheckComponent On Tue, Jul 2, 2013 at 4:14 PM, Prathik Puthran prathik.puthra...@gmail.com wrote: Hi, How can i configure SOLR to provide corrections for misspelled words. If the query string is in dictionary SOLR should not return any suggestions. But if the query string is not in dictionary SOLR should return all possible corrected words in the dictionary which most likely could be the query string? Thanks, Prathik -- Regards, Shalin Shekhar Mangar.
DIH: HTMLStripTransformer in sub-entities?
Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* /entity /entity We've tried several different permutations of putting the sub-entity column in different nest levels of the XML to no avail. I'm curious if we're trying something that is just not supported or whether we are just trying the wrong things. Thanks, Andy Pickler
Solr - working with delta import and cache
I have two entities in 1:n relation - PackageVersion and Tag. I have configured DIH to use CachedSqlEntityProcessor and everything works as planned. First, Tag entity is selected using the query attribute. Then the main entity. Ultra Fast. Now I am adding the delta import. Everything runs and loads, but too slow. Looking at the db profiler output i see : 1. the delta query of the inner entities run first - which is good. 2. the delta query of the main entities runs later - which is still good. 3. deltaImportQuery of the main entity with each of the ID's runs as a single select can be improved using where in all the result. Is it possible? 4. All of the Query attribute of the other tables are running now. This is bad. (In real life I have more than one table in 1:n connection). for instance I get a lot of select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 run. Because it is from the Query attribute - there is no where clause for using the ids. a. How can I fix it ? b. Can I translate the importquery to use where in c. There is no real order for all the select when requesting deltaImport. is it possible to implement the caching also when updating delta? Here is my configuration entity name=PackageVersion pk=PackageVersionId query= select from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId deltaQuery = select PackageVersion.Id PackageVersionId from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId where Package.LastModificationTime '${dataimporter.last_index_time}' OR PackageVersion.Timestamp '${dih.last_index_time}' deltaImportQuery= select from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId Where PackageVersion.Id='${dih.delta.PackageVersionId}' entity name=PackageTag pk=ResourceId processor=CachedSqlEntityProcessor cacheKey=ResourceId cacheLookup=PackageVersion.PackageId query=select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 deltaQuery=select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 and Tag.TimeStamp '${dih.last_index_time}' parentDeltaQuery=select PackageVersion.PackageVersionId from [dbo].[Package] where Package.Id=${PackageTag.ResourceId} /entity /entity
Solr cloud date based paritioning
Hi guys! I have simple use case to implement but I have problem with date based partitioning... Here are some rules: 1. At the beginning I have to create huge index (10GB) based on one db table. 2. Every day I have to update this index. 3. 99,999% are queries based on date field (*data from last 2 months*). So my idea was to create partitions by month an provide month-based partition in query like in example in documentation: http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001 I would provide shards only from 2 months to gain nice performance. Questions are: how can I create month-based partitions? Is it possible to create new shard on each new month are update delta data only to this shard? Examples are very welcome. I read documentation a few times and can't find answers... Thanks! Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - working with delta import and cache
BTW: Just found out that a delta import is only supported by the SqlEntityProcessor . Does it matter that I defined processor=CachedSqlEntityProcessor? On Tue, Jul 2, 2013 at 5:58 PM, Mysurf Mail stammail...@gmail.com wrote: I have two entities in 1:n relation - PackageVersion and Tag. I have configured DIH to use CachedSqlEntityProcessor and everything works as planned. First, Tag entity is selected using the query attribute. Then the main entity. Ultra Fast. Now I am adding the delta import. Everything runs and loads, but too slow. Looking at the db profiler output i see : 1. the delta query of the inner entities run first - which is good. 2. the delta query of the main entities runs later - which is still good. 3. deltaImportQuery of the main entity with each of the ID's runs as a single select can be improved using where in all the result. Is it possible? 4. All of the Query attribute of the other tables are running now. This is bad. (In real life I have more than one table in 1:n connection). for instance I get a lot of select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 run. Because it is from the Query attribute - there is no where clause for using the ids. a. How can I fix it ? b. Can I translate the importquery to use where in c. There is no real order for all the select when requesting deltaImport. is it possible to implement the caching also when updating delta? Here is my configuration entity name=PackageVersion pk=PackageVersionId query= select from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId deltaQuery = select PackageVersion.Id PackageVersionId from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId where Package.LastModificationTime '${dataimporter.last_index_time}' OR PackageVersion.Timestamp '${dih.last_index_time}' deltaImportQuery= select from [dbo].[Package] Package inner join [dbo].[PackageVersion] PackageVersion on Package.Id = PackageVersion.PackageId Where PackageVersion.Id='${dih.delta.PackageVersionId}' entity name=PackageTag pk=ResourceId processor=CachedSqlEntityProcessor cacheKey=ResourceId cacheLookup=PackageVersion.PackageId query=select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 deltaQuery=select ResourceId,[Text] PackageTag from [dbo].[Tag] Tag Where ResourceType = 0 and Tag.TimeStamp '${dih.last_index_time}' parentDeltaQuery=select PackageVersion.PackageVersionId from [dbo].[Package] where Package.Id=${PackageTag.ResourceId} /entity /entity
How to disable debug in Solrj
Hi, I'm running the jetty start.jar and I'm indexing documents with Solrj's HttpSolrServer object : SolrServer server = new HttpSolrServer(http://HOST:8983/solr/;); server.add( docs ); server.commit(); This leads to TONS of debug information (i.e. logs at level DEBUG), on both server and client sides (but much more on client side). I'm read and tried methods suggested in : http://wiki.apache.org/solr/SolrLogging#Customizing_Logging http://wiki.apache.org/solr/LoggingInDefaultJettySetup but nothing changed. How can I lower the debugging level to INFO or WARN? Thanks, Scott.
Re: Solr cloud date based paritioning
On 2 July 2013 20:05, kowish.adamosh kowish.adam...@gmail.com wrote: Hi guys! I have simple use case to implement but I have problem with date based partitioning... Here are some rules: 1. At the beginning I have to create huge index (10GB) based on one db table. 2. Every day I have to update this index. 3. 99,999% are queries based on date field (*data from last 2 months*). [...] Before you start complicating things, have you measured the performance of having everything in one shard? It is quite likely that a 10GB index would have adequate performance on reasonable hardware. Your mileage may vary, but I would try to measure the performance from a single index first. Regards, Gora
Re: Using per-segment FieldCache or DocValues in custom component?
Where do you get the docid from? Usually its best to just look at the whole algorithm, e.g. docids come from per-segment readers by default anyway so ideally you want to access any per-document things from that same segmentreader. As far as supporting docvalues, FieldCache API passes thru to docvalues transparently if its enabled for the field. On Mon, Jul 1, 2013 at 4:55 PM, Michael Ryan mr...@moreover.com wrote: I have some custom code that uses the top-level FieldCache (e.g., FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign this to use the per-segment FieldCaches so that re-opening a Searcher is fast(er). In most cases, I've got a docId and I want to get the value for a particular single-valued field for that doc. Is there a good place to look to see example code of per-segment FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but hoping there might be something less confusing :) Also thinking DocValues might be a better way to go for me... is there any documentation or example code for that? -Michael
Re: DIH: HTMLStripTransformer in sub-entities?
On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote: Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) Please do not do that. This DIH configuration file does not make sense (please see comments below), and we are left guessing in the dark. If the file is too large, you can share it on something like pastebin.com entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* [...] (a) You SELECT replyContent, but the column attribute in the field is named other_content. Nothing should be getting indexed into the field. (b) Why are your entities nested if the inner entity has no relationship to the outer one? Regards, Gora
Re: DIH: HTMLStripTransformer in sub-entities?
Thanks for the quick reply. Unfortunately, I don't believe my company would want me sharing our exact production schema in a public forum, although I realize it makes it harder to diagnose the problem. The sub-entity is a multi-valued field that indeed does have a relationship to the outer entity. I just left off the 'where' clause from the sub-entity, as I didn't believe it was helpful in the context of this problem. We use the convention of.. SELECT dbColumnName AS solrFieldName ...so that we can relate the database column name to what we what it to be named in the Solr index. I don't think any of this helps you identify my problem, but I tried to address your questions. Thanks, Andy On Tue, Jul 2, 2013 at 9:14 AM, Gora Mohanty g...@mimirtech.com wrote: On 2 July 2013 20:29, Andy Pickler andy.pick...@gmail.com wrote: Solr 4.1.0 We've been using the DIH to pull data in from a MySQL database for quite some time now. We're now wanting to strip all the HTML content out of many fields using the HTMLStripTransformer ( http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer). Unfortunately, while it seems to be working fine for top-level entities, we can't seem to get it to work for sub-entities: (not exact schema, reduced for example purposes) Please do not do that. This DIH configuration file does not make sense (please see comments below), and we are left guessing in the dark. If the file is too large, you can share it on something like pastebin.com entity name=blocks dataSource=database transformer=HTMLStripTransformer query= SELECT id as blockId, name as blockTitle, content as content FROM engagement_block field column=content stripHTML=true / *THIS WORKS!* entity name=blockReplies dataSource=database transformer=HTMLStripTransformer query= SELECT br.other_content AS replyContent FROM block_reply field column=other_content stripHTML=true / *THIS DOESN'T WORK!* [...] (a) You SELECT replyContent, but the column attribute in the field is named other_content. Nothing should be getting indexed into the field. (b) Why are your entities nested if the inner entity has no relationship to the outer one? Regards, Gora
Re: Solr indexer and Hadoop
Yes, I've read directly from NFS. Consider the case where your mapper takes as input a list of the file paths to operate on. Your mapper would load each file one by one by using standard java.io.* calls, build a SolrInputDocument out of each one, and submit it to a SolrServer implementation stored as a member field in the mapper during the setup call. Something like this: https://gist.github.com/mdellabitta/5910253 I literally wrote that in the git editor just now, so I don't even know if it compiles, but you can get the idea. Note that the NFS mount has to be live on all of the task nodes. Also, if the number of lines in the input file is small enough, Hadoop might not split it enough for you, so you should use NLineInputFormat. And you should definitely tune the number of running tasks to make sure that you don't destroy your Solr box with lots of traffic. I've used the patch that Anatoli mentions as well, and that does work. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jul 2, 2013 at 3:17 AM, engy.morsy engy.mo...@bibalex.org wrote: Michael, I understand from your post that I can use the current storage without in Hadoop. I already have the storage mounted via NFS. Does your map function read from the mounted storage directly? If possible can you please illustrate more on that. Thanks Engy -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4074604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Newbie SolR - Need advice
Start with the Solr Tutorial. http://lucene.apache.org/solr/tutorial.html -- Jack Krupansky -Original Message- From: fabio1605 Sent: Tuesday, July 02, 2013 11:16 AM To: solr-user@lucene.apache.org Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: OOM killer script woes
Please file a JIRA issue so that we can address this. - Mark On Jul 2, 2013, at 6:20 AM, Daniel Collins danwcoll...@gmail.com wrote: On looking at the code in SolrDispatchFilter, is this intentional or not? I think I remember Mark Miller mentioning that in an OOM case, the best course of action is basically to kill the process, there is very little Solr can do once it has run out of memory. Yet it seems that Solr catches the OOM itself and just logs it as an error, rather than letting it go back up the to the JVM. We have also seem OOMs in IndexWriter and that has specific code to handle OOM cases, and seems to fall-back to the transaction log (but fail committing anything). I understand the logic of that, but in reality, I've seen the tlog can get corrupted in this case, so we still need to be monitoring the system and forcibly kill the process. On 27 June 2013 00:03, Timothy Potter thelabd...@gmail.com wrote: Thanks for the feedback Daniel ... For now, I've opted to just kill the JVM with System.exit(1) in the SolrDispatchFilter code and will restart it with a Linux supervisor. Not elegant but the alternative of having a zombie Solr instance walking around my cluster is much worse ;-) Will try to dig into the code that is trapping this error but for now I've lost too many hours on this problem. Cheers, Tim On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com wrote: Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and throwing it/packaging it as a java.lang.RuntimeException. The -XX option assumes that the application doesn't handle the Errors and so they would reach the JVM and thus invoke the handler. Since Jetty has an exception handler that is dealing with anything (included Errors), they never reach the JVM, hence no handler. Not much we can do short of not using Jetty? That's a pain, I'd just written a nice OOM handler too! On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote: A little more to this ... Just on chance this was a weird Jetty issue or something, I tried with the latest 9 and the problem still occurs :-( This is on Java 7 on debian: java version 1.7.0_21 Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here is an example stack trace from the log 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR solr.servlet.SolrDispatchFilter Q:22 - null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.OutOfMemoryError: Java heap space On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com wrote: Recently upgraded to 4.3.1 but this problem has persisted for a while now ... I'm using the following configuration when starting Jetty: -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p If an OOM is triggered during Solr web app initialization (such as by me lowering -Xmx to a value that is too low to initialize Solr with), then the
Re: Unique key error while indexing pdf files
See http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor The implicit fields generated by the FileListEntityProcessor are fileDir, file, fileAbsolutePath, fileSize, fileLastModified and these are available for use within the entity On Tue, Jul 2, 2013 at 2:47 PM, archit2112 archit2...@gmail.com wrote: Yes. The absolute path is unique. How do i implement it? can you please explain? -- View this message in context: http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074638.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
RE: Newbie SolR - Need advice
Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:fabio.to...@btinternet.com] Sent: 02 July 2013 16:16 To: solr-user@lucene.apache.org Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Newbie SolR - Need advice
Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] ml-node+s472066n4074772...@n3.nabble.com Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html To unsubscribe from Newbie SolR - Need advice, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr cloud date based paritioning
Hi, There is nothing automatic that I know of that will create shards (or maybe you mean SolrCloud Collections?) every month. You can do that in your application, though, just create the Collection via the API. You can make use of aliases to have something like last2months alias point to your last 2 Collections. You would shift this alias every month after you create your new Collection. Of course, right after the shift, you would really be searching only 1 month's worth of data, so you may want to allow searching across last 3 Collections instead, optionally enforcing/limiting query to last 2 months based on document date and a range query. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 10:35 AM, kowish.adamosh kowish.adam...@gmail.com wrote: Hi guys! I have simple use case to implement but I have problem with date based partitioning... Here are some rules: 1. At the beginning I have to create huge index (10GB) based on one db table. 2. Every day I have to update this index. 3. 99,999% are queries based on date field (*data from last 2 months*). So my idea was to create partitions by month an provide month-based partition in query like in example in documentation: http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001 I would provide shards only from 2 months to gain nice performance. Questions are: how can I create month-based partitions? Is it possible to create new shard on each new month are update delta data only to this shard? Examples are very welcome. I read documentation a few times and can't find answers... Thanks! Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to re-index Solr get term frequency within documents
Hi Tony, There is, you can do it with that SolrEntityProcessor I pointed out, if you have all your fields stored in Solr. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 2:00 AM, Tony Mullins tonymullins...@gmail.com wrote: I use Nutch as input datasource for my Solr. So I cannot re-run all the Nutch jobs to generate data again for Solr as it will take very long to generate that much data. I was hoping there would be an easier way inside Solr to just re-index all the existing data. Thanks, Tony On Tue, Jul 2, 2013 at 1:37 AM, Jack Krupansky j...@basetechnology.comwrote: Or, go with a commercial product that has a single-click Solr re-index capability, such as: 1. DataStax Enterprise - data is stored in Cassandra and reindexed into Solr from there. 2. LucidWorks Search - data sources are declared so that the package can automatically re-crawl the data sources. But, yeah, as Otis says, re-index is really just a euphemism for deleting your Solr data directory and indexing from scratch from the original data sources. -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Monday, July 01, 2013 2:26 PM To: solr-user@lucene.apache.org Subject: Re: How to re-index Solr get term frequency within documents If all your fields are stored, you can do it with http://search-lucene.com/?q=**solrentityprocessorhttp://search-lucene.com/?q=solrentityprocessor Otherwise, just reindex the same way you indexed in the first place. *Always* be ready to reindex from scratch. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com wrote: Thanks Jack , it worked. Could you please provide some info on how to re-index existing data in Solr, after changing the schema.xml ? Thanks, Tony On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.com* *wrote: You can write any function query in the field list of the fl parameter. Sounds like you want termfreq: termfreq(field_arg,term) fl=id,a,b,c,termfreq(a,xyz) -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Monday, July 01, 2013 10:47 AM To: solr-user@lucene.apache.org Subject: How to re-index Solr get term frequency within documents Hi, I am using Solr 4.3.0. If I change my solr's schema.xml then do I need to re-index my solr ? And if yes , how to ? My 2nd question is I need to find the frequency of term per document in all documents of search result. My field is field name=CommentX type=text_general stored=true indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ And I am trying this query http://localhost:8080/solr/select/?q=iphonefl=AuthorX%** 2CTitleX%2CCommentXdf=CommentXwt=xmlindent=true** qt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv. offsets=truehttp://localhost:**8080/solr/select/?q=iphonefl=** AuthorX%2CTitleX%2CCommentX**df=CommentXwt=xmlindent=** trueqt=tvrhtv=truetv.tf=**truetv.df=truetv.positions** tv.offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true Its just returning me the result set, no info on my searched term's (iphone) frequency in each document. How can I make Solr to return the frequency of searched term per document in result set ? Thanks, Tony.
Re: Newbie SolR - Need advice
Hi Fabio, No, Solr isn't the database replacement for MS SQL. Solr is built on top of Lucene which is a search engine library for text searches. Solr in itself is not a replacement for any database as it does not support any relational db features, however as Jack and David mentioned its fully optimised search engine platform that can provide all search related features like faceting, highlighting etc. Solr does not have a *database*. It stores the data in binary files called indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These indexes are populated with the data from the database. Solr provides an inbuilt functionality through DataImportHandler component to get the data and generate indexes. When you say, your web servers are mainly doing search function, do you mean it is a text search and you use queries with clauses as 'like', 'in' etc. (in addition to multiple joints) to get the results? Does the web application need faceting? If yes, then solr can be your friend to get it through. Do remember that it always takes some time to get the new concepts from understanding through to implementation. As David mentioned already, it *is* going to be a bumpy ride at the start but *definitely* a sensational one. Good Luck, Sandeep On 2 July 2013 17:09, fabio1605 fabio.to...@btinternet.com wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] ml-node+s472066n4074772...@n3.nabble.com Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html To unsubscribe from Newbie SolR - Need advice, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html Sent from the Solr - User mailing list archive at Nabble.com.
set-based and other less common approaches to search
Let's say I wanted to ask solr to find me any document that contains at least 100 out of some 300 search terms I give it. Can Solr do this out of the box? If not, what kind of customization would it require? Now let's say I want to further have the option to request that those terms a) must show up within the same column of an excel spreadsheet, or b) are exact matches (i.e. match on search, but not searched), or c) occur in the exact order that I specified, or d) occur contiguously and without any words in between, or e) are made up of non-word elements such as 92228345 or SJA12334. Can solr do any of these out of the box? If not, what of these tasks is relatively easy to do with some custom code, and what is not?
Re: set-based and other less common approaches to search
Hi, Solr can do all of these. There are phrase queries, queries where you specify a field, the mm param for min should match, etc. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote: Let's say I wanted to ask solr to find me any document that contains at least 100 out of some 300 search terms I give it. Can Solr do this out of the box? If not, what kind of customization would it require? Now let's say I want to further have the option to request that those terms a) must show up within the same column of an excel spreadsheet, or b) are exact matches (i.e. match on search, but not searched), or c) occur in the exact order that I specified, or d) occur contiguously and without any words in between, or e) are made up of non-word elements such as 92228345 or SJA12334. Can solr do any of these out of the box? If not, what of these tasks is relatively easy to do with some custom code, and what is not?
Tomcat Solr Server startup fails with FileNotFoundException
Hi All, I am newbie to solr. I've accidentally deleted indexed files(manually using rm -rf command) on server from solr index folder. Then on when ever I start my server its failing to start with FNF exception. How can this be fixed quickly? Appreciate if any can suggest a quick fix to this. INFO: created /elevate: solr.SearchHandler Jul 1, 2013 8:17:40 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: java.io.FileNotFoundException: /solr/index/_bbx.fnm (No such file or directory) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1099) I am seeing below exception as well. Can you please help me with these 2 exceptions? Let me know if you need any other details on this. 2013-07-01 20:18:00 TaskUtils$LoggingErrorHandler [ERROR] Unexpected error occurred in scheduled task. org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://servername:8080/solr/admin/ping?wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246) -- Thanks and Regards, Murthy P D N S.
Re: Newbie SolR - Need advice
Hi Ok I'm even more confused now... Sorry for even more stupid questions. So if it's not a database replacement Where do we keep the database then. We have a website that is a documentation website that store documents. It has over 130 million records in a table and 50 million in 2 other plus lots of little tables Most searches are like searching on references or for customer information etc. However with so much information stored ms sql is starting to get slower We have approx 100 tables across 4 different database So this is why I started to look at SolR Q1 if we used SolR would we still use sql as well as SolR or does SolR become sql (speaking theoretically) Q2 if so... How do we move all the data across to SolR. Q3 is SolR useful for what we need. Or is sql the better option based on our circumstances. 50percent of our load is from a website... 50 percent is from scripts adding the information to the site etc Sorry for the silly question I'm just getting really confused now Sent from Samsung Mobile Original message From: Sandeep Mestry [via Lucene] ml-node+s472066n4074795...@n3.nabble.com Date: 02/07/2013 17:29 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: Re: Newbie SolR - Need advice Hi Fabio, No, Solr isn't the database replacement for MS SQL. Solr is built on top of Lucene which is a search engine library for text searches. Solr in itself is not a replacement for any database as it does not support any relational db features, however as Jack and David mentioned its fully optimised search engine platform that can provide all search related features like faceting, highlighting etc. Solr does not have a *database*. It stores the data in binary files called indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These indexes are populated with the data from the database. Solr provides an inbuilt functionality through DataImportHandler component to get the data and generate indexes. When you say, your web servers are mainly doing search function, do you mean it is a text search and you use queries with clauses as 'like', 'in' etc. (in addition to multiple joints) to get the results? Does the web application need faceting? If yes, then solr can be your friend to get it through. Do remember that it always takes some time to get the new concepts from understanding through to implementation. As David mentioned already, it *is* going to be a bumpy ride at the start but *definitely* a sensational one. Good Luck, Sandeep On 2 July 2013 17:09, fabio1605 [hidden email] wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] [hidden email] Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 [hidden email] Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below:
Re: Newbie SolR - Need advice
Consider DataStax Enterprise - it combines Cassandra for NoSql data storage with Solr for indexing - fully integrated. http://www.datastax.com/ -- Jack Krupansky -Original Message- From: fabio1605 Sent: Tuesday, July 02, 2013 12:44 PM To: solr-user@lucene.apache.org Subject: Re: Newbie SolR - Need advice Hi Ok I'm even more confused now... Sorry for even more stupid questions. So if it's not a database replacement Where do we keep the database then. We have a website that is a documentation website that store documents. It has over 130 million records in a table and 50 million in 2 other plus lots of little tables Most searches are like searching on references or for customer information etc. However with so much information stored ms sql is starting to get slower We have approx 100 tables across 4 different database So this is why I started to look at SolR Q1 if we used SolR would we still use sql as well as SolR or does SolR become sql (speaking theoretically) Q2 if so... How do we move all the data across to SolR. Q3 is SolR useful for what we need. Or is sql the better option based on our circumstances. 50percent of our load is from a website... 50 percent is from scripts adding the information to the site etc Sorry for the silly question I'm just getting really confused now Sent from Samsung Mobile Original message From: Sandeep Mestry [via Lucene] ml-node+s472066n4074795...@n3.nabble.com Date: 02/07/2013 17:29 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: Re: Newbie SolR - Need advice Hi Fabio, No, Solr isn't the database replacement for MS SQL. Solr is built on top of Lucene which is a search engine library for text searches. Solr in itself is not a replacement for any database as it does not support any relational db features, however as Jack and David mentioned its fully optimised search engine platform that can provide all search related features like faceting, highlighting etc. Solr does not have a *database*. It stores the data in binary files called indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These indexes are populated with the data from the database. Solr provides an inbuilt functionality through DataImportHandler component to get the data and generate indexes. When you say, your web servers are mainly doing search function, do you mean it is a text search and you use queries with clauses as 'like', 'in' etc. (in addition to multiple joints) to get the results? Does the web application need faceting? If yes, then solr can be your friend to get it through. Do remember that it always takes some time to get the new concepts from understanding through to implementation. As David mentioned already, it *is* going to be a bumpy ride at the start but *definitely* a sensational one. Good Luck, Sandeep On 2 July 2013 17:09, fabio1605 [hidden email] wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] [hidden email] Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 [hidden email] Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent
Re: Newbie SolR - Need advice
On 7/2/2013 10:09 AM, fabio1605 wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Solr is not really a database. Solr 4.x has a lot of features that make it function well in some limited NoSQL roles, but it's a search engine, not a database. It is a good idea to use the stored setting on your Solr fields only for those fields that are required to fully display a search result listing, then use your database as the canonical data store for displaying full information for a single search result when the user clicks on it. Aside from letting you know that it's not a good idea to give Microsoft your money, I can't really say anything bad about MSSQL. If it's working for you and your performance (aside from search) is good, there's no real reason to move away from it as a data repository. MongoDB is a NoSQL database. That would be a candidate for replacing MSSQL. Whether or not it could actually replace it depends on your data model. Thanks, Shawn
Re: Newbie SolR - Need advice
Solr is not a database and it does not handle SQL queries. --wunder On Jul 2, 2013, at 9:09 AM, fabio1605 wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] ml-node+s472066n4074772...@n3.nabble.com Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html To unsubscribe from Newbie SolR - Need advice, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
Re: Newbie SolR - Need advice
Arrfh I see... So SolR is the search engine for a datastore Is that what mongo is.. A datastore bit. Sent from Samsung Mobile Original message From: Jack Krupansky-2 [via Lucene] ml-node+s472066n4074809...@n3.nabble.com Date: 02/07/2013 17:51 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: Re: Newbie SolR - Need advice Consider DataStax Enterprise - it combines Cassandra for NoSql data storage with Solr for indexing - fully integrated. http://www.datastax.com/ -- Jack Krupansky -Original Message- From: fabio1605 Sent: Tuesday, July 02, 2013 12:44 PM To: [hidden email] Subject: Re: Newbie SolR - Need advice Hi Ok I'm even more confused now... Sorry for even more stupid questions. So if it's not a database replacement Where do we keep the database then. We have a website that is a documentation website that store documents. It has over 130 million records in a table and 50 million in 2 other plus lots of little tables Most searches are like searching on references or for customer information etc. However with so much information stored ms sql is starting to get slower We have approx 100 tables across 4 different database So this is why I started to look at SolR Q1 if we used SolR would we still use sql as well as SolR or does SolR become sql (speaking theoretically) Q2 if so... How do we move all the data across to SolR. Q3 is SolR useful for what we need. Or is sql the better option based on our circumstances. 50percent of our load is from a website... 50 percent is from scripts adding the information to the site etc Sorry for the silly question I'm just getting really confused now Sent from Samsung Mobile Original message From: Sandeep Mestry [via Lucene] [hidden email] Date: 02/07/2013 17:29 (GMT+00:00) To: fabio1605 [hidden email] Subject: Re: Newbie SolR - Need advice Hi Fabio, No, Solr isn't the database replacement for MS SQL. Solr is built on top of Lucene which is a search engine library for text searches. Solr in itself is not a replacement for any database as it does not support any relational db features, however as Jack and David mentioned its fully optimised search engine platform that can provide all search related features like faceting, highlighting etc. Solr does not have a *database*. It stores the data in binary files called indexes http://lucene.apache.org/core/3_0_3/fileformats.html. These indexes are populated with the data from the database. Solr provides an inbuilt functionality through DataImportHandler component to get the data and generate indexes. When you say, your web servers are mainly doing search function, do you mean it is a text search and you use queries with clauses as 'like', 'in' etc. (in addition to multiple joints) to get the results? Does the web application need faceting? If yes, then solr can be your friend to get it through. Do remember that it always takes some time to get the new concepts from understanding through to implementation. As David mentioned already, it *is* going to be a bumpy ride at the start but *definitely* a sensational one. Good Luck, Sandeep On 2 July 2013 17:09, fabio1605 [hidden email] wrote: Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] [hidden email] Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 [hidden email] Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i
Re: Tomcat Solr Server startup fails with FileNotFoundException
On 7/2/2013 9:39 AM, Murthy Perla wrote: I am newbie to solr. I've accidentally deleted indexed files(manually using rm -rf command) on server from solr index folder. Then on when ever I start my server its failing to start with FNF exception. How can this be fixed quickly? I believe this happens when you delete files in the index directory but don't delete the index directory itself. Try removing the entire directory. Thanks, Shawn
RE: Newbie SolR - Need advice
Don’t worry Fabio - nobody knows everything (apart from Hossman). Following on from Sandeep, to use SOLR, you extract the data from your MSSQL DB using the DataImportHandler and you can then query it, facet it, pivot it to your heart's content. And fast! You can use almost anything to build the SOLR queries - Java PHP being probably most popular. There is a library for Perl I think but never tried it. So, you keep your mssql database, you just don't use it for searches - that'll relieve some of the load. Searches then all go through SOLR its Lucene indexes. If your various tables need SQL joins, you specify those in the DataImportHandler (DIH) config. That way, when SOLR indexes everything, it indexes the data the way you want to see it. DIH handles the data export from mssql - SOLR and it's not too difficult to set up. You imply you're adding (inserting) data. How much, how often? DIH has a delta import feature so you can add data on the fly to SOLR's indexes. Much of it come down to the data model you have. My advice would be try it and see. You will be pleasantly surprised! -Original Message- From: fabio1605 [mailto:fabio.to...@btinternet.com] Sent: 02 July 2013 17:10 To: solr-user@lucene.apache.org Subject: RE: Newbie SolR - Need advice Thanks guys So SolR is actually a database replacement for mssql... Am I right We have a lot of perl scripts that contains lots of sql insert queries. Etc How do we query the SolR database from scripts I know I have a lot to learn still so excuse my ignorance. Also... What is mongo and how does it compare I just don't understand how in 10years of Web development I have never heard of SolR till last week Sent from Samsung Mobile Original message From: David Quarterman [via Lucene] ml-node+s472066n4074772...@n3.nabble.com Date: 02/07/2013 16:57 (GMT+00:00) To: fabio1605 fabio.to...@btinternet.com Subject: RE: Newbie SolR - Need advice Hi Fabio, Like Jack says, try the tutorial. But to answer your question, SOLR isn't a bolt on to SQLServer or any other DB. It's a fantastically fast indexing/searching tool. You'll need to use the DataImportHandler (see the tutorial) to import your data from the DB into the indices that SOLR uses. Once in there, you'll have more power flexibility than SQLServer would ever give you! Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll work using Jetty or Tomcat as web container. Stick with it. The ride can be bumpy but the experience is sensational! DQ -Original Message- From: fabio1605 [mailto:[hidden email]] Sent: 02 July 2013 16:16 To: [hidden email] Subject: Newbie SolR - Need advice Hi we have a MSSQL Server which is just getting far to large now and performance is dying! the majority of our webservers mainly are doing search function so i thought it may be best to move to SolR But i know very little about it! My questions are! Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR is just the search bit between? Im really struggling to understand the point of SOLR etc so if someone could point me to a Dummies website id apprecaite it! google is throwing to much confusion at me! -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html To unsubscribe from Newbie SolR - Need advice, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr large boolean filter
Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com wrote: So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29 If the IDs are purely numeric, I wonder if the better way is to send a bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000 is included. Even using URL-encoding rules, you can fit at least 65 sequential ID flags per character and I am sure there are more efficient encoding schemes for long empty sequences. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality
Re: Concurrent Modification Exception
Anyone , any suggestion or pointers for this issue? -- View this message in context: http://lucene.472066.n3.nabble.com/Concurrent-Modification-Exception-tp4074371p4074829.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: set-based and other less common approaches to search
Thanks. So following up on a) below, could I set up and query Solr, without any customization of code, to match 10 of my given 20 terms, but only if it finds those 10 terms in an xls document under a column that is named MyID or My ID or My I.D.? If so, what would that query look like? On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote: Hi, Solr can do all of these. There are phrase queries, queries where you specify a field, the mm param for min should match, etc. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote: Let's say I wanted to ask solr to find me any document that contains at least 100 out of some 300 search terms I give it. Can Solr do this out of the box? If not, what kind of customization would it require? Now let's say I want to further have the option to request that those terms a) must show up within the same column of an excel spreadsheet, or b) are exact matches (i.e. match on search, but not searched), or c) occur in the exact order that I specified, or d) occur contiguously and without any words in between, or e) are made up of non-word elements such as 92228345 or SJA12334. Can solr do any of these out of the box? If not, what of these tasks is relatively easy to do with some custom code, and what is not?
Re: Solr large boolean filter
Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.comwrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov ivkus...@gmail.com wrote: So I'm using query like http://127.0.0.1:8080/solr/select?q=*:*fq={!mqparser}id:%281%202%203%29http://127.0.0.1:8080/solr/select?q=*:*fq=%7B!mqparser%7Did:%281%202%203%29 If the IDs are purely numeric, I wonder if the better way is to send a bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on
How to show just the parent domains from results in Solr
hi All, I've indexed documents in my Solr 4.0 index, with fields like URL, page_content etc. Now when I run a search query, against the page_content I get a lot of urls . And say, if I in total 15 URL domains, and under these 15 domains I've all the pages indexed in SOLR. Is there a way in which, I can just get the parent URLs for search results instead of getting all the urls. For example, say searching for abc returns: www.aa.com/11.html www.aa.com/12.htmlwww.aa.com/13.html www.bb.com/15.htmlwww.bb.com/18.html I want the results to be like this:www.aa.comwww.bb.com Is there a way in SOLR, through which I can achieve this. I've tried FieldCollapsing[ https://wiki.apache.org/solr/FieldCollapsing ] but either its not the right solution or I'm not able to use it properly. Could someone help me find the solution to the above problem. Thanks in advance. Regards, KK
Re: Solr cloud date based paritioning
Thanks! I have very limited response time (max 100ms) therefore sharding is a must. Data also have trend to grow up to tens of gigs. Is there any way how to create new logical shard in runtime? I want to logically partition my data by date. I'm still wondering how is implemented example from documentation: / Query specific shard ids of the (implicit) collection. In this example, the user has partitioned the index by date, creating a new shard every month: http://localhost:8983/solr/collection1/select?shards=shard_200812,shard_200912,shard_201001/ Even in first full load I don't know how to do it... In all examples I can see that data are distributed physically by uniqeId % coreNum. Are there some examples of custom (i.e. date based) sharding strategy? I can see that in JIRA: https://issues.apache.org/jira/browse/SOLR-2592 there is something that may help but I can't find anything in documentation. Thanks for help! Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729p4074823.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr large boolean filter
Hello Roman, Don't you consider to pass long id sequence as body and access internally in solr as a content stream? It makes base64 compression not necessary. AFAIK url length is limited somehow, anyway. On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that could be eliminated by using server-side named filters. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson erickerick...@gmail.com wrote: You might consider post filters. The idea is to write a custom filter that gets applied after all other filters etc. One use-case here is exactly ACL lists, and can be quite helpful if you're not doing *:* type queries. Best Erick On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Btw. ElasticSearch has a nice feature here. Not sure what it's called, but I call it named filter. http://www.elasticsearch.org/blog/terms-filter-lookup/ Maybe that's what OP was after? Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jun 17, 2013 at 4:59 PM, Alexandre
Re: set-based and other less common approaches to search
try to hit dismax query parser specifying mm and qf parameters. On Tue, Jul 2, 2013 at 9:31 PM, gilawem mewa...@gmail.com wrote: Thanks. So following up on a) below, could I set up and query Solr, without any customization of code, to match 10 of my given 20 terms, but only if it finds those 10 terms in an xls document under a column that is named MyID or My ID or My I.D.? If so, what would that query look like? On Jul 2, 2013, at 12:38 PM, Otis Gospodnetic wrote: Hi, Solr can do all of these. There are phrase queries, queries where you specify a field, the mm param for min should match, etc. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 12:36 PM, gilawem mewa...@gmail.com wrote: Let's say I wanted to ask solr to find me any document that contains at least 100 out of some 300 search terms I give it. Can Solr do this out of the box? If not, what kind of customization would it require? Now let's say I want to further have the option to request that those terms a) must show up within the same column of an excel spreadsheet, or b) are exact matches (i.e. match on search, but not searched), or c) occur in the exact order that I specified, or d) occur contiguously and without any words in between, or e) are made up of non-word elements such as 92228345 or SJA12334. Can solr do any of these out of the box? If not, what of these tasks is relatively easy to do with some custom code, and what is not? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Converting nested data model to solr schema
during indexing whole block (doc and it's attachment) goes into particular shard, then it's can be queried per every shard and results are merged. btw, do you feel any problem with your current approach - query time joins and out-of-the-box shard routing? On Tue, Jul 2, 2013 at 5:19 PM, adfel70 adfe...@gmail.com wrote: I'm not familiar with block join in lucene. I've read a bit, and I just want to make sure - do you think that when this ticket is released, it will solve the current problem of solr cloud joins? Also, can you elaborate a bit about your solution? Jack Krupansky-2 wrote It sounds like 4.4 will have an RC next week, so the prospects for block join in 4.4 are kind of dim. I mean, such a significant feature should have more than a few days to bake before getting released. But... who knows what Yonik has planned! -- Jack Krupansky -Original Message- From: adfel70 Sent: Tuesday, July 02, 2013 7:41 AM To: solr-user@.apache Subject: Re: Converting nested data model to solr schema As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
copyField and storage requirements
Newbie question: We have the following fields defined in the schema: field name=content type=text_general indexed=true stored=false/ field name=teaser type=text_general indexed=false stored=true/ copyField source=content dest=teaser maxChars=80/ the content is field is about 500KB data. My question is whether Solr stores the entire contents of the that 500KB content field? We want to minimize the stored data in the Solr index, that is why we added the copyField teaser. Thanks Saqib
Request to Edit Solr Wiki
Hi I'd like to contribute to some of the page in the Solr Wiki at wiki.apache.org/solr My username is VivekShivaprabhu (alias: vivekrs) Please do the needful. Thanks in advance! -Vivek R S
Re: Request to Edit Solr Wiki
Done, added VivekShivaprabhu to the Solr contributor's group. Let us know if you need the alias instead And thanks for helping with the Wiki! Erick On Tue, Jul 2, 2013 at 1:42 PM, Vivek Shivaprabhu vivekrs@gmail.comwrote: Hi I'd like to contribute to some of the page in the Solr Wiki at wiki.apache.org/solr My username is VivekShivaprabhu (alias: vivekrs) Please do the needful. Thanks in advance! -Vivek R S
Re: Two instances of solr - the same datadir?
as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master:http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time. There are several ways to trigger a commit: Call commit() periodically within your own code. Use autoCommit in solrconfig.xml. Use an RPC/IPC mechanism between the 2 instance processes to tell the searcher the index has changed, then call commit when called (more complex coding, but good if the index changes on an ad-hoc basis). Note, doing things this way isn't really suitable for an NRT environment. HTH, Peter On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote: Replication is fine, I am going to use it, but I wanted it for instances *distributed* across several (physical) machines - but here I have one physical machine, it has many cores. I want to run 2 instances of solr because I think it has these benefits: 1) I can give less RAM to the writer (4GB), and use more RAM for the searcher (28GB) 2) I can deactivate warming for the writer and keep it for the searcher
Re: copyField and storage requirements
On 7/2/2013 12:22 PM, Ali, Saqib wrote: Newbie question: We have the following fields defined in the schema: field name=content type=text_general indexed=true stored=false/ field name=teaser type=text_general indexed=false stored=true/ copyField source=content dest=teaser maxChars=80/ the content is field is about 500KB data. My question is whether Solr stores the entire contents of the that 500KB content field? We want to minimize the stored data in the Solr index, that is why we added the copyField teaser. With that config, the entire 500KB will not be _stored_ .. but it will affect the index size because you are indexing it. Exactly what degree that will be depends on the definition of the text_general type. Thanks, Shawn
Re: Solr large boolean filter
Hello Mikhail, Yes, GET is limited, but POST is not - so I just wanted that it works in both the same way. But I am not sure if I am understanding your question completely. Could you elaborate on the parameters/body part? Is there no need for encoding of binary data inside the body? Or do you mean it is treated as a string? Or is it just a bytestream and other parameters are seen as string? On a general note: my main concern was to send many ids fast, if we use ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb check numbers please :)). But certainly, if the bitset is sparse or the collection of ids just a 'a few thousands', stream of ints/longs will be smaller, better to use. roman On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Roman, Don't you consider to pass long id sequence as body and access internally in solr as a content stream? It makes base64 compression not necessary. AFAIK url length is limited somehow, anyway. On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use a UserCache and get notified whenever a new searcher was opened and regenerate or purge the cache. Of course you're right if the post filter does NOT have access to the source of truth for the user's privileges. FWIW, Erick On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The unfortunate thing about this is what you still have to *pass* that filter from the client to the server every time you want to use that filter. If that filter is big/long, passing that in all the time has some price that
Re: Two instances of solr - the same datadir?
Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master:http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote: Hi, We use this very same scenario to great effect - 2 instances using the same dataDir with many cores - 1 is a writer (no caching), the other is a searcher (lots of caching). To get the searcher to see the index changes from the writer, you need the searcher to do an empty commit - i.e. you invoke a commit with 0 documents. This will refresh the caches (including autowarming), [re]build the relevant searchers etc. and make any index changes visible to the RO instance. Also, make sure to use lockTypenative/lockType in solrconfig.xml to ensure the two instances don't try to commit at the same time.
Re: Two instances of solr - the same datadir?
Interesting, we are running 4.0 - and solr will refuse the start (or reload) the core. But from looking at the code I am not seeing it is doing any writing - but I should digg more... Are you sure it needs to do writing? Because I am not calling commits, in fact I have deactivated *all* components that write into index, so unless there is something deep inside, which automatically calls the commit, it should never happen. roman On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote: Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master: http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the whole core, but it seems like the easiest thing to do now. -- roman On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Peter, Thank you, I am glad to read that this usecase is not alien. I'd like to make the second instance (searcher) completely read-only, so I have disabled all the components that can write. (being lazy ;)) I'll probably use http://wiki.apache.org/solr/CollectionDistribution to call the curl after commit, or write some IndexReaderFactory that checks for changes The problem with calling the 'core reload' - is that it seems lots of work for just opening a new searcher, eeekkk...somewhere I read that it is cheap to reload a core, but re-opening the index searches must be definitely cheaper... roman On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge peter.stu...@gmail.com wrote:
Filter cache pollution during sharded edismax queries
Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: Replicating files containing external file fields
Jack and Erick, Thanks for your replies. I am able to replicate ext file fields by specifying the relative paths for each individual file. confFiles in solrconfig.xml is really long now with lot of ../ and I got 5 ext file field files. Would be really nice if wild-cards were supported here :-). About the reloadCache on slave: following http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes I set up listeners to reload the ext file fields after commits. Since the slave replicationHandler issues a commit after it replicates the files (as mentioned in https://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F), I believe the ext file fields get reloaded to the slave cache after replication. This is exactly what I was looking for. On Fri, Jun 28, 2013 at 5:08 PM, Jack Krupansky j...@basetechnology.comwrote: Yes, you need to list that EFF file in the confFiles list - only those listed files will be replicated. str name=confFilessolrconfig.**xml,data-config.xml,schema.** xml,stopwords.txt,synonyms.**txt,elevate.xml, /var/solr-data/List/external_***/str Oops... sorry, no wildcards... you must list the individual files. Technically, the path is supposed to be relative to the Solr collection conf directory, so you MAY have you may have to put lots of ../ in the path to get to the files, like: ../../../../solr-data/List/**external_1 Tor each file. (This is what Erick was referring to.) Sorry, I don't have the answer to the reload question at the tip of my tongue. -- Jack Krupansky -Original Message- From: Arun Rangarajan Sent: Friday, June 28, 2013 7:42 PM To: solr-user@lucene.apache.org Subject: Re: Replicating files containing external file fields Jack, Here is the ReplicationHandler definition from solrconfig.xml: requestHandler name=/replication class=solr.**ReplicationHandler lst name=master str name=enable${enable.master:**false}/str str name=replicateAfterstartup**/str str name=replicateAftercommit/**str str name=replicateAfter**optimize/str str name=confFilessolrconfig.**xml,data-config.xml,schema.** xml,stopwords.txt,synonyms.**txt,elevate.xml/str /lst lst name=slave str name=enable${enable.slave:**false}/str str name=masterUrlhttp://${**master.ip}:${master.port}/**solr/${ solr.core.name}/replication/**str str name=pollInterval00:01:00/**str /lst /requestHandler The confFiles are under the dir: /var/solr/application-cores/**List/conf and the external file fields are like: /var/solr-data/List/external_* Should I add /var/solr-data/List/external_* to confFiles like this? str name=confFilessolrconfig.**xml,data-config.xml,schema.** xml,stopwords.txt,synonyms.**txt,elevate.xml, /var/solr-data/List/external_***/str Also, can you tell me when (or whether) I need to do reloadCache on the slave after the ext file fields are replicated? Thx. On Fri, Jun 28, 2013 at 10:13 AM, Jack Krupansky j...@basetechnology.com **wrote: Show us your confFiles directive. Maybe there is some subtle error in the file name. -- Jack Krupansky -Original Message- From: Arun Rangarajan Sent: Friday, June 28, 2013 1:06 PM To: solr-user@lucene.apache.org Subject: Re: Replicating files containing external file fields Erick, Thx for your reply. The external file field fields are already under dataDir specified in solrconfig.xml. They are not getting replicated. (Solr version 4.2.1.) On Thu, Jun 27, 2013 at 10:50 AM, Erick Erickson erickerick...@gmail.com **wrote: Haven't tried this, but I _think_ you can use the confFiles trick with relative paths, see: http://wiki.apache.org/solr/SolrReplicationhttp://wiki.apache.org/solr/**SolrReplication http://wiki.**apache.org/solr/**SolrReplicationhttp://wiki.apache.org/solr/SolrReplication Or just put your EFF files in the data dir? Best Erick On Wed, Jun 26, 2013 at 9:01 PM, Arun Rangarajan arunrangara...@gmail.comwrote: From https://wiki.apache.org/solr/SolrReplicationhttps://wiki.apache.org/solr/**SolrReplication https://wiki.**apache.org/solr/**SolrReplicationhttps://wiki.apache.org/solr/SolrReplicationI understand that index dir and any files under the conf dir can be replicated to slaves. I want to know if there is any way the files under the data dir containing external file fields can be replicated. These are not replicated by default. Currently we are running the ext file field reload script on both the master and the slave and then running reloadCache on each server once they are loaded.
Re: Solr large boolean filter
Roman, It's covered in http://wiki.apache.org/solr/ContentStream | For POST requests where the content-type is not application/x-www-form-urlencoded, the raw POST body is passed as a stream. So, there is no need for encoding of binary data inside the body. Regarding encoding, I have a positive experience of passing such ids encoded by vInt, but they need to be presorted. On Tue, Jul 2, 2013 at 10:46 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello Mikhail, Yes, GET is limited, but POST is not - so I just wanted that it works in both the same way. But I am not sure if I am understanding your question completely. Could you elaborate on the parameters/body part? Is there no need for encoding of binary data inside the body? Or do you mean it is treated as a string? Or is it just a bytestream and other parameters are seen as string? On a general note: my main concern was to send many ids fast, if we use ints (32bit), in one MB, one can fit ~250K, with bitset 33 times more (sb check numbers please :)). But certainly, if the bitset is sparse or the collection of ids just a 'a few thousands', stream of ints/longs will be smaller, better to use. roman On Tue, Jul 2, 2013 at 2:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello Roman, Don't you consider to pass long id sequence as body and access internally in solr as a content stream? It makes base64 compression not necessary. AFAIK url length is limited somehow, anyway. On Tue, Jul 2, 2013 at 9:32 PM, Roman Chyla roman.ch...@gmail.com wrote: Wrong link to the parser, should be: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla roman.ch...@gmail.com wrote: Hello @, This thread 'kicked' me into finishing som long-past task of sending/receiving large boolean (bitset) filter. We have been using bitsets with solr before, but now I sat down and wrote it as a qparser. The use cases, as you have discussed are: - necessity to send lng list of ids as a query (where it is not possible to do it the 'normal' way) - or filtering ACLs It works in the following way: - external application constructs bitset and sends it as a query to solr (q or fq, depends on your needs) - solr unpacks the bitset (translated bits into lucene ids, if necessary), and wraps this into a query which then has the easy job of 'filtering' wanted/unwanted items Therefore it is good only if you can search against something that is indexed as integer (id's often are). A simple benchmark shows acceptable performance, to send the bitset (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) To decode this string (resulting byte size 1.5Mb!) it takes ~90ms (5+14+68ms) But I haven't tested latency of sending it over the network and the query performance, but since the query is very similar as MatchAllDocs, it is probably very fast (and I know that sending many Mbs to Solr is fast as well) I know this is not exactly 'standard' solution, and it is probably not something you want to see with hundreds of millions of docs, but people seem to be doing 'not the right thing' all the time;) So if you think this is something useful for the community, please let me know. If somebody would be willing to test it, i can file a JIRA ticket. Thanks! Roman The code, if no JIRA is needed, can be found here: https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java 839ms. run 154ms. Building random bitset indexSize=1000 fill=0.5 -- Size=15054208,cardinality=3934477 highestBit=999 25ms. Converting bitset to byte array -- resulting array length=125 20ms. Encoding byte array into base64 -- resulting array length=168 ratio=1.344 62ms. Compressing byte array with GZIP -- resulting array length=1218602 ratio=0.9748816 20ms. Encoding gzipped byte array into base64 -- resulting string length=1624804 ratio=1.2998432 5ms. Decoding gzipped byte array from base64 14ms. Uncompressing decoded byte array 68ms. Converting from byte array to bitset 743ms. running On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson erickerick...@gmail.com wrote: Not necessarily. If the auth tokens are available on some other system (DB, LDAP, whatever), one could get them in the PostFilter and cache them somewhere since, presumably, they wouldn't be changing all that often. Or use
Re: DIH: HTMLStripTransformer in sub-entities?
On 2 July 2013 20:55, Andy Pickler andy.pick...@gmail.com wrote: Thanks for the quick reply. Unfortunately, I don't believe my company would want me sharing our exact production schema in a public forum, although I realize it makes it harder to diagnose the problem. The sub-entity is a multi-valued field that indeed does have a relationship to the outer entity. I just left off the 'where' clause from the sub-entity, as I didn't believe it was helpful in the context of this problem. We use the convention of.. SELECT dbColumnName AS solrFieldName ...so that we can relate the database column name to what we what it to be named in the Solr index. I don't think any of this helps you identify my problem, but I tried to address your questions. Um, with all due respect, I do not then know how to address your issues in a public forum. Maybe you are then better off hiring someone to handle your specific problems, after signing a NDA or whatever it takes from your side: Please see http://wiki.apache.org/solr/Support Regards, Gora
Re: Converting nested data model to solr schema
My current solution is overriding the out-of-the-box shard routing, and forcing each document and its attachment to go into a specific shard. But this is so I can support the query time joins (because join are only performed between documents in the same shard). I'm a bit concerned by this approach only because it forces me to overdrive out-of-the-box solr behavior. I didn't implement the whole thing yet, so can't say anything about performance. You're saying that your block-join solution does the same thing at index time (putting document and its attachments in the same shard), but at query time it doesn't require to perform explicit join? If you could add an example of what you'll index, and how you'll query , it would be very helpful. Also, if this ticket is going to get into one of the next releases, and it solves the join problem, it seems that its worth waiting for. Mikhail Khludnev wrote during indexing whole block (doc and it's attachment) goes into particular shard, then it's can be queried per every shard and results are merged. btw, do you feel any problem with your current approach - query time joins and out-of-the-box shard routing? On Tue, Jul 2, 2013 at 5:19 PM, adfel70 lt; adfel70@ gt; wrote: I'm not familiar with block join in lucene. I've read a bit, and I just want to make sure - do you think that when this ticket is released, it will solve the current problem of solr cloud joins? Also, can you elaborate a bit about your solution? Jack Krupansky-2 wrote It sounds like 4.4 will have an RC next week, so the prospects for block join in 4.4 are kind of dim. I mean, such a significant feature should have more than a few days to bake before getting released. But... who knows what Yonik has planned! -- Jack Krupansky -Original Message- From: adfel70 Sent: Tuesday, July 02, 2013 7:41 AM To: solr-user@.apache Subject: Re: Converting nested data model to solr schema As you see it, does SOLR-3076 fixes my problem? Is SOLR-3076 fix getting into solr 4.4? Mikhail Khludnev wrote On Mon, Jul 1, 2013 at 5:56 PM, adfel70 lt; adfel70@ gt; wrote: This requires me to override the solr document distribution mechanism. I fear that with this solution I may loose some of solr cloud's capabilities. It's not clear whether you aware of http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you did doesn't sound scary to me. If it works, it should be fine. I'm not aware of any capabilities that you are going to loose. Obviously SOLR-3076 provides astonishing query time performance, with offloading actual join work into index time. Check it if you current approach turns slow. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074668.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074696.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics lt;http://www.griddynamics.comgt; lt; mkhludnev@ gt; -- View this message in context: http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351p4074876.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr cloud date based paritioning
On 2 July 2013 22:35, kowish.adamosh kowish.adam...@gmail.com wrote: Thanks! I have very limited response time (max 100ms) therefore sharding is a must. Really? Sharding is a must without any measurements to validate that assertion? I am not sure what advice to give you if you seem determined to ignore any, but as a touch point, in the days of Solr 1.4 (much improved perfomance since then), out of the box we used to get an average time of well under 100ms for queries with 50 simultaneous users on an index with *everything* stored, and an index size of 80 GB. This was admittedly non-scientific, as the cache enters significantly into the equation, but I will urge you again: Try measuring things before adding bells and whistles. Regards, Gora
Access to Solr Wiki
Hi, May I please be added to the list of editors to the Solr Wiki as I see that some earlier changes seem to have gone missing. My user name is GoraMohanty Thanks. Regards, Gora
How to query Solr for empty field or specific value
Hello, I'm using Solr 4.2 and am trying to get a specific value (blue) or null field (no color) returned by my filter query. My results should yield 3 documents (If I execute the two separate filters in different queries, I get 2 hits for one query and 1 for the other). I've tried this (blue or no color set): select?q=*:*fq=(-color:[* TO *] OR color:blue) When that returned zero hits, I added a new field called color.not_null and am setting it only if a color is defined (thinking there was a problem with using the same field name). select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue) That too yielded zero results. Again, executing them separately does return hits (3). Does anyone see what I might be doing wrong? Thanks in advance, Kristian
RE: Newbie SolR - Need advice
So, you keep your mssql database, you just don't use it for searches - that'll relieve some of the load. Searches then all go through SOLR its Lucene indexes. If your various tables need SQL joins, you specify those in the DataImportHandler (DIH) config. That way, when SOLR indexes everything, it indexes the data the way you want to see it. -- SO by this you mean we keep mssql as we do!! But we use the website to run through SOLR SOLR will then handle the indexing and retrieval of data from its own index's, and will make its own calls to our MSSQL server when required(i.e updating/adding to indexs..) Am I on the right tracks there now! So MSSQL becomes the datastore SOLR becomes the search engine... -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074889.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: copyField and storage requirements
Thanks Shawn. Here is the text_general type definition. We would like to bring down the storage requirement down to a minimum for those 500KB content documents. We just need basic full-text search. Thanks!!! :) fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType On Tue, Jul 2, 2013 at 11:35 AM, Shawn Heisey s...@elyograg.org wrote: On 7/2/2013 12:22 PM, Ali, Saqib wrote: Newbie question: We have the following fields defined in the schema: field name=content type=text_general indexed=true stored=false/ field name=teaser type=text_general indexed=false stored=true/ copyField source=content dest=teaser maxChars=80/ the content is field is about 500KB data. My question is whether Solr stores the entire contents of the that 500KB content field? We want to minimize the stored data in the Solr index, that is why we added the copyField teaser. With that config, the entire 500KB will not be _stored_ .. but it will affect the index size because you are indexing it. Exactly what degree that will be depends on the definition of the text_general type. Thanks, Shawn
Re: How to query Solr for empty field or specific value
Better to define color.not_null as a boolean field and always initialize as either true or false. But, even without that you need write a pure negative query or clause as (*:* -term) So: select?q=*:*fq=((*:* -color:[* TO *]) OR color:blue) and select?q=*:*fq=((*:* -color.not_null:[* TO *]) OR color:blue) -- Jack Krupansky -Original Message- From: Van Tassell, Kristian Sent: Tuesday, July 02, 2013 3:47 PM To: solr-user@lucene.apache.org Subject: How to query Solr for empty field or specific value Hello, I'm using Solr 4.2 and am trying to get a specific value (blue) or null field (no color) returned by my filter query. My results should yield 3 documents (If I execute the two separate filters in different queries, I get 2 hits for one query and 1 for the other). I've tried this (blue or no color set): select?q=*:*fq=(-color:[* TO *] OR color:blue) When that returned zero hits, I added a new field called color.not_null and am setting it only if a color is defined (thinking there was a problem with using the same field name). select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue) That too yielded zero results. Again, executing them separately does return hits (3). Does anyone see what I might be doing wrong? Thanks in advance, Kristian
Re: Solr cloud date based paritioning
Sure, I'ill measure results and come back if results will be unsatisfactory. Thanks very much for advice. Out of curiosity: is there any way to partition shards (logical and physical) by specified value of specified field? Kowish -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cloud-date-based-paritioning-tp4074729p4074899.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to show just the parent domains from results in Solr
Re-index your data with a separate field for domain name, then either manually populate it or use an update processor to extract the domain name and store it in the desired field. You can then group by that field. The URL Classify update processor can do the trick. Or maybe a custom script with the Stateless Script update processor. My book has examples for URL Classify. -- Jack Krupansky -Original Message- From: A Geek Sent: Tuesday, July 02, 2013 1:47 PM To: solr user Subject: How to show just the parent domains from results in Solr hi All, I've indexed documents in my Solr 4.0 index, with fields like URL, page_content etc. Now when I run a search query, against the page_content I get a lot of urls . And say, if I in total 15 URL domains, and under these 15 domains I've all the pages indexed in SOLR. Is there a way in which, I can just get the parent URLs for search results instead of getting all the urls. For example, say searching for abc returns: www.aa.com/11.html www.aa.com/12.htmlwww.aa.com/13.html www.bb.com/15.htmlwww.bb.com/18.html I want the results to be like this:www.aa.comwww.bb.com Is there a way in SOLR, through which I can achieve this. I've tried FieldCollapsing[ https://wiki.apache.org/solr/FieldCollapsing ] but either its not the right solution or I'm not able to use it properly. Could someone help me find the solution to the above problem. Thanks in advance. Regards, KK
RE: How to query Solr for empty field or specific value
Thank you! -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Tuesday, July 02, 2013 3:05 PM To: solr-user@lucene.apache.org Subject: Re: How to query Solr for empty field or specific value Better to define color.not_null as a boolean field and always initialize as either true or false. But, even without that you need write a pure negative query or clause as (*:* -term) So: select?q=*:*fq=((*:* -color:[* TO *]) OR color:blue) and select?q=*:*fq=((*:* -color.not_null:[* TO *]) OR color:blue) -- Jack Krupansky -Original Message- From: Van Tassell, Kristian Sent: Tuesday, July 02, 2013 3:47 PM To: solr-user@lucene.apache.org Subject: How to query Solr for empty field or specific value Hello, I'm using Solr 4.2 and am trying to get a specific value (blue) or null field (no color) returned by my filter query. My results should yield 3 documents (If I execute the two separate filters in different queries, I get 2 hits for one query and 1 for the other). I've tried this (blue or no color set): select?q=*:*fq=(-color:[* TO *] OR color:blue) When that returned zero hits, I added a new field called color.not_null and am setting it only if a color is defined (thinking there was a problem with using the same field name). select?q=*:*fq=(-color.not_null:[* TO *] OR color:blue) That too yielded zero results. Again, executing them separately does return hits (3). Does anyone see what I might be doing wrong? Thanks in advance, Kristian
What are the options for obtaining IDF at interactive speeds?
Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: Two instances of solr - the same datadir?
The RO instance commit isn't (or shouldn't be) doing any real writing, just an empty commit to force new searchers, autowarm/refresh caches etc. Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in this area. As long as you don't have autocommit in solrconfig.xml, there wouldn't be any commits 'behind the scenes' (we do all our commits via a local solrj client so it can be fully managed). The only caveat might be NRT/soft commits, but I'm not too familiar with this in 4.0. In any case, your RO instance must be getting updated somehow, otherwise how would it know your write instance made any changes? Perhaps your write instance notifies the RO instance externally from Solr? (a perfectly valid approach, and one that would allow a 'single' lock to work without contention) On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote: Interesting, we are running 4.0 - and solr will refuse the start (or reload) the core. But from looking at the code I am not seeing it is doing any writing - but I should digg more... Are you sure it needs to do writing? Because I am not calling commits, in fact I have deactivated *all* components that write into index, so unless there is something deep inside, which automatically calls the commit, it should never happen. roman On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote: Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener enable=${montysolr.enable.warming:true}... 3. to trigger refresh of the read-only-master (from write-master): listener event=postCommit class=solr.RunExecutableListener enable=${montysolr.master:true} str name=execurl/str str name=dir./str bool name=waitfalse/bool arr name=args str${montysolr.read.master: http://localhost }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr /listener This works, I still don't like the reload of the
Partial Matching in both query and field
Given a string of 123456 and a search query 923459, what should the schema look like to consider this a match because at least 4 consecutive in characters the query match 4 consecutive characters in the data? I'm trying an NGramFilterFactory on the index and NGramTokenizerFactory on the query in the schema, but that's not working. I believe the problem is 'field:923459' is parsed as 'field:9234 2345 3459' instead of 'field:9234 field:2345 field:3459'. [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net
Re: Newbie SolR - Need advice
Hi Fabio, Yes, you're on right track. I'd like to now direct you to first reply from Jack to go through solr tutorial. Even with Solr,, it will take some time to learn various bits and pieces about designing fields, their field types, server configuration, etc. and then tune the results to match the results that you're currently getting from the database. There is lots of info available for Solr on web and do check Lucidworks' Solr Reference Guide. http://docs.lucidworks.com/display/solr/Apache+Solr+Reference+Guide;jsessionid=16ED0DB3B6F6BE8CEC6E6CDB207DBC49 Best of Solr Luck! Sandeep On 2 July 2013 20:47, fabio1605 fabio.to...@btinternet.com wrote: So, you keep your mssql database, you just don't use it for searches - that'll relieve some of the load. Searches then all go through SOLR its Lucene indexes. If your various tables need SQL joins, you specify those in the DataImportHandler (DIH) config. That way, when SOLR indexes everything, it indexes the data the way you want to see it. -- SO by this you mean we keep mssql as we do!! But we use the website to run through SOLR SOLR will then handle the indexing and retrieval of data from its own index's, and will make its own calls to our MSSQL server when required(i.e updating/adding to indexs..) Am I on the right tracks there now! So MSSQL becomes the datastore SOLR becomes the search engine... -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074889.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Partial Matching in both query and field
You will need to set q.op to OR, and... use a field type that has the autoGeneratePhraseQueries attribute set to false. -- Jack Krupansky -Original Message- From: James Bathgate Sent: Tuesday, July 02, 2013 5:10 PM To: solr-user@lucene.apache.org Subject: Partial Matching in both query and field Given a string of 123456 and a search query 923459, what should the schema look like to consider this a match because at least 4 consecutive in characters the query match 4 consecutive characters in the data? I'm trying an NGramFilterFactory on the index and NGramTokenizerFactory on the query in the schema, but that's not working. I believe the problem is 'field:923459' is parsed as 'field:9234 2345 3459' instead of 'field:9234 field:2345 field:3459'. [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net
Re: Partial Matching in both query and field
Jack, I've already tried that, here's my query: str name=debugQueryon/str str name=indenton/str str name=start0/str str name=q0_extrafield1_n:20454/str str name=q.opOR/str str name=rows10/str str name=version2.2/str Here's the parsed query: str name=parsedquery_toString0_extrafield1_n:2o45 o454 2o454/str Here's the applicable lines from schema.xml: fieldType name=ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=0 replacement=o replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=1|l replacement=i replace=all/ filter class=solr.NGramFilterFactory minGramSize=4 maxGramSize=16/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=4 maxGramSize=16 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.PatternReplaceFilterFactory pattern=[^A-Za-z0-9]+ replacement= replace=all/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=0 replacement=o replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=1|l replacement=i replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType dynamicField name=*_n type=ngram indexed=true stored=true autoGeneratePhraseQueries=false / James [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net On Tue, Jul 2, 2013 at 2:22 PM, Jack Krupansky j...@basetechnology.comwrote: You will need to set q.op to OR, and... use a field type that has the autoGeneratePhraseQueries attribute set to false. -- Jack Krupansky -Original Message- From: James Bathgate Sent: Tuesday, July 02, 2013 5:10 PM To: solr-user@lucene.apache.org Subject: Partial Matching in both query and field Given a string of 123456 and a search query 923459, what should the schema look like to consider this a match because at least 4 consecutive in characters the query match 4 consecutive characters in the data? I'm trying an NGramFilterFactory on the index and NGramTokenizerFactory on the query in the schema, but that's not working. I believe the problem is 'field:923459' is parsed as 'field:9234 2345 3459' instead of 'field:9234 field:2345 field:3459'. [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net
Re: Two instances of solr - the same datadir?
Wouldn't it be better to do a RELOAD? http://wiki.apache.org/solr/CoreAdmin#RELOAD Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jul 2, 2013 at 5:05 PM, Peter Sturge peter.stu...@gmail.com wrote: The RO instance commit isn't (or shouldn't be) doing any real writing, just an empty commit to force new searchers, autowarm/refresh caches etc. Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in this area. As long as you don't have autocommit in solrconfig.xml, there wouldn't be any commits 'behind the scenes' (we do all our commits via a local solrj client so it can be fully managed). The only caveat might be NRT/soft commits, but I'm not too familiar with this in 4.0. In any case, your RO instance must be getting updated somehow, otherwise how would it know your write instance made any changes? Perhaps your write instance notifies the RO instance externally from Solr? (a perfectly valid approach, and one that would allow a 'single' lock to work without contention) On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote: Interesting, we are running 4.0 - and solr will refuse the start (or reload) the core. But from looking at the code I am not seeing it is doing any writing - but I should digg more... Are you sure it needs to do writing? Because I am not calling commits, in fact I have deactivated *all* components that write into index, so unless there is something deep inside, which automatically calls the commit, it should never happen. roman On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com wrote: Hmmm, single lock sounds dangerous. It probably works ok because you've been [un]lucky. For example, even with a RO instance, you still need to do a commit in order to reload caches/changes from the other instance. What happens if this commit gets called in the middle of the other instance's commit? I've not tested this scenario, but it's very possible with a 'single' lock the results are indeterminate. If the 'single' lock mechanism is making assumptions e.g. no other process will interfere, and then one does, the Lucene index could very well get corrupted. For the error you're seeing using 'native', we use native lockType for both write and RO instances, and it works fine - no contention. Which version of Solr are you using? Perhaps there's been a change in behaviour? Peter On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote: as i discovered, it is not good to use 'native' locktype in this scenario, actually there is a note in the solrconfig.xml which says the same when a core is reloaded and solr tries to grab lock, it will fail - even if the instance is configured to be read-only, so i am using 'single' lock for the readers and 'native' for the writer, which seems to work OK roman On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote: I have auto commit after 40k RECs/1800secs. But I only tested with manual commit, but I don't see why it should work differently. Roman On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote: If it makes you feel better, I also considered this approach when I was in the same situation with a separate indexer and searcher on one Physical linux machine. My main concern was re-using the FS cache between both instances - If I replicated to myself there would be two independent copies of the index, FS-cached separately. I like the suggestion of using autoCommit to reload the index. If I'm reading that right, you'd set an autoCommit on 'zero docs changing', or just 'every N seconds'? Did that work? Best of luck! Tim On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote: So here it is for a record how I am solving it right now: Write-master is started with: -Dmontysolr.warming.enabled=false -Dmontysolr.write.master=true -Dmontysolr.read.master= http://localhost:5005 Read-master is started with: -Dmontysolr.warming.enabled=true -Dmontysolr.write.master=false solrconfig.xml changes: 1. all index changing components have this bit, enable=${montysolr.master:true} - ie. updateHandler class=solr.DirectUpdateHandler2 enable=${montysolr.master:true} 2. for cache warming de/activation listener event=newSearcher class=solr.QuerySenderListener
Re: Partial Matching in both query and field
Ahhh... you put autoGeneratePhraseQueries=false on the field - but it needs to be on the field type. You can see from the parsed query that it generated the phrase. -- Jack Krupansky -Original Message- From: James Bathgate Sent: Tuesday, July 02, 2013 5:35 PM To: solr-user@lucene.apache.org Subject: Re: Partial Matching in both query and field Jack, I've already tried that, here's my query: str name=debugQueryon/str str name=indenton/str str name=start0/str str name=q0_extrafield1_n:20454/str str name=q.opOR/str str name=rows10/str str name=version2.2/str Here's the parsed query: str name=parsedquery_toString0_extrafield1_n:2o45 o454 2o454/str Here's the applicable lines from schema.xml: fieldType name=ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 splitOnNumerics=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=0 replacement=o replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=1|l replacement=i replace=all/ filter class=solr.NGramFilterFactory minGramSize=4 maxGramSize=16/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.NGramTokenizerFactory minGramSize=4 maxGramSize=16 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.PatternReplaceFilterFactory pattern=[^A-Za-z0-9]+ replacement= replace=all/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=0 replacement=o replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=1|l replacement=i replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType dynamicField name=*_n type=ngram indexed=true stored=true autoGeneratePhraseQueries=false / James [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net On Tue, Jul 2, 2013 at 2:22 PM, Jack Krupansky j...@basetechnology.comwrote: You will need to set q.op to OR, and... use a field type that has the autoGeneratePhraseQueries attribute set to false. -- Jack Krupansky -Original Message- From: James Bathgate Sent: Tuesday, July 02, 2013 5:10 PM To: solr-user@lucene.apache.org Subject: Partial Matching in both query and field Given a string of 123456 and a search query 923459, what should the schema look like to consider this a match because at least 4 consecutive in characters the query match 4 consecutive characters in the data? I'm trying an NGramFilterFactory on the index and NGramTokenizerFactory on the query in the schema, but that's not working. I believe the problem is 'field:923459' is parsed as 'field:9234 2345 3459' instead of 'field:9234 field:2345 field:3459'. [image: SearchSpring | Findability Unleashed] James Bathgate | Sr. Developer Toll Free (888) 643-9043 x610 - Fax (719) 358-2027 4291 Austin Bluffs Pkwy #206 | Colorado Springs, CO 80918 www.searchspring.net http://www.searchspring.net
Re: Access to Solr Wiki
I've added GoraMohanty to the Solr wiki's ContributorsGroup page. - Steve On Jul 2, 2013, at 3:25 PM, Gora Mohanty g...@mimirtech.com wrote: Hi, May I please be added to the list of editors to the Solr Wiki as I see that some earlier changes seem to have gone missing. My user name is GoraMohanty Thanks. Regards, Gora