Append children documents for nested document
Hi, I have these nested document in solr: curl http://localhost:8983/solr/update/json?commit=true -H 'Content-type:application/json' -d ' [ { id: chapter1, content_type: chapter, _childDocuments_: [ { id: 1-1, text: xxx }, { id: 1-2, text: yyy } ] } ] ' Then I would like to use atomic updates to add one more child document under parent document id:chapter1, like: curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { id : chapter1, _childDocuments_ : { add:{ id:1-3, text: zzz } } } ]' It doesn't work and solr return {responseHeader:{status:400,QTime:0},error:{msg:Expected: ARRAY_START but got OBJECT_START at [58],code:400}} How can I add children documents for specific parent documents? thanks, Brad -- View this message in context: http://lucene.472066.n3.nabble.com/Append-children-documents-for-nested-document-tp4157087.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Edismax mm and efficiency
indeed https://issues.apache.org/jira/browse/LUCENE-4571 my feeling is it gives a significant gain in mm high values. On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org wrote: Are there any speed advantages to using “mm”? I can imagine pruning the set of matching documents early, which could help, but is that (or something else) done? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: indexing unique keys
Hello, You are asking without giving a context. What's the size of sets, desired TPS, key length, and even values? It's hard to answer definitely. It's not primary usage for Lucene, it adds some unnecessary overhead. However, community collected a few workaround for such kind of problem. From the other side, as far as I know executing queries like WHERE x IN (1,,2324) is not a piece of cake for SQL servers, also. you can follow link at https://plus.google.com/u/0/+MichaelMcCandless/posts/8VNydNi3wvK to find a relevant benchmark. it might help you to get least estimates for the Lucene solution. On Thu, Sep 4, 2014 at 5:53 PM, Mark , N nipen.m...@gmail.com wrote: I have a use-case where we want to store unique keys ( Hashes) which would be used to compare against another set of keys ( Hashes) For example Index set= { h1, h2 , h3 , h4 } comparision set = { h1 , h2 } result set = h1,h2 Would it be an advantage to store index set in Solr instead of storing in traditional databases? Thanks in advance *Nipen Mark * -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
HI Shawn, Thanks for your reply. The memory setting of my Solr box is 12G physically memory. 4G for java (-Xmx4096m) The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0. I do think the RAM size of java is one of the reasons for this slowness. I'm doing one big commit and when the ingestion process finished 50%, I can see the solr server already used over 90% of full memory. I'll try to assign more RAM to Solr Java. But from your experience, does 4G sounds like a good number for Java heap size for my scenario? Is there any way to reduce memory usage during index time? (One thing I know is do a few commits instead of one commit. ) My concern is providing I have 12 G in total, If I assign too much to Solr server, I may not have enough for the OS to cache Solr index file. I had a look to solr config file, but couldn't find anything that obviously wrong, Just wondering which part of that config file would impact the index time? Thanks, Ryan One possible source of problems with that particular upgrade is the fact that stored field compression was added in 4.1, and termvector compression was added in 4.2. They are on by default and cannot be turned off. The compression is typically fast, but with very large documents like yours, it might result in pretty major computational overhead. It can also require additional java heap, which ties into what follows: Another problem might be RAM-related. If your java heap is very large, or just a little bit too small, there can be major performance issues from garbage collection. Based on the fact that the earlier version performed well, a too-small heap is more likely than a very large heap. If your index size is such that it can't be effectively cached by the amount of total RAM on the machine (minus the java heap assigned to Solr), that can cause performance problems. Your index size is likely to be several gigabytes, and might even reach double-digit gigabytes. Can you relate those numbers -- index size, java heap size, and total system RAM? If you can, it would also be a good idea to share your solrconfig.xml. Here's a wiki page that goes into more detail about possible performance issues. It doesn't mention the possible compression problem: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
RE: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
Hi Erick, As Ryan Ernst noticed, those big fields (eg majorTextSignalStem) is not stored. There are a few stored fields in my schema, but they are very small fields basically name or id for that document. I tried turn them off(only store id filed) and that didn't make any difference. Thanks, Ryan Ryan: As it happens, there's a discssion on the dev list about this. If at all possible, could you try a brief experiment? Turn off all the storage, i.e. set stored=false on all fields. It's a lot to ask, but it'd help the discussion. Or join the discussion at https://issues.apache.org/jira/browse/LUCENE-5914. Best, Erick From: Li, Ryan Sent: Friday, September 05, 2014 3:28 PM To: solr-user@lucene.apache.org Subject: Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9 HI Shawn, Thanks for your reply. The memory setting of my Solr box is 12G physically memory. 4G for java (-Xmx4096m) The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0. I do think the RAM size of java is one of the reasons for this slowness. I'm doing one big commit and when the ingestion process finished 50%, I can see the solr server already used over 90% of full memory. I'll try to assign more RAM to Solr Java. But from your experience, does 4G sounds like a good number for Java heap size for my scenario? Is there any way to reduce memory usage during index time? (One thing I know is do a few commits instead of one commit. ) My concern is providing I have 12 G in total, If I assign too much to Solr server, I may not have enough for the OS to cache Solr index file. I had a look to solr config file, but couldn't find anything that obviously wrong, Just wondering which part of that config file would impact the index time? Thanks, Ryan One possible source of problems with that particular upgrade is the fact that stored field compression was added in 4.1, and termvector compression was added in 4.2. They are on by default and cannot be turned off. The compression is typically fast, but with very large documents like yours, it might result in pretty major computational overhead. It can also require additional java heap, which ties into what follows: Another problem might be RAM-related. If your java heap is very large, or just a little bit too small, there can be major performance issues from garbage collection. Based on the fact that the earlier version performed well, a too-small heap is more likely than a very large heap. If your index size is such that it can't be effectively cached by the amount of total RAM on the machine (minus the java heap assigned to Solr), that can cause performance problems. Your index size is likely to be several gigabytes, and might even reach double-digit gigabytes. Can you relate those numbers -- index size, java heap size, and total system RAM? If you can, it would also be a good idea to share your solrconfig.xml. Here's a wiki page that goes into more detail about possible performance issues. It doesn't mention the possible compression problem: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
Hi Guys, Just some update. I've tried with Solr 4.10 (same code for Solr 4.9). And that has the same index speed as 4.0. The only problem left now is that Solr 4.10 takes more memory than 4.0 so I'm trying to figure out what is the best number for Java heap size. I think that proves there is some performance issue with Solr 4.9 when index big document (even just over 1mb). Thanks, Ryan
FAST-like document vector data structures in Solr?
Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen
SolrJ 4.10.0 errors
Hi, I have upgraded to from Solr 4.9 to 4.10 and the server side seems fine but the client is reporting the following exception: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: solr_host.somedomain at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at ... (company's related packages) Caused by: org.apache.http.NoHttpResponseException: solr_host.somedomain failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161) at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) ... 9 more To test I downgraded the client to 4.9 and the error is gone. Best regards, Guido.
Re: FAST-like document vector data structures in Solr?
Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen
Re: SolrJ 4.10.0 errors
Sorry I didn't give enough information so I'm adding to it, the SolrJ client is on our webapp and the documents are getting indexed properly into Solr, the only problem we are seeing is that with SolrJ 4.10 once Solr server response comes back it seems like SolrJ client doesn't know what to with such response and reports the exception I mentioned, I then downgraded the SolrJ client to 4.9 and the exception is now gone, I'm using the following relevant libraries: Java 7u67 64 bits at both webapp client side side and Jetty's HTTP client/mine 4.3.5 HTTP core 4.3.2 Here is a list of my Solr war modified lib folder, I usually don't stay with the standard jars because I believe most of them are out of date if you are running a JDK 7u55+: antlr-runtime-3.5.jar asm-4.2.jar asm-commons-4.2.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-configuration-1.9.jar commons-fileupload-1.3.1.jar commons-io-2.4.jar commons-lang-2.6.jar concurrentlinkedhashmap-lru-1.4.jar dom4j-1.6.1.jar guava-18.0.jar hadoop-annotations-2.2.0.jar hadoop-auth-2.2.0.jar hadoop-common-2.2.0.jar hadoop-hdfs-2.2.0.jar hppc-0.5.2.jar httpclient-4.3.5.jar httpcore-4.3.2.jar httpmime-4.3.5.jar joda-time-2.2.jar lucene-analyzers-common-4.10.0.jar lucene-analyzers-kuromoji-4.10.0.jar lucene-analyzers-phonetic-4.10.0.jar lucene-codecs-4.10.0.jar lucene-core-4.10.0.jar lucene-expressions-4.10.0.jar lucene-grouping-4.10.0.jar lucene-highlighter-4.10.0.jar lucene-join-4.10.0.jar lucene-memory-4.10.0.jar lucene-misc-4.10.0.jar lucene-queries-4.10.0.jar lucene-queryparser-4.10.0.jar lucene-spatial-4.10.0.jar lucene-suggest-4.10.0.jar noggit-0.5.jar org.restlet-2.1.1.jar org.restlet.ext.servlet-2.1.1.jar protobuf-java-2.6.0.jar solr-core-4.10.0.jar solr-solrj-4.10.0.jar spatial4j-0.4.1.jar wstx-asl-3.2.7.jar zookeeper-3.4.6.jar Best regards, Guido. On 05/09/14 09:42, Guido Medina wrote: Hi, I have upgraded to from Solr 4.9 to 4.10 and the server side seems fine but the client is reporting the following exception: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: solr_host.somedomain at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at ... (company's related packages) Caused by: org.apache.http.NoHttpResponseException: solr_host.somedomain failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161) at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448) ... 9 more To test I downgraded the client to 4.9 and the error is gone. Best regards, Guido.
Re: FAST-like document vector data structures in Solr?
Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
Why do one big commit? You could do hard commits along the way but keep searcher open and not see the changes until the end. Obviously a separate issue from memory consumption discussion, but thought I'll add it anyway. Regards, Alex On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote: HI Shawn, Thanks for your reply. The memory setting of my Solr box is 12G physically memory. 4G for java (-Xmx4096m) The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0. I do think the RAM size of java is one of the reasons for this slowness. I'm doing one big commit and when the ingestion process finished 50%, I can see the solr server already used over 90% of full memory. I'll try to assign more RAM to Solr Java. But from your experience, does 4G sounds like a good number for Java heap size for my scenario? Is there any way to reduce memory usage during index time? (One thing I know is do a few commits instead of one commit. ) My concern is providing I have 12 G in total, If I assign too much to Solr server, I may not have enough for the OS to cache Solr index file. I had a look to solr config file, but couldn't find anything that obviously wrong, Just wondering which part of that config file would impact the index time? Thanks, Ryan One possible source of problems with that particular upgrade is the fact that stored field compression was added in 4.1, and termvector compression was added in 4.2. They are on by default and cannot be turned off. The compression is typically fast, but with very large documents like yours, it might result in pretty major computational overhead. It can also require additional java heap, which ties into what follows: Another problem might be RAM-related. If your java heap is very large, or just a little bit too small, there can be major performance issues from garbage collection. Based on the fact that the earlier version performed well, a too-small heap is more likely than a very large heap. If your index size is such that it can't be effectively cached by the amount of total RAM on the machine (minus the java heap assigned to Solr), that can cause performance problems. Your index size is likely to be several gigabytes, and might even reach double-digit gigabytes. Can you relate those numbers -- index size, java heap size, and total system RAM? If you can, it would also be a good idea to share your solrconfig.xml. Here's a wiki page that goes into more detail about possible performance issues. It doesn't mention the possible compression problem: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
statuscode list
Hi, If I'm correct you will get a statuscode=0 in the response if you use XML messages for updating the solr index. Is there a list of possible other statuscodes you can receive in case anything fails and what these errorcodes mean? THNX, Jan.
Re: Solr API for getting shard's leader/replica status
Thanks for the comments!! I found out the solution on how I can get the replica's state. Here's the piece of code. while (iter.hasNext()) { Slice slice = iter.next(); for(Replica replica:slice.getReplicas()) { System.out.println(replica state for + replica.getStr(core) + : + replica.getStr( state )); System.out.println(slice.getName()); System.out.println(slice.getState()); } } -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-API-for-getting-shard-s-leader-replica-status-tp4156902p4157108.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
On Fri, Sep 5, 2014 at 3:22 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Why do one big commit? You could do hard commits along the way but keep searcher open and not see the changes until the end. Alexandre, I don't think it's can happen in solr-user list, next search pickups the new searcher. Ryan, Regularly, commit is judged by application requirement, ie. when to make updates visible. Memory consumption is judged by ramBufferSizeMB and maxIndexingThreads. Exceeding the buffer, causes flush to disk, but doesn't trigger commit. Obviously a separate issue from memory consumption discussion, but thought I'll add it anyway. Regards, Alex On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote: HI Shawn, Thanks for your reply. The memory setting of my Solr box is 12G physically memory. 4G for java (-Xmx4096m) The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0. I do think the RAM size of java is one of the reasons for this slowness. I'm doing one big commit and when the ingestion process finished 50%, I can see the solr server already used over 90% of full memory. I'll try to assign more RAM to Solr Java. But from your experience, does 4G sounds like a good number for Java heap size for my scenario? Is there any way to reduce memory usage during index time? (One thing I know is do a few commits instead of one commit. ) My concern is providing I have 12 G in total, If I assign too much to Solr server, I may not have enough for the OS to cache Solr index file. I had a look to solr config file, but couldn't find anything that obviously wrong, Just wondering which part of that config file would impact the index time? Thanks, Ryan One possible source of problems with that particular upgrade is the fact that stored field compression was added in 4.1, and termvector compression was added in 4.2. They are on by default and cannot be turned off. The compression is typically fast, but with very large documents like yours, it might result in pretty major computational overhead. It can also require additional java heap, which ties into what follows: Another problem might be RAM-related. If your java heap is very large, or just a little bit too small, there can be major performance issues from garbage collection. Based on the fact that the earlier version performed well, a too-small heap is more likely than a very large heap. If your index size is such that it can't be effectively cached by the amount of total RAM on the machine (minus the java heap assigned to Solr), that can cause performance problems. Your index size is likely to be several gigabytes, and might even reach double-digit gigabytes. Can you relate those numbers -- index size, java heap size, and total system RAM? If you can, it would also be a good idea to share your solrconfig.xml. Here's a wiki page that goes into more detail about possible performance issues. It doesn't mention the possible compression problem: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: FAST-like document vector data structures in Solr?
For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky From: Jürgen Wagner (DVT) Sent: Friday, September 5, 2014 7:03 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: FAST-like document vector data structures in Solr?
Jürgen, I can't get it. Can you tell more about this feature or point to the doc? Thanks On Fri, Sep 5, 2014 at 11:44 AM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
How to implement multilingual word components fields schema?
Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Why do one big commit? You could do hard commits along the way but keep searcher open and not see the changes until the end. Alexandre, I don't think it's can happen in solr-user list, next search pickups the new searcher. Why not? Isn't that what the Solr example configuration doing at: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386 ? Hard commit does not reopen the searcher. The soft commit does (further down), but that can be disabled to get the effect I am proposing. What am I missing? Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
Re: How to implement multilingual word components fields schema?
It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.
Is there any sentence tokenizers in sold 4.9.0?
Hi, I was looking out the options for sentence tokenizers default in solr but could not find it. Does any one used? Integrated from any other language tokenizers to solr. Example python etc.. Please let me know. Thanks and regards, Sandeep
Re: FAST-like document vector data structures in Solr?
Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky
Re: FAST-like document vector data structures in Solr?
Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a feature called unsupervised feedback that does that but something like a docvector might make it a more realistic default. -- Jack Krupansky -Original Message- From: Jürgen Wagner (DVT) Sent: Friday, September 5, 2014 10:29 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky
Re: Is there any sentence tokenizers in sold 4.9.0?
Sorry for typo it is solr 4.9.0 instead of sold 4.9.0 On Sep 5, 2014 7:48 PM, Sandeep B A belgavi.sand...@gmail.com wrote: Hi, I was looking out the options for sentence tokenizers default in solr but could not find it. Does any one used? Integrated from any other language tokenizers to solr. Example python etc.. Please let me know. Thanks and regards, Sandeep
Re: Query ReRanking question
Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
RE: Query ReRanking question
Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9
Alexandre: It Depends (tm) of course. It all hinges on the setting in autocommit, whether openSearcher is true or false. In the former case, you, well, open a new searcher. In the latter you don't. I agree, though, this is all tangential to the memory consumption issue since the RAM buffer will be flushed regardless of these settings. FWIW, Erick On Fri, Sep 5, 2014 at 7:11 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Why do one big commit? You could do hard commits along the way but keep searcher open and not see the changes until the end. Alexandre, I don't think it's can happen in solr-user list, next search pickups the new searcher. Why not? Isn't that what the Solr example configuration doing at: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386 ? Hard commit does not reopen the searcher. The soft commit does (further down), but that can be disabled to get the effect I am proposing. What am I missing? Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
Re: Query ReRanking question
OK, why can't you switch the clauses from Joel's suggestion? Something like: q=Malaysia plane crashrq={!rerank reRankDocs=1000 reRankQuery=$myquery}myquery=*:*sort=date+desc (haven't tried this yet, but you get the idea). Best, Erick On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
Re: Query ReRanking question
Boosting on recency is probably a better approach. A fixed re-ranking horizon will always be a compromise, a guess at the precision of the query. It will give poor results for queries that are more or less specific than the assumption. Think of the recency boost as a tie-breaker. When documents are similar in relevance, show the most recent. This can work over a wide range of queries. For “malaysian airlines crash”, there are two sets of relevant documents, one set on MH 370 starting six months ago, and one set on MH 17, two months ago. But four hours ago, The Guardian published a “six months on” article on MH 370. A recency boost will handle that complexity. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 5, 2014, at 10:23 AM, Erick Erickson erickerick...@gmail.com wrote: OK, why can't you switch the clauses from Joel's suggestion? Something like: q=Malaysia plane crashrq={!rerank reRankDocs=1000 reRankQuery=$myquery}myquery=*:*sort=date+desc (haven't tried this yet, but you get the idea). Best, Erick On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
Re: Edismax mm and efficiency
Great! We have some very long queries, where students paste entire homework problems. One of them was 1051 words. Many of them are over 100 words. This could help. In the Jira discussion, I saw some comments about handling the most sparse lists first. We did something like that in the Infoseek Ultra engine about twenty years ago. Short termlists (documents matching a term) were processed first, which kept the in-memory lists of matching docs small. It also allowed early short-circuiting for no-hits queries. What would be a high mm value, 75%? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: indeed https://issues.apache.org/jira/browse/LUCENE-4571 my feeling is it gives a significant gain in mm high values. On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood wun...@wunderwood.org wrote: Are there any speed advantages to using “mm”? I can imagine pruning the set of matching documents early, which could help, but is that (or something else) done? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrJ 4.10.0 errors
On 9/5/2014 3:50 AM, Guido Medina wrote: Sorry I didn't give enough information so I'm adding to it, the SolrJ client is on our webapp and the documents are getting indexed properly into Solr, the only problem we are seeing is that with SolrJ 4.10 once Solr server response comes back it seems like SolrJ client doesn't know what to with such response and reports the exception I mentioned, I then downgraded the SolrJ client to 4.9 and the exception is now gone, I'm using the following relevant libraries: Java 7u67 64 bits at both webapp client side side and Jetty's HTTP client/mine 4.3.5 HTTP core 4.3.2 Here is a list of my Solr war modified lib folder, I usually don't stay with the standard jars because I believe most of them are out of date if you are running a JDK 7u55+: You're in uncharted territory if you're going to modify the jars included with Solr itself. We do upgrade these from time to time, and usually it's completely harmless, but we also run all the tests when we do it, to make sure that nothing will get broken. Some of the components are on specific versions because upgrading them isn't as simple as simply changing the jar. What happens if you return Solr to what's in the release war? Thanks, Shawn
RE: How to implement multilingual word components fields schema?
Agree with the approach Jack suggested to use same source text in multiple fields for each language and then doing a dismax query. Would love to hear if it works for you? Thanks, Susheel -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Friday, September 05, 2014 10:21 AM To: solr-user@lucene.apache.org Subject: Re: How to implement multilingual word components fields schema? It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language. Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields. -- Jack Krupansky -Original Message- From: Ilia Sretenskii Sent: Friday, September 5, 2014 10:06 AM To: solr-user@lucene.apache.org Subject: How to implement multilingual word components fields schema? Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii. This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.
Re: How to implement multilingual word components fields schema?
Hi Ilia, I don't know if it would be helpful but below I've listed some academic papers on this issue of how best to deal with mixed language/mixed script queries and documents. They are probably taking a more complex approach than you will want to use, but perhaps they will help to think about the various ways of approaching the problem. We haven't tackled this problem yet. We have over 200 languages. Currently we are using the ICUTokenizer and ICUFolding filter but don't do any stemming due to a concern with overstemming (we have very high recall, so don't want to hurt precision by stemming) and the difficulty of correct language identification of short queries. If you have languages where there is only one language per script however, you might be able to do much more. I'm not sure if I'm remembering correctly but I believe some of the stemmers such as the Greek stemmer will pass through any strings that don't contain characters in the Greek script. So it might be possible to at least do stemming on some of your languages/scripts. I'll be very interested to learn what approach you end up using. Tom -- Some papers: Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and weighting of multilingual and mixed documents. In *Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA, 161-170. DOI=10.1145/2072221.2072240 http://doi.acm.org/10.1145/2072221.2072240 That paper and some others are here: http://www.husseinsspace.com/research/students/mohammedmustafaali.html There is also some code from this article: Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In *Proceedings of the 37th international ACM SIGIR conference on Research development in information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686. DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622 Code: http://users.dsic.upv.es/~pgupta/mixed-script-ir.html Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii sreten...@multivi.ru wrote: Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages. For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.
RE: Is there any sentence tokenizers in sold 4.9.0?
There is SmartChineseSentenceTokenizerFactory or SentenceTokenizer which is getting being deprecated replaced with HMMChineseTokenizer. Not aware of other tokenizer but you may to either build your own similar to SentenceTokenizer or employ any external Sentence detection/recognizer built Solr tokenizer on top of it. Don't know how complex your use case is but I would suggest to look SentenceTokenizer and create similar tokenizer. Thanks, Susheel -Original Message- From: Sandeep B A [mailto:belgavi.sand...@gmail.com] Sent: Friday, September 05, 2014 10:40 AM To: solr-user@lucene.apache.org Subject: Re: Is there any sentence tokenizers in sold 4.9.0? Sorry for typo it is solr 4.9.0 instead of sold 4.9.0 On Sep 5, 2014 7:48 PM, Sandeep B A belgavi.sand...@gmail.com wrote: Hi, I was looking out the options for sentence tokenizers default in solr but could not find it. Does any one used? Integrated from any other language tokenizers to solr. Example python etc.. Please let me know. Thanks and regards, Sandeep This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.
Re: Query ReRanking question
Erick, I believe when you apply sort this way it runs the query and sort first and then tries to rerank...so basically it already lost the true relevancy because of sort taking precedence. Am I making sense ? Ravi Kiran Bhaskar On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson erickerick...@gmail.com wrote: OK, why can't you switch the clauses from Joel's suggestion? Something like: q=Malaysia plane crashrq={!rerank reRankDocs=1000 reRankQuery=$myquery}myquery=*:*sort=date+desc (haven't tried this yet, but you get the idea). Best, Erick On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
Re: Query ReRanking question
Walter, thank you for the valuable insight. The problem I am facing is that between the term frequencies, mm, date boost and stemming the results can become very inconsistent...Look at the following examples Here the chronology is all over the place because of what I mentioned above http://www.washingtonpost.com/pb/newssearch/?query=malaysian+airline+crash Now take the instance of an old topic/news which was covered a a while ago for a period of time but not actively updated recently...In this case, the date boosting predominantly takes over because of common terms and we get a rash of irrelevant content http://www.washingtonpost.com/pb/newssearch/?query=faces+of+the+fallen This has become such a balancing act and hence I was looking to see if reRanking might help Thanks Ravi Kiran Bhaskar On Fri, Sep 5, 2014 at 1:32 PM, Walter Underwood wun...@wunderwood.org wrote: Boosting on recency is probably a better approach. A fixed re-ranking horizon will always be a compromise, a guess at the precision of the query. It will give poor results for queries that are more or less specific than the assumption. Think of the recency boost as a tie-breaker. When documents are similar in relevance, show the most recent. This can work over a wide range of queries. For “malaysian airlines crash”, there are two sets of relevant documents, one set on MH 370 starting six months ago, and one set on MH 17, two months ago. But four hours ago, The Guardian published a “six months on” article on MH 370. A recency boost will handle that complexity. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Sep 5, 2014, at 10:23 AM, Erick Erickson erickerick...@gmail.com wrote: OK, why can't you switch the clauses from Joel's suggestion? Something like: q=Malaysia plane crashrq={!rerank reRankDocs=1000 reRankQuery=$myquery}myquery=*:*sort=date+desc (haven't tried this yet, but you get the idea). Best, Erick On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - You can already achieve this by boosting on the document's recency. The result set won't be exactly ordered by date but you will get the most relevant and recent documents on top. Markus -Original message- From:Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com Sent: Friday 5th September 2014 18:06 To: solr-user@lucene.apache.org mailto:solr-user@lucene.apache.org Subject: Re: Query ReRanking question Thank you very much for responding. I want to do exactly the opposite of what you said. I want to sort the relevant docs in reverse chronology. If you sort by date before hand then the relevancy is lost. So I want to get Top N relevant results and then rerank those Top N to achieve relevant reverse chronological results. If you ask Why would I want to do that ?? Lets take a example about Malaysian airline crash. several articles might have been published over a period of time. When I search for - malaysia airline crash blackbox - I would want to see relevant results but would also like to see the the recent developments on the top i.e. effectively a reverse chronological order within the relevant results, like telling a story over a period of time Hope i am clear. Thanks for your help. Thanks Ravi Kiran Bhaskar On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein joels...@gmail.com mailto:joels...@gmail.com wrote: If you want the main query to be sorted by date then the top N docs reranked by a query, that should work. Try something like this: q=foosort=date+descrq={!rerank reRandDocs=1000 reRankQuery=$myquery}myquery=blah Joel Bernstein Search Engineer at Heliosearch On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr ravis...@gmail.com mailto:ravis...@gmail.com wrote: Can the ReRanking API be used to sort within docs retrieved by a date field ? Can somebody help me understand how to write such a query ? Thanks Ravi Kiran Bhaskar
How to solve?
We have a core with each document as a person. We want to boost based on the sweater color, but if the person has sweaters in their closet which are the same manufactuer we want to boost even more by adding them together. Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater: Blue=1 : Polo Tony S - Sweater: Red =2: Nike Bill O - Sweater:Red = 2: Polo, Blue=1: Polo Scores: Peter Smit - 1+2 = 3. Tony S - 2 Bill O - 2 + 1 I thought about using payloads. sweaters_payload Blue: Nike: 1 Red: Nike: 2 Blue: Polo: 1 How do I query this? http://localhost:8983/solr/persons?q=*:*sort=?? Ideas? -- Bill Bell billnb...@gmail.com cell 720-256-8076