Personalized Search
Has anybody done personalized search with Solr? I'm thinking of including fields such as bought or like per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Looking forward to your ideas. Rih
RE: how to achieve filters
Hi All I am getting Error in Solr Error loading class 'Solr.TrieField' I have added following in Types of schema file fieldType name=tint class=solr.TrieField omitNorms=true / And in custom fields of schema have added field name=bitrate type=tint indexed=true stored=true / I am using solr version 1.3 cant I handle filter(my example bitrate) with sint Thanks Prakash Ahmet wrote.. Yep content is string, and bitrate is int. bitrate should be trie based tint, not int, for range queries work correctly. I am digging more now Can we combine both the scenarios. q=rockfq={!field f=content}mp3 q=rockfq:bitrate:[* TO 128] Say if I want only mp3 from 0 to 128 You can append filter queries (fq) as many as you want. q=rockfq={!field f=content}mp3fq=bitrate:[* TO 128] -Original Message- From: Doddamani, Prakash [mailto:prakash.doddam...@corp.aol.com] Sent: Tuesday, May 18, 2010 9:06 PM To: solr-user@lucene.apache.org Subject: RE: how to achieve filters Hey q=rockfq:bitrate:[* TO 128] bitrate is int This also return docs with more then 128 bitrate, Is there something I am doing wrong Regards prakash -Original Message- From: Doddamani, Prakash [mailto:prakash.doddam...@corp.aol.com] Sent: Tuesday, May 18, 2010 8:44 PM To: solr-user@lucene.apache.org Subject: RE: how to achieve filters Thanks much Ahmet, Yep content is string, and bitrate is int. I am digging more now Can we combine both the scenarios. q=rockfq={!field f=content}mp3 q=rockfq:bitrate:[* TO 128] Say if I want only mp3 from 0 to 128 Regards Prakash -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, May 18, 2010 8:24 PM To: solr-user@lucene.apache.org Subject: Re: how to achieve filters I am using dismax query to fetch docs from solr where I have set some boost to the each fields, If I search for query Rock I get following docs with some boost value which I have specified, doc float name=score19.494072/float int name=bitrate120/int str name=contentmp3/str str name=genreRock/str str name=id1/str str name=namest name 1/str /doc doc float name=score19.494052/float int name=bitrate248/int str name=contentaac+/str str name=genreRock/str str name=id2/str str name=namest name 2/str /doc doc float name=score19.494042/float int name=bitrate127/int str name=contentaac+/str str name=genreRock/str str name=id3/str str name=namest name 3/str /doc doc float name=score19.494032/float int name=bitrate256/int str name=contentmp3/str str name=genreRock/str str name=id4/str str name=namest name 5/str /doc I am looking for something below What is the best way to achieve them ? With filter queries. fq= 1. Query=rock where content= mp3 where it should return only first and last docs where content=mp3 Assuming that content is string typed. q=rockfq={!field f=content}mp3 2. Query=rock where bitrate128 where it should return only first and third docs where bitrate128 q=rockfq:bitrate:[* TO 128] for this bitrate field must be tint type.
Re: Personalized Search
Hi Rih, You going to include either of the two field bought or like to per member/visitor OR a unique field per member / visitor? If it's one or two common fields are included then there will not be any impact in performance. If you want to include unique field then you need to consider multi value field otherwise you certainly hit the wall. Regards Aditya www.findbestopensource.com On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote: Has anybody done personalized search with Solr? I'm thinking of including fields such as bought or like per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Looking forward to your ideas. Rih
Re: Personalized Search
Another approach would be to do query time boosts of 'my' items under the assumption that count is limited: - keep the SOLR index independent of bought/like - have a db table with user prefs on a per item basis - at query time, specify boosts for 'my items' items We are planning to do this in the context of document management where documents in 'my (used/favorited ) folders' provide a boost factor to the results. On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote: Hi Rih, You going to include either of the two field bought or like to per member/visitor OR a unique field per member / visitor? If it's one or two common fields are included then there will not be any impact in performance. If you want to include unique field then you need to consider multi value field otherwise you certainly hit the wall. Regards Aditya www.findbestopensource.com On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote: Has anybody done personalized search with Solr? I'm thinking of including fields such as bought or like per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Looking forward to your ideas. Rih -- Sent from my mobile device
RE: how to achieve filters
I am getting Error in Solr Error loading class 'Solr.TrieField' I have added following in Types of schema file fieldType name=tint class=solr.TrieField omitNorms=true / And in custom fields of schema have added field name=bitrate type=tint indexed=true stored=true / I am using solr version 1.3 cant I handle filter(my example bitrate) with sint Sure you can handle, if you are using solr 1.3.0 you need to use sint type.
RE: how to achieve filters
Hey Ahmet I have added field name=bitrate type=sint indexed=true stored=true default=0/ And the request I am passing is /solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3fq:bitrate:[* TO 127] start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.fl Still I am seeing documents above bitarate 127 Regards Prakash -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, May 20, 2010 4:09 PM To: solr-user@lucene.apache.org Cc: Doddamani, Prakash Subject: RE: how to achieve filters I am getting Error in Solr Error loading class 'Solr.TrieField' I have added following in Types of schema file fieldType name=tint class=solr.TrieField omitNorms=true / And in custom fields of schema have added field name=bitrate type=tint indexed=true stored=true / I am using solr version 1.3 cant I handle filter(my example bitrate) with sint Sure you can handle, if you are using solr 1.3.0 you need to use sint type.
RE: how to achieve filters
And the request I am passing is /solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3fq:bitrate:[* TO 127] start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.fl Still I am seeing documents above bitarate 127 There is a typo instead of fq: there should be fq= fq=bitrate:[* TO 127]
RE: how to achieve filters
Oops my bad, Thanks much -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, May 20, 2010 4:31 PM To: solr-user@lucene.apache.org Subject: RE: how to achieve filters And the request I am passing is /solr/select?indent=onversion=2.2q=rockfq={!field%20f=content}mp3f q:bitrate:[* TO 127] start=0rows=10fl=*%2Cscoreqt=dismaxwt=standardexplainOther=hl.f l Still I am seeing documents above bitarate 127 There is a typo instead of fq: there should be fq= fq=bitrate:[* TO 127]
Statistics exposed as JSON
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Are the Solr 1.4 statistics like #docs, #docsPending etc. exposed in JSON format? Andreas -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkv1LkIACgkQCJIWIbr9KYyOBACg1DqCTNJrP8WaFTNPqPa9HLo1 0lkAn1XaIbhi5s4Wv2OiT+lMWeD8VLl+ =Qxjy -END PGP SIGNATURE-
solr caches from external caching system like memcached
Hi, Is it possible to use solr caches such as query cache , filter cache and document cache from external caching system like memcached as it has several advantages such as centralized caching system and reducing the pause time of JVM 's garbage collection as we can assign less memory to jvm . Thanks, Bharath
Re: index merge
Hi All, The problem is resolved. It is purely due to filesystem. My filesystem is of 32-bit, running on 64 bit OS. I changed to 64 bit filesystem and all works as expected. Uma -- View this message in context: http://lucene.472066.n3.nabble.com/index-merge-tp472904p832053.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Personalized Search
Hi dc, - at query time, specify boosts for 'my items' items Do you mean something like document-boost or do you want to include something like OR myItemId:100^100 ? Can you tell us how you would specify document-boostings at query-time? Or are you querying something like a boolean field (i.e. isFavorite:true^10) or a numeric field? Kind regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Personalized Search
On May 19, 2010, at 11:43pm, Rih wrote: Has anybody done personalized search with Solr? I'm thinking of including fields such as bought or like per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Mitch is right, what you're looking for here is a recommendation engine, if I understand your question properly. And yes, Mahout should work though the Taste recommendation engine it supports is pretty new. But Sean Owen Robin Anil have a Mahout in Action book that's in early release via Manning, and it has lots of good information about Mahout recommender systems. Assuming you have a list of recommendations for a given user, based on their past behavior and the recommendation engine, then you could use this to adjust search results. I'm waiting for Hoss to jump in here on how best to handle that :) -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Machine utilization while indexing
Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
seemingly impossible query
Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: Machine utilization while indexing
How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Solr Delta Queries
I have a indexed_timestamp field in my index - which lets me know when document was indexed: field name=indexed_timestamp type=date indexed=true stored=true default=NOW multiValued=false/ For some reason when doing delta indexing via DIH, this field is not being updated. Are timestamp fields updated during DELTA updates? Kind regards, Vladimir Sutskever Investment Bank - Technology JPMorgan Chase, Inc. This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
RE: Machine utilization while indexing
It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Non-English query via Solr Example Admin corrupts text
Hi guys/gals, I am using apache-solr-1.4.0.war deployed to glassfishv3 on my development machine which is Ubuntu 9.10 64-bit. I am using Solrj 1.4 using the CommonsHttpSolrServer connection to that Solr instance (http://localhost:8080/apache-solr-1.4.0) during my development. To simplify things however, I have found that I can duplicate my issue directly from Solr example admin page so for ease of confirmation, I will use the Solr Example Admin page for this example: I deployed the apache-solr-1.4.0/dist/apache-solr-1.4.0.war file to my glassfishv3 application server. It deploys successfully. I access http://localhost:8080/apache-solr-1.4.0/admin/form.jsp and enter into Solr/Lucene Statement textarea this word: numéro (Note the é) When I check the server.log file, I see this: INFO: [] webapp=/apache-solr-1.4.0 path=/select params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=} hits=0 status=0 QTime=16 As well, the output from the Admin system is with the same incorrect decoding. In my SolrJ using application, I have a test case which queries for numéro and succeeds if I use Embedded and fails if I use CommonsHttpSolrServer... I don't want to use embedded for a number of reasons including that its not recommended (http://wiki.apache.org/solr/EmbeddedSolr) I am sorry if you'd dealt with this issue in the past, I've spent a few hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other permutations/combinations but there were seemingly endless amounts of chaff that I couldn't find anything useful after scouring it for a few hours. I can't decide whether it's a glassfish issue or not so I am not sure where to direct my energy. Any tips or advice are appreciated! Thanks in advance, Tim Gilbert
Re: Machine utilization while indexing
I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Re: Machine utilization while indexing
Why would I need faster hardware if my current hardware isn't reaching it's max capacity? I'm already using a different machine for querying and indexing so while indexing the queries aren't affected. Pulling an optimized snapshot isn't even noticeable on the query-machines. Thijs On 20-5-2010 17:25, Dennis Gearon wrote: It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallinknagelb...@globeandmail.com wrote: From: Nagelberg, Kallinknagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org'solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs ram, and a doc size of 5-10k maybe substantially larger than what you have. They could be substantially smaller too. As another point of reference my index ends up being about 20Gigs with the 5 million docs. I should also point out I only need to do this once.. I'm not constantly reindexing everything. My indexed documents rarely change, and when they do we have a process that selectively updates those few that need it. Combine that with a constant trickle of new documents and indexing performance isn't much of a concern. You should be able to experiment with a small subset of your documents to speedily test new schemas, etc. In my case I selected a representative sample and store them in my project for unit testing. -Kallin Nagelberg -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: RE: Machine utilization while indexing It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing?
RE: Machine utilization while indexing
You're sure it's not blocking on indexing IO? If not then I guess it must be a thread waiting unnecessarily in solr or your loading program. To get my loader running at full speed I hooked it up to jprofiler's thread views to see where the stalls were and optimized from there. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
Here is a good article from IBM, with code, on how to do hybrid/cloud computing. http://www.ibm.com/developerworks/library/x-cloudpt1/ Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Re: Non-English query via Solr Example Admin corrupts text
In my SolrJ using application, I have a test case which queries for “numéro” and succeeds if I use Embedded and fails if I use CommonsHttpSolrServer… I don’t want to use embedded for a number of reasons including that its not recommended (http://wiki.apache.org/solr/EmbeddedSolr) I am sorry if you’d dealt with this issue in the past, I’ve spent a few hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other permutations/combinations but there were seemingly endless amounts of chaff that I couldn’t find anything useful after scouring it for a few hours. I can’t decide whether it’s a glassfish issue or not so I am not sure where to direct my energy. Any tips or advice are appreciated! I have never used glassfish but I am pretty sure it is a glassfish issue. The same thing happens in Tomcat if you don't set URIEncoing=UTF-8. http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Confighttp://forums.java.net/jive/thread.jspa?threadID=38020http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding
Re: seemingly impossible query
Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Re: Non-English query via Solr Example Admin corrupts text
I had had the same issue within tomcat, further to what Ahmet wrote I recommend to plug a filter in your solr context that forces responses and requests to be encodded in UTF8 On Thu, May 20, 2010 at 5:11 PM, Ahmet Arslan iori...@yahoo.com wrote: In my SolrJ using application, I have a test case which queries for “numéro” and succeeds if I use Embedded and fails if I use CommonsHttpSolrServer… I don’t want to use embedded for a number of reasons including that its not recommended (http://wiki.apache.org/solr/EmbeddedSolr) I am sorry if you’d dealt with this issue in the past, I’ve spent a few hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other permutations/combinations but there were seemingly endless amounts of chaff that I couldn’t find anything useful after scouring it for a few hours. I can’t decide whether it’s a glassfish issue or not so I am not sure where to direct my energy. Any tips or advice are appreciated! I have never used glassfish but I am pretty sure it is a glassfish issue. The same thing happens in Tomcat if you don't set URIEncoing=UTF-8. http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Confighttp://forums.java.net/jive/thread.jspa?threadID=38020http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding -- Abdelhamid ABID Software Engineer- J2EE / WEB
Re: seemingly impossible query
Would each Id need to return a different doc? If not: you could probably use FieldCollapsing: http://wiki.apache.org/solr/FieldCollapsing http://wiki.apache.org/solr/FieldCollapsingi.e: - collapse on listOfIds (see wiki entry for syntax) - constrain the field to only return the id's you want e.g: q= listOfIds:10 OR q= listOfIds:5,...,OR q= listOfIds:56 Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
The problem here, I think, is that you only want 1 of many _results_ for a particular ID. How would Solr know which result you want to keep? And which to throw away? However... You can do this in two queries if you want. Have a separate solr document with unique ID equal to the listOfIds value as they are indexed (one for each unique id then). On that _id document_ store a field pointing to the ID of the real document you want as they are indexed. Each time the _id document_ is rewritten with a document id, it overwrites any prior data for that unique _id document_. Now, you first query the _id document_ using the 100 id's you receive. Each has a reference to a _single_ real document. Then you retrieve the document field of each of those to write a single query to get all the last indexed real documents for those id's. It would work. Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Re: seemingly impossible query
Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Solr highlighter and custom queries?
Hi all! I'm trying to do some simple highlighting, but I cannot seem to figure out how to make it work. I'm using my own QueryParser which generates custom made queries. I would like Solr to be able to highlight them. I've tried many options in the highlighter but cannot get any snippets to show. However, if I change the QueryParser to the default solr parser it works. There is certainly a place in the config or in the query parser where I can specify how Solr can highlight my custom queries? I checked a bit in the source code, and in WeightedSpanTermExtractor class, in the method extract(Query query, Map terms), there is a huge list of instanceof's that check which type of query we are attempting to match. Is that the only place where the conversion between query - highlighting happens? If so, its looks pretty hard coded and would not work with any other queries than the ones included in Lucene. I guess there must be a good reason for this, but is there any other way of making the highlighter work without having to hard code all the possible queries in a big if / instanceofs? If we could somehow reuse the code contained in each query to find possible matches, it would avoid having to recode the same logic elsewhere. But as I said, there must be a good reason for doing it the way its already coded. Any ideas on how to work this out with the existing code base would be greatly appreciated :) Daniel Shane
RE: seemingly impossible query
Thanks, I'm going to take a look at fieldcollapsingquery as it seems like it should do the trick! -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Debugging - DIH Delta Queries-
Hi All, How can I see all of the queries sent to my DB during a Delta Import? It seems like my documents are not being updated via delta import When I use SOLR's DataIMport Handler Console - with delta-import selected I see lst name=entity:getall lst name=document#1/ /lst − lst name=entity:getall lst name=document#1/ /lst − lst name=entity:getall lst name=document#1/ /lst But that’s not very helpful - I want to see the exact queries Thank You This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
Re: Non-English query via Solr Example Admin corrupts text
: I am using apache-solr-1.4.0.war deployed to glassfishv3 on my ... : INFO: [] webapp=/apache-solr-1.4.0 path=/select : params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=} : hits=0 status=0 QTime=16 ... : In my SolrJ using application, I have a test case which queries for : numéro and succeeds if I use Embedded and fails if I use : CommonsHttpSolrServer... I don't want to use embedded for a number of ... : I am sorry if you'd dealt with this issue in the past, I've spent a few : hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other : permutations/combinations but there were seemingly endless amounts of : chaff that I couldn't find anything useful after scouring it for a few : hours. I can't decide whether it's a glassfish issue or not so I am not : sure where to direct my energy. Any tips or advice are appreciated! I suspect if you switched to using POST instead of GET your problem would go away -- this stems from amiguity in the way HTTP servers/browsers deal with encoding UTF8 in URLs. a quick search for glassfish url encoding turns up this thread... http://forums.java.net/jive/thread.jspa?threadID=38020 which refreneces... http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding ...it looks like you want to modify the default-charset attribute of the parameter-encoding -Hoss
Re: Machine utilization while indexing
I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
RE: Non-English query via Solr Example Admin corrupts text
Chris, You are the best. Switching to POST solved the problem. I hadn't noticed that option earlier but after finding: https://issues.apache.org/jira/browse/SOLR-612 I found the option in the code. Thank you, you just made my day. Secondly, in an effort to narrow down whether this was a glassfish issue or not, here is what I found. Starting with glassfishv3 (I think) UTF-8 is the default for URI. You can see this by going to the admin site, clicking on Network Config | Network Listeners | then select the listener. Select the tab HTTP and about half way down, you will see URI Encoding: UTF-8. HOWEVER, that doesn't appear to be correct because following Abdelhamid Abid's advice, I deployed Solr to Tomcat, then followed the direction here: http://wiki.apache.org/solr/SolrTomcat to force tomcat to UTF-8 for URI. Then I deployed Solr to tomcat, and using CommonsHttpSolrServer, connected to that tomcat served instance. It worked- first time. So, it appears that there is a problem with glassfishv3 and UTF-8 URI's for at least the apache-solr-1.4.0.war. I wonder if I added that sun-web.xml file into the war to force UTF-8 it might work... not sure. However, the workaround is to change the method to POST as Chris suggested. You can do that in Solrj here: server.query(solrQuery, METHOD.POST); and it works as you'd expect. Thanks for the advice/tips, Tim -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, May 20, 2010 2:41 PM To: solr-user@lucene.apache.org Subject: Re: Non-English query via Solr Example Admin corrupts text : I am using apache-solr-1.4.0.war deployed to glassfishv3 on my ... : INFO: [] webapp=/apache-solr-1.4.0 path=/select : params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=} : hits=0 status=0 QTime=16 ... : In my SolrJ using application, I have a test case which queries for : numéro and succeeds if I use Embedded and fails if I use : CommonsHttpSolrServer... I don't want to use embedded for a number of ... : I am sorry if you'd dealt with this issue in the past, I've spent a few : hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other : permutations/combinations but there were seemingly endless amounts of : chaff that I couldn't find anything useful after scouring it for a few : hours. I can't decide whether it's a glassfish issue or not so I am not : sure where to direct my energy. Any tips or advice are appreciated! I suspect if you switched to using POST instead of GET your problem would go away -- this stems from amiguity in the way HTTP servers/browsers deal with encoding UTF8 in URLs. a quick search for glassfish url encoding turns up this thread... http://forums.java.net/jive/thread.jspa?threadID=38020 which refreneces... http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding ...it looks like you want to modify the default-charset attribute of the parameter-encoding -Hoss
RE: Machine utilization while indexing
StreamingUpdateSolrServer already has multiple threads and uses multiple connections under the covers. At least the api says ' Uses an internal MultiThreadedHttpConnectionManager to manage http connections'. The constructor allows you to specify the number of threads used, http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String, int, int) . -Kallin Nagelberg -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, May 20, 2010 3:14 PM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
RE: Non-English query via Solr Example Admin corrupts text
: Starting with glassfishv3 (I think) UTF-8 is the default for URI. You : can see this by going to the admin site, clicking on Network Config | : Network Listeners | then select the listener. Select the tab HTTP and : about half way down, you will see URI Encoding: UTF-8. : : HOWEVER, that doesn't appear to be correct because following Abdelhamid ... I know nothing about glassfish, but according to that forum URL i mentioned before, the URI Encoding option in glassfish explicitly (and evidently contenciously) does not apply to hte query args -- only the path, hence the two different config options mentioned in the FAQ... : http://forums.java.net/jive/thread.jspa?threadID=38020 ... : http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding -Hoss
Re: Subclassing DIH
: I am trying to subclass DIH to add I am having a hard time trying to get : access to the current Solr Context. How is this possible? I don't think DIH was particularly designed to be subclassed (i'm suprised it's not final) ... it was built with the assumption that people would write plugins (transformers, datasources, etc...) If you elaborate a little bit more on what you hope to achieve by subclassing, people cna provide more insight into the best way to go about it... http://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
RE: Machine utilization while indexing
: StreamingUpdateSolrServer already has multiple threads and uses multiple : connections under the covers. At least the api says ' Uses an internal Hmmm... i think one of us missunderstands the point behind StreamingUpdateSolrServer and it's internal threads/queues. (it's very possible that it's me) my understanding is that this allows it to manage the batching of multiple operations for you, reusing connections as it goes -- so the the queueSize is how many individual requests it buffers before sending the batch to Solr, and the threadCount controls how many batches it can send in parallel (in the event that one thread is still waiting for the response when the queue next fills up) But if you are only using a single thread to feed SolrRequests to a single instance of StreamingUpdateSolrServer then there can still be lots of opportunities for Solr itself to be idle -- as i said, it's not clear to me if you are using multiple threads to write to your StreamingUpdateSolrServer ... even if if you reuse the same StreamingUpdateSolrServer instance, multiple threads in your client code may increse the throughput (assuming that at the moment the threads in StreamingUpdateSolrServer are largely idle) But as i said ... this is all mostly a guess. I'm not intimatiely familiar with solrj. -Hoss
RE: seemingly impossible query
Yeah this looks perfect. Too bad it's not in 1.4, I guess I can build from trunk and patch it. This is probably a stupid question but is there any feeling as to when 1.5 might come out? Thanks, -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Re: Subclassing DIH
Ok to further explain myself. Well first off I was experience a StackOverFlow error during my delta-imports after doing a full-import. The strange thing was, it only happened sometimes. Thread is here: http://lucene.472066.n3.nabble.com/StackOverflowError-during-Delta-Import-td811053.html#a824780 I never did find a good solution to that bug however I did come up with a workaround. I noticed if I removed my deletedPkQuery then the delta-import would work as expected. Obviously I still have the need to delete items out of the index during indexing so I wanted to subclass the DataImportHandler to first update all documents then I would delete all the documents that my deletedPkQuery would have deleted. I can actually accomplish the above behavior using the onImportEnd EventListener however I lose the ability to know how many documents were actually deleted since my manual deletion of documents doesnt get pick up in the data importer cumulativeStatistics. My hope was that I could subclass DIH and massage the cumulativeStatistics after my manual deletion of documents. FYI my manual deletion is accomplished by sending a deleteById query to an instance of CommonsHttpSolrServer that I create from the current context of the EventListener. Side question: How can I retrieve the # of items actually removed from the index after a deletedById query??? Thoughts on the process? There just has to be an easier way. -- View this message in context: http://lucene.472066.n3.nabble.com/Subclassing-DIH-tp830954p832684.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Non-English query via Solr Example Admin corrupts text
I wanted to improve the documentation in the solr wiki by adding in my findings. However, when I try to log in and create a new account, I receive this error message: You are not allowed to do newaccount on this page. Login and try again. Does anyone know how I can get permission to add a page to the documentation? Tim -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, May 20, 2010 3:21 PM To: solr-user@lucene.apache.org Subject: RE: Non-English query via Solr Example Admin corrupts text : Starting with glassfishv3 (I think) UTF-8 is the default for URI. You : can see this by going to the admin site, clicking on Network Config | : Network Listeners | then select the listener. Select the tab HTTP and : about half way down, you will see URI Encoding: UTF-8. : : HOWEVER, that doesn't appear to be correct because following Abdelhamid ... I know nothing about glassfish, but according to that forum URL i mentioned before, the URI Encoding option in glassfish explicitly (and evidently contenciously) does not apply to hte query args -- only the path, hence the two different config options mentioned in the FAQ... : http://forums.java.net/jive/thread.jspa?threadID=38020 ... : http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEnco ding -Hoss
Re: Solr highlighter and custom queries?
Actually, its not as much a Solr problem as a Lucene one, as it turns out, the WeightedSpanTermExtractor is in Lucene and not Solr. Why they decided to only highlight queries that are in Lucene I don't know, but what I did to solve this problem was simply to make my queries extends a Lucene query instead of just Query. So I decided to extend a BooleanQuery, which is the closest fit to what mine actually does. This make the highlighting do something even though its not perfect. Daniel Shane
Endeca vs Solr?
First of all, I'd like to apologize in advance for being a pretty raw newbie when it comes to search technologies, so please bear with me! The situation: My company has a system that moderates 15 character free form text fields. We have a dictionary of words in our database that are banned due to various legal reasons (profanity, copywrite issues, etc). Our system does an intial check when the user is entering their choices for these fields and auto-rejects anything that matches or comes close to matching (based on phonetics, purposefully misspelled, etc) anything on our banned list. Once the order is placed, the system checks again to see if the fields are exact matches to anything on our auto-approve list (held in same database as previous list) and passes those on through. Items that do not match either list are moved to a review queue where a customer service rep manually reviews the items. During the review the CSR can add a word to either list, which will prevent future orders using the newly added value from needing to be reviewed. My question: Currently our system simply holds all the words in a hash map in memory, but we're worried about scalability. I've been asked to try and find out more about Solr and how it compares to Endeca, which another of our department uses but I'm not very familiar with. I've been reading the wiki and other articles I've found online, but it seems like there's a lot of overlap of features between Solr and Endeca, the main difference just seems to be cost. Endeca also seems to have better support of real time searches, and has a stricter sorting algorithm. On the other hand, it sounds like Solr re-indexes quickly enough that its quick enough for my purposes, and its sorting algorithm can be tweaked to match what I need. Are there any other technical differences between the two if used in the scenario I described above? Also, are there any important hardware footprint differences? I'm no admin, but I believe our system runs on Jboss on a Solaris box last I checked. Any help or insight you guys can provide would help greatly. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p832826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Shard - Strange results
I know this post is old but did you ever get a resolution to this problem? I am running into the exact same issue. I even switched my id from text to string and reindexed as that was the last suggestion and still no resolution. --Tony -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Shard-Strange-results-tp496373p832844.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Shard - Strange results
So are we the only ones who never got sharding working with multi-cores? Bummer... Hopefully someone else will chime in with an answer. --Tony -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Shard-Strange-results-tp496373p832863.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Endeca vs Solr?
Hello kkieser. I've used both and my name may of come up in your searches. For your system, I would definitely not use Endeca as its too complicated for the relatively simple needs that you have. You asked if there are technical differences and of course being two different systems, the answer is yes -- but both can fit your needs. I'm not quite convinced that either would be worthwhile for what you describe over something more home-grown with a database. I could see you re-using Lucene's analysis package to tokenize and the process each token against and match against a hashtable. By the way, Solr is going to use a Hashtable as well on either index or query time to handle synonyms. Your scenario does not suggest that this list would be so large to be concerning. Of course if you want other features in Solr like highlighting and faceting and the other goodies, then its clearly worthwhile. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p832972.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Non-English query via Solr Example Admin corrupts text
rant_by_HTTP_Verb_Nazi Using POST totally violates the access model for an entity in the HTTP Verb model. Basically: GET=READ POST=CREATE PUT=MODIFY DELETE=(drum roll please)DELETE Granted, the whole web uses POST for modify, but let's not make the situation worse by using it for everything. /rant_by_HTTP_Verb_Nazi Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Chris Hostetter hossman_luc...@fucit.org wrote: From: Chris Hostetter hossman_luc...@fucit.org Subject: Re: Non-English query via Solr Example Admin corrupts text To: solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 11:40 AM : I am using apache-solr-1.4.0.war deployed to glassfishv3 on my ... : INFO: [] webapp=/apache-solr-1.4.0 path=/select : params={indent=onversion=2.2q=numérofq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl.fl=} : hits=0 status=0 QTime=16 ... : In my SolrJ using application, I have a test case which queries for : numéro and succeeds if I use Embedded and fails if I use : CommonsHttpSolrServer... I don't want to use embedded for a number of ... : I am sorry if you'd dealt with this issue in the past, I've spent a few : hours googling for solr utf-8 query and glassfishv3 utf-8 uri plus other : permutations/combinations but there were seemingly endless amounts of : chaff that I couldn't find anything useful after scouring it for a few : hours. I can't decide whether it's a glassfish issue or not so I am not : sure where to direct my energy. Any tips or advice are appreciated! I suspect if you switched to using POST instead of GET your problem would go away -- this stems from amiguity in the way HTTP servers/browsers deal with encoding UTF8 in URLs. a quick search for glassfish url encoding turns up this thread... http://forums.java.net/jive/thread.jspa?threadID=38020 which refreneces... http://wiki.glassfish.java.net/Wiki.jsp?page=FaqHttpRequestParameterEncoding ...it looks like you want to modify the default-charset attribute of the parameter-encoding -Hoss
Re: Endeca vs Solr?
Thanks for your response David! At the moment we have over 40,000 words on our banned list, and only recently added the white list, so we anticipate this number to jump quite quickly. I've heard Solr can handle up to around 2 million records before slowing down so I'm not too worried about hitting that limit. Our database implementation has already started slowing down and is causing complaints from the CSRs. This system is used on a public facing website that gets quite a lot of traffic, which is why we're looking into swapping from having the full database hashmap in memory to something with an efficient index that can handle both the high traffic of users creating designs as well as the CSRs reviewing the ones that arent auto approved or auto rejected. -- View this message in context: http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p833016.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Endeca vs Solr?
kkieser, It just occurred to me that Solr might actually fit the bill. Your scenario is definitely not present a use of Solr that is typical at all, but a novel use of Solr I am about to describe could totally get what you want. A Solr index is composed of documents which are typically similar to a user document or database record or something like that. But in your case, the document would be one word that's either one of your good word or bad words. You could have a boolean indicating which type, and you could index it several ways including phonetically. When you want to compare a document to see if it matches any words, you use Solr's More-Like-This feature, configured appropriately, to tell you what matching documents (e.g. naughty words) get matched. You could even facet on the naughty boolean to know how many of each. What I described is definitely not a task for Endeca. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Endeca-vs-Solr-tp832826p833019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: jmx issue with solr
http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX Ask the wiki! On Wed, May 19, 2010 at 6:19 AM, Na_D nabam...@zaloni.com wrote: Thanks for the info , using the above properties solved the issue . -- View this message in context: http://lucene.472066.n3.nabble.com/jmx-issue-with-solr-tp828478p829057.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
How real-time are Soir/Lucene queries?
Hello Soir, Soir looks like an excellent API and its nice to have a tutorial that makes it easy to discover the basics of what Soir does, I'm impressed. I can see plenty of potential uses of Soir/Lucene and I'm interested now in just how real-time the queries made to an index can be? For example, in my application I have time ordered data being processed by a paint method in real-time. Each piece of data is identified and its associated renderer is invoked. The Java2D renderer would then lookup any layout and style values it requires to render the current data it has received from the layout and style indexes. What I'm wondering is if this lookup which would be a Lucene search will be fast enough? Would it be best to make Lucene queries for the relevant layout and style values required by the renderers ahead of rendering time and have the query results placed into the most performant collection (map/array) so renderer lookup would be as fast as possible? Or can Lucene handle many individual lookup queries fast enough so rendering is quick? Best regards from Canada, Thom
Special Circumstances for embedded Solr
Hi all, We'd started using embedded Solr back in 2007, via a patched version of the in-progress 1.3 code base. I recently was reading http://wiki.apache.org/solr/EmbeddedSolr, and wondered about the paragraph that said: The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances. Given the current state of SolrJ, and the expected roadmap for Solr in general, what would be some guidelines for special circumstances that warrant the use of SolrJ? I know what ours were back in 2007 - namely: - we had multiple indexes, but didn't want to run multiple webapps (now handled by multi-core) - we needed efficient generation of updated indexes, without generating lots of HTTP traffic (now handled by DIH, maybe with specific extensions?) - we wanted tighter coupling of the front-end API with the back-end Solr search system, since this was an integrated system in the hands of customers - no just restart the webapp container option if anything got wedged (might still be an issue?) Any other commonly compelling reasons to use SolrJ? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How real-time are Soir/Lucene queries?
Solr is a very good engine, but it is not real-time. You can turn off the caches and reduce the delays, but it is fundamentally not real-time. I work at MarkLogic, and we have a real-time transactional search engine (and respository). If you are curious, contact me directly. I do like Solr for lots of applications -- I chose it when I was at Netflix. wunder On May 20, 2010, at 7:22 PM, Thomas J. Buhr wrote: Hello Soir, Soir looks like an excellent API and its nice to have a tutorial that makes it easy to discover the basics of what Soir does, I'm impressed. I can see plenty of potential uses of Soir/Lucene and I'm interested now in just how real-time the queries made to an index can be? For example, in my application I have time ordered data being processed by a paint method in real-time. Each piece of data is identified and its associated renderer is invoked. The Java2D renderer would then lookup any layout and style values it requires to render the current data it has received from the layout and style indexes. What I'm wondering is if this lookup which would be a Lucene search will be fast enough? Would it be best to make Lucene queries for the relevant layout and style values required by the renderers ahead of rendering time and have the query results placed into the most performant collection (map/array) so renderer lookup would be as fast as possible? Or can Lucene handle many individual lookup queries fast enough so rendering is quick? Best regards from Canada, Thom