Indexing strategies?
Hi, I'm facing a dilemma of choosing the indexing strategies. My application architecture is - I have a listing table in my DB - For each listing, I have 3 calls to a URL Datasource of different system I have 200k records Time taken to index 25 docs is 1Minute, so for 200k it might take more than 100hrs :-(? I know there are lot of factors to consider from Network to DB. I'm looking for different strategies that we could perform index. - Can we run multiple data import handlers? one data-config for first 100k and second one is for another 100k - Would it be possible to write java service using SolrJ and perform multi-threaded calls to Solr to Index? - The URL Datasources i'm using is actually resided in MSSQL database of different system. Could I be able to fasten indexing time if I just could use JDBCDataSource that calls DB directly instead through API URL data source? Is there any other strategies we could use? Thank you, -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-strategies-tp4116852.html Sent from the Solr - User mailing list archive at Nabble.com.
Phonetic search on multiple fields
Hi, I am beginner of solr, I am trying to implement phonetic search in my application my code in schema.xml for fieldType fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=0 generateWordParts=1 stemEnglishPossessive=0 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType And fieldType name=text_general_phonetic class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=0 generateWordParts=1 stemEnglishPossessive=0 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.BeiderMorseFilterFactory nameType=GENERIC ruleType=APPROX concat=true languageSet=auto/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType AND field definition field name=fname type=text_general indexed=true stored=true required=false multiValued=false/ field name=fname_copy type=text_general_phonetic indexed=true stored=true required=false / copyfield source=fname dest=fname_copy/ when I am search stephen, stifn will gives me stephen but it wont works... Also if how can I use phonetic filter with DoubleMetaphone encoder.. Please help me Thanks in Advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Phonetic-search-on-multiple-fields-tp4116876.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing in the case of entire shard failure
We have a system which consists of 2 shards while every shard has a leader and one replica. During indexing one of the shards (both leader and replica) was shut down. We got two types of HTTP requests: rm= Service Unavailable and rm=OK. From this we’ve got to the conclusion that the shard which was up was going on indexing. Is this behavior correct? We expected that in the case of the entire shard fails the system stop indexing completely. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-in-the-case-of-entire-shard-failure-tp4116881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing in the case of entire shard failure
Note: In SolrCloud terminology, a leader is also a replica. IOW, you have two replicas, one of which (and it can vary over time) is elected as leader for that shard. The other shards remain capable of indexing even if one shard becomes unavailable. That is expected - and desired - behavior in a fault-tolerant, fully-distributed system. Your application can/should make its own decision as to what it will do if an indexing operation cannot be serviced. -- Jack Krupansky -Original Message- From: elmerfudd Sent: Wednesday, February 12, 2014 7:54 AM To: solr-user@lucene.apache.org Subject: Indexing in the case of entire shard failure We have a system which consists of 2 shards while every shard has a leader and one replica. During indexing one of the shards (both leader and replica) was shut down. We got two types of HTTP requests: rm= Service Unavailable and rm=OK. From this we’ve got to the conclusion that the shard which was up was going on indexing. Is this behavior correct? We expected that in the case of the entire shard fails the system stop indexing completely. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-in-the-case-of-entire-shard-failure-tp4116881.html Sent from the Solr - User mailing list archive at Nabble.com.
Searching phonetic by DoubleMetaphone soundex encoder
Hi, I am using solr for searching phoneticly equivalent string my schema contains... fieldType name=text_general_doubleMetaphone class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and fields are field name=fname type=text_general indexed=true stored=true required=false / field name=fname_sound type=text_general_doubleMetaphone indexed=true stored=true required=false / copyField source=fname dest=fname_sound / it works when i search stfn=== stephen, stephn But I am expecting stephn= stephen like How I will get this result. m I doing something wrong Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-phonetic-by-DoubleMetaphone-soundex-encoder-tp4116885.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Yes, committing after each document will greatly degrade performance. I typically use autoCommit and autoSoftCommit to set the time interval between commits, but commitWithin should have a similar effect.. I often see performance of 2000+ docs per second on the load using auto commits. When explicitly committing after each document, your commits will happen too frequently, overworking the indexing process. Joel Bernstein Search Engineer at Heliosearch On Wed, Feb 12, 2014 at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.comwrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Newb - Search not returning any results
I setup a Solr Core and populated it with documents but I am not able to get any results when attempting to search the documents. A generic search (q=*.*) returns all documents (and fields/values within those documents), however when I try to search using specific criteria I get no results back. I have the following setup in my Schema.xml: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=mailingcity type=string indexed=true stored=true multiValued=false/ I run the following query against my Solr instance: http://{domain}8983/solr/MIM/select?q=*rows=2fl=id+mailingcitywt=jsonindent=truedebugQuery=true I get the following results back: { responseHeader:{ status:0, QTime:0, params:{ debugQuery:true, fl:id, mailingcity, indent:true, q:*, wt:json, rows:2}}, response:{numFound:18024,start:0,docs:[ { id:214530123, mailingcity:redford}, { id:204686608, mailingcity:detroit}] }, debug:{ rawquerystring:*, querystring:*, parsedquery:MatchAllDocsQuery(*:*), parsedquery_toString:*:*, explain:{ 214530123:\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n, 204686608:\n1.0 = (MATCH) MatchAllDocsQuery, product of:\n 1.0 = queryNorm\n}, QParser:LuceneQParser, ... } However, when I try to search specifically where mailingcity=redford I don't get any results back. See the following query/results. Query: http://{domain}:8983/solr/MIM/select?q=mailingcity=redfordrows=2fl=id,mailingcitywt=jsonindent=truedebugQuery=true Results: { responseHeader: { status: 0, QTime: 1, params: { debugQuery: true, fl: id,mailingcity, indent: true, q: mailingcity=redford, _: 1392218691700, wt: json, rows: 2 } }, response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: mailingcity=redford, querystring: mailingcity=redford, parsedquery: text:mailingcity text:redford, parsedquery_toString: text:mailingcity text:redford, explain: {}, QParser: LuceneQParser, ... } If anyone can provide some info on why this is happening and how to solve it it would be appreciated. Thank You Lee -- View this message in context: http://lucene.472066.n3.nabble.com/Newb-Search-not-returning-any-results-tp4116905.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Importing database DIH
On 12 February 2014 20:53, Maheedhar Kolla maheedhar.ko...@gmail.com wrote: Hi , I need help with importing data, through DIH. ( using solr-3.6.1, tomcat6 ) I see the following error when I try to do a full-import from my local MySQL table ( http:/s/solr//dataimport?command=full-import ). snip .. str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Processed0/str str name=Total Documents Skipped0/str str name=Indexing failed. Rolled back all changes./str /snip I did search to find ways to solve this problem and did create the file dataimport.properties , but no success. [...] You do not have to create dataimport.properties. Look in the Tomcat logs for more details on the error, and post the relevant sections here if you cannot make sense of it. My guess would be that your database credentials are incorrect, or that the SELECT is failing. Try logging into mysql from an admin. tool with those credentials, and running the SELECT manually. Regards, Gora
Question about how to upload XML by using SolrJ Client Java Code
I was just trying to use SolrJ Client to import XML data to Solr server. And I read SolrJ wiki that says SolrJ lets you upload content in XML and Binary format I realized there is a XML parser in Solr (We can use a dataUpadateHandler in Solr default UI Solr Core Dataimport) So I was wondering how to directly use solr xml parser to upload xml by using SolrJ Java Code? I could use other open-source xml parser, But I really want to know if there is a way to call Solr parser library. Would you mind send me a simple code if possible, really appreciated. Thanks in advance. solr/4.6.1 -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-how-to-upload-XML-by-using-SolrJ-Client-Java-Code-tp4116901.html Sent from the Solr - User mailing list archive at Nabble.com.
Importing database DIH
Hi , I need help with importing data, through DIH. ( using solr-3.6.1, tomcat6 ) I see the following error when I try to do a full-import from my local MySQL table ( http:/s/solr//dataimport?command=full-import ). snip .. str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Processed0/str str name=Total Documents Skipped0/str str name=Indexing failed. Rolled back all changes./str /snip I did search to find ways to solve this problem and did create the file dataimport.properties , but no success. Any help would be appreciated. cheers, Kolla PS: When I check the admin panel for statistics for the query /dataimport , I see the following: Status : IDLE Documents Processed : 0 Requests made to DataSource : 0 Rows Fetched : 0 Documents Deleted : 0 Documents Skipped : 0 Total Documents Processed : 0 Total Requests made to DataSource : 0 Total Rows Fetched : 0 Total Documents Deleted : 0 Total Documents Skipped : 0 handlerStart : 1391612468278 requests : 5 errors : 0 timeouts : 0 totalTime : 28 avgTimePerRequest : 5.6 Also, Here is my dataconfig file. dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/DBNAME user=USER password=PWD/ document entity name=docs query=select * from TABLENAME field column=id name=id/ field column=content name=text/ field column=title name=title/ /entity /document -- Cheers, Kolla
RE: Importing database DIH
It can be anything from wrong credentials, to missing driver in the class path, to malformed connection string, etc.. What does the Solr log say? -Original Message- From: Maheedhar Kolla [mailto:maheedhar.ko...@gmail.com] Sent: יום ד 12 פברואר 2014 17:23 To: solr-user@lucene.apache.org Subject: Importing database DIH Hi , I need help with importing data, through DIH. ( using solr-3.6.1, tomcat6 ) I see the following error when I try to do a full-import from my local MySQL table ( http:/s/solr//dataimport?command=full-import ). snip .. str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Processed0/str str name=Total Documents Skipped0/str str name=Indexing failed. Rolled back all changes./str /snip I did search to find ways to solve this problem and did create the file dataimport.properties , but no success. Any help would be appreciated. cheers, Kolla PS: When I check the admin panel for statistics for the query /dataimport , I see the following: Status : IDLE Documents Processed : 0 Requests made to DataSource : 0 Rows Fetched : 0 Documents Deleted : 0 Documents Skipped : 0 Total Documents Processed : 0 Total Requests made to DataSource : 0 Total Rows Fetched : 0 Total Documents Deleted : 0 Total Documents Skipped : 0 handlerStart : 1391612468278 requests : 5 errors : 0 timeouts : 0 totalTime : 28 avgTimePerRequest : 5.6 Also, Here is my dataconfig file. dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/DBNAME user=USER password=PWD/ document entity name=docs query=select * from TABLENAME field column=id name=id/ field column=content name=text/ field column=title name=title/ /entity /document -- Cheers, Kolla
Re: Newb - Search not returning any results
On 12 February 2014 20:57, leevduhl ld...@corp.realcomp.com wrote: [...] However, when I try to search specifically where mailingcity=redford I don't get any results back. See the following query/results. Query: http://{domain}:8983/solr/MIM/select?q=mailingcity=redfordrows=2fl=id,mailingcitywt=jsonindent=truedebugQuery=true Please start by reading https://wiki.apache.org/solr/SolrQuerySyntax The argument to 'q' above should be mailingcity:redford. The debug section in the results even tells you that, as the parsedquery becomes text:mailingcity text:redford which means that it is searching the default full-text search field for the strings mailingcity and/or redford. Regards, Gora
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Cross-posting my answer from SO: According to this wiki: https://wiki.apache.org/solr/NearRealtimeSearch the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant. What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations). On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy vitaliy.pisa...@hp.comwrote: I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Re: Indexing strategies?
I'd seriously consider a SolrJ program that pulled the necessary data from two of your systems, held it in cache and then pulled the data from your main system and enriched it with the cached data. Or export your information from your remote systems and import them into a single system where you could do joins. I believe DIH has some caching ability too that you might consider. Your basic problem is an inefficient data model where you have to query these different systems on a row-by-row system, that's where I'd concentrate my energies.. Best, Erick On Wed, Feb 12, 2014 at 2:09 AM, manju16832003 manju16832...@gmail.comwrote: Hi, I'm facing a dilemma of choosing the indexing strategies. My application architecture is - I have a listing table in my DB - For each listing, I have 3 calls to a URL Datasource of different system I have 200k records Time taken to index 25 docs is 1Minute, so for 200k it might take more than 100hrs :-(? I know there are lot of factors to consider from Network to DB. I'm looking for different strategies that we could perform index. - Can we run multiple data import handlers? one data-config for first 100k and second one is for another 100k - Would it be possible to write java service using SolrJ and perform multi-threaded calls to Solr to Index? - The URL Datasources i'm using is actually resided in MSSQL database of different system. Could I be able to fasten indexing time if I just could use JDBCDataSource that calls DB directly instead through API URL data source? Is there any other strategies we could use? Thank you, -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-strategies-tp4116852.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Phonetic search on multiple fields
First, why are you talking about DoubleMetaphone when your fieldType uses BeiderMorseFilterFactory? Which points up a basic issue you need to wrap your head around or you'll be endlessly confused. At least I was... Your analysis chains _must_ do compatible things at index and query time. The fieldType you're using for phonetic searching does not since it doesn't use the BeiderMorseFilterFactory at query time. So the actual values in your index are whatever the Beider... factory produces but the terms searched are NOT transformed by that factory. Say you index the term Erick. Your index may have (and I don't remember what the actual output of Beider is) something totally transformed like MNUA. But your query does NOT do the transformation, so the query is looking for Erick. Obviously it isn't found. I _strongly_ advise you to take some time to get familiar with the admin/analysis page, that'll shed light on a _lot_ of analysis issues. Best, Erick On Wed, Feb 12, 2014 at 4:26 AM, Navaa navnath.thomb...@xtremumsolutions.com wrote: Hi, I am beginner of solr, I am trying to implement phonetic search in my application my code in schema.xml for fieldType fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=0 generateWordParts=1 stemEnglishPossessive=0 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType And fieldType name=text_general_phonetic class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=0 generateWordParts=1 stemEnglishPossessive=0 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.BeiderMorseFilterFactory nameType=GENERIC ruleType=APPROX concat=true languageSet=auto/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType AND field definition field name=fname type=text_general indexed=true stored=true required=false multiValued=false/ field name=fname_copy type=text_general_phonetic indexed=true stored=true required=false / copyfield source=fname dest=fname_copy/ when I am search stephen, stifn will gives me stephen but it wont works... Also if how can I use phonetic filter with DoubleMetaphone encoder.. Please help me Thanks in Advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Phonetic-search-on-multiple-fields-tp4116876.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing database DIH
Thanks for the comments/advice. I did mess with the drivers ( by deliberately moving the libs) and it did fail as it is supposed to. When I looked into catalina.out, I realized that the problem lies with data directory being owned by root instead of tomcat6. I changed it so that tomcat6 can write to data directory and now I see a different error ( after processing 100+ rows) . But, am happy that the initial problem is gone. I should be able to fix these minor ones. Thanks for the advice again. Cheers, Kolla PS: I think I was looking into the wrong logs before :/ On Wed, Feb 12, 2014 at 10:31 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: It can be anything from wrong credentials, to missing driver in the class path, to malformed connection string, etc.. What does the Solr log say? -Original Message- From: Maheedhar Kolla [mailto:maheedhar.ko...@gmail.com] Sent: יום ד 12 פברואר 2014 17:23 To: solr-user@lucene.apache.org Subject: Importing database DIH Hi , I need help with importing data, through DIH. ( using solr-3.6.1, tomcat6 ) I see the following error when I try to do a full-import from my local MySQL table ( http:/s/solr//dataimport?command=full-import ). snip .. str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Processed0/str str name=Total Documents Skipped0/str str name=Indexing failed. Rolled back all changes./str /snip I did search to find ways to solve this problem and did create the file dataimport.properties , but no success. Any help would be appreciated. cheers, Kolla PS: When I check the admin panel for statistics for the query /dataimport , I see the following: Status : IDLE Documents Processed : 0 Requests made to DataSource : 0 Rows Fetched : 0 Documents Deleted : 0 Documents Skipped : 0 Total Documents Processed : 0 Total Requests made to DataSource : 0 Total Rows Fetched : 0 Total Documents Deleted : 0 Total Documents Skipped : 0 handlerStart : 1391612468278 requests : 5 errors : 0 timeouts : 0 totalTime : 28 avgTimePerRequest : 5.6 Also, Here is my dataconfig file. dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/DBNAME user=USER password=PWD/ document entity name=docs query=select * from TABLENAME field column=id name=id/ field column=content name=text/ field column=title name=title/ /entity /document -- Cheers, Kolla -- Cheers, Kolla
Re: Question about how to upload XML by using SolrJ Client Java Code
Hmmm, before going there let's be sure you're trying to do what you think you are. Solr does _not_ index arbitrary XML. There is a very specific format of XML that describes solr documents that _can_ be indexed. But random XML is not supported. See the documents in example/exampledocs for the XML form of Solr docs. So if you have arbitrary XML, you need to parse it and then construct Solr documents. One way would be to use SolrJ, parse the docs using your favorite Java parser and construct SolrInputDocuments which you then use one of the SolrServer classes (e.g. CloudSolrServer) to add to the index. There really is no Solr MXL Parser that I know of, Solr just uses one of the standard XML parsers (e.g. sax)... Best, Erick On Wed, Feb 12, 2014 at 7:21 AM, Eric_Peng sagittariuse...@gmail.comwrote: I was just trying to use SolrJ Client to import XML data to Solr server. And I read SolrJ wiki that says SolrJ lets you upload content in XML and Binary format I realized there is a XML parser in Solr (We can use a dataUpadateHandler in Solr default UI Solr Core Dataimport) So I was wondering how to directly use solr xml parser to upload xml by using SolrJ Java Code? I could use other open-source xml parser, But I really want to know if there is a way to call Solr parser library. Would you mind send me a simple code if possible, really appreciated. Thanks in advance. solr/4.6.1 -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-how-to-upload-XML-by-using-SolrJ-Client-Java-Code-tp4116901.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about how to upload XML by using SolrJ Client Java Code
On 2/12/2014 8:21 AM, Eric_Peng wrote: I was just trying to use SolrJ Client to import XML data to Solr server. And I read SolrJ wiki that says SolrJ lets you upload content in XML and Binary format I realized there is a XML parser in Solr (We can use a dataUpadateHandler in Solr default UI Solr Core Dataimport) So I was wondering how to directly use solr xml parser to upload xml by using SolrJ Java Code? I could use other open-source xml parser, But I really want to know if there is a way to call Solr parser library. Would you mind send me a simple code if possible, really appreciated. Thanks in advance. solr/4.6.1 When the docs say that SolrJ lets you upload data in XML and binary format, what they actually mean is that SolrJ will create an update request that is formatted using XML, not that it will let you send arbitrary XML data. It is referring to the specific XML format shown here: http://wiki.apache.org/solr/UpdateXmlMessages#add.2Freplace_documents As for an XML parser ... SolrJ's XMLResponseParser is a class that accepts XML *responses* from Solr and translates them into the Java response object. There is also BinaryResponseParser. The only things that I am aware of in Solr that will deal with XML as the data source are the XPathEntityProcessor in the dataimport handler and the ExtractingRequestHandler which uses Apache Tika. Both of these are actually contrib modules -- jar files for these features are in the download, but not built into Solr or SolrJ. If you are using the extracting request handler, you could probably use the DirectXmlRequest object, where 'xml' is a String with the xml in it: DirectXmlRequest req = new DirectXmlRequest( /update/extract, xml ); ModifiableSolrParams params = new ModifiableSolrParams(); params.set(someParam, someValue); req.setParams(params); NamedListObject response = solrServer.request(req); I hope that you are right and there actually is an XML parser built into SolrJ. We would both learn something. Thanks, Shawn
Re: Newb - Search not returning any results
Thanks the syntax correction solved the problem. I actually thought I tried that before I posted. Thanks Lee -- View this message in context: http://lucene.472066.n3.nabble.com/Newb-Search-not-returning-any-results-tp4116905p4116930.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Here's some additional background that may shed light on the performance.. http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Wed, Feb 12, 2014 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote: Cross-posting my answer from SO: According to this wiki: https://wiki.apache.org/solr/NearRealtimeSearch the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant. What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations). On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
The explicit commit will cause your app to be delayed until that commit completes, and then Solr would be idle until that request completion makes its way back to your app and you submit another request which finds its way to Solr, maybe a few ms. That includes network latency. That interval of time could well be more than enough for the short-interval autoCommit or commitWithin to run in the background and in parallel with the request return to your app and the submission by your app of the subsequent request. The magic of asynchronous operation in a parallel and distributed computing environment, coupled with multi-core processors and parallel threads. -- Jack Krupansky -Original Message- From: Pisarev, Vitaliy Sent: Wednesday, February 12, 2014 10:28 AM To: solr-user@lucene.apache.org Subject: RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Question about how to upload XML by using SolrJ Client Java Code
Thanks a lot, learnt a lot from it -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-how-to-upload-XML-by-using-SolrJ-Client-Java-Code-tp4116901p4116937.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about how to upload XML by using SolrJ Client Java Code
Thanks you so much Erick, I will try to write my owe XML parser -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-how-to-upload-XML-by-using-SolrJ-Client-Java-Code-tp4116901p4116936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using numeric ranges in Solr query
Hello! Just specify the left boundary, like: price:[900 TO 1000] -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ When user enter a price in price field, for Ex: 1000 USD, i want to fetch all items with price around 1000 USD. I found in documentation that i can use price:[* to 1000] like that. It will get all items with from 1 to 1000 USD. But i want to get results where price is between 900 to 1000 USD. Any help is appreciated. Thank You. -- View this message in context: http://lucene.472066.n3.nabble.com/Using-numeric-ranges-in-Solr-query-tp4116941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question about how to upload XML by using SolrJ Client Java Code
There is also an XSLT update handler option to transform raw XML to Solr XML on the fly. If anybody here has used it, feel free to chime in. See: http://wiki.apache.org/solr/XsltUpdateRequestHandler and https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UsingXSLTtoTransformXMLIndexUpdates -- Jack Krupansky -Original Message- From: Eric_Peng Sent: Wednesday, February 12, 2014 11:42 AM To: solr-user@lucene.apache.org Subject: Re: Question about how to upload XML by using SolrJ Client Java Code Thanks you so much Erick, I will try to write my owe XML parser -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-how-to-upload-XML-by-using-SolrJ-Client-Java-Code-tp4116901p4116936.html Sent from the Solr - User mailing list archive at Nabble.com.
Using numeric ranges in Solr query
When user enter a price in price field, for Ex: 1000 USD, i want to fetch all items with price around 1000 USD. I found in documentation that i can use price:[* to 1000] like that. It will get all items with from 1 to 1000 USD. But i want to get results where price is between 900 to 1000 USD. Any help is appreciated. Thank You. -- View this message in context: http://lucene.472066.n3.nabble.com/Using-numeric-ranges-in-Solr-query-tp4116941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using numeric ranges in Solr query
Is price a float/double field? price:[99.5 TO 100.5] -- price near 100 price:[900 TO 1000] or price:[899.5 TO 1000.5] -- Jack Krupansky -Original Message- From: jay67 Sent: Wednesday, February 12, 2014 12:03 PM To: solr-user@lucene.apache.org Subject: Using numeric ranges in Solr query When user enter a price in price field, for Ex: 1000 USD, i want to fetch all items with price around 1000 USD. I found in documentation that i can use price:[* to 1000] like that. It will get all items with from 1 to 1000 USD. But i want to get results where price is between 900 to 1000 USD. Any help is appreciated. Thank You. -- View this message in context: http://lucene.472066.n3.nabble.com/Using-numeric-ranges-in-Solr-query-tp4116941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Set up embedded Solr container and cores programmatically to read their configs from the classpath
Hi Robert, I don't think this is possible at the moment, but I hope to get https://issues.apache.org/jira/browse/SOLR-4478 in for Lucene/Solr 4.7, which should allow you to inject your own SolrResourceLoader implementation for core creation (it sounds as though you want to wrap the core's loader in a ClasspathResourceLoader). You could try applying that patch to your setup and see if that helps you out. Alan Woodward www.flax.co.uk On 11 Feb 2014, at 10:41, Robert Krüger wrote: Hi, I have an application with an embedded Solr instance (and I want to keep it embedded) and so far I have been setting up my Solr installation programmatically using folder paths to specify where the specific container or core configs are. I have used the CoreContainer methods createAndLoad and create using File arguments and this works fine. However, now I want to change this so that all configuration files are loaded from certain locations using the classloader but I have not been able to get this to work. E.g. I want to have my solr config located in the classpath at my/base/package/solr/conf and the core configs at my/base/package/solr/cores/core1/conf, my/base/package/solr/cores/core2/conf etc.. Is this possible at all? Looking through the source code it seems that specifying classpath resources in such a qualified way is not supported but I may be wrong. I could get this to work for the container by supplying my own implementation of SolrResourceLoader that allows a base path to be specified for the resources to be loaded (I first thought that would happen already when specifying instanceDir accordingly but looking at the code it does not. for resources loaded through the classloader, instanceDir is not prepended). However then I am stuck with the loading of the cores' resources as the respective code (see org.apache.solr.core.CoreContainer#createFromLocal) instantiates a SolResourceLoader internally. Thanks for any help with this (be it a clarification that it is not possible). Robert
RE: Solr4 performance
Does Solr4 load entire index in Memory mapped file? What is the eviction policy of this memory mapped file? Can we control it? _ From: Joshi, Shital [Tech] Sent: Wednesday, February 05, 2014 12:00 PM To: 'solr-user@lucene.apache.org' Subject: Solr4 performance Hi, We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute boxes (cloud). We're using local disk (/local/data) to store solr index files. All hosts have 60GB ram and Solr4 JVM are running with max 30GB heap size. So far we have 470 million documents. We are using custom sharding and all shards have ~9-10 million documents. We have a GUI sending queries to this cloud and GUI has 30 seconds of timeout. Lately we're getting many timeouts on GUI and upon checking we found that all timeouts are happening on 2 hosts. The admin GUI for one of the hosts show 96% of physical memory but the other host looks perfectly good. Both hosts are for different shards. Would increasing ram of these two hosts make these timeouts go away? What else we can check? Many Thanks!
Re: Solr4 performance
No, Solr doesn't load the entire index in memory. I think you'll find Uwe's blog most helpful on this matter: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html On Thu, Feb 13, 2014 at 12:27 AM, Joshi, Shital shital.jo...@gs.com wrote: Does Solr4 load entire index in Memory mapped file? What is the eviction policy of this memory mapped file? Can we control it? _ From: Joshi, Shital [Tech] Sent: Wednesday, February 05, 2014 12:00 PM To: 'solr-user@lucene.apache.org' Subject: Solr4 performance Hi, We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute boxes (cloud). We're using local disk (/local/data) to store solr index files. All hosts have 60GB ram and Solr4 JVM are running with max 30GB heap size. So far we have 470 million documents. We are using custom sharding and all shards have ~9-10 million documents. We have a GUI sending queries to this cloud and GUI has 30 seconds of timeout. Lately we're getting many timeouts on GUI and upon checking we found that all timeouts are happening on 2 hosts. The admin GUI for one of the hosts show 96% of physical memory but the other host looks perfectly good. Both hosts are for different shards. Would increasing ram of these two hosts make these timeouts go away? What else we can check? Many Thanks! -- Regards, Shalin Shekhar Mangar.
Re: Solr4 performance
Shital, Take a look at http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html as it's a pretty decent explanation of memory mapped files. I don't believe that the default configuration for solr is to use MMapDirectory but even if it does my understanding is that the entire file won't be forcibly cached by solr. The OS's filesystem cache should control what's actually in ram and the eviction process will depend on the OS. Thanks, Greg On Feb 12, 2014, at 12:57 PM, Joshi, Shital shital.jo...@gs.com wrote: Does Solr4 load entire index in Memory mapped file? What is the eviction policy of this memory mapped file? Can we control it? _ From: Joshi, Shital [Tech] Sent: Wednesday, February 05, 2014 12:00 PM To: 'solr-user@lucene.apache.org' Subject: Solr4 performance Hi, We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute boxes (cloud). We're using local disk (/local/data) to store solr index files. All hosts have 60GB ram and Solr4 JVM are running with max 30GB heap size. So far we have 470 million documents. We are using custom sharding and all shards have ~9-10 million documents. We have a GUI sending queries to this cloud and GUI has 30 seconds of timeout. Lately we're getting many timeouts on GUI and upon checking we found that all timeouts are happening on 2 hosts. The admin GUI for one of the hosts show 96% of physical memory but the other host looks perfectly good. Both hosts are for different shards. Would increasing ram of these two hosts make these timeouts go away? What else we can check? Many Thanks!
RE: Indexing spatial fields into SolrCloud (HTTP)
Hi David, I finally got back to this again, after getting sidetracked for a couple of weeks. I implemented things in accordance with my understanding of what you wrote below. Using SolrJ, the code to index the spatial field is as follows, private void addSpatialField(double lat, double lon, SolrInputDocument document) { StringBuilder sb = new StringBuilder(); sb.append(lat).append(,).append(lon); document.addField(location, sb.toString()); } Using Solr 4.3.1 and spatial4j 0.3, I am getting the following error in the solr logs: 1518436 [qtp1529118084-24] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: com.spatial4j.core.exception.InvalidShapeException: Unable to read: Pt(x=-72.544123,y=41.85) at org.apache.solr.schema.AbstractSpatialFieldType.parseShape(AbstractSpatialFieldType.java:144) at org.apache.solr.schema.AbstractSpatialFieldType.createFields(AbstractSpatialFieldType.java:118) spatial4j 0.3 is looking for something like POINT but Solr is converting my lat,long to Pt(x=-72.544123,y=41.85). Version mismatch? Thanks in advance for your help! Jim Beale From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Monday, January 13, 2014 11:30 AM To: Beale, Jim (US-KOP); solr-user@lucene.apache.org Subject: Re: Indexing spatial fields into SolrCloud (HTTP) Hello Jim, By the way, using GeohashPrefixTree.getMaxLevelsPossible() is usually an extreme choice. Instead you probably want to choose only as many levels needed for your distance tolerance. See SpatialPrefixTreeFactory which you can use outright or borrow the code it uses. Looking at your code, I see you are coding against Solr directly in Java instead of where most people do this as an HTTP web service. But it's unclear in what context your code runs because you are getting into the guts of things that you normally don't have to do, even if your using SolrJ or writing some sort of UpdateRequestProcessor. Simply configure the field type in schema.xml appropriately, and then to index a point simply give Solr a string for the field in latitude, longitude format. I don't know why you are using field.tokenStream(analyzer) for the field value - that is clearly wrong and the cause of the error. I think your confusion more has to do with differences in coding to Lucene versus Solr; this being an actual spatial concern. You referenced SpatialDemoUpdateProcessorFactory so I see you have looked at SolrSpatialSandbox on GitHub. That particular URP should get some warnings added to it in the code to suggest that you probably should do what it does. If you look at the solrconfig.xml that configures it, there is a warning as follows: !-- spatial Only needed for an SpatialDemoUpdateProcessorFactory which copies spatial objects from one field to other spatial fields in object form to avoid redundant/inefficient string to spatial object de-serialization. -- Even if you have a similar circumstance, you're code doesn't quite look like this URP. You shouldn't need to reference the SpatialStrategy, for example. ~ David From: Beale, Jim (US-KOP) jim.be...@hibu.commailto:jim.be...@hibu.com Date: Friday, January 10, 2014 at 12:15 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Cc: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org Subject: Indexing spatial fields into SolrCloud (HTTP) I am porting an application from Lucene to Solr which makes use of spatial4j for distance searches. The Lucene version works correctly but I am having a problem getting the Solr version to work in the same way. Lucene version: SpatialContext geoSpatialCtx = SpatialContext.GEO; geoSpatialStrategy = new RecursivePrefixTreeStrategy(new GeohashPrefixTree( geoSpatialCtx, GeohashPrefixTree.getMaxLevelsPossible()), DocumentFieldNames.LOCATION); Point point = geoSpatialCtx.makePoint(lon, lat); for (IndexableField field : geoSpatialStrategy.createIndexableFields(point)) { document.add(field); } //Store the field document.add(new StoredField(geoSpatialStrategy.getFieldName(), geoSpatialCtx.toString(point))); Solr version: Point point = geoSpatialCtx.makePoint(lon, lat); for (IndexableField field : geoSpatialStrategy.createIndexableFields(point)) { try { solrDocument.addField(field.name(), field.tokenStream(analyzer)); } catch (IOException e) { LOGGER.error(Failed to add geo field to Solr index, e); } } // Store the field solrDocument.addField(geoSpatialStrategy.getFieldName(), geoSpatialCtx.toString(point)); The server-side error is as follows: Caused by: com.spatial4j.core.exception.InvalidShapeException: Unable to read:
filtering/faceting by a big list IDs
Hi all,I am running a Solr application and I would need to implement a feature that requires faceting and filtering on a large list of IDs. The IDs are stored outside of Solr and is specific to the current logged on user. An example of this is the articles/tweets the user has read in the last few weeks. Note that the IDs here are the real document IDs and not Lucene internal docids.So the question is what would be the best way to implement this in Solr? The list could be as large as a ten of thousands of IDs. The obvious way of rewriting Solr query to add the ID list as "facet.query" and "fq" doesn't seem to be the best way because: a) the query would be very long, and b) it would surely exceed thatthe default limit of 1024 Boolean clauses and I am sure the limit is there for a reason.I had a similar problem before but back then I was using Lucene directly and the way I solved it is to use a MultiTermQuery to retrieve the internal docids from the ID list and then apply the resulting DocSet to counting and filtering. It was working reasonably for list of size ~10K, and with proper caching, it was working ok. My current application is very invested in Solr that going back to Lucene is not an option anymore.All advice/suggestion are welcomed.Thanks,Tri
Re: Solr4 performance
On 2/12/2014 12:07 PM, Greg Walters wrote: Take a look at http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html as it's a pretty decent explanation of memory mapped files. I don't believe that the default configuration for solr is to use MMapDirectory but even if it does my understanding is that the entire file won't be forcibly cached by solr. The OS's filesystem cache should control what's actually in ram and the eviction process will depend on the OS. I only have a little bit to add. Here's the first thing that Uwe's blog post (linked above) says: Since version 3.1, *Apache Lucene*and *Solr *use MMapDirectoryby default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. The default in Solr 4.x is NRTCachingDirectory, which uses MMapDirectory by default under the hood. A summary about all this that should be relevant to the original question: It's the *operating system* that handles memory mapping, including any caching that happens. Assuming that you don't have a badly configured virtual machine setup, I'm fairly sure that only real memory gets used, never swap space on the disk. If something else on the system makes a memory allocation, the operating system will instantly give up memory used for caching and mapping. One of the strengths of mmap is that it can't exceed available resources unless it's used incorrectly. Thanks, Shawn
Re: filtering/faceting by a big list IDs
Tri, You will most likely need to implement a custom QParserPlugin to efficiently handle what you described. Inside of this QParserPlugin you could create the logic that would bring in your outside list of ID's and build a DocSet that could be applied to the fq and the facet.query. I haven't attempted to use a QParserPlugin with a facet.query, but in theory it would work. With the filter query you also have the option of implementing your Query as a PostFilter. PostFilter logic is applied at collect time so the logic needs to only be applied to the documents that match the query. In many cause this can be faster, especially when result sets are relatively small but the index is large. Joel Bernstein Search Engineer at Heliosearch On Wed, Feb 12, 2014 at 2:12 PM, Tri Cao tm...@me.com wrote: Hi all, I am running a Solr application and I would need to implement a feature that requires faceting and filtering on a large list of IDs. The IDs are stored outside of Solr and is specific to the current logged on user. An example of this is the articles/tweets the user has read in the last few weeks. Note that the IDs here are the real document IDs and not Lucene internal docids. So the question is what would be the best way to implement this in Solr? The list could be as large as a ten of thousands of IDs. The obvious way of rewriting Solr query to add the ID list as facet.query and fq doesn't seem to be the best way because: a) the query would be very long, and b) it would surely exceed that the default limit of 1024 Boolean clauses and I am sure the limit is there for a reason. I had a similar problem before but back then I was using Lucene directly and the way I solved it is to use a MultiTermQuery to retrieve the internal docids from the ID list and then apply the resulting DocSet to counting and filtering. It was working reasonably for list of size ~10K, and with proper caching, it was working ok. My current application is very invested in Solr that going back to Lucene is not an option anymore. All advice/suggestion are welcomed. Thanks, Tri
Re: Indexing spatial fields into SolrCloud (HTTP)
That’s pretty weird. It appears that somehow a Spatial4j Point class is having it’s toString() called on it (which looks like Pt(x=-72.544123,y=41.85) ) and then Spatial4j is trying to parse this which isn’t in a valid format — the toString is more debug-ability. Your SolrJ code looks totally fine. Perhaps you’ve got some funky UpdateRequestProcessor from experimentation you’ve done that’s parsing then toString’ing it? And also, your stack trace should have more to it than what you present here. It may be on the server-side versus the HTTP error page which can get abbreviated. ~ David From: Beale, Jim (US-KOP) jim.be...@hibu.commailto:jim.be...@hibu.com Date: Wednesday, February 12, 2014 at 2:05 PM To: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org, solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: RE: Indexing spatial fields into SolrCloud (HTTP) Hi David, I finally got back to this again, after getting sidetracked for a couple of weeks. I implemented things in accordance with my understanding of what you wrote below. Using SolrJ, the code to index the spatial field is as follows, privatevoid addSpatialField(double lat, double lon, SolrInputDocument document) { StringBuilder sb = new StringBuilder(); sb.append(lat).append(,).append(lon); document.addField(location, sb.toString()); } Using Solr 4.3.1 and spatial4j 0.3, I am getting the following error in the solr logs: 1518436 [qtp1529118084-24] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: com.spatial4j.core.exception.InvalidShapeException: Unable to read: Pt(x=-72.544123,y=41.85) at org.apache.solr.schema.AbstractSpatialFieldType.parseShape(AbstractSpatialFieldType.java:144) at org.apache.solr.schema.AbstractSpatialFieldType.createFields(AbstractSpatialFieldType.java:118) spatial4j 0.3 is looking for something like POINT but Solr is converting my “lat,long” to Pt(x=-72.544123,y=41.85). Version mismatch? Thanks in advance for your help! Jim Beale From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Monday, January 13, 2014 11:30 AM To: Beale, Jim (US-KOP); solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Indexing spatial fields into SolrCloud (HTTP) Hello Jim, By the way, using GeohashPrefixTree.getMaxLevelsPossible() is usually an extreme choice. Instead you probably want to choose only as many levels needed for your distance tolerance. See SpatialPrefixTreeFactory which you can use outright or borrow the code it uses. Looking at your code, I see you are coding against Solr directly in Java instead of where most people do this as an HTTP web service. But it’s unclear in what context your code runs because you are getting into the guts of things that you normally don’t have to do, even if your using SolrJ or writing some sort of UpdateRequestProcessor. Simply configure the field type in schema.xml appropriately, and then to index a point simply give Solr a string for the field in “latitude, longitude” format. I don’t know why you are using field.tokenStream(analyzer) for the field value — that is clearly wrong and the cause of the error. I think your confusion more has to do with differences in coding to Lucene versus Solr; this being an actual spatial concern. You referenced “SpatialDemoUpdateProcessorFactory” so I see you have looked at SolrSpatialSandbox on GitHub. That particular URP should get some warnings added to it in the code to suggest that you probably should do what it does. If you look at the solrconfig.xml that configures it, there is a warning as follows: !-- spatial Only needed for an SpatialDemoUpdateProcessorFactory which copies spatial objects from one field to other spatial fields in object form to avoid redundant/inefficient string to spatial object de-serialization. -- Even if you have a similar circumstance, you’re code doesn’t quite look like this URP. You shouldn’t need to reference the SpatialStrategy, for example. ~ David From: Beale, Jim (US-KOP) jim.be...@hibu.commailto:jim.be...@hibu.com Date: Friday, January 10, 2014 at 12:15 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Cc: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org Subject: Indexing spatial fields into SolrCloud (HTTP) I am porting an application from Lucene to Solr which makes use of spatial4j for distance searches. The Lucene version works correctly but I am having a problem getting the Solr version to work in the same way. Lucene version: SpatialContext geoSpatialCtx = SpatialContext.GEO; geoSpatialStrategy = new RecursivePrefixTreeStrategy(new GeohashPrefixTree( geoSpatialCtx, GeohashPrefixTree.getMaxLevelsPossible()),
Re: Solr4 performance
And perhaps one other, but very pertinent, recommendation is: allocate only as little heap as is necessary. By allocating more, you are working against the OS caching. To know how much is enough is bit tricky, though. Best, roman On Wed, Feb 12, 2014 at 2:56 PM, Shawn Heisey s...@elyograg.org wrote: On 2/12/2014 12:07 PM, Greg Walters wrote: Take a look at http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory- on-64bit.html as it's a pretty decent explanation of memory mapped files. I don't believe that the default configuration for solr is to use MMapDirectory but even if it does my understanding is that the entire file won't be forcibly cached by solr. The OS's filesystem cache should control what's actually in ram and the eviction process will depend on the OS. I only have a little bit to add. Here's the first thing that Uwe's blog post (linked above) says: Since version 3.1, *Apache Lucene*and *Solr *use MMapDirectoryby default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. The default in Solr 4.x is NRTCachingDirectory, which uses MMapDirectory by default under the hood. A summary about all this that should be relevant to the original question: It's the *operating system* that handles memory mapping, including any caching that happens. Assuming that you don't have a badly configured virtual machine setup, I'm fairly sure that only real memory gets used, never swap space on the disk. If something else on the system makes a memory allocation, the operating system will instantly give up memory used for caching and mapping. One of the strengths of mmap is that it can't exceed available resources unless it's used incorrectly. Thanks, Shawn
Re: Searching phonetic by DoubleMetaphone soundex encoder
Navaa, you need query expansion for that. E.g. if your query goes through dismax, you need to add the two field names to the qf parameter. The nice thing is that qf can be: text^3.0 test.stemmed^2 text.phonetic^1 And thus exact matches are preferred to stemmed or phonetic matches. This is configured in solrconfig.xml. It's quite common to create your own query component to do more than just dismax for this. hope it helps. paul Le 12 févr. 2014 à 14:22, Navaa navnath.thomb...@xtremumsolutions.com a écrit : Hi, I am using solr for searching phoneticly equivalent string my schema contains... fieldType name=text_general_doubleMetaphone class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and fields are field name=fname type=text_general indexed=true stored=true required=false / field name=fname_sound type=text_general_doubleMetaphone indexed=true stored=true required=false / copyField source=fname dest=fname_sound / it works when i search stfn=== stephen, stephn But I am expecting stephn= stephen like How I will get this result. m I doing something wrong Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-phonetic-by-DoubleMetaphone-soundex-encoder-tp4116885.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimize Index in solr 4.6
On 2/6/2014 4:00 AM, Shawn Heisey wrote: I would not recommend it, but if you know for sure that your infrastructure can handle it, then you should be able to optimize them all at once by sending parallel optimize requests with distrib=false directly to the Solr cores that hold the shard replicas, not the collection. Followup on this thread: Evidence now suggests (thank you, Yago!) that sending an optimize request with distrib=false might *NOT* optimize just the core that receives the request. I can confirm that this is the case on a SolrCloud 4.2.1 setup with one shard and replicationFactor=2. It optimized that core, then when that was finished, optimized the other replica. I would have already filed an issue in Jira, except that I do not currently have any way to test this on 4.6.1, so I do not know if this is still the way it works. Also, I do not have a distributed SolrCloud index available. I will be looking into writing a unit test, but my grasp of SolrCloud tests is very weak. Thanks, Shawn
RE: Indexing spatial fields into SolrCloud (HTTP)
Hi David, You wrote: Perhaps you’ve got some funky UpdateRequestProcessor from experimentation you’ve done that’s parsing then toString’ing it? No, nothing at all. The update processing is straight out-of-the-box Solr. And also, your stack trace should have more to it than what you present here. I trimmed the stack trace because it seemed like TMI, but here it is for completeness’ sake: 1518436 [qtp1529118084-24] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: com.spatial4j.core.exception.InvalidShapeException: Unable to read: Pt(x=-72.544123,y=41.85) at org.apache.solr.schema.AbstractSpatialFieldType.parseShape(AbstractSpatialFieldType.java:144) at org.apache.solr.schema.AbstractSpatialFieldType.createFields(AbstractSpatialFieldType.java:118) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:186) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:257) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:199) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:504) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:640) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) ….. Your SolrJ code looks totally fine. Finally, I changed the code to the following, private void addSpatialLcnFields(double lat, double lon, SolrInputDocument document) { Point point = geoSpatialCtx.makePoint(lon, lat); document.addField(geoSpatialStrategy.getFieldName(), geoSpatialCtx.toString(point)); } and at least it isn’t throwing exceptions now. I’m not sure what is going into the index yet. I’ll have to wait for it to finish. ( Does that code seem correct? I want to avoid the deprecated API but so far I haven’t found any alternatives which work. Thanks, Jim Beale From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Wednesday, February 12, 2014 3:07 PM To: Beale, Jim (US-KOP); solr-user@lucene.apache.org Subject: Re: Indexing spatial fields into SolrCloud (HTTP) That’s pretty weird. It appears that somehow a Spatial4j Point class is having it’s toString() called on it (which looks like Pt(x=-72.544123,y=41.85) ) and then Spatial4j is trying to parse this which isn’t in a valid format — the toString is more debug-ability. Your SolrJ code looks totally fine. Perhaps you’ve got some funky UpdateRequestProcessor from experimentation you’ve done that’s parsing then toString’ing it? And also, your stack trace should have more to it than what you present here. It may be on the server-side versus the HTTP error page which can get abbreviated. ~ David From: Beale, Jim (US-KOP) jim.be...@hibu.commailto:jim.be...@hibu.com Date: Wednesday, February 12, 2014 at 2:05 PM To: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org, solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: RE: Indexing spatial fields into SolrCloud (HTTP) Hi David, I finally got back to this again, after getting sidetracked for a couple of weeks. I implemented things in accordance with my understanding of what you wrote below. Using SolrJ, the code to index the spatial field is as follows, privatevoid addSpatialField(double lat, double lon,
Re: Indexing spatial fields into SolrCloud (HTTP)
Your new code should also work, and should be equivalent. The longer stack trace you have is of the wrapping SolrException which wraps another exception — InvalidShapeException. You should also see the stack trace of InvalidShapeException which should originate out of Spatial4j. ~ David From: Beale, Jim (US-KOP) jim.be...@hibu.commailto:jim.be...@hibu.com Date: Wednesday, February 12, 2014 at 5:21 PM To: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org, solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: RE: Indexing spatial fields into SolrCloud (HTTP) Hi David, You wrote: Perhaps you’ve got some funky UpdateRequestProcessor from experimentation you’ve done that’s parsing then toString’ing it? No, nothing at all. The update processing is straight out-of-the-box Solr. And also, your stack trace should have more to it than what you present here. I trimmed the stack trace because it seemed like TMI, but here it is for completeness’ sake: 1518436 [qtp1529118084-24] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: com.spatial4j.core.exception.InvalidShapeException: Unable to read: Pt(x=-72.544123,y=41.85) at org.apache.solr.schema.AbstractSpatialFieldType.parseShape(AbstractSpatialFieldType.java:144) at org.apache.solr.schema.AbstractSpatialFieldType.createFields(AbstractSpatialFieldType.java:118) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:186) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:257) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:199) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:504) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:640) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) ….. Your SolrJ code looks totally fine. Finally, I changed the code to the following, privatevoid addSpatialLcnFields(double lat, double lon, SolrInputDocument document) { Point point = geoSpatialCtx.makePoint(lon, lat); document.addField(geoSpatialStrategy.getFieldName(), geoSpatialCtx.toString(point)); } and at least it isn’t throwing exceptions now. I’m not sure what is going into the index yet. I’ll have to wait for it to finish. ( Does that code seem correct? I want to avoid the deprecated API but so far I haven’t found any alternatives which work. Thanks, Jim Beale From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Wednesday, February 12, 2014 3:07 PM To: Beale, Jim (US-KOP); solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Indexing spatial fields into SolrCloud (HTTP) That’s pretty weird. It appears that somehow a Spatial4j Point class is having it’s toString() called on it (which looks like Pt(x=-72.544123,y=41.85) ) and then Spatial4j is trying to parse this which isn’t in a valid format — the toString is more debug-ability. Your SolrJ code looks totally fine. Perhaps you’ve got some funky UpdateRequestProcessor from experimentation you’ve done that’s parsing then toString’ing it? And also, your stack trace should have more to it than what you present here. It may be on the server-side versus the HTTP error page which can get
Weird issue with q.op=AND
Hi, I'm facing a weird problem while using q.op=AND condition. Looks like it gets into some conflict if I use multiple appends condition in conjunction. It works as long as I've one filtering condition in appends. lst name=appends str name=fqSource:TestHelp/str /lst Now, the moment I add an additional parameter, search stops returning any result. lst name=appends str name=fqSource:TestHelp | Source:TestHelp2/str /lst If I remove q.op=AND from request handler, I get results back. Data is present for both the Source I'm using, so it's not a filtering issue. Even a blank query fails to return data. Here's my request handler. requestHandler name=/testhandler class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str float name=tie0.01/float str name=wtvelocity/str str name=v.templatebrowse/str str name=v.contentTypetext/html;charset=UTF-8/str str name=v.layoutlayout/str str name=v.channeltesthandler/str str name=defTypeedismax/str str name=q.opAND/str str name=q.alt*:*/str str name=rows15/str str name=flid,url,Source2,text/str str name=qftext^1.5 title^2/str str name=bqSource:TestHelp^3 Source:TestHelp2^0.85/str str name=bfrecip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0/str str name=dftext/str !-- facets -- str name=faceton/str str name=facet.mincount1/str str name=facet.limit100/str str name=facet.fieldlanguage/str str name=facet.fieldSource/str !-- Highlighting defaults -- str name=hltrue/str str name=hl.fltext title/str str name=f.text.hl.fragsize250/str str name=f.text.hl.alternateFieldShortDesc/str !-- Spell check settings -- str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str !-- Shard Tolerant -- str name=shards.toleranttrue/str /lst lst name=appends str name=fqSource:TestHelp | Source2:TestHelp2/str /lst arr name=last-components strspellcheck/str /arr /requestHandler Not sure what's going wrong. I'm using a SolrCloud environment with 2 shards having a replica each. Any pointers will be appreciated. Thanks, Shamik
Re: Weird issue with q.op=AND
On 2/12/2014 3:32 PM, Shamik Bandopadhyay wrote: Hi, I'm facing a weird problem while using q.op=AND condition. Looks like it gets into some conflict if I use multiple appends condition in conjunction. It works as long as I've one filtering condition in appends. lst name=appends str name=fqSource:TestHelp/str /lst Now, the moment I add an additional parameter, search stops returning any result. lst name=appends str name=fqSource:TestHelp | Source:TestHelp2/str /lst If I remove q.op=AND from request handler, I get results back. Data is present for both the Source I'm using, so it's not a filtering issue. Even a blank query fails to return data. I'm pretty sure that's not valid Solr query syntax for what you're trying to do. Try this instead, although with these specific examples I would leave the quotes out of the fq value: lst name=appends str name=fqSource:(TestHelp OR TestHelp2)/str /lst It's pretty much accidental (a result of the query analysis chain) that it was working when you didn't have q.op=AND. You can verify what I'm saying by looking at the parsed query when adding debugQuery=true as a query option. Thanks, Shawn
Re: Weird issue with q.op=AND
Thanks a lot Shawn. Changing the appends filtering based on your suggestion worked. The part which confused me bigtime is the syntax I've been using so far without an issue (barring the q.op part). lst name=appends str name=fqSource:TestHelp | Source:downloads | -AccessMode:internal | -workflowparentid:[* TO *]/str /lst This has been working as expected and applies the filter correctly. Just curious, if its an invalid syntax, how's Solr handling this ? -- View this message in context: http://lucene.472066.n3.nabble.com/Weird-issue-with-q-op-AND-tp4117013p4117022.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Weird issue with q.op=AND
On 2/12/2014 4:58 PM, shamik wrote: Thanks a lot Shawn. Changing the appends filtering based on your suggestion worked. The part which confused me bigtime is the syntax I've been using so far without an issue (barring the q.op part). lst name=appends str name=fqSource:TestHelp | Source:downloads | -AccessMode:internal | -workflowparentid:[* TO *]/str /lst This has been working as expected and applies the filter correctly. Just curious, if its an invalid syntax, how's Solr handling this ? Honestly, I can't really say what's going on here. After I got this, I tried some example queries like that and they do seem to be parsed right. You could try adding turning on debugQuery for the query that doesn't work and see if you can see what the problem is. I had never seen a query syntax with | in it before. The other syntax is a little more explicit, though. Thanks, Shawn
change character correspondence in icu lib
Hello, I use icu4j-49.1.jar, lucene-analyzers-icu-4.6-SNAPSHOT.jar for one of the fields in the form filter class=solr.ICUFoldingFilterFactory / I need to change one of the accent char's corresponding letter. I made changes to this file lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt recompiled solr and lucene and replaced the above jars with new ones, but no change in the indexing and parsing of keywords. Any ideas where the appropriate change must be made? Thanks. Alex.
Re: change character correspondence in icu lib
Not a direct answer, but the usual next question is: are you absolutely sure you are using the right jars? Try renaming them and restarting Solr. If it complains, you got the right ones. If not Also, unzip those jars and see if your file made it all the way through the build pipeline. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Feb 13, 2014 at 8:12 AM, alx...@aim.com wrote: Hello, I use icu4j-49.1.jar, lucene-analyzers-icu-4.6-SNAPSHOT.jar for one of the fields in the form filter class=solr.ICUFoldingFilterFactory / I need to change one of the accent char's corresponding letter. I made changes to this file lucene/analysis/icu/src/data/utr30/DiacriticFolding.txt recompiled solr and lucene and replaced the above jars with new ones, but no change in the indexing and parsing of keywords. Any ideas where the appropriate change must be made? Thanks. Alex.
Re: Weird issue with q.op=AND
Thanks, I'll take a look at the debug data. -- View this message in context: http://lucene.472066.n3.nabble.com/Weird-issue-with-q-op-AND-tp4117013p4117047.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing strategies?
Hi Erick, Thank you very much, those are valuable suggestions :-). I would give a try. Appreciate your time. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-strategies-tp4116852p4117050.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Weird issue with q.op=AND
Did you mean to use || for the OR operator? A single | is not treated as an operator - it will be treated as a term and sent through normal term analysis. -- Jack Krupansky -Original Message- From: Shamik Bandopadhyay Sent: Wednesday, February 12, 2014 5:32 PM To: solr-user@lucene.apache.org Subject: Weird issue with q.op=AND Hi, I'm facing a weird problem while using q.op=AND condition. Looks like it gets into some conflict if I use multiple appends condition in conjunction. It works as long as I've one filtering condition in appends. lst name=appends str name=fqSource:TestHelp/str /lst Now, the moment I add an additional parameter, search stops returning any result. lst name=appends str name=fqSource:TestHelp | Source:TestHelp2/str /lst If I remove q.op=AND from request handler, I get results back. Data is present for both the Source I'm using, so it's not a filtering issue. Even a blank query fails to return data. Here's my request handler. requestHandler name=/testhandler class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str float name=tie0.01/float str name=wtvelocity/str str name=v.templatebrowse/str str name=v.contentTypetext/html;charset=UTF-8/str str name=v.layoutlayout/str str name=v.channeltesthandler/str str name=defTypeedismax/str str name=q.opAND/str str name=q.alt*:*/str str name=rows15/str str name=flid,url,Source2,text/str str name=qftext^1.5 title^2/str str name=bqSource:TestHelp^3 Source:TestHelp2^0.85/str str name=bfrecip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^2.0/str str name=dftext/str !-- facets -- str name=faceton/str str name=facet.mincount1/str str name=facet.limit100/str str name=facet.fieldlanguage/str str name=facet.fieldSource/str !-- Highlighting defaults -- str name=hltrue/str str name=hl.fltext title/str str name=f.text.hl.fragsize250/str str name=f.text.hl.alternateFieldShortDesc/str !-- Spell check settings -- str name=spellchecktrue/str str name=spellcheck.dictionarydefault/str str name=spellcheck.collatetrue/str str name=spellcheck.onlyMorePopularfalse/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count1/str !-- Shard Tolerant -- str name=shards.toleranttrue/str /lst lst name=appends str name=fqSource:TestHelp | Source2:TestHelp2/str /lst arr name=last-components strspellcheck/str /arr /requestHandler Not sure what's going wrong. I'm using a SolrCloud environment with 2 shards having a replica each. Any pointers will be appreciated. Thanks, Shamik
Limit amount of search result
Dear all gurus, I would like to limit amount of search result, let's say I have many shop which is selling shirt. So when I search white shirt I want to give a maximum number per shop (ex. 5). The result should be like this... - Shop A - Shop A - Shop B - Shop B - Shop B - Shop B - Shop B - Shop C - Shop C - Shop C any suggestion would be very appreciate. Thank you very much, Chun. -- View this message in context: http://lucene.472066.n3.nabble.com/Limit-amount-of-search-result-tp4117062.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Limit amount of search result
Chun, Have you looked at Grouping / Field Collapsing feature in solr? https://wiki.apache.org/solr/FieldCollapsing If shop is one of your field, you can use field collapsing on that field with a maximum of 'n' to return per field value (or group). Sameer. -- www.measuredsearch.com tw: measuredsearch On Wednesday, February 12, 2014, rachun rachun.c...@gmail.com wrote: Dear all gurus, I would like to limit amount of search result, let's say I have many shop which is selling shirt. So when I search white shirt I want to give a maximum number per shop (ex. 5). The result should be like this... - Shop A - Shop A - Shop B - Shop B - Shop B - Shop B - Shop B - Shop C - Shop C - Shop C any suggestion would be very appreciate. Thank you very much, Chun. -- View this message in context: http://lucene.472066.n3.nabble.com/Limit-amount-of-search-result-tp4117062.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sameer Maggon Founder, Measured Search m: 310.344.7266 tw: @measuredsearch w: http://www.measuredsearch.com
Re: Join Scoring
Re-posting... Thanks, Anand On 2/12/2014 10:55 AM, anand chandak wrote: Thanks David, really helpful response. You mentioned that if we have to add scoring support in solr then a possible approach would be to add a custom QueryParser, which might be taking Lucene's JOIN module. I have tired this approach and this makes it slow, because I believe this is making more searches.. Curious, if it is possible instead to enhance existing solr's JoinQParserPlugin and add the the scoring support in the same class ? Do you think its feasible and recommended ? If yes, what would it take (highlevel) - in terms of code changes, any pointers ? Thanks, Anand On 2/12/2014 10:31 AM, David Smiley (@MITRE.org) wrote: Hi Anand. Solr's JOIN query, {!join}, constant-scores. It's simpler and faster and more memory efficient (particularly the worse-case memory use) to implement the JOIN query without scoring, so that's why. Of course, you might want it to score and pay whatever penalty is involved. For that you'll need to write a Solr QueryParser that might use Lucene's join module which has scoring variants. I've taken this approach before. You asked a specific question about the purpose of JoinScorer when it doesn't actually score. Lucene's Query produces a Weight which in turn produces a Scorer that is a DocIdSetIterator plus it returns a score. So Queries have to have a Scorer to match any document even if the score is always 1. Solr does indeed have a lot of caching; that may be in play here when comparing against a quick attempt at using Lucene directly. In particular, the matching documents are likely to end up in Solr's DocumentCache. Returning stored fields that come back in search results are one of the more expensive things Lucene/Solr does. I also think you noted that the fields on documents from the from side of the query are not available to be returned in search results, just the to side. Yup; that's true. To remedy this, you might write a Solr SearchComponent that adds fields from the from side. That could be tricky to do; it would probably need to re-run the from-side query but filtered to the matching top-N documents being returned. ~ David anand chandak wrote Resending, if somebody can please respond. Thanks, Anand On 2/5/2014 6:26 PM, anand chandak wrote: Hi, Having a question on join score, why doesn't the solr join query return the scores. Looking at the code, I see there's JoinScorer defined in the JoinQParserPlugin class ? If its not used for scoring ? where is it actually used. Also, to evaluate the performance of solr join plugin vs lucene joinutil, I filed same join query against same data-set and same schema and in the results, I am always seeing the Qtime for Solr much lower then lucenes. What is the reason behind this ? Solr doesn't return scores could that cause so much difference ? My guess is solr has very sophisticated caching mechanism and that might be coming in play, is that true ? or there's difference in the way JOIN happens in the 2 approach. If I understand correctly both the implementation are using 2 pass approach - first all the terms from fromField and then returns all documents that have matching terms in a toField If somebody can throw some light, would highly appreciate. Thanks, Anand - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searching phonetic by DoubleMetaphone soundex encoder
Navaa, You need the query to be sent to the two fields. In dismax, this is easy. Paul On 12 février 2014 14:22:33 HNEC, Navaa navnath.thomb...@xtremumsolutions.com wrote: Hi, I am using solr for searching phoneticly equivalent string my schema contains... fieldType name=text_general_doubleMetaphone class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.PhoneticFilterFactory encoder=DoubleMetaphone inject=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType and fields are field name=fname type=text_general indexed=true stored=true required=false / field name=fname_sound type=text_general_doubleMetaphone indexed=true stored=true required=false / copyField source=fname dest=fname_sound / it works when i search stfn=== stephen, stephn But I am expecting stephn= stephen like How I will get this result. m I doing something wrong Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-phonetic-by-DoubleMetaphone-soundex-encoder-tp4116885.html Sent from the Solr - User mailing list archive at Nabble.com. -- Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.
APACHE SOLR: Pass a file as query parameter and then parse each line to form a criteria
Hi , I am new to solr , i need help with the following PROBLEM: I have a huge file of 1 lines i want this to be an inclusion or exclusion in the query . i.e each line like ( line1 or line2 or ..) How can this be achieved in solr , is there a custom implementation that i would need to implement. Also will it help to implement a custom filter ? Thank You, Rajeev Nadgauda. -- View this message in context: http://lucene.472066.n3.nabble.com/APACHE-SOLR-Pass-a-file-as-query-parameter-and-then-parse-each-line-to-form-a-criteria-tp4117066.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr delta indexing approach
Hi,I am working on a prototyope where i have a content source i am indexing all documents strore the index in solr.Now i have pre-condition that my content source is ever changing means there is always new content added to it. As i have read that solr use to do indexing on full source only everytime solr is asked for indexing.But this may lead to underutilization of reosources as same documents are getting reindexed again again.Is there any approach to handle such scenarios. E.g. I have 1 documents in my source which have been indexed by solr till today. But next day my source has 11000 documents. So i want to i ndex only new 1000 documents not all 11000. Can anybody suggest for this?Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr delta indexing approach
You have read that Solr needs to reindex a full source. That's correct (unless you use atomic updates). But - the important point is - this is per document. So, once you indexed your 1 documents, you don't need to worry about them until they change. Just go ahead and index your additional documents only. I am assuming your source system can figure out what the new ones are (timestamp, etc). Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Feb 13, 2014 at 12:15 PM, lalitjangra lalit.j.jan...@gmail.com wrote: Hi,I am working on a prototyope where i have a content source i am indexing all documents strore the index in solr.Now i have pre-condition that my content source is ever changing means there is always new content added to it. As i have read that solr use to do indexing on full source only everytime solr is asked for indexing.But this may lead to underutilization of reosources as same documents are getting reindexed again again.Is there any approach to handle such scenarios. E.g. I have 1 documents in my source which have been indexed by solr till today. But next day my source has 11000 documents. So i want to i ndex only new 1000 documents not all 11000. Can anybody suggest for this?Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Searching phonetic by DoubleMetaphone soundex encoder
hi, Thanks for your reply.. I m beginner of solr kindly elaborate it mor details because in my solrconfig.xml requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows5/int str name=dfname/str /lst /requestHandler requestHandler name=standard class=solr.StandardRequestHandler default=true / requestHandler name=/update class=solr.XmlUpdateRequestHandler / requestHandler name=/admin/ class=org.apache.solr.handler.admin.AdminHandlers / requestHandler name=/analysis/field class=solr.FieldAnalysisRequestHandler/ requestHandler name=/get class=solr.RealTimeGetHandler lst name=defaults str name=omitHeadertrue/str str name=wtjson/str str name=indenttrue/str /lst /requestHandler where I can add this qf parameter for those two fields... hope you will understand the scenario.. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-phonetic-by-DoubleMetaphone-soundex-encoder-tp4116885p4117073.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr delta indexing approach
Thanks Alex, Yes my source system maintains the crettion last modificaiton system of each document. As per your inputs, can i assume that next time when solr starts indexing, it scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing. How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it if you can share your thoughts on same, it would be great. Regards, Lalit. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr delta indexing approach
I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this: - Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently). - At the end of a successful scan, write out the time it started to a file. - Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system: = If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it. = If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there. We're doing that and its indexing just fine. -Original Message- From: lalitjangra [mailto:lalit.j.jan...@gmail.com] Sent: Thursday, 13 February 2014 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Solr delta indexing approach Thanks Alex, Yes my source system maintains the crettion last modificaiton system of each document. As per your inputs, can i assume that next time when solr starts indexing, it scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing. How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it if you can share your thoughts on same, it would be great. Regards, Lalit. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html Sent from the Solr - User mailing list archive at Nabble.com. == Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it. ==
Re: Solr delta indexing approach
I'd start from doing Solr tutorial. It will explain a lot of things. But in summary, you can send data to Solr (best option) or you can pull it using DataImportHandler. Take your pick, do the tutorial, maybe read some books. Then come back with specific questions of where you started. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Feb 13, 2014 at 12:45 PM, lalitjangra lalit.j.jan...@gmail.com wrote: Thanks Alex, Yes my source system maintains the crettion last modificaiton system of each document. As per your inputs, can i assume that next time when solr starts indexing, it scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing. How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it if you can share your thoughts on same, it would be great. Regards, Lalit. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr delta indexing approach
Why write a Perl script for that? touch new_timestamp find . -newer timestamp | script-to-submit mv new_timestamp timestamp Neither approach deals with deleted files. To do this correctly, you need lists of all the files in the index with their timestamps, and of all the files in the repository. Then you need to difference them to find deleted ones, new ones, and ones that have changed. You might even want to track links and symlinks to get dupes and canonical paths. wunder On Feb 12, 2014, at 10:00 PM, Sadler, Anthony anthony.sad...@yrgrp.com wrote: I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this: - Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently). - At the end of a successful scan, write out the time it started to a file. - Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system: = If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it. = If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there. We're doing that and its indexing just fine. -Original Message- From: lalitjangra [mailto:lalit.j.jan...@gmail.com] Sent: Thursday, 13 February 2014 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Solr delta indexing approach Thanks Alex, Yes my source system maintains the crettion last modificaiton system of each document. As per your inputs, can i assume that next time when solr starts indexing, it scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing. How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it if you can share your thoughts on same, it would be great. Regards, Lalit. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html Sent from the Solr - User mailing list archive at Nabble.com. == Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it. == -- Walter Underwood wun...@wunderwood.org
Re: Solr delta indexing approach
Thanks all. I am following couple of articles for same. I am sending data to solr instead of using DIH and able to successfully index data in solr. My concern here is to ensure how to minimize solr indexing so that only updated data is indexed each time out of all data items. Is this something OOTB available in solr or we need to do it? Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117087.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Integrating Oauth2 with Solr MailEntityProcessor
Hi again, Anybody interested in this feature for Solr MailEntityProcessor? WDYT? Thanks, Dileepa On Thu, Jan 30, 2014 at 11:00 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I think Oauth2 integration is a valid usecase for Solr when it comes to importing data from user-accounts like email, social-networks, enterprise stores etc. Do you think Oauth2 integration in Solr will be an useful feature? If so I would like to start working on this. I feel this could also be a good project for GSoC 2014. Thanks, Dileepa On Wed, Jan 29, 2014 at 3:57 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm doing a research project on : Email Reputation Analysis and for this project I'm planning to use Apache Solr, Tika and Mahout projects to analyse, store and query reputation of emails and correspondents. For indexing emails in Solr I'm going to use the MailEntityProcessor [1]. But I see that it requires the user to provide their email credentials to the DIH which is a security risk. Also I feel current MailEntityProcessor doesn't allow importing data from multiple mail boxes. What do you think of integrating an authorization mechanism like OAuth2 in Solr? Appreciate your ideas on using this for indexing multiple mailboxes without requiring users to give their username passwords. document entity processor=MailEntityProcessor user=someb...@gmail.compassword=something host=imap.gmail.comprotocol=imaps folders = x,y,z//document Regards, Dileepa [1] http://wiki.apache.org/solr/MailEntityProcessor
RE: Solr delta indexing approach
At the risk of derailing the thread: We do a lot more in the script than is mentioned here: We pull out parts of the path and mangle them (for example turn them into a UNC path for users to use, or pull out a client name or job number using a known folder structure). As for deleted files, here's how the script works in totality: - Script runs first time, finds every file and puts into formerly empty Solr DB. For every file found, set date_last_seen = current_time. Writes out last_begin_time file and ends. - Secondary function of script runs sometime after, looks for any file with a date_last_seen last_begin_time. Nothing is found this time around. - Script runs next time, see's that there is a last_begin_time file, reads in that time. Script then runs in a delta mode, looking for all files modified later than last_begin_time. If it finds them, it re-indexs them and their contents. All other files that have mod_time less than last_run_time merely have their date_last_seen updated. Script ends, writes out last_begin_time file. At this point, any files that were deleted between the first and second run were not updated , so their date_last_seen is different from all the others. This gives me something to look for. - Secondary function of script runs sometime after, looks for any file with a date_last_seen last_begin_time. This time around, some files are found. These files have their isDeleted field in solr set to 1. Hopefully that makes a bit more sense. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, 13 February 2014 5:26 PM To: solr-user@lucene.apache.org Subject: Re: Solr delta indexing approach Why write a Perl script for that? touch new_timestamp find . -newer timestamp | script-to-submit mv new_timestamp timestamp Neither approach deals with deleted files. To do this correctly, you need lists of all the files in the index with their timestamps, and of all the files in the repository. Then you need to difference them to find deleted ones, new ones, and ones that have changed. You might even want to track links and symlinks to get dupes and canonical paths. wunder On Feb 12, 2014, at 10:00 PM, Sadler, Anthony anthony.sad...@yrgrp.com wrote: I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this: - Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently). - At the end of a successful scan, write out the time it started to a file. - Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system: = If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it. = If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there. We're doing that and its indexing just fine. -Original Message- From: lalitjangra [mailto:lalit.j.jan...@gmail.com] Sent: Thursday, 13 February 2014 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Solr delta indexing approach Thanks Alex, Yes my source system maintains the crettion last modificaiton system of each document. As per your inputs, can i assume that next time when solr starts indexing, it scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing. How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it if you can share your thoughts on same, it would be great. Regards, Lalit. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117 068p4117077.html Sent from the Solr - User mailing list archive at Nabble.com. == Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it. == -- Walter Underwood wun...@wunderwood.org