Re: DIH import from MySQL results in garbage text for special chars
The output of Show variables goes like this. I have verified with the hex values and they are different in MySQL and Solr. | Variable_name| Value | +--++ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results| latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ *Pranav Prakash* temet nosce On Wed, Sep 26, 2012 at 6:45 PM, Gora Mohanty g...@mimirtech.com wrote: On 21 September 2012 11:19, Pranav Prakash pra...@gmail.com wrote: I am seeing the garbage text in browser, Luke Index Toolbox and everywhere it is the same. My servlet container is Jetty which is the out-of-box one. Many other special chars are getting indexed and stored properly, only few characters causes pain. Could you double-check the encoding on the mysql side? What is the output of mysql SHOW VARIABLES LIKE 'character\_set\_%'; Regards, Gora
Re: DIH import from MySQL results in garbage text for special chars
I looked at the HEX codes of the texts. The hex code in MySQL is different from that which is stored in the index. The hex code in index is longer than the hex code in MySQL, this leads me to the fact that somewhere in between smething is messing up, *Pranav Prakash* temet nosce On Fri, Sep 21, 2012 at 11:19 AM, Pranav Prakash pra...@gmail.com wrote: I am seeing the garbage text in browser, Luke Index Toolbox and everywhere it is the same. My servlet container is Jetty which is the out-of-box one. Many other special chars are getting indexed and stored properly, only few characters causes pain. *Pranav Prakash* temet nosce On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson erickerick...@gmail.comwrote: Is your _browser_ set to handle the appropriate character set? Or whatever you're using to inspect your data? How about your servlet container? Best Erick On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash pra...@gmail.com wrote: Hi Folks, I am attempting to import documents to Solr from MySQL using DIH. One of the field contains the text - “Future of Mobile Value Added Services (VAS) in Australia” .Notice the character “ and ”. When I am importing, it gets stored as - “Future of Mobile Value Added Services (VAS) in Australiaâ€�. The datasource config clearly mentions use of UTF-8 as follows: dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/ohapp_devel user=username useUnicode=true characterEncoding=UTF-8 password=password zeroDateTimeBehavior=convertToNull name=app / A plain SQL Select statement on the MySQL Console gives appropriate text. I even tried using following scriptTransformer to get rid of this char, but it was of no particular use in my case. function gsub(source, pattern, replacement) { var match, result; if (!((pattern != null) (replacement != null))) { return source; } result = ''; while (source.length 0) { if ((match = source.match(pattern))) { result += source.slice(0, match.index); result += replacement; source = source.slice(match.index + match[0].length); } else { result += source; source = ''; } } return result; } function fixQuotes(c){ c = gsub(c, /\342\200(?:\234|\235)/,''); c = gsub(c, /\342\200(?:\230|\231)/,'); c = gsub(c, /\342\200\223/,-); c = gsub(c, /\342\200\246/,...); c = gsub(c, /\303\242\342\202\254\342\204\242/,'); c = gsub(c, /\303\242\342\202\254\302\235/,''); c = gsub(c, /\303\242\342\202\254\305\223/,''); c = gsub(c, /\303\242\342\202\254/,'-'); c = gsub(c, /\342\202\254\313\234/,''); c = gsub(c, /“/, ''); return c; } function cleanFields(row){ var fieldsToClean = ['title', 'description']; for(i =0; i fieldsToClean.length; i++){ var old_text = String(row.get(fieldsToClean[i])); row.put(fieldsToClean[i], fixQuotes(old_text) ); } return row; } My understanding goes that this must be a very common problem. It also occurs with human names which have these chars. What is an appropriate way to get the appropriate text indexed and searchable? The fieldtype where this is stored goes as follows fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / /analyzer /fieldType *Pranav Prakash* temet nosce
Re: DIH import from MySQL results in garbage text for special chars
I am seeing the garbage text in browser, Luke Index Toolbox and everywhere it is the same. My servlet container is Jetty which is the out-of-box one. Many other special chars are getting indexed and stored properly, only few characters causes pain. *Pranav Prakash* temet nosce On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson erickerick...@gmail.comwrote: Is your _browser_ set to handle the appropriate character set? Or whatever you're using to inspect your data? How about your servlet container? Best Erick On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash pra...@gmail.com wrote: Hi Folks, I am attempting to import documents to Solr from MySQL using DIH. One of the field contains the text - “Future of Mobile Value Added Services (VAS) in Australia” .Notice the character “ and ”. When I am importing, it gets stored as - “Future of Mobile Value Added Services (VAS) in Australiaâ€�. The datasource config clearly mentions use of UTF-8 as follows: dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/ohapp_devel user=username useUnicode=true characterEncoding=UTF-8 password=password zeroDateTimeBehavior=convertToNull name=app / A plain SQL Select statement on the MySQL Console gives appropriate text. I even tried using following scriptTransformer to get rid of this char, but it was of no particular use in my case. function gsub(source, pattern, replacement) { var match, result; if (!((pattern != null) (replacement != null))) { return source; } result = ''; while (source.length 0) { if ((match = source.match(pattern))) { result += source.slice(0, match.index); result += replacement; source = source.slice(match.index + match[0].length); } else { result += source; source = ''; } } return result; } function fixQuotes(c){ c = gsub(c, /\342\200(?:\234|\235)/,''); c = gsub(c, /\342\200(?:\230|\231)/,'); c = gsub(c, /\342\200\223/,-); c = gsub(c, /\342\200\246/,...); c = gsub(c, /\303\242\342\202\254\342\204\242/,'); c = gsub(c, /\303\242\342\202\254\302\235/,''); c = gsub(c, /\303\242\342\202\254\305\223/,''); c = gsub(c, /\303\242\342\202\254/,'-'); c = gsub(c, /\342\202\254\313\234/,''); c = gsub(c, /“/, ''); return c; } function cleanFields(row){ var fieldsToClean = ['title', 'description']; for(i =0; i fieldsToClean.length; i++){ var old_text = String(row.get(fieldsToClean[i])); row.put(fieldsToClean[i], fixQuotes(old_text) ); } return row; } My understanding goes that this must be a very common problem. It also occurs with human names which have these chars. What is an appropriate way to get the appropriate text indexed and searchable? The fieldtype where this is stored goes as follows fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / /analyzer /fieldType *Pranav Prakash* temet nosce
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
I am experiencing similar problem related to encoding. In my case, the char like (double quote) is also garbaled. I believe this is because the encoding in my MySQL table is latin1 and in the JDBC it is being specified as UTF-8. Is there a way to specify latin1 charset in JDBC? probably that would resolve this. *Pranav Prakash* temet nosce On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2012 6:54 PM, kiran chitturi wrote: The error i am getting is 'org.apache.solr.common.**SolrException: Invalid Date String: '1345743552'. I think it was being saved as a string in DB, so i will use the DateFormatTransformer. To go along with all the other replies that you have gotten: I import from MySQL with a unix format date field. It's a bigint, not a string, but a quick test on MySQL 5.1 shows that the function works with strings too. This is how my SELECT handles that field - I have MySQL convert it before it gets to Solr: from_unixtime(`d`.`post_date`) AS `pd` When it comes to the character set issues, this is how I have defined the driver in the dataimport config. The character set in the database is utf8. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver encoding=UTF-8 url=jdbc:mysql://${**dataimporter.request.dbHost}:** 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=** convertToNull batchSize=-1 user=removed password=removed/ Thanks, Shawn
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
The character is actually - “ and not *Pranav Prakash* temet nosce On Mon, Sep 10, 2012 at 2:45 PM, Pranav Prakash pra...@gmail.com wrote: I am experiencing similar problem related to encoding. In my case, the char like (double quote) is also garbaled. I believe this is because the encoding in my MySQL table is latin1 and in the JDBC it is being specified as UTF-8. Is there a way to specify latin1 charset in JDBC? probably that would resolve this. *Pranav Prakash* temet nosce On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2012 6:54 PM, kiran chitturi wrote: The error i am getting is 'org.apache.solr.common.**SolrException: Invalid Date String: '1345743552'. I think it was being saved as a string in DB, so i will use the DateFormatTransformer. To go along with all the other replies that you have gotten: I import from MySQL with a unix format date field. It's a bigint, not a string, but a quick test on MySQL 5.1 shows that the function works with strings too. This is how my SELECT handles that field - I have MySQL convert it before it gets to Solr: from_unixtime(`d`.`post_date`) AS `pd` When it comes to the character set issues, this is how I have defined the driver in the dataimport config. The character set in the database is utf8. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver encoding=UTF-8 url=jdbc:mysql://${**dataimporter.request.dbHost}:** 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=** convertToNull batchSize=-1 user=removed password=removed/ Thanks, Shawn
Exact match on few fields, fuzzy on others
Hi Folks, I am using Solr 3.4 and my document schema has attributes - title, transcript, author_name. Presently, I am using DisMax to search for a user query across transcript. I would also like to do an exact search on author_name so that for a query Albert Einstein, I would want to get all the documents which contain Albert or Einstein in transcript and also those documents which have author_name exactly as 'Albert Einstein'. Can we do this by dismax query parser? The schema for both the fields are below: fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / /analyzer /fieldType fieldType name=text_standard class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory / filter class=solr.TrimFilterFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / filter class=solr.RemoveDuplicatesTokenFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType field name=titletype=text_commongrams indexed=true stored=true multiValued=false / field name=author_name type=text_standard indexed=true stored=false / -- *Pranav Prakash* temet nosce
Re: DIH XML configs for multi environment
Jerry, Glad it worked for you. I will also do the same thing. This seems easier for me, as I have a solr start shell script, which sets the JVM params for master/slave, Xmx and so on according to the environment. Setting a jdbc connect url in the start script is convenient than changing the configs. *Pranav Prakash* temet nosce On Tue, Jul 24, 2012 at 1:17 AM, jerry.min...@gmail.com jerry.min...@gmail.com wrote: Pranav, Sorry, I should have checked my response a little better as I misspelled your name and, mentioned that I tried what Marcus suggested then described something totally different. I didn't try using the property mechanism as Marcus suggested as I am not using a solr.xml file. What you mentioned in your post on Wed, Jul 18, 2012 at 3:46 PM will work as I have done it successfully. That is I created a JVM variable to contain the connect URLs for each of my environments and one of those to set the URL parameter of the dataSource entity in my data config files. Best, Jerry On Mon, Jul 23, 2012 at 3:34 PM, jerry.min...@gmail.com jerry.min...@gmail.com wrote: Pranay, I tried two similar approaches to resolve this in my system which is Solr 4.0 running in Tomcat 7.x on Ubuntu 9.10. My preference was to use an alias for each of my database environments as a JVM parameter because it makes more sense to me that the database connection be stored in the data config file rather than in a Tomcat configuration or startup file. Because of preference, I first attempted the following: 1. Set a JVM environment variable 'solr.dbEnv' to the represent the database environment that should be accessed. For example, in my dev environment, the JVM environment variable was set as -Dsolr.dbEnv=dev. 2. In the data config file I had 3 data sources. Each data source had a name that matched one of the database environment aliases. 3. In the entity of my data config file dataSource parameter was set as follows dataSource=${solr.dbEnv}. Unfortunately, this fails to work. Setting dataSource parameter in the data config file does not override the default. The default appears to be the first data source defined in the data config file. Second, I tried what Marcus suggested. That is, I created a JVM variable to contain the connect URLs for each of my environments. I use that variable to set the URL parameter of the dataSource entity in the data config file. This works well. Best, Jerry Mindek Unfortunately, the first option did not work. It seemed as though On Wed, Jul 18, 2012 at 3:46 PM, Pranav Prakash pra...@gmail.com wrote: That approach would work for core dependent parameters. In my case, the params are environment dependent. I think a simpler approach would be to pass the url param as JVM options, and these XMLs get it from there. I haven't tried it yet. *Pranav Prakash* temet nosce On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose m...@shi-gmbh.com wrote: Hi There is one more approach using the property mechanism. You could specify the datasource like this: dataSource name=database driver=${sqlDriver} url=${sqlURL}/ And you can specifiy the properties in the solr.xml in your core configuration like this: core instanceDir=core1 name=core1 property name=sqlURL value=jdbc:hsqldb:/temp/example/ex/ /core Viele Grüße aus Augsburg Markus Klose SHI Elektronische Medien GmbH Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg Tel.: 0821 7482633 26 Tel.: 0821 7482633 0 (Zentrale) Mobil:0176 56516869 Fax: 0821 7482633 29 E-Mail: markus.kl...@shi-gmbh.com Internet: http://www.shi-gmbh.com Registergericht Augsburg HRB 17382 Geschäftsführer: Peter Spiske USt.-ID: DE 182167335 -Ursprüngliche Nachricht- Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com] Gesendet: Mittwoch, 11. Juli 2012 11:21 An: solr-user@lucene.apache.org Betreff: Re: DIH XML configs for multi environment http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource http://docs.codehaus.org/display/JETTY/DataSource+Examples On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash pra...@gmail.com wrote: That's cool. Is there something similar for Jetty as well? We use Jetty! *Pranav Prakash* temet nosce On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Hi Pranav, If you are using Tomcat to host Solr, you can define your data source in context.xml file under tomcat configuration. You have to refer to this datasource with the same name in all the 3 environments from DIH data-config.xml. This context.xml file will vary across 3 environments having different credentials for dev, stag and prod. eg DIH data-config.xml will refer to the datasource as listed below dataSource jndiName=java:comp/env
Re: can solr admin tab statistics be customized... how can this be achived.
You can checkout Solr source code, do the patch work in admin JSP files and use it as your custom Solr Instance. *Pranav Prakash* temet nosce On Fri, Jul 20, 2012 at 12:14 PM, yayati yayatirajpa...@gmail.com wrote: Hi, I want to compute my own stats in addition to solr default stats. How can i enhance statistics in solr? How this thing can be achieved.. Solr compute stats as cumulative, is there is any way to get per instant stats...?? Thanks... waiting for good replies.. -- View this message in context: http://lucene.472066.n3.nabble.com/can-solr-admin-tab-statistics-be-customized-how-can-this-be-achived-tp3996128.html Sent from the Solr - User mailing list archive at Nabble.com.
How To apply transformation in DIH for multivalued numeric field?
I have a multivalued integer field and a multivalued string field defined in my schema as field name=community_tag_ids type=integer indexed=true stored=true multiValued=true omitNorms=true / field name=community_tags type=text indexed=true termVectors=true stored=true multiValued=true omitNorms=true / The DIH entity and field defn for the same goes as entity name=document dataSource=app onError=skip transformer=RegexTransformer query=... entity name=community_tags transformer=RegexTransformer query=SELECT group_concat(a.id SEPARATOR ',') AS community_tag_ids, group_concat(a.title SEPARATOR ',') AS community_tags FROM tags a JOIN tag_dets b ON a.id = b.tag_id WHERE b.doc_id = ${document.id} field column=community_tag_ids name=community_tag_ids/ field column=community_tags splitBy=, / /entity /entity The value for field community_tags comes correctly as an array of strings. However the value of field community_tag_ids is not proper arr name=community_tag_ids int[B@390c0a18/int /arr I tried chaining NumberFormatTransformer with formatStyle=number but that throws DataImportHandlerException: Failed to apply NumberFormat on column. Could it be due to NULL values from database or because the value is not proper? How do we handle NULL in this case? *Pranav Prakash* temet nosce
Re: DIH XML configs for multi environment
That approach would work for core dependent parameters. In my case, the params are environment dependent. I think a simpler approach would be to pass the url param as JVM options, and these XMLs get it from there. I haven't tried it yet. *Pranav Prakash* temet nosce On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose m...@shi-gmbh.com wrote: Hi There is one more approach using the property mechanism. You could specify the datasource like this: dataSource name=database driver=${sqlDriver} url=${sqlURL}/ And you can specifiy the properties in the solr.xml in your core configuration like this: core instanceDir=core1 name=core1 property name=sqlURL value=jdbc:hsqldb:/temp/example/ex/ /core Viele Grüße aus Augsburg Markus Klose SHI Elektronische Medien GmbH Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg Tel.: 0821 7482633 26 Tel.: 0821 7482633 0 (Zentrale) Mobil:0176 56516869 Fax: 0821 7482633 29 E-Mail: markus.kl...@shi-gmbh.com Internet: http://www.shi-gmbh.com Registergericht Augsburg HRB 17382 Geschäftsführer: Peter Spiske USt.-ID: DE 182167335 -Ursprüngliche Nachricht- Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com] Gesendet: Mittwoch, 11. Juli 2012 11:21 An: solr-user@lucene.apache.org Betreff: Re: DIH XML configs for multi environment http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource http://docs.codehaus.org/display/JETTY/DataSource+Examples On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash pra...@gmail.com wrote: That's cool. Is there something similar for Jetty as well? We use Jetty! *Pranav Prakash* temet nosce On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Hi Pranav, If you are using Tomcat to host Solr, you can define your data source in context.xml file under tomcat configuration. You have to refer to this datasource with the same name in all the 3 environments from DIH data-config.xml. This context.xml file will vary across 3 environments having different credentials for dev, stag and prod. eg DIH data-config.xml will refer to the datasource as listed below dataSource jndiName=java:comp/env/*YOUR_DATASOURCE_NAME* type=JdbcDataSource readOnly=true / context.xml file which is located under /TOMCAT_HOME/conf folder will have the resource entry as follows Resource name=*YOUR_DATASOURCE_NAME* auth=Container type= username=X password=X driverClassName= url= maxActive=8 / On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash pra...@gmail.com wrote: The DIH XML config file has to be specified dataSource. In my case, and possibly with many others, the logon credentials as well as mysql server paths would differ based on environments (dev, stag, prod). I don't want to end up coming with three different DIH config files, three different handlers and so on. What is a good way to deal with this? *Pranav Prakash* temet nosce -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar
Re: How To apply transformation in DIH for multivalued numeric field?
I had tried with splitBy for numeric field, but that also did not worked for me. However I got rid of group_concat and it was all good to go. Thanks a lot!! I really had a difficult time understanding this behavior. *Pranav Prakash* temet nosce On Thu, Jul 19, 2012 at 1:34 AM, Dyer, James james.d...@ingrambook.comwrote: Don't you want to specify splitBy for the integer field too? Actually though, you shouldn't need to use GROUP_CONCAT and RegexTransformer at all. DIH is designed to handle 1many relations between parent and child entities by populating all the child fields as multi-valued automatically. I guess your approach leads to a lot fewer rows getting sent from your db to Solr though. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Pranav Prakash [mailto:pra...@gmail.com] Sent: Wednesday, July 18, 2012 2:38 PM To: solr-user@lucene.apache.org Subject: How To apply transformation in DIH for multivalued numeric field? I have a multivalued integer field and a multivalued string field defined in my schema as field name=community_tag_ids type=integer indexed=true stored=true multiValued=true omitNorms=true / field name=community_tags type=text indexed=true termVectors=true stored=true multiValued=true omitNorms=true / The DIH entity and field defn for the same goes as entity name=document dataSource=app onError=skip transformer=RegexTransformer query=... entity name=community_tags transformer=RegexTransformer query=SELECT group_concat(a.id SEPARATOR ',') AS community_tag_ids, group_concat(a.title SEPARATOR ',') AS community_tags FROM tags a JOIN tag_dets b ON a.id = b.tag_id WHERE b.doc_id = ${document.id} field column=community_tag_ids name=community_tag_ids/ field column=community_tags splitBy=, / /entity /entity The value for field community_tags comes correctly as an array of strings. However the value of field community_tag_ids is not proper arr name=community_tag_ids int[B@390c0a18/int /arr I tried chaining NumberFormatTransformer with formatStyle=number but that throws DataImportHandlerException: Failed to apply NumberFormat on column. Could it be due to NULL values from database or because the value is not proper? How do we handle NULL in this case? *Pranav Prakash* temet nosce
DIH XML configs for multi environment
The DIH XML config file has to be specified dataSource. In my case, and possibly with many others, the logon credentials as well as mysql server paths would differ based on environments (dev, stag, prod). I don't want to end up coming with three different DIH config files, three different handlers and so on. What is a good way to deal with this? *Pranav Prakash* temet nosce
Re: DIH XML configs for multi environment
That's cool. Is there something similar for Jetty as well? We use Jetty! *Pranav Prakash* temet nosce On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Hi Pranav, If you are using Tomcat to host Solr, you can define your data source in context.xml file under tomcat configuration. You have to refer to this datasource with the same name in all the 3 environments from DIH data-config.xml. This context.xml file will vary across 3 environments having different credentials for dev, stag and prod. eg DIH data-config.xml will refer to the datasource as listed below dataSource jndiName=java:comp/env/*YOUR_DATASOURCE_NAME* type=JdbcDataSource readOnly=true / context.xml file which is located under /TOMCAT_HOME/conf folder will have the resource entry as follows Resource name=*YOUR_DATASOURCE_NAME* auth=Container type= username=X password=X driverClassName= url= maxActive=8 / On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash pra...@gmail.com wrote: The DIH XML config file has to be specified dataSource. In my case, and possibly with many others, the logon credentials as well as mysql server paths would differ based on environments (dev, stag, prod). I don't want to end up coming with three different DIH config files, three different handlers and so on. What is a good way to deal with this? *Pranav Prakash* temet nosce -- Thanks and Regards Rahul A. Warawdekar
Top 5 high freq words - UpdateProcessorChain or DIH Script?
Hi, I want to store top 5 high frequency non-stopwords words. I use DIH to import data. Now I have two approaches - 1. Use DIH JavaScript to find top 5 frequency words and put them in a copy field. The copy field will then stem it and remove stop words based on appropriate tokenizers. 2. Write a custom function for the same and add it to UpdateRequestProcessor Chain. Which of the two would be better suited? I find the first approach rather simple, but the issue is that I won't be having access to stop words/synonyms etc at the DIH time. In the second approach, if I add it to UpdateRequestProcessor Chain and insert the function after StopWordsFilterFactory and DuplicateRemoveFilterFactory, should be rather good way of doing this? -- *Pranav Prakash* temet nosce
Deduplication in MLT
I have an implementation of Deduplication as mentioned at http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search results. I would like to achieve the same functionality in my MLT queries, where the result set should include grouped documents. What is a good way to do the same? *Pranav Prakash* temet nosce
Typical Cache Values
Based on the hit ratio of my caches, they seem to be pretty low. Here they are. What are typical values of yours production setup? What are some of the things that can be done to improve the ratios? queryResultCache lookups : 3234602 hits : 496 hitratio : 0.00 inserts : 3234239 evictions : 3230143 size : 4096 warmupTime : 8886 cumulative_lookups : 3465734 cumulative_hits : 526 cumulative_hitratio : 0.00 cumulative_inserts : 3465208 cumulative_evictions : 3457151 documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Typical Cache Values
* * This is not unusual, but there's also not much reason to give this much memory in your case. This is the cache that is hit when a user pages through result set. Your numbers would seem to indicate one of two things: 1 your window is smaller than 2 pages, see solrconfig.xml, queryResultWindowSize or 2 your users are rarely going to the next page. this cache isn't doing you much good, but then it's also not using that much in the way of resources. True it is. Although the queryResultWindowSize is 30, I will be reducing it to 4 or so. And yes, we have observed that mostly people don't go beyond the first page documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 Again, this is actually quite reasonable. This cache is used to hold document data, and often doesn't have a great hit ratio. It is necessary though, it saves quite a bit of disk seeks when servicing a single query. fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 Not doing much in the way of faceting, are you? No. We don't facet results filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 Not a bad hit ratio here, this is where fq filters are stored. One caution here; it is better to break out your filter queries where possible into small chunks. Rather than write fq=field1:val1 AND field2:val2, it's better to write fq=field1:val1fq=field2:val2 Think of this cache as a map with the query as the key. If you write the fq the first way above, subsequent fqs for either half won't use the cache. That was a great advise. We do use the former approach but going forward we would stick to the latter one. Thanks, Pranav
Something like featured results in solr response?
Hi, I believe, there is a feature in Solr, which allows to return a set of featured documents for a query. I did read it couple of months back, and now when I have decided to work on it, I somehow can't find it's reference. Here is the description - For a search keyword, apart from the results generated by Solr (which is based on relevancy, score), there is another set of documents which just comes up. It is very much similar to the sponsored results feature of Google. Can you guys point me to the appropriate resources for the same? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Something like featured results in solr response?
Thanks a lot :-) This is exactly what I had read back then. However, going through it now, it seems that everytime a document needs to be elevated, it has to be in the config file. Which means that Solr should be restarted. This does not make a lot of sense for a production environment, where Solr restarts are as infrequent as config changes. What could be a sound way to implement this? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2012/1/30 Rafał Kuć r@solr.pl Hello! Please look at http://wiki.apache.org/solr/QueryElevationComponent. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hi, I believe, there is a feature in Solr, which allows to return a set of featured documents for a query. I did read it couple of months back, and now when I have decided to work on it, I somehow can't find it's reference. Here is the description - For a search keyword, apart from the results generated by Solr (which is based on relevancy, score), there is another set of documents which just comes up. It is very much similar to the sponsored results feature of Google. Can you guys point me to the appropriate resources for the same? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Something like featured results in solr response?
Wow, this looks interesting. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Mon, Jan 30, 2012 at 21:16, Erick Erickson erickerick...@gmail.comwrote: There's the tricky line: If the file exists in the /conf/ directory it will be loaded once at start-up. If it exists in the data directory, it will be reloaded for each IndexReader. on the page: http://wiki.apache.org/solr/QueryElevationComponent Which basically means that if your config file is in the right directory, it'll be reloaded whenever the index changes, i.e. when a replication happens in a master/slave setup or when a commit happens on a single machine used for both indexing and searching. Best Erick On Mon, Jan 30, 2012 at 8:31 AM, Pranav Prakash pra...@gmail.com wrote: Thanks a lot :-) This is exactly what I had read back then. However, going through it now, it seems that everytime a document needs to be elevated, it has to be in the config file. Which means that Solr should be restarted. This does not make a lot of sense for a production environment, where Solr restarts are as infrequent as config changes. What could be a sound way to implement this? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2012/1/30 Rafał Kuć r@solr.pl Hello! Please look at http://wiki.apache.org/solr/QueryElevationComponent. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hi, I believe, there is a feature in Solr, which allows to return a set of featured documents for a query. I did read it couple of months back, and now when I have decided to work on it, I somehow can't find it's reference. Here is the description - For a search keyword, apart from the results generated by Solr (which is based on relevancy, score), there is another set of documents which just comes up. It is very much similar to the sponsored results feature of Google. Can you guys point me to the appropriate resources for the same? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Highlighting uses lots of memory and eventually slows down Solr
No respinse !! Bumping it up *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Dec 9, 2011 at 14:11, Pranav Prakash pra...@gmail.com wrote: Hi Group, I would like to have highlighting for search and I have the fields indexed with the following schema (Solr 3.4) fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt ignoreCase=true/ filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase =true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll =0preserveOriginal=1/ /analyzer /fieldType field name=transcript type=text_commongrams indexed=true stored= true termVectors=true termPositions=true termOffsets=true/ dynamicField name=*_en type=text_commongrams indexed=true stored= true termVectors=true termPositions=true termOffsets=true/ And the following config highlighting fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize20/int float name=hl.regex.slop0.5/float str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre ![CDATA[ strong ]] /str str name=hl.simple.post ![CDATA[ /strong ]] /str /lst /formatter /highlighting The problem is that when I turn on highlighting, I face memory issues. The Memory usage on system goes higher and higher until it consumes all the memory (I dont receive OOM errors, there is always like 300 MB free memory). The total memory I have is 48GiB. My Index size is 138GiB and there are about 10m documents in the index. I also get the following warning, but I am not sure how to get it done. WARNING: Deprecated syntax found. highlighting/ should move to searchComponent/ My Solr log with highlighting turned on looks something like this [core0] webapp=/solr path=/select params={mm=390%25qf=title^2hl.simple.pre=stronghl.fl=title,transcript,transcript_enwt=rubyhl=truerows=12defType=dismaxfl=id,title,descriptiondebugQuery=falsestart=0q=asdfghjklbf=recip(ms(NOW,created_at),1.88e-11,1,1)hl.simple.post=/strongps=50} Any help on this would be greatly appreciated. Thanks in advance !! *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Google http://www.google.com/profiles/pranny
Highlighting uses lots of memory and eventually slows down Solr
Hi Group, I would like to have highlighting for search and I have the fields indexed with the following schema (Solr 3.4) fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase =true expand=true/ filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt ignoreCase=true/ filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase= true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ /analyzer /fieldType field name=transcript type=text_commongrams indexed=true stored=true termVectors=true termPositions=true termOffsets=true/ dynamicField name=*_en type=text_commongrams indexed=true stored= true termVectors=true termPositions=true termOffsets=true/ And the following config highlighting fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize20/int float name=hl.regex.slop0.5/float str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre ![CDATA[ strong ]] /str str name=hl.simple.post ![CDATA[ /strong ]] /str /lst /formatter /highlighting The problem is that when I turn on highlighting, I face memory issues. The Memory usage on system goes higher and higher until it consumes all the memory (I dont receive OOM errors, there is always like 300 MB free memory). The total memory I have is 48GiB. My Index size is 138GiB and there are about 10m documents in the index. I also get the following warning, but I am not sure how to get it done. WARNING: Deprecated syntax found. highlighting/ should move to searchComponent/ My Solr log with highlighting turned on looks something like this [core0] webapp=/solr path=/select params={mm=390%25qf=title^2hl.simple.pre=stronghl.fl=title,transcript,transcript_enwt=rubyhl=truerows=12defType=dismaxfl=id,title,descriptiondebugQuery=falsestart=0q=asdfghjklbf=recip(ms(NOW,created_at),1.88e-11,1,1)hl.simple.post=/strongps=50} Any help on this would be greatly appreciated. Thanks in advance !! *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Howto Programatically check if the index is optimized or not?
Hi, After the commit, my optimize usually takes 20 minutes. The thing is that I need to know programatically if the optimization has completed or not. Is there an API call through which I can know the status of optimization? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Painfully slow indexing
Hey guys, Your responses are welcome, but I still haven't gained a lot of improvements *Are you posting through HTTP/SOLRJ?* I am using RSolr gem, which internally uses Ruby HTTP lib to POST document to Solr *Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right??* Correct. It also includes the time taken to convert all those documents from a Ruby Hash to XML. *generate the ready-for-indexing XML documents on a file system* Alain, I have somewhere 6m documents for Indexing. You mean to say that I should convert all of it into one XML file and then index? *are you calling commit after your batches or do an optimize by any chance?* I am not optimizing, but I am performing an autocommit every 10 docs. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Oct 21, 2011 at 16:32, Simon Willnauer simon.willna...@googlemail.com wrote: On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote: Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N ST QT - 390 docs | 3,478,804 Bytes | 14.5s| 2297 852 docs | 6,039,535 Bytes | 25.3s| 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. indexDefaults useCompoundFilefalse/useCompoundFile mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount10/int int name=maxThreadCount10/int /mergeScheduler ramBufferSizeMB2048/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength300/maxFieldLength writeLockTimeout1000/writeLockTimeout maxBufferedDocs5/maxBufferedDocs termIndexInterval256/termIndexInterval mergeFactor10/mergeFactor useCompoundFilefalse/useCompoundFile !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int /mergePolicy -- /indexDefaults mainIndex unlockOnStartuptrue/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny hey, are you calling commit after your batches or do an optimize by any chance? I would suggest you to stream your documents to solr and try to commit only if you really need to. Set your RAM Buffer to something between 256 and 320 MB and remove the maxBufferedDocs setting completely. You can also experiment with your merge settings a little and 10 merging threads seem to be a lot. I know you have lots of CPU but IO will be the bottleneck here. simon
Painfully slow indexing
Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N ST QT - 390 docs | 3,478,804 Bytes | 14.5s| 2297 852 docs | 6,039,535 Bytes | 25.3s| 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. indexDefaults useCompoundFilefalse/useCompoundFile mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount10/int int name=maxThreadCount10/int /mergeScheduler ramBufferSizeMB2048/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength300/maxFieldLength writeLockTimeout1000/writeLockTimeout maxBufferedDocs5/maxBufferedDocs termIndexInterval256/termIndexInterval mergeFactor10/mergeFactor useCompoundFilefalse/useCompoundFile !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit19/int int name=segmentsPerTier9/int /mergePolicy -- /indexDefaults mainIndex unlockOnStartuptrue/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
How to achieve Indexing @ 270GiB/hr
Greetings, While going through the article 265% indexing speedup with Lucene's concurrent flushinghttp://java.dzone.com/news/265-indexing-speedup-lucenes?mz=33057-solr_lucene I was stunned by the endless possibilities in which Indexing speed could be increased. I'd like to take inputs from everyone over here as to how to achieve this speed. As far as I understand there are two broad ways of feeding data to Solr - 1. Using DataImportHandler 2. Using HTTP to POST docs to Solr. The speeds at which the article describes indexing seems kinda too much to expect using the second approach. Or is it possible using multiple instances feeding docs to Solr? My current setup does the following - 1. Execute SQL queries to create database of documents that needs to be fed. 2. Go through the columns one by one, and create XMLs for them and send it over to Solr in batches of max 500 docs. Even if using DataImportHandler what are the ways this could be optimized? If I am able to solve the problem of indexing data in our current setup, my life would become a lot easier. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Suggestions on how to perform infrastructure migration from 1.4 to 3.4?
Hi List, We have our production search infrastructure as - 1 indexing master, 2 serving identical twin slaves. They are all Solr 1.4 beasts. Apart from this we have 1 beast on Solr 3.4, which we have benchmarked against our production setup (against performance and relevancy) and would like to upgrade our production setup. Something like this has not happened before in our organization. I'd like to know opinions from the community about what are ways in which this migration can be performed? Will there be any downtimes, if so for how many hours? What are some of the common issues that might come along? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Can't use ms() function on non-numeric legacy date field
Hi, I had been trying to boost my recent documents, using what is described here http://wiki.apache.org/solr/FunctionQuery#Date_Boosting My date field looks like fieldType name=date class=solr.DateField sortMissingLast=true omitNorms=true/ field name=created_at type=date indexed=true stored=true omitNorms =true/ However, upon trying to do ms(NOW, created_at) it shows the error Can't use ms() function on non-numeric legacy date field created_at * * *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: StopWords coming in Top 10 terms despite using StopFilterFactory
You've got CommonGramsFilterFactory and StopFilterFactory both using stopwords.txt, which is a confusing configuration. Normally you'd want one or the other, not both ... but if you did legitimately have both, you'd want them to each use a different wordlist. Maybe I am wrong. But my intentions of using both of them is - first I want to use phrase queries so used CommonGramsFilterFactory. Secondly, I dont want those stopwords in my index, so I have used StopFilterFactory to remove them. The commongrams filter turns each found occurrence of a word in the file into two tokens - one prepended with the token before it, one appended with the token after it. If it's the first or last term in a field, it only produces one token. When it gets to the stopfilter, the combined terms no longer match what's in stopwords.txt, so no action is taken. If I had to guess, what you are seeing in the top 10 terms is the concatenation of your most common stopword with another word. If it were English, I would guess that to be of_the or something similar. If my guess is wrong, then I'm not sure what's going on, and some cut/paste of what you're actually seeing might be in order. term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588 with 23055is 22510 Did you do delete and do a full reindex after you changed your schema? Yup I did that a couple of times Thanks, Shawn *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com/ | Google http://www.google.com/profiles/pranny
StopWords coming in Top 10 terms despite using StopFilterFactory
Hi List, I included StopFilterFactory and I can see it taking action in the Analyzer Interface. However, when I go to Schema Analyzer, I see those stop words in the top 10 terms. Is this normal? fieldType name=text_commongrams class=solr.TextField analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase =true expand=true/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase= true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ /analyzer /fieldType *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: java.io.CharConversionException While Indexing in Solr 3.4
I managed to resolve this issue. Turns out that the issue was because of a faulty XML file being generated by ruby-solr gem. I had to install libxml-ruby, rsolr and I used rsolr gem instead of ruby-solr. Also, if you face this kind of issue, the test-utf8.sh file included in exampledocs is a good file to test Solr's behavior towards UTF-8 chars. Great wok Solr team, and special thanks to Erik Hatcher. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Mon, Sep 19, 2011 at 15:54, Pranav Prakash pra...@gmail.com wrote: Just in case, someone might be intrested here is the log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 26 more Also, is there a setting so I can change the level of backtrace? This would be helpful in showing the complete stack instead of 26 more ... *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Google http://www.google.com/profiles/pranny On Mon, Sep 19, 2011 at 14:16, Pranav Prakash pra...@gmail.com wrote: Hi List, I tried Solr 3.4.0 today and while indexing I got the error java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) My earlier version was Solr 1.4 and this same document went into index successfully. Looking around, I see issue https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the issue. I thought this patch is already applied to Solr 3.4.0. Is there something I am missing? Is there anything else I need to mention? Logs/ My document details etc.? *Pranav Prakash* temet nosce Twitter http://twitter.com
Re: Stemming and other tokenizers
I have a similar use case, but slightly more flexible and straight forward. In my case, I have a field language which stores 'en', 'es' or whatever the language of the document is. Then the field 'transcript' stores the actual content which is in the language as described in language field. Following up with the conversation, is this how I am supposed to proceed: 1. Create one field type in my schema per supported language. This would cause me to create ~30 fields. 2. Since, I already know the language of my content, I can skip SOLR-1979 (which is expected in Solr 3.5) The point where I am unclear is, how do I specify at Index time, to use a certain field for a certain language? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Mon, Sep 12, 2011 at 20:55, Jan Høydahl jan@cominvent.com wrote: Hi, Do they? Can you explain the layout of the documents? There are two ways to handle multi lingual docs. If all your docs have both an English and a Norwegian version, you may either split these into two separate documents, each with the language field filled by LangId - which then also lets you filter by language. Or you may assign a title_en and title_no to the same document (expand with more fields if you have more languages per document), and keep it as one document. Your client will then be adapted to search the language(s) that the user wants. If one document has multiple languages within the same field, e.g. body, say one paragraph of English and the next is Norwegian, then we currently do not have any capability in Solr to apply different analysis (tokenization, stemming etc) to each paragraph. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. sep. 2011, at 11:37, Manish Bafna wrote: What is single document has multiple languages? On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl jan@cominvent.com wrote: Hi Everybody else use dedicated field per language, so why can't you? Please explain your use case, and perhaps we can better help understand what you're trying to do. Do you always know the query language in advance? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. sep. 2011, at 08:28, Patrick Sauts wrote: I can't create one field per language, that is the problem but I'll dig into it following your indications. I let you know what I could come out with. Patrick. 2011/9/11 Jan Høydahl jan@cominvent.com Hi, You'll not be able to detect language and change stemmer on the same field in one go. You need to create one fieldType in your schema per language you want to use, and then use LanguageIdentification (SOLR-1979) to do the magic of detecting language and renaming the field. If you set langid.override=false, languid.map=true and populate your language field with the known language, you will probably get the desired effect. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 10. sep. 2011, at 03:24, Patrick Sauts wrote: Hello, I want to implement some king of AutoStemming that will detect the language of a field based on a tag at the start of this field like #en# my field is stored on disc but I don't want this tag to be stored. Is there a way to avoid this field to be stored ? To me all the filters and the tokenizers interact only with the indexed field and not the stored one. Am I wrong ? Is it possible to you to do such a filter. Patrick.
java.io.CharConversionException While Indexing in Solr 3.4
Hi List, I tried Solr 3.4.0 today and while indexing I got the error java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) My earlier version was Solr 1.4 and this same document went into index successfully. Looking around, I see issue https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the issue. I thought this patch is already applied to Solr 3.4.0. Is there something I am missing? Is there anything else I need to mention? Logs/ My document details etc.? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: java.io.CharConversionException While Indexing in Solr 3.4
Just in case, someone might be intrested here is the log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 26 more Also, is there a setting so I can change the level of backtrace? This would be helpful in showing the complete stack instead of 26 more ... *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Mon, Sep 19, 2011 at 14:16, Pranav Prakash pra...@gmail.com wrote: Hi List, I tried Solr 3.4.0 today and while indexing I got the error java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) My earlier version was Solr 1.4 and this same document went into index successfully. Looking around, I see issue https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the issue. I thought this patch is already applied to Solr 3.4.0. Is there something I am missing? Is there anything else I need to mention? Logs/ My document details etc.? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com | Google http://www.google.com/profiles/pranny
How To Implement Sweet Spot Similarity?
I was wondering if there is *any* article on the web that provides me with implementation details and some sort of analysis on Sweet Spot Similarity? Google shows me all the JIRA commits and comments but no article about actual implementation. What are the various configs that could be done. What are the good approaches for figuring out sweet spots? Can a combination of multiple Similarity Classes be used? Any information would be so appreciated. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case
Solr 3.3. has a feature Grouping. Is it practically same as deduplication? Here is my use case for duplicates removal - We have many documents with similar (upto 99%) content. Upon some search queries, almost all of them come up on first page results. Of all these documents, essentially one is original and the other are duplicates. We are able to find the original content on a basis of number of factors - who uploaded it, when, how many viral shares.It is also possible that the duplicates are uploaded earlier (and hence exist in search index) while the original is uploaded later (and gets added later to index). AFAIK, Deduplication targets index time. Is there a means I can specify the original which should be returned and the duplicates which could be removed from coming up.? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
OOM due to JRE Issue (LUCENE-1566)
Hi, This might probably have been discussed long time back, but I got this error recently in one of my production slaves. SEVERE: java.lang.OutOfMemoryError: OutOfMemoryError likely caused by the Sun VM Bug described in https://issues.apache.org/jira/browse/LUCENE-1566; try calling FSDirectory.setReadChunkSize with a a value smaller than the current chunk size (2147483647) I am currently using Solr1.4. Going through JIRA Issue comments, I found that this patch applies to 2.9 or above. We are also planning an upgrade to Solr 3.3. Is this patch included in 3.3 so as to I don't have to manually apply the patch? What are the other workarounds of the problem? Thanks in adv. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: OOM due to JRE Issue (LUCENE-1566)
AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to the version you are using. maybe you can provide the stacktrace and more deatails about your problem and report back? Unfortunately, I have only this much information with me. However following is my speficiations, if they are any helpful :- /usr/bin/java -d64 -Xms5000M -Xmx5000M -XX:+UseParallelGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$GC_LOGFILE -XX:+CMSPermGenSweepingEnabled -Dsolr.solr.home=multicore -Denable.slave=true -jar start.jar 32GiB RAM Any thoughts? Will a switch to ConcurrentGC help in any means?
Re: Is optimize needed on slaves if it replicates from optimized master?
That is not true. Replication is roughly a copy of the diff between the master and the slave's index. In my case, during replication entire index is copied from master to slave, during which the size of index goes a little over double. Then it shrinks to its original size. Am I doing something wrong? How can I get the master to serve only delta index instead of serving whole index and the slaves merging the new and old index? *Pranav Prakash*
How come this query string starts with wildcard?
While going through my error logs of Solr, i found that a user had fired a query - jawapan ujian bulanan thn 4 (bahasa melayu). This was converted to following for autosuggest purposes - jawapan?ujian?bulanan?thn?4?(bahasa?melayu)* by the javascript code. Solr threw the exception Cannot parse 'jawapan?ujian?bulanan?thn?4?(bahasa?melayu)*': '*' or '?' not allowed as first character in WildcardQuery How come this query string begins with wildcard character? When I changed the query to remove brackets, everything went smooth. There were no results, because probably my search index didn't had any. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Is optimize needed on slaves if it replicates from optimized master?
Very well explained. Thanks. Yes, we do optimize Index before replication. I am not particularly worried about disk space usage. I was more curious of that behavior. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Wed, Aug 10, 2011 at 19:55, Erick Erickson erickerick...@gmail.comwrote: This is expected behavior. You might be optimizing your index on the master after every set of changes, in which case the entire index is copied. During this period, the space on disk will at least double, there's no way around that. If you do NOT optimize, then the slave will only copy changed segments instead of the entire index. Optimizing isn't usually necessary except periodically (daily, perhaps weekly, perhaps never actually). All that said, depending on how merging happens, you will always have the possibility of the entire index being copied sometimes because you'll happen to hit a merge that merges all segments into one. There are some advanced options that can control some parts of merging, but you need to get to the bottom of why the whole index is getting copied every time before you go there. I'd bet you're issuing an optimize. Best Erick On Wed, Aug 10, 2011 at 5:30 AM, Pranav Prakash pra...@gmail.com wrote: That is not true. Replication is roughly a copy of the diff between the master and the slave's index. In my case, during replication entire index is copied from master to slave, during which the size of index goes a little over double. Then it shrinks to its original size. Am I doing something wrong? How can I get the master to serve only delta index instead of serving whole index and the slaves merging the new and old index? *Pranav Prakash*
Re: Solr 3.3 crashes after ~18 hours?
What do you mean by it just crashes? Does the process stops execution? Does it takes too long to respond which might result in lots of 503s in your application? Does the system run out of resources? Are you indexing and serving from the same server? It happened once with us that Solr was performing commit and then optimize while the load from app server was at its peak. This caused slow response from search server, which caused requests getting stacked up at app server and causing 503s. Could you look if you have a similar syndrome? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Tue, Aug 2, 2011 at 15:31, alexander sulz a.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: PivotFaceting in solr 3.3
From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA. Is this what you are looking for ? https://issues.apache.org/jira/browse/SOLR-792 *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Wed, Aug 3, 2011 at 10:16, Isha Garg isha.g...@orkash.com wrote: Hi All! Can anyone tell which patch should I apply to solr 3.3 to enable pivot faceting in it. Thanks in advance! Isha garg
Re: Solr Incremental Indexing
There could be multiple ways of getting this done, and the exact one depends a lot on factors like - what system are you using? How realtime the change has to be reflected back into the system? How is the indexing/replication done? Usually, in cases where the tolerance is about 6hrs (i.e. your DB change wont be reflected in Solr Index for as high as 6hrs), you can set up a cron job to be triggered every 6 hrs. It will see all the changes made between that time, and update Index and commit it. In cases, where a more real time requirement, there could be a trigger in the application (and not at the db level), which would fork a process to update Solr about this change by means of delayed task. If using this approach, it is suggested to use autocommit every N documents, N could be anything depending your app. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Sun, Jul 31, 2011 at 02:32, Alexei Martchenko ale...@superdownloads.com.br wrote: I always have a field in my databases called datelastmodified, so whenever I update that record, i set it to getdate() - mssql func - and then get all latest records order by that field. 2011/7/29 Mohammed Lateef Hussain mohammedlateefh...@gmail.com Hi Need some help in Solr incremental indexing approch. I have built my Solr index using SolrJ API and now want to update the index whenever any changes has been made in database. My requirement is not to use DB triggers to call any update events. I want to update my index on the fly whenever my application updates any record in database. Note: My indexing logic to get the required data from DB is some what complex and involves many tables. Please suggest me how can I proceed here. Thanks Lateef -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: Index
Every indexed document has to have a unique ID associated with it. You may do a search by ID something like http://localhost:/solr/select?q=id:X If you see a result, then the document has been indexed and is searchable. You might also want to check Luke (http://code.google.com/p/luke) to gain more insight about the index. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK gauravpareek2...@gmail.comwrote: Yes NICK you are correct ? how can you check whether it has been indexed by solr, and is searchable? On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase nch...@earthlink.net wrote: Do you mean, how can you check whether it has been indexed by solr, and is searchable? Nick On 7/28/2011 5:45 PM, GAURAV PAREEK wrote: Hi All, How we can check the particular;ar file is not INDEX in solr ? Regards, Gaurav
Re: Dealing with keyword stuffing
Cool, So I used SweetSpotSimilarity with default params and I see some improvements. However, I could still see some of the 'stuffed' documents coming up in the results. I feel that SweetSpotSimilarity alone is not enough. Going through http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out that there are other things - Pivoted Length Normalization and term frequency normalization that needs fine tuning too. Should I create a custom Similarity Class that overrides all the default behavior? I guess that should help me get more relevant results. Where should I start beginning with it? Pl. do not assume less obvious things, I am still learning !! :-) *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote: On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote: [...] I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated. Replace the existing DefaultSimilarity class in schema.xml (look towards the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line like: similarity class=org.apache.lucene.search.SweetSpotSimilarity/ Regards, Gora
Re: Dealing with keyword stuffing
On Thu, Jul 28, 2011 at 08:31, Chris Hostetter hossman_luc...@fucit.orgwrote: : Presumably, they are doing this by increasing tf (term frequency), : i.e., by repeating keywords multiple times. If so, you can use a custom : similarity class that caps term frequency, and/or ensures that the scoring : increases less than linearly with tf. Please see In some cases, yes they are repeating keywords multiple times. Stuffing different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr Guide. in paticular, using something like SweetSpotSimilarity tuned to know what values make sense for good content in your domain can be useful because it can actaully penalize docsuments that are too short/long or have term freqs that are outside of a reasonble expected range. I am not a Solr expert, But I was thinking in this direction. The ratio of tokens/total_length would be nearer to 1 for a stuffed document, while it would be nearer to 0 for a bogus document. Somewhere between the two lies documents that are more likely to be meaningful. I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated.
Custom Handler support in Solr-ruby
Hi, I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really inflexible in terms of specifying handler. The Solr::Request::Select class defines handler as select and all other classes inherit from this class. And since the methods in Solr::Connection use one of the classes from Solr::Request, I don't see a direct way to use a custom handler (which I have made for MoreLikeThis). Currently, the approach I am using is to create the query URL, do a CURL, parse the response and return it. Even if I'd to extend the classes, I'd end up making a new Solr::Request::CustomSelect which will be similar to Solr::Request::Select except for the flexibility for the user to provide handler, defaulted by 'select'. Then creating different classes each for DisMax and all, which will be derived from Solr::Request::CustomSelect. Isn't this too much of an overhead? Or am I missing something? Also, where can I file bugs to solr-ruby? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Index Version and Epoch Time?
Hi, I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Removing duplicate documents from search results
I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Index Version and Epoch Time?
Hi, I am facing multiple issues with solr and I am not sure what happens in each case. I am quite naive in Solr and there are some scenarios I'd like to discuss with you. We have a huge volume of documents to be indexed. Somewhere about 5 million. We have a full indexer script which essentially picks up all the documents from database and updates into Solr and an incremental script which adds new documents to Solr.. Relevant areas of my config file goes like unlockOnStartupfalse/unlockOnStartup deletionPolicy class=solr.SolrDeletionPolicy !-- Keep only optimized commit points -- str name=keepOptimizedOnlyfalse/str !-- The maximum number of commit points to be kept -- str name=maxCommitsToKeep1/str /deletionPolicy updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=enable${enable.master:false}/str str name=replicateAfterstartup/str str name=replicateAftercommit/str /lst lst name=slave str name=enable${enable.slave:false}/str str name=masterUrlhttp://hostname:port/solr/core0/replication/str /lst /requestHandler Sometimes, while the full indexer script breaks while adding documents to Solr. The script adds the documents and then commits the operation. So, when the script breaks, we have a huge lot of data which has been updated but not committed. Next, the incremental index script executes, and figures out all the new entries, adds them to Solr. It works successfully and commits the operation. - Will the commit by incremental indexer script also commit the previously uncommitted changes made by full indexer script before it broke? Sometimes, while during execution, Solr's avg response time 9avg resp time for last 10 requests, read from log file) goes as high as 9000ms (which I am still unclear why, any ideas how to start hunting for the problem?), so the watchdog process restarts Solr (because it causes a pile of requests queue at application server, which causes app server to crash). On my local environment, I performed the same experiment by adding docs to Solr, killing the process and restarting it. I found that the uncommitted changes were applied and searchable. However, the updates were uncommitted. Could you explain me as to how is this happening, or is there a configuration that can be adjusted for this? Also, what would the index state be if after the restarting Solr, a commit is applied or a commit is not applied? I'd be happy to provide any other information that might be needed. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote: I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. The index version shown on the dashboard is the time at which the most recent index segment was created. I'm not sure why it has a value older than a month if a commit has happened after that time. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Yeah, it changes after every commit which added/deleted a document. Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? If you have configured replication correctly, the admin dashboard should show a Replication link right next to the Schema Browser link. The path should be /admin/replication/index.jsp -- Regards, Shalin Shekhar Mangar.
Re: how to index data in solr form database automatically
Cron is a time-based job scheduler in Unix-like computer operating systems. en.wikipedia.org/wiki/Cron *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Fri, Jun 24, 2011 at 12:26, Romi romijain3...@gmail.com wrote: Yeah i am using data-import to get data from database and indexing it. but what is cron can you please provide a link for it - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html Sent from the Solr - User mailing list archive at Nabble.com.
Removing duplicate documents from search results
How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Removing duplicate documents from search results
This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Questions about Solr MLTHanlder, performance, Indexes
Hi folks, I am new to Solr, and using it for web application. I have been experimenting with it and have a couple of doubts which I was unable to resolve by Google. Our portal allows users to upload content and the fields we use are - title, description, transcript, tags. Now each of the content has certain - hits, downloads, favorites and auto calculated values - rating. We have a master/slave configuration (1 master, 2 slaves). Solr version: 1.4.0 Java version 1.6.0_16 Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) 32GiB RAM and 8 Core Index Size: ~100 GiB One of my use case is to find out related documents given a document ID. I have been using More Like Handler to generate related documents, using DisMax query. Now, I have to filter out certain content from the results solr gives me. So, if for a document id X, solr returns me a list of 20 related documents, I want to apply a filter that these 20 documents should not contain black listed words. This is fairly straight forward in a direct query using NOT operator. How is it possible to implement a similar behavior in MoreLikeThisHandler? Every week, we perform a full index of all the documents and a nightly incremental indexing. This is done by a script which reads data from MySQL and updates it to Solr. Sometimes it happens that the script fails after updating 60% of the documents. Commit has not been performed at this stage. The next cron executes, it adds some more documents and commits them. So, will this commit involve the current update as well as the last uncommitted updates as well? Are those uncommitted changes (which are stored in a temp file) deleted after some time? Is there a way to clean uncommitted changes? Off lately, Solr has started to perform slow. When Solr is started it goes quick and responds to requests in ~100ms. Gradually (very gradually) it goes on to a limit where avg response time of last 10 queries goes beyond 5000ms, and that is when requests start to pile up. As I am composing this mail, optimize command is being executed which I hope should help, but to what extent, I will need to see. Finally, what happens if the schema of master and slave are different (there exists a field in master which does not exist in slave). I thought that replication would show me some kind of error, but it went on successfully. Thanks, Pranav