from:"Pranav Prakash"

Re: DIH import from MySQL results in garbage text for special chars

2012-09-27 Thread Pranav Prakash

The output of Show variables goes like this. I have verified with the hex
values and they are different in MySQL and Solr.

| Variable_name| Value  |
+--++
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database   | latin1 |
| character_set_filesystem | binary |
| character_set_results| latin1 |
| character_set_server | latin1 |
| character_set_system | utf8   |
| character_sets_dir   | /usr/share/mysql/charsets/



*Pranav Prakash*

temet nosce



On Wed, Sep 26, 2012 at 6:45 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 21 September 2012 11:19, Pranav Prakash pra...@gmail.com wrote:

  I am seeing the garbage text in browser, Luke Index Toolbox and
 everywhere
  it is the same. My servlet container is Jetty which is the out-of-box
 one.
  Many other special chars are getting indexed and stored properly, only
 few
  characters causes pain.
 

 Could you double-check the encoding on the mysql side?
 What is the output of

 mysql SHOW VARIABLES LIKE 'character\_set\_%';

 Regards,
 Gora

Re: DIH import from MySQL results in garbage text for special chars

2012-09-26 Thread Pranav Prakash

I looked at the HEX codes of the texts. The hex code in MySQL is different
from that which is stored in the index.

The hex code in index is longer than the hex code in MySQL, this leads me
to the fact that somewhere in between smething is messing up,

*Pranav Prakash*

temet nosce



On Fri, Sep 21, 2012 at 11:19 AM, Pranav Prakash pra...@gmail.com wrote:

 I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
 it is the same. My servlet container is Jetty which is the out-of-box one.
 Many other special chars are getting indexed and stored properly, only few
 characters causes pain.

 *Pranav Prakash*

 temet nosce




 On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Is your _browser_ set to handle the appropriate character set? Or whatever
 you're using to inspect your data? How about your servlet container?



 Best
 Erick

 On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash pra...@gmail.com wrote:
  Hi Folks,
 
  I am attempting to import documents to Solr from MySQL using DIH. One of
  the field contains the text - “Future of Mobile Value Added Services
 (VAS)
  in Australia” .Notice the character “ and ”.
 
  When I am importing, it gets stored as - â€œFuture of Mobile Value Added
  Services (VAS) in Australiaâ€�.
 
  The datasource config clearly mentions use of UTF-8 as follows:
 
dataSource type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost/ohapp_devel
  user=username
  useUnicode=true
  characterEncoding=UTF-8
  password=password
  zeroDateTimeBehavior=convertToNull
  name=app /
 
 
  A plain SQL Select statement on the MySQL Console gives appropriate
 text. I
  even tried using following scriptTransformer to get rid of this char,
 but
  it was of no particular use in my case.
 
  function gsub(source, pattern, replacement) {
var match, result;
if (!((pattern != null)  (replacement != null))) {
  return source;
}
result = '';
while (source.length  0) {
  if ((match = source.match(pattern))) {
result += source.slice(0, match.index);
result += replacement;
source = source.slice(match.index + match[0].length);
  } else {
result += source;
source = '';
  }
}
return result;
  }
 
  function fixQuotes(c){
c = gsub(c, /\342\200(?:\234|\235)/,'');
c = gsub(c, /\342\200(?:\230|\231)/,');
c = gsub(c, /\342\200\223/,-);
c = gsub(c, /\342\200\246/,...);
c = gsub(c, /\303\242\342\202\254\342\204\242/,');
c = gsub(c, /\303\242\342\202\254\302\235/,'');
c = gsub(c, /\303\242\342\202\254\305\223/,'');
c = gsub(c, /\303\242\342\202\254/,'-');
c = gsub(c, /\342\202\254\313\234/,'');
c = gsub(c, /“/, '');
return c;
  }
 
  function cleanFields(row){
var fieldsToClean = ['title', 'description'];
for(i =0; i fieldsToClean.length; i++){
  var old_text = String(row.get(fieldsToClean[i]));
  row.put(fieldsToClean[i], fixQuotes(old_text) );
}
return row;
  }
 
  My understanding goes that this must be a very common problem. It also
  occurs with human names which have these chars. What is an appropriate
 way
  to get the appropriate text indexed and searchable? The fieldtype where
  this is stored goes as follows
 
fieldType name=text_commongrams class=solr.TextField
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory
 language=English
  protected=protwords.txt/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true
expand=true /
  filter class=solr.CommonGramsFilterFactory
words=stopwords_en.txt
ignoreCase=true /
  filter class=solr.StopFilterFactory
words=stopwords_en.txt
ignoreCase=true /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
preserveOriginal=1 /
/analyzer
  /fieldType
 
 
  *Pranav Prakash*
 
  temet nosce

Re: DIH import from MySQL results in garbage text for special chars

2012-09-20 Thread Pranav Prakash

I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
it is the same. My servlet container is Jetty which is the out-of-box one.
Many other special chars are getting indexed and stored properly, only few
characters causes pain.

*Pranav Prakash*

temet nosce



On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson erickerick...@gmail.comwrote:

 Is your _browser_ set to handle the appropriate character set? Or whatever
 you're using to inspect your data? How about your servlet container?



 Best
 Erick

 On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash pra...@gmail.com wrote:
  Hi Folks,
 
  I am attempting to import documents to Solr from MySQL using DIH. One of
  the field contains the text - “Future of Mobile Value Added Services
 (VAS)
  in Australia” .Notice the character “ and ”.
 
  When I am importing, it gets stored as - â€œFuture of Mobile Value Added
  Services (VAS) in Australiaâ€�.
 
  The datasource config clearly mentions use of UTF-8 as follows:
 
dataSource type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost/ohapp_devel
  user=username
  useUnicode=true
  characterEncoding=UTF-8
  password=password
  zeroDateTimeBehavior=convertToNull
  name=app /
 
 
  A plain SQL Select statement on the MySQL Console gives appropriate
 text. I
  even tried using following scriptTransformer to get rid of this char, but
  it was of no particular use in my case.
 
  function gsub(source, pattern, replacement) {
var match, result;
if (!((pattern != null)  (replacement != null))) {
  return source;
}
result = '';
while (source.length  0) {
  if ((match = source.match(pattern))) {
result += source.slice(0, match.index);
result += replacement;
source = source.slice(match.index + match[0].length);
  } else {
result += source;
source = '';
  }
}
return result;
  }
 
  function fixQuotes(c){
c = gsub(c, /\342\200(?:\234|\235)/,'');
c = gsub(c, /\342\200(?:\230|\231)/,');
c = gsub(c, /\342\200\223/,-);
c = gsub(c, /\342\200\246/,...);
c = gsub(c, /\303\242\342\202\254\342\204\242/,');
c = gsub(c, /\303\242\342\202\254\302\235/,'');
c = gsub(c, /\303\242\342\202\254\305\223/,'');
c = gsub(c, /\303\242\342\202\254/,'-');
c = gsub(c, /\342\202\254\313\234/,'');
c = gsub(c, /“/, '');
return c;
  }
 
  function cleanFields(row){
var fieldsToClean = ['title', 'description'];
for(i =0; i fieldsToClean.length; i++){
  var old_text = String(row.get(fieldsToClean[i]));
  row.put(fieldsToClean[i], fixQuotes(old_text) );
}
return row;
  }
 
  My understanding goes that this must be a very common problem. It also
  occurs with human names which have these chars. What is an appropriate
 way
  to get the appropriate text indexed and searchable? The fieldtype where
  this is stored goes as follows
 
fieldType name=text_commongrams class=solr.TextField
  analyzer
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory language=English
  protected=protwords.txt/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true
expand=true /
  filter class=solr.CommonGramsFilterFactory
words=stopwords_en.txt
ignoreCase=true /
  filter class=solr.StopFilterFactory
words=stopwords_en.txt
ignoreCase=true /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
preserveOriginal=1 /
/analyzer
  /fieldType
 
 
  *Pranav Prakash*
 
  temet nosce

Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-10 Thread Pranav Prakash

I am experiencing similar problem related to encoding. In my case, the char
like  (double quote)
is also garbaled.

I believe this is because the encoding in my MySQL table is latin1 and in
the JDBC it is being specified as UTF-8. Is there a way to specify latin1
charset in JDBC? probably that would resolve this.


*Pranav Prakash*

temet nosce



On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey s...@elyograg.org wrote:

 On 9/6/2012 6:54 PM, kiran chitturi wrote:

 The error i am getting is 'org.apache.solr.common.**SolrException:
 Invalid
 Date String: '1345743552'.

   I think it was being saved as a string in DB, so i will use the
 DateFormatTransformer.


 To go along with all the other replies that you have gotten:  I import
 from MySQL with a unix format date field.  It's a bigint, not a string, but
 a quick test on MySQL 5.1 shows that the function works with strings too.
  This is how my SELECT handles that field - I have MySQL convert it before
 it gets to Solr:

 from_unixtime(`d`.`post_date`) AS `pd`

 When it comes to the character set issues, this is how I have defined the
 driver in the dataimport config.  The character set in the database is utf8.

   dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 encoding=UTF-8
 url=jdbc:mysql://${**dataimporter.request.dbHost}:**
 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=**
 convertToNull
 batchSize=-1
 user=removed
 password=removed/

 Thanks,
 Shawn

Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-10 Thread Pranav Prakash

The character is actually  - “ and not 


*Pranav Prakash*

temet nosce



On Mon, Sep 10, 2012 at 2:45 PM, Pranav Prakash pra...@gmail.com wrote:

 I am experiencing similar problem related to encoding. In my case, the
 char like  (double quote)
 is also garbaled.

 I believe this is because the encoding in my MySQL table is latin1 and in
 the JDBC it is being specified as UTF-8. Is there a way to specify latin1
 charset in JDBC? probably that would resolve this.


 *Pranav Prakash*

 temet nosce




 On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey s...@elyograg.org wrote:

 On 9/6/2012 6:54 PM, kiran chitturi wrote:

 The error i am getting is 'org.apache.solr.common.**SolrException:
 Invalid
 Date String: '1345743552'.

   I think it was being saved as a string in DB, so i will use the
 DateFormatTransformer.


 To go along with all the other replies that you have gotten:  I import
 from MySQL with a unix format date field.  It's a bigint, not a string, but
 a quick test on MySQL 5.1 shows that the function works with strings too.
  This is how my SELECT handles that field - I have MySQL convert it before
 it gets to Solr:

 from_unixtime(`d`.`post_date`) AS `pd`

 When it comes to the character set issues, this is how I have defined the
 driver in the dataimport config.  The character set in the database is utf8.

   dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 encoding=UTF-8
 url=jdbc:mysql://${**dataimporter.request.dbHost}:**
 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=**
 convertToNull
 batchSize=-1
 user=removed
 password=removed/

 Thanks,
 Shawn

Exact match on few fields, fuzzy on others

2012-08-01 Thread Pranav Prakash

Hi Folks,

I am using Solr 3.4 and my document schema has attributes - title,
transcript, author_name. Presently, I am using DisMax to search for a user
query across transcript. I would also like to do an exact search on
author_name so that for a query Albert Einstein, I would want to get all
the documents which contain Albert or Einstein in transcript and also those
documents which have author_name exactly as 'Albert Einstein'.

Can we do this by dismax query parser? The schema for both the fields are
below:

 fieldType name=text_commongrams class=solr.TextField
analyzer
  charFilter class=solr.HTMLStripCharFilterFactory /
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
  filter class=solr.TrimFilterFactory /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
  ignoreCase=true
  expand=true /
filter class=solr.CommonGramsFilterFactory
  words=stopwords_en.txt
  ignoreCase=true /
filter class=solr.StopFilterFactory
  words=stopwords_en.txt
  ignoreCase=true /
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=0
  preserveOriginal=1 /
  /analyzer
/fieldType
fieldType name=text_standard class=solr.TextField
analyzer
  charFilter class=solr.HTMLStripCharFilterFactory /
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.TrimFilterFactory /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.StopFilterFactory
words=stopwords_en.txt
ignoreCase=true /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
preserveOriginal=1 /
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true
expand=false /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
  /fieldType

 field name=titletype=text_commongrams   indexed=true
 stored=true  multiValued=false /
 field name=author_name type=text_standard indexed=true
stored=false /


--
*Pranav Prakash*

temet nosce

Re: DIH XML configs for multi environment

2012-07-25 Thread Pranav Prakash

Jerry,

Glad it worked for you. I will also do the same thing. This seems easier
for me, as I have a solr start shell script, which sets the JVM params for
master/slave, Xmx and so on according to the environment. Setting a jdbc
connect url in the start script is convenient than changing the configs.

*Pranav Prakash*

temet nosce



On Tue, Jul 24, 2012 at 1:17 AM, jerry.min...@gmail.com 
jerry.min...@gmail.com wrote:

 Pranav,

 Sorry, I should have checked my response a little better as I
 misspelled your name and, mentioned that I tried what Marcus suggested
 then described something totally different.
 I didn't try using the property mechanism as Marcus suggested as I am
 not using a solr.xml file.

 What you mentioned in your post on Wed, Jul 18, 2012 at 3:46 PM will
 work as I have done it successfully.
 That is I created a JVM variable to contain the connect URLs for each
 of my environments and one of those to set the URL parameter of the
 dataSource entity
 in my data config files.

 Best,
 Jerry


 On Mon, Jul 23, 2012 at 3:34 PM, jerry.min...@gmail.com
 jerry.min...@gmail.com wrote:
  Pranay,
 
  I tried two similar approaches to resolve this in my system which is
  Solr 4.0 running in Tomcat 7.x on Ubuntu 9.10.
 
  My preference was to use an alias for each of my database environments
  as a JVM parameter because it makes more sense to me that the database
  connection be stored in the data config file rather than in a Tomcat
  configuration or startup file.
  Because of preference, I first attempted the following:
  1. Set a JVM environment variable 'solr.dbEnv' to the represent the
  database environment that should be accessed. For example, in my dev
  environment, the JVM environment variable was set as -Dsolr.dbEnv=dev.
  2. In the data config file I had 3 data sources. Each data source had
  a name that matched one of the database environment aliases.
  3. In the entity of my data config file dataSource parameter was set
  as follows dataSource=${solr.dbEnv}.
 
  Unfortunately, this fails to work. Setting dataSource parameter in
  the data config file does not override the default. The default
  appears to be the first data source defined in the data config file.
 
  Second, I tried what Marcus suggested.
 
  That is, I created a JVM variable to contain the connect URLs for each
  of my environments.
  I use that variable to set the URL parameter of the dataSource entity
  in the data config file.
 
  This works well.
 
 
  Best,
  Jerry Mindek
 
  Unfortunately, the first option did not work. It seemed as though
  On Wed, Jul 18, 2012 at 3:46 PM, Pranav Prakash pra...@gmail.com
 wrote:
  That approach would work for core dependent parameters. In my case, the
  params are environment dependent. I think a simpler approach would be to
  pass the url param as JVM options, and these XMLs get it from there.
 
  I haven't tried it yet.
 
  *Pranav Prakash*
 
  temet nosce
 
 
 
  On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose m...@shi-gmbh.com wrote:
 
  Hi
 
  There is one more approach using the property mechanism.
 
  You could specify the datasource like this:
  dataSource name=database driver=${sqlDriver} url=${sqlURL}/
 
   And you can specifiy the properties in the solr.xml in your core
  configuration like this:
 
  core instanceDir=core1 name=core1
  property name=sqlURL value=jdbc:hsqldb:/temp/example/ex/
  
  /core
 
 
  Viele Grüße aus Augsburg
 
  Markus Klose
  SHI Elektronische Medien GmbH
 
 
  Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg
 
  Tel.:   0821 7482633 26
  Tel.:   0821 7482633 0 (Zentrale)
  Mobil:0176 56516869
  Fax:   0821 7482633 29
 
  E-Mail: markus.kl...@shi-gmbh.com
  Internet: http://www.shi-gmbh.com
 
  Registergericht Augsburg HRB 17382
  Geschäftsführer: Peter Spiske
  USt.-ID: DE 182167335
 
 
 
 
 
  -Ursprüngliche Nachricht-
  Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com]
  Gesendet: Mittwoch, 11. Juli 2012 11:21
  An: solr-user@lucene.apache.org
  Betreff: Re: DIH XML configs for multi environment
 
  http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource
  http://docs.codehaus.org/display/JETTY/DataSource+Examples
 
 
  On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash pra...@gmail.com
 wrote:
 
   That's cool. Is there something similar for Jetty as well? We use
 Jetty!
  
   *Pranav Prakash*
  
   temet nosce
  
  
  
   On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar 
   rahul.warawde...@gmail.com wrote:
  
Hi Pranav,
   
If you are using Tomcat to host Solr, you can define your data
source in context.xml file under tomcat configuration.
You have to refer to this datasource with the same name in all the
 3
environments from DIH data-config.xml.
This context.xml file will vary across 3 environments having
different credentials for dev, stag and prod.
   
eg
DIH data-config.xml will refer to the datasource as listed below
dataSource jndiName=java:comp/env

Re: can solr admin tab statistics be customized... how can this be achived.

2012-07-23 Thread Pranav Prakash

You can checkout Solr source code, do the patch work in admin JSP files and
use it as your custom Solr Instance.


*Pranav Prakash*

temet nosce



On Fri, Jul 20, 2012 at 12:14 PM, yayati yayatirajpa...@gmail.com wrote:



 Hi,

 I want to compute my own stats in addition to solr default stats. How can i
 enhance statistics in solr? How this thing can be achieved.. Solr compute
 stats as cumulative, is there is any way to get per instant stats...??

 Thanks... waiting for good replies..





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/can-solr-admin-tab-statistics-be-customized-how-can-this-be-achived-tp3996128.html
 Sent from the Solr - User mailing list archive at Nabble.com.

How To apply transformation in DIH for multivalued numeric field?

2012-07-18 Thread Pranav Prakash

I have a multivalued integer field and a multivalued string field defined
in my schema as

field name=community_tag_ids
type=integer
indexed=true
stored=true
multiValued=true
omitNorms=true /
field name=community_tags
type=text
indexed=true
termVectors=true
stored=true
multiValued=true
omitNorms=true /


The DIH entity and field defn for the same goes as

entity name=document
  dataSource=app
  onError=skip
  transformer=RegexTransformer
  query=...

 entity name=community_tags
transformer=RegexTransformer
query=SELECT
group_concat(a.id SEPARATOR ',') AS community_tag_ids,
group_concat(a.title SEPARATOR ',') AS community_tags
FROM tags a JOIN tag_dets b ON a.id = b.tag_id
WHERE b.doc_id = ${document.id} 
field column=community_tag_ids name=community_tag_ids/
field column=community_tags splitBy=, /
  /entity

/entity

The value for field community_tags comes correctly as an array of strings.
However the value of field community_tag_ids is not proper

arr name=community_tag_ids
int[B@390c0a18/int
/arr

I tried chaining NumberFormatTransformer with formatStyle=number but that
throws DataImportHandlerException: Failed to apply NumberFormat on column.
Could it be due to NULL values from database or because the value is not
proper? How do we handle NULL in this case?


*Pranav Prakash*

temet nosce

Re: DIH XML configs for multi environment

2012-07-18 Thread Pranav Prakash

That approach would work for core dependent parameters. In my case, the
params are environment dependent. I think a simpler approach would be to
pass the url param as JVM options, and these XMLs get it from there.

I haven't tried it yet.

*Pranav Prakash*

temet nosce



On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose m...@shi-gmbh.com wrote:

 Hi

 There is one more approach using the property mechanism.

 You could specify the datasource like this:
 dataSource name=database driver=${sqlDriver} url=${sqlURL}/

  And you can specifiy the properties in the solr.xml in your core
 configuration like this:

 core instanceDir=core1 name=core1
 property name=sqlURL value=jdbc:hsqldb:/temp/example/ex/
 
 /core


 Viele Grüße aus Augsburg

 Markus Klose
 SHI Elektronische Medien GmbH


 Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg

 Tel.:   0821 7482633 26
 Tel.:   0821 7482633 0 (Zentrale)
 Mobil:0176 56516869
 Fax:   0821 7482633 29

 E-Mail: markus.kl...@shi-gmbh.com
 Internet: http://www.shi-gmbh.com

 Registergericht Augsburg HRB 17382
 Geschäftsführer: Peter Spiske
 USt.-ID: DE 182167335





 -Ursprüngliche Nachricht-
 Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com]
 Gesendet: Mittwoch, 11. Juli 2012 11:21
 An: solr-user@lucene.apache.org
 Betreff: Re: DIH XML configs for multi environment

 http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource
 http://docs.codehaus.org/display/JETTY/DataSource+Examples


 On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash pra...@gmail.com wrote:

  That's cool. Is there something similar for Jetty as well? We use Jetty!
 
  *Pranav Prakash*
 
  temet nosce
 
 
 
  On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar 
  rahul.warawde...@gmail.com wrote:
 
   Hi Pranav,
  
   If you are using Tomcat to host Solr, you can define your data
   source in context.xml file under tomcat configuration.
   You have to refer to this datasource with the same name in all the 3
   environments from DIH data-config.xml.
   This context.xml file will vary across 3 environments having
   different credentials for dev, stag and prod.
  
   eg
   DIH data-config.xml will refer to the datasource as listed below
   dataSource jndiName=java:comp/env/*YOUR_DATASOURCE_NAME*
   type=JdbcDataSource readOnly=true /
  
   context.xml file which is located under /TOMCAT_HOME/conf folder
   will have the resource entry as follows
 Resource name=*YOUR_DATASOURCE_NAME* auth=Container
   type= username=X password=X
   driverClassName=
   url=
   maxActive=8
   /
  
   On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash pra...@gmail.com
  wrote:
  
The DIH XML config file has to be specified dataSource. In my
case, and possibly with many others, the logon credentials as well
as mysql
  server
paths would differ based on environments (dev, stag, prod). I
don't
  want
   to
end up coming with three different DIH config files, three
different handlers and so on.
   
What is a good way to deal with this?
   
   
*Pranav Prakash*
   
temet nosce
   
  
  
  
   --
   Thanks and Regards
   Rahul A. Warawdekar
  
 



 --
 Thanks and Regards
 Rahul A. Warawdekar

Re: How To apply transformation in DIH for multivalued numeric field?

2012-07-18 Thread Pranav Prakash

I had tried with splitBy for numeric field, but that also did not worked
for me. However I got rid of group_concat and it was all good to go.

Thanks a lot!! I really had a difficult time understanding this behavior.


*Pranav Prakash*

temet nosce



On Thu, Jul 19, 2012 at 1:34 AM, Dyer, James james.d...@ingrambook.comwrote:

 Don't you want to specify splitBy for the integer field too?

 Actually though, you shouldn't need to use GROUP_CONCAT and
 RegexTransformer at all.  DIH is designed to handle 1many relations
 between parent and child entities by populating all the child fields as
 multi-valued automatically.  I guess your approach leads to a lot fewer
 rows getting sent from your db to Solr though.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Pranav Prakash [mailto:pra...@gmail.com]
 Sent: Wednesday, July 18, 2012 2:38 PM
 To: solr-user@lucene.apache.org
 Subject: How To apply transformation in DIH for multivalued numeric field?

 I have a multivalued integer field and a multivalued string field defined
 in my schema as

 field name=community_tag_ids
 type=integer
 indexed=true
 stored=true
 multiValued=true
 omitNorms=true /
 field name=community_tags
 type=text
 indexed=true
 termVectors=true
 stored=true
 multiValued=true
 omitNorms=true /


 The DIH entity and field defn for the same goes as

 entity name=document
   dataSource=app
   onError=skip
   transformer=RegexTransformer
   query=...

  entity name=community_tags
 transformer=RegexTransformer
 query=SELECT
 group_concat(a.id SEPARATOR ',') AS community_tag_ids,
 group_concat(a.title SEPARATOR ',') AS community_tags
 FROM tags a JOIN tag_dets b ON a.id = b.tag_id
 WHERE b.doc_id = ${document.id} 
 field column=community_tag_ids name=community_tag_ids/
 field column=community_tags splitBy=, /
   /entity

 /entity

 The value for field community_tags comes correctly as an array of strings.
 However the value of field community_tag_ids is not proper

 arr name=community_tag_ids
 int[B@390c0a18/int
 /arr

 I tried chaining NumberFormatTransformer with formatStyle=number but that
 throws DataImportHandlerException: Failed to apply NumberFormat on column.
 Could it be due to NULL values from database or because the value is not
 proper? How do we handle NULL in this case?


 *Pranav Prakash*

 temet nosce

DIH XML configs for multi environment

2012-07-11 Thread Pranav Prakash

The DIH XML config file has to be specified dataSource. In my case, and
possibly with many others, the logon credentials as well as mysql server
paths would differ based on environments (dev, stag, prod). I don't want to
end up coming with three different DIH config files, three different
handlers and so on.

What is a good way to deal with this?


*Pranav Prakash*

temet nosce

Re: DIH XML configs for multi environment

2012-07-11 Thread Pranav Prakash

That's cool. Is there something similar for Jetty as well? We use Jetty!

*Pranav Prakash*

temet nosce



On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar 
rahul.warawde...@gmail.com wrote:

 Hi Pranav,

 If you are using Tomcat to host Solr, you can define your data source in
 context.xml file under tomcat configuration.
 You have to refer to this datasource with the same name in all the 3
 environments from DIH data-config.xml.
 This context.xml file will vary across 3 environments having different
 credentials for dev, stag and prod.

 eg
 DIH data-config.xml will refer to the datasource as listed below
 dataSource jndiName=java:comp/env/*YOUR_DATASOURCE_NAME*
 type=JdbcDataSource readOnly=true /

 context.xml file which is located under /TOMCAT_HOME/conf folder will
 have the resource entry as follows
   Resource name=*YOUR_DATASOURCE_NAME* auth=Container
 type= username=X password=X
 driverClassName=
 url=
 maxActive=8
 /

 On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash pra...@gmail.com wrote:

  The DIH XML config file has to be specified dataSource. In my case, and
  possibly with many others, the logon credentials as well as mysql server
  paths would differ based on environments (dev, stag, prod). I don't want
 to
  end up coming with three different DIH config files, three different
  handlers and so on.
 
  What is a good way to deal with this?
 
 
  *Pranav Prakash*
 
  temet nosce
 



 --
 Thanks and Regards
 Rahul A. Warawdekar

Top 5 high freq words - UpdateProcessorChain or DIH Script?

2012-07-08 Thread Pranav Prakash

Hi,

I want to store top 5 high frequency non-stopwords words. I use DIH to
import data. Now I have two approaches -

   1. Use DIH JavaScript to find top 5 frequency words and put them in a
   copy field. The copy field will then stem it and remove stop words based on
   appropriate tokenizers.
   2. Write a custom function for the same and add it to
   UpdateRequestProcessor Chain.

Which of the two would be better suited? I find the first approach rather
simple, but the issue is that I won't be having access to stop
words/synonyms etc at the DIH time.

In the second approach, if I add it to UpdateRequestProcessor Chain and
insert the function after StopWordsFilterFactory and
DuplicateRemoveFilterFactory, should be rather good way of doing this?

--
*Pranav Prakash*

temet nosce

Deduplication in MLT

2012-06-12 Thread Pranav Prakash

I have an implementation of Deduplication as mentioned at
http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search
results. I would like to achieve the same functionality in my MLT queries,
where the result set should include grouped documents. What is a good way
to do the same?


*Pranav Prakash*

temet nosce

Typical Cache Values

2012-02-07 Thread Pranav Prakash

Based on the hit ratio of my caches, they seem to be pretty low. Here they
are. What are typical values of yours production setup? What are some of
the things that can be done to improve the ratios?

queryResultCache

lookups : 3234602
hits : 496
hitratio : 0.00
inserts : 3234239
evictions : 3230143
size : 4096
warmupTime : 8886
cumulative_lookups : 3465734
cumulative_hits : 526
cumulative_hitratio : 0.00
cumulative_inserts : 3465208
cumulative_evictions : 3457151


documentCache

lookups : 17647360
hits : 11935609
hitratio : 0.67
inserts : 5711851
evictions : 5707755
size : 4096
warmupTime : 0
cumulative_lookups : 19009142
cumulative_hits : 12813630
cumulative_hitratio : 0.67
cumulative_inserts : 6195512
cumulative_evictions : 6187460


fieldValueCache

lookups : 0
hits : 0
hitratio : 0.00
inserts : 0
evictions : 0
size : 0
warmupTime : 0
cumulative_lookups : 0
cumulative_hits : 0
cumulative_hitratio : 0.00
cumulative_inserts : 0
cumulative_evictions : 0


filterCache

lookups : 30059278
hits : 28813869
hitratio : 0.95
inserts : 1245744
evictions : 1245232
size : 512
warmupTime : 28005
cumulative_lookups : 32155745
cumulative_hits : 30845811
cumulative_hitratio : 0.95
cumulative_inserts : 1309934
cumulative_evictions : 1309245




*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Typical Cache Values

2012-02-07 Thread Pranav Prakash


 *
 *
 This is not unusual, but there's also not much reason to give this much
 memory in your case. This is the cache that is hit when a user pages
 through result set. Your numbers would seem to indicate one of two things:
 1 your window is smaller than 2 pages, see solrconfig.xml,
queryResultWindowSize
 or
 2 your users are rarely going to the next page.

 this cache isn't doing you much good, but then it's also not using that
 much in the way of resources.



True it is. Although the queryResultWindowSize is 30, I will be reducing it
to 4 or so. And yes, we have observed that mostly people don't go beyond
the first page



  documentCache
 
  lookups : 17647360
  hits : 11935609
  hitratio : 0.67
  inserts : 5711851
  evictions : 5707755
  size : 4096
  warmupTime : 0
  cumulative_lookups : 19009142
  cumulative_hits : 12813630
  cumulative_hitratio : 0.67
  cumulative_inserts : 6195512
  cumulative_evictions : 6187460
 

 Again, this is actually quite reasonable. This cache
 is used to hold document data, and often doesn't have
 a great hit ratio. It is necessary though, it saves quite
 a bit of disk seeks when servicing a single query.

 
  fieldValueCache
 
  lookups : 0
  hits : 0
  hitratio : 0.00
  inserts : 0
  evictions : 0
  size : 0
  warmupTime : 0
  cumulative_lookups : 0
  cumulative_hits : 0
  cumulative_hitratio : 0.00
  cumulative_inserts : 0
  cumulative_evictions : 0
 

 Not doing much in the way of faceting, are you?


No. We don't facet results


 
  filterCache
 
  lookups : 30059278
  hits : 28813869
  hitratio : 0.95
  inserts : 1245744
  evictions : 1245232
  size : 512
  warmupTime : 28005
  cumulative_lookups : 32155745
  cumulative_hits : 30845811
  cumulative_hitratio : 0.95
  cumulative_inserts : 1309934
  cumulative_evictions : 1309245
 
 

 Not a bad hit ratio here, this is where
 fq filters are stored. One caution here;
 it is better to break out your filter
 queries where possible into small chunks.
 Rather than write fq=field1:val1 AND field2:val2,
 it's better to write fq=field1:val1fq=field2:val2
 Think of this cache as a map with the query
 as the key. If you write the fq the first way above,
 subsequent fqs for either half won't use the cache.


That was a great advise. We do use the former approach but going forward we
would stick to the latter one.

Thanks,

Pranav

Something like featured results in solr response?

2012-01-30 Thread Pranav Prakash

Hi,

I believe, there is a feature in Solr, which allows to return a set of
featured documents for a query. I did read it couple of months back, and
now when I have decided to work on it, I somehow can't find it's reference.

Here is the description - For a search keyword, apart from the results
generated by Solr (which is based on relevancy, score), there is another
set of documents which just comes up. It is very much similar to the
sponsored results feature of Google.

Can you guys point me to the appropriate resources for the same?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Something like featured results in solr response?

2012-01-30 Thread Pranav Prakash

Thanks a lot :-) This is exactly what I had read back then. However, going
through it now, it seems that everytime a document needs to be elevated, it
has to be in the config file. Which means that Solr should be restarted.
This does not make a lot of sense for a production environment, where Solr
restarts are as infrequent as config changes.

What could be a sound way to implement this?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


2012/1/30 Rafał Kuć r@solr.pl

 Hello!

 Please look at http://wiki.apache.org/solr/QueryElevationComponent.

 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  Hi,

  I believe, there is a feature in Solr, which allows to return a set of
  featured documents for a query. I did read it couple of months back,
 and
  now when I have decided to work on it, I somehow can't find it's
 reference.

  Here is the description - For a search keyword, apart from the results
  generated by Solr (which is based on relevancy, score), there is another
  set of documents which just comes up. It is very much similar to the
  sponsored results feature of Google.

  Can you guys point me to the appropriate resources for the same?


  *Pranav Prakash*

  temet nosce

  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny

Re: Something like featured results in solr response?

2012-01-30 Thread Pranav Prakash

Wow, this looks interesting.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Mon, Jan 30, 2012 at 21:16, Erick Erickson erickerick...@gmail.comwrote:

 There's the tricky line:
 If the file exists in the /conf/ directory it will be loaded once at
 start-up. If it exists in the data directory, it will be reloaded for
 each IndexReader.

 on the page: http://wiki.apache.org/solr/QueryElevationComponent

 Which basically means that if your config file is in the right directory,
 it'll be reloaded whenever the index changes, i.e. when a replication
 happens in a master/slave setup or when a commit happens on
 a single machine used for both indexing  and searching.

 Best
 Erick

 On Mon, Jan 30, 2012 at 8:31 AM, Pranav Prakash pra...@gmail.com wrote:
  Thanks a lot :-) This is exactly what I had read back then. However,
 going
  through it now, it seems that everytime a document needs to be elevated,
 it
  has to be in the config file. Which means that Solr should be restarted.
  This does not make a lot of sense for a production environment, where
 Solr
  restarts are as infrequent as config changes.
 
  What could be a sound way to implement this?
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  2012/1/30 Rafał Kuć r@solr.pl
 
  Hello!
 
  Please look at http://wiki.apache.org/solr/QueryElevationComponent.
 
  --
  Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 
   Hi,
 
   I believe, there is a feature in Solr, which allows to return a set of
   featured documents for a query. I did read it couple of months back,
  and
   now when I have decided to work on it, I somehow can't find it's
  reference.
 
   Here is the description - For a search keyword, apart from the results
   generated by Solr (which is based on relevancy, score), there is
 another
   set of documents which just comes up. It is very much similar to the
   sponsored results feature of Google.
 
   Can you guys point me to the appropriate resources for the same?
 
 
   *Pranav Prakash*
 
   temet nosce
 
   Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com |
   Google http://www.google.com/profiles/pranny

Re: Highlighting uses lots of memory and eventually slows down Solr

2011-12-19 Thread Pranav Prakash

No respinse !! Bumping it up

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Dec 9, 2011 at 14:11, Pranav Prakash pra...@gmail.com wrote:

 Hi Group,

 I would like to have highlighting for search and I have the fields indexed
 with the following schema (Solr 3.4)

 fieldType name=text_commongrams class=solr.TextField
  analyzer
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.TrimFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt
 ignoreCase=true/
 filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase
 =true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll
 =0preserveOriginal=1/
 /analyzer
 /fieldType

 field name=transcript type=text_commongrams indexed=true stored=
 true termVectors=true termPositions=true termOffsets=true/

 dynamicField name=*_en type=text_commongrams indexed=true stored=
 true termVectors=true termPositions=true termOffsets=true/

 And the following config

 highlighting
  fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter
 default=true
  lst name=defaults
 int name=hl.fragsize100/int
 /lst
 /fragmenter
 fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter
 
  lst name=defaults
 int name=hl.fragsize20/int
 float name=hl.regex.slop0.5/float
 str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
 /lst
 /fragmenter
 formatter name=html class=org.apache.solr.highlight.HtmlFormatter
 default=true
  lst name=defaults
  str name=hl.simple.pre
 ![CDATA[ strong ]]
 /str
 str name=hl.simple.post
 ![CDATA[ /strong ]]
 /str
 /lst
 /formatter
 /highlighting

 The problem is that when I turn on highlighting, I face memory issues. The
 Memory usage on system goes higher and higher until it consumes all the
 memory (I dont receive OOM errors, there is always like 300 MB free
 memory). The total memory I have is 48GiB. My Index size is 138GiB and
 there are about 10m documents in the index.

 I also get the following warning, but I am not sure how to get it done.

 WARNING: Deprecated syntax found. highlighting/ should move to
 searchComponent/

 My Solr log with highlighting turned on looks something like this

  [core0] webapp=/solr path=/select
 params={mm=390%25qf=title^2hl.simple.pre=stronghl.fl=title,transcript,transcript_enwt=rubyhl=truerows=12defType=dismaxfl=id,title,descriptiondebugQuery=falsestart=0q=asdfghjklbf=recip(ms(NOW,created_at),1.88e-11,1,1)hl.simple.post=/strongps=50}

 Any help on this would be greatly appreciated. Thanks in advance !!

 *Pranav Prakash*

 temet nosce

 Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com |
 Google http://www.google.com/profiles/pranny

Highlighting uses lots of memory and eventually slows down Solr

2011-12-09 Thread Pranav Prakash

Hi Group,

I would like to have highlighting for search and I have the fields indexed
with the following schema (Solr 3.4)

fieldType name=text_commongrams class=solr.TextField
 analyzer
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase
=true expand=true/
filter class=solr.CommonGramsFilterFactory words=stopwords_en.txt
ignoreCase=true/
filter class=solr.StopFilterFactory words=stopwords_en.txt ignoreCase=
true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
preserveOriginal=1/
/analyzer
/fieldType

field name=transcript type=text_commongrams indexed=true stored=true
 termVectors=true termPositions=true termOffsets=true/

dynamicField name=*_en type=text_commongrams indexed=true stored=
true termVectors=true termPositions=true termOffsets=true/

And the following config

highlighting
 fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter
default=true
 lst name=defaults
int name=hl.fragsize100/int
/lst
/fragmenter
fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter
 lst name=defaults
int name=hl.fragsize20/int
float name=hl.regex.slop0.5/float
str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
/lst
/fragmenter
formatter name=html class=org.apache.solr.highlight.HtmlFormatter
default=true
 lst name=defaults
 str name=hl.simple.pre
![CDATA[ strong ]]
/str
str name=hl.simple.post
![CDATA[ /strong ]]
/str
/lst
/formatter
/highlighting

The problem is that when I turn on highlighting, I face memory issues. The
Memory usage on system goes higher and higher until it consumes all the
memory (I dont receive OOM errors, there is always like 300 MB free
memory). The total memory I have is 48GiB. My Index size is 138GiB and
there are about 10m documents in the index.

I also get the following warning, but I am not sure how to get it done.

WARNING: Deprecated syntax found. highlighting/ should move to
searchComponent/

My Solr log with highlighting turned on looks something like this

[core0] webapp=/solr path=/select
params={mm=390%25qf=title^2hl.simple.pre=stronghl.fl=title,transcript,transcript_enwt=rubyhl=truerows=12defType=dismaxfl=id,title,descriptiondebugQuery=falsestart=0q=asdfghjklbf=recip(ms(NOW,created_at),1.88e-11,1,1)hl.simple.post=/strongps=50}

Any help on this would be greatly appreciated. Thanks in advance !!

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Howto Programatically check if the index is optimized or not?

2011-11-15 Thread Pranav Prakash

Hi,

After the commit, my optimize usually takes 20 minutes. The thing is that I
need to know programatically if the optimization has completed or not. Is
there an API call through which I can know the status of optimization?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Painfully slow indexing

2011-10-24 Thread Pranav Prakash

Hey guys,

Your responses are welcome, but I still haven't gained a lot of improvements

*Are you posting through HTTP/SOLRJ?*
I am using RSolr gem, which internally uses Ruby HTTP lib to POST document
to Solr

*Your script time 'T' includes time between sending POST request -to-
the response fetched after successful response right??*
Correct. It also includes the time taken to convert all those documents from
a Ruby Hash to XML.


 *generate the ready-for-indexing XML documents on a file system*
Alain, I have somewhere 6m documents for Indexing. You mean to say that I
should convert all of it into one XML file and then index?

*are you calling commit after your batches or do an optimize by any chance?*
I am not optimizing, but I am performing an autocommit every 10 docs.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Oct 21, 2011 at 16:32, Simon Willnauer 
simon.willna...@googlemail.com wrote:

 On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash pra...@gmail.com wrote:
  Hi guys,
 
  I have set up a Solr instance and upon attempting to index document, the
  whole process is painfully slow. I will try to put as much info as I can
 in
  this mail. Pl. feel free to ask me anything else that might be required.
 
  I am sending documents in batches not exceeding 2,000. The size of each
 of
  them depends but usually is around 10-15MiB. My indexing script tells me
  that Solr took T seconds to add N documents of size S. For the same data,
  the Solr Log add QTime is QT. Some of the sample data are:
 
N ST   QT
  -
   390 docs  |   3,478,804 Bytes   | 14.5s|  2297
   852 docs  |   6,039,535 Bytes   | 25.3s|  4237
  1345 docs | 11,147,512 Bytes   |  47s  |  8543
  1147 docs |   9,457,717 Bytes   |  44s  |  2297
  1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
 
  The time T includes the time of converting an array of Hash objects into
  XML, POSTing it to Solr and response acknowledged from Solr. Clearly,
 there
  is a huge difference between both the time T and QT. After a lot of
 efforts,
  I have no clue why these times do not match.
 
  The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
  -XX:+UseParNewGC
 
  I believe my Indexing is getting slow. Relevant portion from my schema
 file
  are as follows. On a related note, every document has one dynamic field.
  Based on this rate, it takes me ~30hrs to do a full index of my database.
  I would really appreciate kindness of community in order to get this
  indexing faster.
 
  indexDefaults
 
  useCompoundFilefalse/useCompoundFile
 
  mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
 
  int name=maxMergeCount10/int
 
  int name=maxThreadCount10/int
 
   /mergeScheduler
 
  ramBufferSizeMB2048/ramBufferSizeMB
 
  maxMergeDocs2147483647/maxMergeDocs
 
  maxFieldLength300/maxFieldLength
 
  writeLockTimeout1000/writeLockTimeout
 
  maxBufferedDocs5/maxBufferedDocs
 
  termIndexInterval256/termIndexInterval
 
  mergeFactor10/mergeFactor
 
  useCompoundFilefalse/useCompoundFile
 
  !-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 
   int name=maxMergeAtOnceExplicit19/int
 
  int name=segmentsPerTier9/int
 
  /mergePolicy --
 
  /indexDefaults
 
  mainIndex
 
  unlockOnStartuptrue/unlockOnStartup
 
  reopenReaderstrue/reopenReaders
 
  deletionPolicy class=solr.SolrDeletionPolicy
 
   str name=maxCommitsToKeep1/str
 
  str name=maxOptimizedCommitsToKeep0/str
 
  /deletionPolicy
 
  infoStream file=INFOSTREAM.txtfalse/infoStream
 
  /mainIndex
 
  updateHandler class=solr.DirectUpdateHandler2 
 
  autoCommit
 
   maxDocs10/maxDocs
 
  /autoCommit
 
  /updateHandler
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 

 hey,

 are you calling commit after your batches or do an optimize by any chance?

 I would suggest you to stream your documents to solr and try to commit
 only if you really need to. Set your RAM Buffer to something between
 256 and 320 MB and remove the maxBufferedDocs setting completely. You
 can also experiment with your merge settings a little and 10 merging
 threads seem to be a lot. I know you have lots of CPU but IO will be
 the bottleneck here.

 simon

Painfully slow indexing

2011-10-19 Thread Pranav Prakash

Hi guys,

I have set up a Solr instance and upon attempting to index document, the
whole process is painfully slow. I will try to put as much info as I can in
this mail. Pl. feel free to ask me anything else that might be required.

I am sending documents in batches not exceeding 2,000. The size of each of
them depends but usually is around 10-15MiB. My indexing script tells me
that Solr took T seconds to add N documents of size S. For the same data,
the Solr Log add QTime is QT. Some of the sample data are:

   N ST   QT
-
 390 docs  |   3,478,804 Bytes   | 14.5s|  2297
 852 docs  |   6,039,535 Bytes   | 25.3s|  4237
1345 docs | 11,147,512 Bytes   |  47s  |  8543
1147 docs |   9,457,717 Bytes   |  44s  |  2297
1096 docs | 13,058,204 Bytes   |  54.3s   |   8782

The time T includes the time of converting an array of Hash objects into
XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
is a huge difference between both the time T and QT. After a lot of efforts,
I have no clue why these times do not match.

The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
-XX:+UseParNewGC

I believe my Indexing is getting slow. Relevant portion from my schema file
are as follows. On a related note, every document has one dynamic field.
Based on this rate, it takes me ~30hrs to do a full index of my database.
I would really appreciate kindness of community in order to get this
indexing faster.

indexDefaults

useCompoundFilefalse/useCompoundFile

mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler

int name=maxMergeCount10/int

int name=maxThreadCount10/int

 /mergeScheduler

ramBufferSizeMB2048/ramBufferSizeMB

maxMergeDocs2147483647/maxMergeDocs

maxFieldLength300/maxFieldLength

writeLockTimeout1000/writeLockTimeout

maxBufferedDocs5/maxBufferedDocs

termIndexInterval256/termIndexInterval

mergeFactor10/mergeFactor

useCompoundFilefalse/useCompoundFile

!-- mergePolicy class=org.apache.lucene.index.TieredMergePolicy

 int name=maxMergeAtOnceExplicit19/int

int name=segmentsPerTier9/int

/mergePolicy --

/indexDefaults

mainIndex

unlockOnStartuptrue/unlockOnStartup

reopenReaderstrue/reopenReaders

deletionPolicy class=solr.SolrDeletionPolicy

 str name=maxCommitsToKeep1/str

str name=maxOptimizedCommitsToKeep0/str

/deletionPolicy

infoStream file=INFOSTREAM.txtfalse/infoStream

/mainIndex

updateHandler class=solr.DirectUpdateHandler2 

autoCommit

 maxDocs10/maxDocs

/autoCommit

/updateHandler


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

How to achieve Indexing @ 270GiB/hr

2011-10-04 Thread Pranav Prakash

Greetings,

While going through the article 265% indexing speedup with Lucene's
concurrent 
flushinghttp://java.dzone.com/news/265-indexing-speedup-lucenes?mz=33057-solr_lucene
I
was stunned by the endless possibilities in which Indexing speed could be
increased.

I'd like to take inputs from everyone over here as to how to achieve this
speed. As far as I understand there are two broad ways of feeding data to
Solr -

   1. Using DataImportHandler
   2. Using HTTP to POST docs to Solr.

The speeds at which the article describes indexing seems kinda too much to
expect using the second approach. Or is it possible using multiple instances
feeding docs to Solr?

My current setup does the following -

   1. Execute SQL queries to create database of documents that needs to be
   fed.
   2. Go through the columns one by one, and create XMLs for them and send
   it over to Solr in batches of max 500 docs.


Even if using DataImportHandler what are the ways this could be optimized?
If I am able to solve the problem of indexing data in our current setup, my
life would become a lot easier.


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Suggestions on how to perform infrastructure migration from 1.4 to 3.4?

2011-09-30 Thread Pranav Prakash

Hi List,

We have our production search infrastructure as - 1 indexing master, 2
serving identical twin slaves. They are all Solr 1.4 beasts. Apart from this
we have 1 beast on Solr 3.4, which we have benchmarked against our
production setup (against performance and relevancy) and would like to
upgrade our production setup. Something like this has not happened before in
our organization. I'd like to know opinions from the community about what
are ways in which this migration can be performed? Will there be any
downtimes, if so for how many hours? What are some of the common issues that
might come along?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Can't use ms() function on non-numeric legacy date field

2011-09-27 Thread Pranav Prakash

Hi, I had been trying to boost my recent documents, using what is described
here http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

My date field looks like

fieldType name=date class=solr.DateField sortMissingLast=true
omitNorms=true/
field name=created_at type=date indexed=true stored=true omitNorms
=true/

However, upon trying to do ms(NOW, created_at) it shows the error
Can't use ms() function on non-numeric legacy date field created_at
*
*
*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-23 Thread Pranav Prakash

 You've got CommonGramsFilterFactory and StopFilterFactory both using
 stopwords.txt, which is a confusing configuration.  Normally you'd want one
 or the other, not both ... but if you did legitimately have both, you'd want
 them to each use a different wordlist.


Maybe I am wrong. But my intentions of using both of them is - first I want
to use phrase queries so used CommonGramsFilterFactory. Secondly, I dont
want those stopwords in my index, so I have used StopFilterFactory to remove
them.




 The commongrams filter turns each found occurrence of a word in the file
 into two tokens - one prepended with the token before it, one appended with
 the token after it.  If it's the first or last term in a field, it only
 produces one token.  When it gets to the stopfilter, the combined terms no
 longer match what's in stopwords.txt, so no action is taken.

 If I had to guess, what you are seeing in the top 10 terms is the
 concatenation of your most common stopword with another word.  If it were
 English, I would guess that to be of_the or something similar.  If my
 guess is wrong, then I'm not sure what's going on, and some cut/paste of
 what you're actually seeing might be in order.


term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588
with 23055is 22510



  Did you do delete and do a full reindex after you changed your schema?


Yup I did that a couple of times



 Thanks,
 Shawn


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com/
 | Google http://www.google.com/profiles/pranny

StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-22 Thread Pranav Prakash

Hi List,

I included StopFilterFactory and I  can see it taking action in the Analyzer
Interface. However, when I go to Schema Analyzer, I see those stop words in
the top 10 terms. Is this normal?

fieldType name=text_commongrams class=solr.TextField
analyzer
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase
=true expand=true/
filter class=solr.CommonGramsFilterFactory words=stopwords.txt
ignoreCase=true/
filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=
true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
preserveOriginal=1/
/analyzer
/fieldType


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-20 Thread Pranav Prakash

I managed to resolve this issue. Turns out that the issue was because of a
faulty XML file being generated by ruby-solr gem. I had to install
libxml-ruby, rsolr and I used rsolr gem instead of ruby-solr.

Also, if you face this kind of issue, the test-utf8.sh file included in
exampledocs is a good file to test Solr's behavior towards UTF-8 chars.

Great wok Solr team, and special thanks to Erik Hatcher.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Mon, Sep 19, 2011 at 15:54, Pranav Prakash pra...@gmail.com wrote:


 Just in case, someone might be intrested here is the log

 SEVERE: java.lang.RuntimeException: [was class
 java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char
 #66641, byte #65289)
  at
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
  at
 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
 at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
 at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
  at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
  at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
  at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
  at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
  at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
  at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
  at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
  at
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73
 (at char #66641, byte #65289)
  at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
 at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
  at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
 at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
  at
 com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
 at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
  at
 com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
 at
 com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
  at
 com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
 at
 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
  ... 26 more


 Also, is there a setting so I can change the level of backtrace? This would
 be helpful in showing the complete stack instead of 26 more ...

 *Pranav Prakash*

 temet nosce

 Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com |
 Google http://www.google.com/profiles/pranny


 On Mon, Sep 19, 2011 at 14:16, Pranav Prakash pra...@gmail.com wrote:


 Hi List,

 I tried Solr 3.4.0 today and while indexing I got the error
 java.lang.RuntimeException: [was class java.io.CharConversionException]
 Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)

 My earlier version was Solr 1.4 and this same document went into index
 successfully. Looking around, I see issue
 https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the
 issue. I thought this patch is already applied to Solr 3.4.0. Is there
 something I am missing?

 Is there anything else I need to mention? Logs/ My document details etc.?

 *Pranav Prakash*

 temet nosce

 Twitter http://twitter.com

Re: Stemming and other tokenizers

2011-09-20 Thread Pranav Prakash

I have a similar use case, but slightly more flexible and straight forward.
In my case, I have a field language which stores 'en', 'es' or whatever
the language of the document is. Then the field 'transcript' stores the
actual content which is in the language as described in language field.
Following up with the conversation, is this how I am supposed to proceed:

   1. Create one field type in my schema per supported language. This would
   cause me to create ~30 fields.
   2. Since, I already know the language of my content, I can skip SOLR-1979
   (which is expected in Solr 3.5)

The point where I am unclear is, how do I specify at Index time, to use a
certain field for a certain language?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Mon, Sep 12, 2011 at 20:55, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 Do they? Can you explain the layout of the documents?

 There are two ways to handle multi lingual docs. If all your docs have both
 an English and a Norwegian version, you may either split these into two
 separate documents, each with the language field filled by LangId - which
 then also lets you filter by language. Or you may assign a title_en and
 title_no to the same document (expand with more fields if you have more
 languages per document), and keep it as one document. Your client will then
 be adapted to search the language(s) that the user wants.

 If one document has multiple languages within the same field, e.g. body,
 say one paragraph of English and the next is Norwegian, then we currently do
 not have any capability in Solr to apply different analysis (tokenization,
 stemming etc) to each paragraph.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 12. sep. 2011, at 11:37, Manish Bafna wrote:

  What is single document has multiple languages?
 
  On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl jan@cominvent.com
 wrote:
 
  Hi
 
  Everybody else use dedicated field per language, so why can't you?
  Please explain your use case, and perhaps we can better help understand
  what you're trying to do.
  Do you always know the query language in advance?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
 
  I can't create one field per language, that is the problem but I'll dig
  into
  it following your indications.
  I let you know what I could come out with.
 
  Patrick.
 
  2011/9/11 Jan Høydahl jan@cominvent.com
 
  Hi,
 
  You'll not be able to detect language and change stemmer on the same
  field
  in one go. You need to create one fieldType in your schema per
 language
  you
  want to use, and then use LanguageIdentification (SOLR-1979) to do the
  magic
  of detecting language and renaming the field. If you set
  langid.override=false, languid.map=true and populate your language
  field
  with the known language, you will probably get the desired effect.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
 
  Hello,
 
 
 
  I want to implement some king of AutoStemming that will detect the
  language
  of a field based on a tag at the start of this field like #en# my
 field
  is
  stored on disc but I don't want this tag to be stored. Is there a way
  to
  avoid this field to be stored ?
 
  To me all the filters and the tokenizers interact only with the
 indexed
  field and not the stored one.
 
  Am I wrong ?
 
  Is it possible to you to do such a filter.
 
 
 
  Patrick.

java.io.CharConversionException While Indexing in Solr 3.4

2011-09-19 Thread Pranav Prakash

Hi List,

I tried Solr 3.4.0 today and while indexing I got the error
java.lang.RuntimeException: [was class java.io.CharConversionException]
Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)

My earlier version was Solr 1.4 and this same document went into index
successfully. Looking around, I see issue
https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the
issue. I thought this patch is already applied to Solr 3.4.0. Is there
something I am missing?

Is there anything else I need to mention? Logs/ My document details etc.?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-19 Thread Pranav Prakash

Just in case, someone might be intrested here is the log

SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char
#66641, byte #65289)
 at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73
(at char #66641, byte #65289)
 at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
 at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
 at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
 at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
 at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
 ... 26 more


Also, is there a setting so I can change the level of backtrace? This would
be helpful in showing the complete stack instead of 26 more ...

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Mon, Sep 19, 2011 at 14:16, Pranav Prakash pra...@gmail.com wrote:


 Hi List,

 I tried Solr 3.4.0 today and while indexing I got the error
 java.lang.RuntimeException: [was class java.io.CharConversionException]
 Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)

 My earlier version was Solr 1.4 and this same document went into index
 successfully. Looking around, I see issue
 https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the
 issue. I thought this patch is already applied to Solr 3.4.0. Is there
 something I am missing?

 Is there anything else I need to mention? Logs/ My document details etc.?

 *Pranav Prakash*

 temet nosce

 Twitter http://twitter.com/pranavprakash | Bloghttp://blog.myblive.com |
 Google http://www.google.com/profiles/pranny

How To Implement Sweet Spot Similarity?

2011-09-16 Thread Pranav Prakash

I was wondering if there is *any* article on the web that provides me with
implementation details and some sort of analysis on Sweet Spot Similarity?
Google shows me all the JIRA commits and comments but no article about
actual implementation. What are the various configs that could be done. What
are the good approaches for figuring out sweet spots? Can a combination of
multiple Similarity Classes be used?

Any information would be so appreciated.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Pranav Prakash

Solr 3.3. has a feature Grouping. Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Pranav Prakash

Hi,

This might probably have been discussed long time back, but I got this error
recently in one of my production slaves.

SEVERE: java.lang.OutOfMemoryError: OutOfMemoryError likely caused by the
Sun VM Bug described in https://issues.apache.org/jira/browse/LUCENE-1566;
try calling FSDirectory.setReadChunkSize with a a value smaller than the
current chunk size (2147483647)

I am currently using Solr1.4. Going through JIRA Issue comments, I found
that this patch applies to 2.9 or above. We are also planning an upgrade to
Solr 3.3. Is this patch included in 3.3 so as to I don't have to manually
apply the patch?

What are the other workarounds of the problem?

Thanks in adv.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Pranav Prakash



 AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to
 the version you are using.
 maybe you can provide the stacktrace and more deatails about your
 problem and report back?


Unfortunately, I have only this much information with me. However following
is my speficiations, if they are any helpful :-

/usr/bin/java -d64 -Xms5000M -Xmx5000M -XX:+UseParallelGC -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$GC_LOGFILE
-XX:+CMSPermGenSweepingEnabled -Dsolr.solr.home=multicore
 -Denable.slave=true -jar start.jar

32GiB RAM


Any thoughts? Will a switch to ConcurrentGC help in any means?

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Pranav Prakash

That is not true. Replication is roughly a copy of the diff between the
 master and the slave's index.


In my case, during replication entire index is copied from master to slave,
during which the size of index goes a little over double. Then it shrinks to
its original size. Am I doing something wrong? How can I get the master to
serve only delta index instead of serving whole index and the slaves merging
the new and old index?

*Pranav Prakash*

How come this query string starts with wildcard?

2011-08-10 Thread Pranav Prakash

While going through my error logs of Solr, i found that a user had fired a
query - jawapan ujian bulanan thn 4 (bahasa melayu). This was converted to
following for autosuggest purposes -
jawapan?ujian?bulanan?thn?4?(bahasa?melayu)* by the javascript code. Solr
threw the exception

Cannot parse 'jawapan?ujian?bulanan?thn?4?(bahasa?melayu)*': '*' or
'?' not allowed as first character in WildcardQuery

How come this query string begins with wildcard character?

When I changed the query to remove brackets, everything went smooth.
There were no results, because probably my search index didn't had
any.


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Pranav Prakash

Very well explained. Thanks. Yes, we do optimize Index before replication. I
am not particularly worried about disk space usage. I was more curious of
that behavior.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Wed, Aug 10, 2011 at 19:55, Erick Erickson erickerick...@gmail.comwrote:

 This is expected behavior. You might be optimizing
 your index on the master after every set of changes,
 in which case the entire index is copied. During this
 period, the space on disk will at least double, there's no
 way around that.

 If you do NOT optimize, then the slave will only copy changed
 segments instead of the entire index. Optimizing isn't
 usually necessary except periodically (daily, perhaps weekly,
 perhaps never actually).

 All that said, depending on how merging happens, you will always
 have the possibility of the entire index being copied sometimes
 because you'll happen to hit a merge that merges all segments
 into one.

 There are some advanced options that can control some parts
 of merging, but you need to get to the bottom of why the whole
 index is getting copied every time before you go there. I'd bet
 you're issuing an optimize.

 Best
 Erick

 On Wed, Aug 10, 2011 at 5:30 AM, Pranav Prakash pra...@gmail.com wrote:
  That is not true. Replication is roughly a copy of the diff between the
  master and the slave's index.
 
 
  In my case, during replication entire index is copied from master to
 slave,
  during which the size of index goes a little over double. Then it shrinks
 to
  its original size. Am I doing something wrong? How can I get the master
 to
  serve only delta index instead of serving whole index and the slaves
 merging
  the new and old index?
 
  *Pranav Prakash*

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread Pranav Prakash

What do you mean by it just crashes? Does the process stops execution? Does
it takes too long to respond which might result in lots of 503s in your
application? Does the system run out of resources?

Are you indexing and serving from the same server? It happened once with us
that Solr was performing commit and then optimize while the load from app
server was at its peak. This caused slow response from search server, which
caused requests getting stacked up at app server and causing 503s. Could you
look if you have a similar syndrome?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Tue, Aug 2, 2011 at 15:31, alexander sulz a.s...@digiconcept.net wrote:

 Hello folks,

 I'm using the latest stable Solr release - 3.3 and I encounter strange
 phenomena with it.
 After about 19 hours it just crashes, but I can't find anything in the
 logs, no exceptions, no warnings,
 no suspicious info entries..

 I have an index-job running from 6am to 8pm every 10 minutes. After each
 job there is a commit.
 An optimize-job is done twice a day at 12:15pm and 9:15pm.

 Does anyone have an idea what could possibly be wrong or where to look for
 further debug info?

 regards and thank you
  alex

Re: PivotFaceting in solr 3.3

2011-08-02 Thread Pranav Prakash

From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA.
Is this what you are looking for ?

https://issues.apache.org/jira/browse/SOLR-792


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Wed, Aug 3, 2011 at 10:16, Isha Garg isha.g...@orkash.com wrote:

 Hi All!

  Can anyone tell which patch should I apply to solr 3.3 to enable pivot
 faceting in it.

 Thanks in advance!
 Isha garg

Re: Solr Incremental Indexing

2011-07-31 Thread Pranav Prakash

There could be multiple ways of getting this done, and the exact one depends
a lot on factors like - what system are you using? How realtime the change
has to be reflected back into the system? How is the indexing/replication
done?

Usually, in cases where the tolerance is about 6hrs (i.e. your DB change
wont be reflected in Solr Index for as high as 6hrs), you can set up a cron
job to be triggered every 6 hrs. It will see all the changes made between
that time, and update Index and commit it.

In cases, where a more real time requirement, there could be a trigger in
the application (and not at the db level), which would fork a process to
update Solr about this change by means of delayed task. If using this
approach, it is suggested to use autocommit every N documents, N could be
anything depending your app.


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Sun, Jul 31, 2011 at 02:32, Alexei Martchenko 
ale...@superdownloads.com.br wrote:

 I always have a field in my databases called datelastmodified, so whenever
 I
 update that record, i set it to getdate() - mssql func - and then get all
 latest records order by that field.

 2011/7/29 Mohammed Lateef Hussain mohammedlateefh...@gmail.com

  Hi
 
  Need some help in Solr incremental indexing approch.
 
  I have built my Solr index using SolrJ API and now want to update the
 index
  whenever any changes has been made in
  database. My requirement is not to use DB triggers to call any update
  events.
 
  I want to update my index on the fly whenever my application updates any
  record in database.
 
  Note: My indexing logic to get the required data from DB is some what
  complex and involves many tables.
 
  Please suggest me how can I proceed here.
 
  Thanks
  Lateef
 



 --

 *Alexei Martchenko* | *CEO* | Superdownloads
 ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
 5083.1018/5080.3535/5080.3533

Re: Index

2011-07-29 Thread Pranav Prakash

Every indexed document has to have a unique ID associated with it. You may
do a search by ID something like

http://localhost:/solr/select?q=id:X If you see a result, then the
document has been indexed and is searchable.

You might also want to check Luke (http://code.google.com/p/luke) to gain
more insight about the index.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK gauravpareek2...@gmail.comwrote:

 Yes NICK you are correct ?
 how can you check whether it has been indexed by solr, and is searchable?

 On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase nch...@earthlink.net
 wrote:

  Do you mean, how can you check whether it has been indexed by solr, and
 is
  searchable?
 
    Nick
 
 
  On 7/28/2011 5:45 PM, GAURAV PAREEK wrote:
 
  Hi All,
 
  How we can check the particular;ar file is not INDEX in solr ?
 
  Regards,
  Gaurav

Re: Dealing with keyword stuffing

2011-07-29 Thread Pranav Prakash

Cool, So I used SweetSpotSimilarity with default params and I see some
improvements. However, I could still see some of the 'stuffed' documents
coming up in the results. I feel that SweetSpotSimilarity alone is not
enough. Going through
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out
that there are other things - Pivoted Length Normalization and term
frequency normalization that needs fine tuning too.

Should I create a custom Similarity Class that overrides all the default
behavior? I guess that should help me get more relevant results. Where
should I start beginning with it? Pl. do not assume less obvious things, I
am still learning !! :-)

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote:

 On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote:
 [...]
  I am not sure how to use SweetSpotSimilarity. I am googling on this, but
  any useful insights are so much appreciated.

 Replace the existing DefaultSimilarity class in schema.xml (look towards
 the bottom of the file) with the SweetSpotSimilarity class, e.g., have a
 line
 like:
  similarity class=org.apache.lucene.search.SweetSpotSimilarity/

 Regards,
 Gora

Re: Dealing with keyword stuffing

2011-07-28 Thread Pranav Prakash

On Thu, Jul 28, 2011 at 08:31, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Presumably, they are doing this by increasing tf (term frequency),
 : i.e., by repeating keywords multiple times. If so, you can use a custom
 : similarity class that caps term frequency, and/or ensures that the
 scoring
 : increases less than linearly with tf. Please see


In some cases, yes they are repeating keywords multiple times. Stuffing
different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr
Guide.



 in paticular, using something like SweetSpotSimilarity tuned to know what
 values make sense for good content in your domain can be useful because
 it can actaully penalize docsuments that are too short/long or have term
 freqs that are outside of a reasonble expected range.


I am not a Solr expert, But I was thinking in this direction. The ratio of
tokens/total_length would be nearer to 1 for a stuffed document, while it
would be nearer to 0 for a bogus document. Somewhere between the two lies
documents that are more likely to be meaningful. I am not sure how to use
SweetSpotSimilarity. I am googling on this, but any useful insights are so
much appreciated.

Custom Handler support in Solr-ruby

2011-06-28 Thread Pranav Prakash

Hi,

I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really
inflexible in terms of specifying handler. The Solr::Request::Select class
defines handler as select and all other classes inherit from this class.
And since the methods in Solr::Connection use one of the classes from
Solr::Request, I don't see a direct way to use a custom handler (which I
have made for MoreLikeThis). Currently, the approach I am using is to create
the query URL, do a CURL, parse the response and return it.

Even if I'd to extend the classes, I'd end up making a new
Solr::Request::CustomSelect which will be similar to Solr::Request::Select
except for the flexibility for the user to provide handler, defaulted by
'select'. Then creating different classes each for DisMax and all, which
will be derived from Solr::Request::CustomSelect. Isn't this too much of an
overhead? Or am I missing something?

Also, where can I file bugs to solr-ruby?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash

Hi,

I am not sure what is the index number value? It looks like an epoch time,
but in my case, this points to one month back. However, i can see documents
which were added last week, to be in the index.

Even after I did a commit, the index number did not change? Isn't it
supposed to change on every commit? If not, is there a way to look into the
last index time?

Also, this page
http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a
Replication Dashboard. How is this dashboard invoked? Is there any URL which
needs to be called?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Removing duplicate documents from search results

2011-06-28 Thread Pranav Prakash

I found the deduplication thing really useful. Although I have not yet
started to work on it, as there are some other low hanging fruits I've to
capture. Will share my thoughts soon.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

2011/6/28 François Schiettecatte fschietteca...@gmail.com

Maybe there is a way to get Solr to reject documents that already exist in
the index but I doubt it, maybe someone else with can chime here here. You
could do a search for each document prior to indexing it so see if it is
already in the index, that is probably non-optimal, maybe it is easiest to
check if the document exists in your Riak repository, it no add it and index
it, and drop if it already exists.

François

On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:

I am making the Hash from URL, but I can't use this as UniqueKey because
I
am using UUID as UniqueKey,
Since I am using SOLR as index engine Only and using Riak(key-value
storage) as storage engine, I dont want to do the overwrite on duplicate.
I just need to discard the duplicates.

2011/6/28 François Schiettecatte fschietteca...@gmail.com

Create a hash from the url and use that as the unique key, md5 or sha1
would probably be good enough.

Cheers

François

On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

I also have the problem of duplicate docs.
I am indexing news articles, Every news article will have the source
URL,
If two news-article has the same URL, only one need to index,
removal of duplicate at index time.

On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:

have you checked out the deduplication process that's available at
indexing time ? This includes a fuzzy hash algorithm .

http://wiki.apache.org/solr/Deduplication

-Simon

On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
wrote:
This approach would definitely work is the two documents are
*Exactly*
the
same. But this is very fragile. Even if one extra space has been
added,
the
whole hash would change. What I am really looking for is some %age
similarity between documents, and remove those documents which are
more
than
95% similar.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog
http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:

What you need to do, is to calculate some HASH (using any message
digest
algorithm you want, md5, sha-1 and so on), then do some reading on
solr
field collapse capabilities. Should not be too complicated..

*Omri Cohen*

Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
+972-3-6036295

My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
[image:
Twitter] http://www.twitter.com/omricohe [image:
WordPress]http://omricohen.me
Please consider your environmental responsibility. Before printing
this
e-mail message, ask yourself whether you really need a hard copy.
IMPORTANT: The contents of this email and any attachments are
confidential.
They are intended for the named recipient(s) only. If you have
received
this
email by mistake, please notify the sender immediately and do not
disclose
the contents to anyone or make copies thereof.
Signature powered by

http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer

WiseStamp

http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer

-- Forwarded message --
From: Pranav Prakash pra...@gmail.com
Date: Thu, Jun 23, 2011 at 12:26 PM
Subject: Removing duplicate documents from search results
To: solr-user@lucene.apache.org

How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are
almost
similar (people submitting same stuff multiple times, sometimes
different
people submitting same stuff). Now when a search is performed for
keyword,
in the top N results, quite frequently, same document comes up
multiple
times. I want to remove those duplicate (or possible duplicate)
documents.
Very similar to what Google does when they say In order to show you
most
relevant result, duplicates have been removed. How can I achieve
this
functionality using Solr? Does Solr has an implied or plugin which
could
help me with it?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog
http://blog.myblive.com

|
Google http://www.google.com/profiles/pranny

--
Thanks and Regards
Mohammad Shariq

Re: Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash

Hi,

I am facing multiple issues with solr and I am not sure what happens in each
case. I am quite naive in Solr and there are some scenarios I'd like to
discuss with you.

We have a huge volume of documents to be indexed. Somewhere about 5 million.
We have a full indexer script which essentially picks up all the documents
from database and updates into Solr and an incremental script which adds new
documents to Solr.. Relevant areas of my config file goes like

unlockOnStartupfalse/unlockOnStartup
deletionPolicy class=solr.SolrDeletionPolicy
!-- Keep only optimized commit points --
str name=keepOptimizedOnlyfalse/str
!-- The maximum number of commit points to be kept --
str name=maxCommitsToKeep1/str
/deletionPolicy
updateHandler class=solr.DirectUpdateHandler2
autoCommit
maxDocs10/maxDocs
/autoCommit
/updateHandler
requestHandler name=/replication class=solr.ReplicationHandler
lst name=master
str name=enable${enable.master:false}/str
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
/lst
lst name=slave
str name=enable${enable.slave:false}/str
str name=masterUrlhttp://hostname:port/solr/core0/replication/str
/lst
/requestHandler

Sometimes, while the full indexer script breaks while adding documents to
Solr. The script adds the documents and then commits the operation. So, when
the script breaks, we have a huge lot of data which has been updated but not
committed. Next, the incremental index script executes, and figures out all
the new entries, adds them to Solr. It works successfully and commits the
operation.

   - Will the commit by incremental indexer script also commit the
   previously uncommitted changes made by full indexer script before it broke?

Sometimes, while during execution, Solr's avg response time 9avg resp time
for last 10 requests, read from log file) goes as high as 9000ms (which I am
still unclear why, any ideas how to start hunting for the problem?), so the
watchdog process restarts Solr (because it causes a pile of requests queue
at application server, which causes app server to crash). On my local
environment, I performed the same experiment by adding docs to Solr, killing
the process and restarting it. I found that the uncommitted changes were
applied and searchable. However, the updates were uncommitted. Could you
explain me as to how is this happening, or is there a configuration that can
be adjusted for this? Also, what would the index state be if after the
restarting Solr, a commit is applied or a commit is not applied?

I'd be happy to provide any other information that might be needed.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar shalinman...@gmail.com
 wrote:

 On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote:

 
  I am not sure what is the index number value? It looks like an epoch
 time,
  but in my case, this points to one month back. However, i can see
 documents
  which were added last week, to be in the index.
 

 The index version shown on the dashboard is the time at which the most
 recent index segment was created. I'm not sure why it has a value older
 than
 a month if a commit has happened after that time.

 
  Even after I did a commit, the index number did not change? Isn't it
  supposed to change on every commit? If not, is there a way to look into
 the
  last index time?
 

 Yeah, it changes after every commit which added/deleted a document.


  Also, this page
  http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows
 a
  Replication Dashboard. How is this dashboard invoked? Is there any URL
  which
  needs to be called?
 
 
 If you have configured replication correctly, the admin dashboard should
 show a Replication link right next to the Schema Browser link. The path
 should be /admin/replication/index.jsp

 --
 Regards,
 Shalin Shekhar Mangar.

Re: how to index data in solr form database automatically

2011-06-24 Thread Pranav Prakash

Cron is a time-based job scheduler in Unix-like computer operating systems.
en.wikipedia.org/wiki/Cron

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Jun 24, 2011 at 12:26, Romi romijain3...@gmail.com wrote:

 Yeah i am using data-import to get data from database and indexing it. but
 what is cron can you please provide a link for it

 -
 Thanks  Regards
 Romi
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash

How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are almost
similar (people submitting same stuff multiple times, sometimes different
people submitting same stuff). Now when a search is performed for keyword,
in the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say In order to show you most
relevant result, duplicates have been removed. How can I achieve this
functionality using Solr? Does Solr has an implied or plugin which could
help me with it?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

Re: Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash

This approach would definitely work is the two documents are *Exactly* the
same. But this is very fragile. Even if one extra space has been added, the
whole hash would change. What I am really looking for is some %age
similarity between documents, and remove those documents which are more than
95% similar.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny

On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:

What you need to do, is to calculate some HASH (using any message digest
algorithm you want, md5, sha-1 and so on), then do some reading on solr
field collapse capabilities. Should not be too complicated..

*Omri Cohen*

Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295

My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image:
Twitter] http://www.twitter.com/omricohe [image:
WordPress]http://omricohen.me
Please consider your environmental responsibility. Before printing this
e-mail message, ask yourself whether you really need a hard copy.
IMPORTANT: The contents of this email and any attachments are confidential.
They are intended for the named recipient(s) only. If you have received
this
email by mistake, please notify the sender immediately and do not disclose
the contents to anyone or make copies thereof.
Signature powered by

http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer

WiseStamp
http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer

-- Forwarded message --
From: Pranav Prakash pra...@gmail.com
Date: Thu, Jun 23, 2011 at 12:26 PM
Subject: Removing duplicate documents from search results
To: solr-user@lucene.apache.org

How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are almost
similar (people submitting same stuff multiple times, sometimes different
people submitting same stuff). Now when a search is performed for
keyword,
in the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say In order to show you most
relevant result, duplicates have been removed. How can I achieve this
functionality using Solr? Does Solr has an implied or plugin which could
help me with it?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com

|
Google http://www.google.com/profiles/pranny

Questions about Solr MLTHanlder, performance, Indexes

2011-06-20 Thread Pranav Prakash

Hi folks,

I am new to Solr, and using it for web application. I have been
experimenting with it and have a couple of doubts which I was unable to
resolve by Google. Our portal allows users to upload content and the fields
we use are - title, description, transcript, tags. Now each of the content
has certain - hits, downloads, favorites and auto calculated values -
rating. We have a master/slave configuration (1 master, 2 slaves).

Solr version: 1.4.0
Java version 1.6.0_16
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
32GiB RAM and 8 Core
Index Size: ~100 GiB


One of my use case is to find out related documents given a document ID. I
have been using More Like Handler to generate related documents, using
DisMax query. Now, I have to filter out certain content from the results
solr gives me. So, if for a document id X, solr returns me a list of 20
related documents, I want to apply a filter that these 20 documents should
not contain black listed words. This is fairly straight forward in a
direct query using NOT operator. How is it possible to implement a similar
behavior in MoreLikeThisHandler?

Every week, we perform a full index of all the documents and
a nightly incremental indexing. This is done by a script which reads data
from MySQL and updates it to Solr. Sometimes it happens that the script
fails after updating 60% of the documents. Commit has not been performed at
this stage. The next cron executes, it adds some more documents and commits
them. So, will this commit involve the current update as well as the last
uncommitted updates as well? Are those uncommitted changes (which are stored
in a temp file) deleted after some time? Is there a way to clean uncommitted
changes?

Off lately, Solr has started to perform slow. When Solr is started it goes
quick and responds to requests in ~100ms. Gradually (very gradually) it goes
on to a limit where avg response time of last 10 queries goes beyond 5000ms,
and that is when requests start to pile up. As I am composing this mail,
optimize command is being executed which I hope should help, but to what
extent, I will need to see.

Finally, what happens if the schema of master and slave are different (there
exists a field in master which does not exist in slave). I thought that
replication would show me some kind of error, but it went on successfully.

Thanks,

Pranav

55 matches

Mail list logo