omitTermFreq only?

2011-07-13 Thread Jibo John
Hello,

I was wondering if there is a way we can omit only the Term Frequency in solr? 

omitTermFreqAndPositions =true wouldn't work for us since we need the positions 
for supporting phrase queries.

Thanks,
-Jibo


omitTermFreq only?

2011-07-13 Thread Jibo John
Hello,

I was wondering if there is a way we can omit only the Term Frequency in solr? 

omitTermFreqAndPositions =true wouldn't work for us since we need the positions 
for supporting phrase queries.

Thanks,
-Jibo



Re: omitTermFreq only?

2011-07-13 Thread Jibo John
Sorry I should have made the objectives clear. The goal is to reduce the index 
size by avoiding TermFrequency stored in the index (in .frq segment files). 

After exploring a bit more, realized that LUCENE-2048 now allows omitPositions. 
Similarly, I'm looking for a omitFrequency option.

Thanks,
-Jibo


On Jul 13, 2011, at 1:34 PM, Markus Jelsma wrote:

 A dirty hack is to return 1.0f for each tf  0. Just a couple of lines code 
 for a custom similarity class.
 
 Hello,
 
 I was wondering if there is a way we can omit only the Term Frequency in
 solr?
 
 omitTermFreqAndPositions =true wouldn't work for us since we need the
 positions for supporting phrase queries.
 
 Thanks,
 -Jibo



FieldType for storing date

2010-09-24 Thread Jibo John
Hello,

I was wondering what would be the best FieldType for storing date with a 
millisecond precision that would allow me to sort and run range queries against 
this field.
We would like to achieve the best query performance, minimal heap - fieldcache 
- requirements, good indexing throughput  and minimal index size in that order.

To give you some background, we have a production system that runs in a 
multicore set up, each core with a maximum index size of 6G each, and, the 
search and indexing operations occur against the same cores. We store the date 
with a minutely precision(format yymmddhhmm), and, we use TrieIntField with a 
precisionStep=1. This works well, however, as a next step, we want to store the 
date with a millisecond precision with minimal architectural changes.

We could probably use TrieLongField, however, as we understand, this doubles 
the heap requirements for fieldcache. Was wondering if there is a clever way of 
achieving this without adding to the heap.

Appreciate your input.

Thanks,
-Jibo


parsedquery becomes PhraseQuery

2009-12-16 Thread Jibo John
Hello,

I have a question on how solr determines whether the q value needs to be 
analyzed as a regular query or as a phrase query.

Let's say, I have a text'jibojohn info disk/1.0'

If I query for 'jibojohn info', I get the results. The query is parsed as:

 str name=rawquerystringjibojohn info/str
 str name=querystringjibojohn info/str
 str name=parsedquery+data:jibojohn +data:info/str
 str name=parsedquery_toString+data:jibojohn +data:info/str

However, if I query for 'disk/1.0', I get nothing. The query is parsed as:

str name=rawquerystringdisk/1.0/str
 str name=querystringdisk/1.0/str
 str name=parsedqueryPhraseQuery(data:disk 1 0)/str
 str name=parsedquery_toStringdata:disk 1 0/str

I was expecting this to be treated as a regular query, instead of a phrase 
query.  I was wondering why.

Appreciate your input.

-Jibo






Invoke expungeDeletes using SolrJ's SolrServer.commit()

2009-10-02 Thread Jibo John

Hello,

I know I can invoke expungeDeletes using updatehandler  ( curl update - 
F stream.body=' commit expungeDeletes=true/' ), however, I was  
wondering if it is possible to invoke it using SolrJ.


It looks like, currently, there are no SolrServer.commit(..) methods  
that I can use for this purpose.


Any input will be helpful.


Thanks,
-Jibo




Re: Invoke expungeDeletes using SolrJ's SolrServer.commit()

2009-10-02 Thread Jibo John

Created jira issue https://issues.apache.org/jira/browse/SOLR-1487

Thanks,
-Jibo

On Oct 2, 2009, at 2:17 PM, Shalin Shekhar Mangar wrote:


On Sat, Oct 3, 2009 at 1:35 AM, Jibo John jiboj...@mac.com wrote:


Hello,

I know I can invoke expungeDeletes using updatehandler  ( curl  
update -F
stream.body=' commit expungeDeletes=true/' ), however, I was  
wondering

if it is possible to invoke it using SolrJ.

It looks like, currently, there are no SolrServer.commit(..)  
methods that I

can use for this purpose.

Any input will be helpful.



You are right. Please create an issue. We need this in 1.4

--
Regards,
Shalin Shekhar Mangar.




Re: Problem changing the default MergePolicy/Scheduler

2009-09-28 Thread Jibo John



On Sep 27, 2009, at 9:42 PM, Shalin Shekhar Mangar wrote:


On Mon, Sep 28, 2009 at 2:59 AM, Jibo John jiboj...@mac.com wrote:

Additionally, I get the same exception even if I declare the  
mergePolicy

in the mainIndex.

mainIndex
  mergePolicy  
class=org.apache.lucene.index.LogByteSizeMergePolicy

boolean name=calibrateSizeByDeletestrue/boolean
  /mergePolicy
/mainIndex



That should be bool instead of boolean



Yeah, that was it. Thank you very much.

Thanks,
-Jibo




--
Regards,
Shalin Shekhar Mangar.




Re: Problem changing the default MergePolicy/Scheduler

2009-09-27 Thread Jibo John
Thanks for this. I've updated trunk/, rebuilt solr.war, however,  
running into another issue.


Sep 27, 2009 1:55:44 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalArgumentException
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)

at java.lang.reflect.Method.invoke(Method.java:592)
	at  
org 
.apache.solr.util.SolrPluginUtils.invokeSetters(SolrPluginUtils.java: 
989)

at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:87)
	at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java: 
185)
	at  
org 
.apache 
.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
	at  
org 
.apache 
.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java: 
173)
	at  
org 
.apache 
.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
	at  
org 
.apache 
.solr 
.update 
.processor 
.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)

at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)


Also, the log file has a bunch of these lines:

Sep 27, 2009 1:55:56 PM org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates  
a bug -- POSSIBLE RESOURCE LEAK!!!



Here is the snippet from my solrconfig.xml


  indexDefaults
   !-- Values here affect all index writers and act as a default  
unless overridden. --

useCompoundFilefalse/useCompoundFile

mergeFactor6/mergeFactor
ramBufferSizeMB500/ramBufferSizeMB
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
mergePolicy  
class=org.apache.lucene.index.LogByteSizeMergePolicy

  boolean name=calibrateSizeByDeletestrue/boolean
/mergePolicy
mergeSchedulerorg.apache.lucene.index.ConcurrentMergeScheduler/ 
mergeScheduler

lockTypesingle/lockType
  /indexDefaults




Thanks,
-Jibo



On Sep 27, 2009, at 1:43 PM, Shalin Shekhar Mangar wrote:


On Mon, Sep 28, 2009 at 1:18 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


On Sat, Sep 26, 2009 at 7:13 AM, Jibo John jiboj...@mac.com wrote:


Hello,

It looks like solr is not allowing me to change the default
MergePolicy/Scheduler classes.

Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy
and ConcurrentMergeScheduler) defined in solrconfig.xml to a  
different one
(LogDocMergePolicy and SerialMergeScheduler), my profiler shows  
the default

classes are still being loaded.

Also, if I use the default LogByteSizeMergePolicy, I can't seem to
override the 'calibrateSizeByDeletes' to 'true' value using  
solrconfig using

the new syntax that was introduced this week (SOLR-1447).

I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.


Specifying mergePolicy and mergeScheduler in indexDefaults does  
not work
in trunk. If you specify in the mainIndex section, it will work.  
I'll give

a patch with a fix.



This is fixed in trunk now. Thanks!

--
Regards,
Shalin Shekhar Mangar.




Re: Problem changing the default MergePolicy/Scheduler

2009-09-27 Thread Jibo John
Additionally, I get the same exception even if I declare the  
mergePolicy in the mainIndex.


  mainIndex
mergePolicy  
class=org.apache.lucene.index.LogByteSizeMergePolicy

  boolean name=calibrateSizeByDeletestrue/boolean
/mergePolicy
  /mainIndex

Thanks,
-Jibo


On Sep 27, 2009, at 2:03 PM, Jibo John wrote:

Thanks for this. I've updated trunk/, rebuilt solr.war, however,  
running into another issue.


Sep 27, 2009 1:55:44 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalArgumentException
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl 
.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:592)
	at  
org 
.apache.solr.util.SolrPluginUtils.invokeSetters(SolrPluginUtils.java: 
989)
	at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java: 
87)
	at  
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java: 
185)
	at  
org 
.apache 
.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java: 
98)
	at  
org 
.apache 
.solr 
.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
	at  
org 
.apache 
.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java: 
220)
	at  
org 
.apache 
.solr 
.update 
.processor 
.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java: 
140)

at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)


Also, the log file has a bunch of these lines:

Sep 27, 2009 1:55:56 PM org.apache.solr.update.SolrIndexWriter  
finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(),  
indicates a bug -- POSSIBLE RESOURCE LEAK!!!



Here is the snippet from my solrconfig.xml


 indexDefaults
  !-- Values here affect all index writers and act as a default  
unless overridden. --

   useCompoundFilefalse/useCompoundFile

   mergeFactor6/mergeFactor
   ramBufferSizeMB500/ramBufferSizeMB
   maxFieldLength1/maxFieldLength
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   mergePolicy  
class=org.apache.lucene.index.LogByteSizeMergePolicy

 boolean name=calibrateSizeByDeletestrue/boolean
   /mergePolicy
   mergeSchedulerorg.apache.lucene.index.ConcurrentMergeScheduler/ 
mergeScheduler

   lockTypesingle/lockType
 /indexDefaults




Thanks,
-Jibo



On Sep 27, 2009, at 1:43 PM, Shalin Shekhar Mangar wrote:


On Mon, Sep 28, 2009 at 1:18 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


On Sat, Sep 26, 2009 at 7:13 AM, Jibo John jiboj...@mac.com wrote:


Hello,

It looks like solr is not allowing me to change the default
MergePolicy/Scheduler classes.

Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy
and ConcurrentMergeScheduler) defined in solrconfig.xml to a  
different one
(LogDocMergePolicy and SerialMergeScheduler), my profiler shows  
the default

classes are still being loaded.

Also, if I use the default LogByteSizeMergePolicy, I can't seem to
override the 'calibrateSizeByDeletes' to 'true' value using  
solrconfig using

the new syntax that was introduced this week (SOLR-1447).

I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.


Specifying mergePolicy and mergeScheduler in indexDefaults does  
not work
in trunk. If you specify in the mainIndex section, it will work.  
I'll give

a patch with a fix.



This is fixed in trunk now. Thanks!

--
Regards,
Shalin Shekhar Mangar.






Problem changing the default MergePolicy/Scheduler

2009-09-25 Thread Jibo John

Hello,

It looks like solr is not allowing me to change the default  
MergePolicy/Scheduler classes.


Even if I change the default MergePolicy/ 
Scheduler(LogByteSizeMErgePolicy and ConcurrentMergeScheduler) defined  
in solrconfig.xml to a different one (LogDocMergePolicy and  
SerialMergeScheduler), my profiler shows the default classes are still  
being loaded.


Also, if I use the default LogByteSizeMergePolicy, I can't seem to  
override the 'calibrateSizeByDeletes' to 'true' value using solrconfig  
using the new syntax that was introduced this week (SOLR-1447).


I'm using the version checked out from trunk yesterday.

Any pointers will be helpful.

Thanks,
-Jibo


How to leverage the LogMergePolicy calibrateSizeByDeletes patch in Solr ?

2009-09-17 Thread Jibo John

Hello,

Came across a lucene patch (http://issues.apache.org/jira/browse/LUCENE-1634 
) that would consider the number of deleted documents as the criteria  
when deciding which segments to merge.


Since we expect to have very frequent deletes, we hope this would help  
reclaim the space consumed by the deleted documents in a much more  
efficient way.


Currently, we can specify a mergepolicy in solrconfig.xml like this:


 !--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy/ 
mergePolicy--



However, by default, calibrateSizeByDeletes = false in LogMergePolicy.

I was wondering if there is a way I can modify calibrateSizeByDeletes  
just by configuration ?


Thanks,
-Jibo







Re: Solr 1.4 Replication scheme

2009-08-14 Thread Jibo John
Slightly off topic one question on the index file transfer  
mechanism used in the new 1.4 Replication scheme.
Is my understanding correct that the transfer is over http?  (vs.  
rsync in the script-based snappuller)


Thanks,
-Jibo


On Aug 14, 2009, at 6:34 AM, Yonik Seeley wrote:


Longer term, it might be nice to enable clients to specify what
version of the index they were searching against.  This could be used
to prevent consistency issues across different slaves, even if they
commit at different times.  It could also be used in distributed
search to make sure the index didn't change between phases.

-Yonik
http://www.lucidimagination.com



2009/8/14 Noble Paul നോബിള്‍  नोब्ळ्  
noble.p...@corp.aol.com:
On Fri, Aug 14, 2009 at 2:28 PM,  
KaktuChakarabatijimmoe...@gmail.com wrote:


Hey Noble,
you are right in that this will solve the problem, however it  
implicitly
assumes that commits to the master are infrequent enough ( so that  
most
polling operations yield no update and only every few polls lead  
to an

actual commit. )
This is a relatively safe assumption in most cases, but one that  
couples the
master update policy with the performance of the slaves - if the  
master gets
updated (and committed to) frequently, slaves might face a commit  
on every
1-2 poll's, much more than is feasible given new searcher warmup  
times..
In effect what this comes down to it seems is that i must make the  
master
commit frequency the same as i'd want the slaves to use - and this  
is
markedly different than previous behaviour with which i could have  
the
master get updated(+committed to) at one rate and slaves  
committing those

updates at a different rate.
I see , the argument. But , isn't it better to keep both the mster  
and

slave as consistent as possible? There is no use in committing in
master, if you do not plan to search on those docs. So the best thing
to do is do a commit only as frequently as you wish to commit in a
slave.

On a different track, if we can have an option of disabling commit
after replication, is it worth it? So the user can trigger a commit
explicitly




Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:


usually the pollInterval is kept to a small value like 10secs.  
there

is no harm in polling more frequently. This can ensure that the
replication happens at almost same time




On Fri, Aug 14, 2009 at 1:58 PM, KaktuChakarabatijimmoe...@gmail.com 


wrote:


Hey Shalin,
thanks for your prompt reply.
To clarity:
With the old script-based replication, I would snappull every x  
minutes

(say, on the order of 5 minutes).
Assuming no index optimize occured ( I optimize 1-2 times a day  
so we can
disregard it for the sake of argument), the snappull would take  
a few

seconds to run on each iteration.
I then have a crontab on all slaves that runs snapinstall on a  
fixed

time,
lets say every 15 minutes from start of a round hour, inclusive.  
(slave
machine times are synced e.g via ntp) so that essentially all  
slaves will
begin a snapinstall exactly at the same time - assuming uniform  
load and

the
fact they all have at this point in time the same snapshot since I
snappull
frequently - this leads to a fairly synchronized replication  
across the

board.

With the new replication however, it seems that by binding the  
pulling

and
installing as well specifying the timing in delta's only (as  
opposed to

absolute-time based like in crontab) we've essentially made it
impossible
to effectively keep multiple slaves up to date and synchronized;  
e.g if

we
set poll interval to 15 minutes, a slight offset in the startup  
times of

the
slaves (that can very much be the case for arbitrary resets/ 
maintenance
operations) can lead to deviations in snappull(+install) times.  
this in

turn
is further made worse by the fact that the pollInterval is then  
computed
based on the offset of when the last commit *finished* - and  
this number

seems to have a higher variance, e.g due to warmup which might be
different
across machines based on the queries they've handled previously.

To summarize, It seems to me like it might be beneficial to  
introduce a
second parameter that acts more like a crontab time-based  
tableau, in so

far
that it can enable a user to specify when an actual commit  
should occur -

so
then we can have the pollInterval set to a low value (e.g 60  
seconds) but
then specify to only perform a commit on the 0,15,30,45-minutes  
of every
hour. this makes the commit times on the slaves fairly  
deterministic.


Does this make sense or am i missing something with current in- 
process

replication?

Thanks,
-Chak


Shalin Shekhar Mangar wrote:


On Fri, Aug 14, 2009 at 8:39 AM, KaktuChakarabati
jimmoe...@gmail.comwrote:



In the old replication, I could snappull with multiple slaves
asynchronously
but perform the snapinstall on each at the same time (+- epsilon
seconds),
so that way production load balanced query serving will always  
be

consistent.

With the new system 

Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Jibo John

Thanks for the response, Eric.

We have seen that size of the index has a direct impact on the search  
speed, especially when the index size is in GBs, so trying all  
possible ways to keep the index size as low as we can.


We thought solr.ExternalFileField type would help to keep the index  
size low by storing a text field out side of the index.


Here's what we were planning: initially, all the fields except the  
solr.ExternalFileField type field will be queried and will be  
displayed to the end user. . There will be subsequent calls from the  
UI  to pull the solr.ExternalFileField field that will be loaded in a  
lazy manner.


However, realized that solr.ExternalFileField only supports float  
type, however, the data that we're planning to keep as an external  
field is a string type.


Thanks,
-Jibo



On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:


Hoping the experts chime in if I'm wrong, but
As far as I know, while storing a field increases the size of an  
index,

it doesn't have much impact on the search speed. Which you could
pretty easily test by creating the index both ways and firing off some
timing queries and comparing. Although it would be time  
consuming...


I believe there's some info on the Lucene Wiki about this, but my  
memory

isn't what it used to be.

Erick


On Tue, Jul 21, 2009 at 2:42 PM, Jibo John jiboj...@mac.com wrote:


We're in the process of building a log searcher application.

In order to reduce the index size to improve the query performance,  
we're

exploring the possibility of having:

1. One field for each log line with 'indexed=true  stored=false'  
that

will be used for searching
2. Another field for each log line of type solr.ExternalFileField  
that

will be used only for display purpose.

We realized that currently solr.ExternalFileField supports only  
float type.


Is there a way we can override this to support string type? Any  
issues with

this approach?

Any ideas are welcome.


Thanks,
-Jibo







Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Jibo John

Thanks for the quick response, Otis.

We have been able to achieve the ratio of 2 with different settings,  
however, considering the huge volume of the data that we need to deal  
with - 600 GB of data per day, and, we need to keep it in the index  
for 3 days - we're looking at all possible ways to reduce the index  
size further.
Will definitely keep exploring the straightforward things and see if  
we can find a better setting.



Thanks,
-Jibo

On Jul 23, 2009, at 9:49 AM, Otis Gospodnetic wrote:

I'm not sure if there is a lot of benefit from storing the literal  
values in that external file vs. directly in the index.  There are a  
number of things one should look at first, as far as performance is  
concerned - JVM settings, cache sizes, analysis, etc.


For example, I have one index here that is 9 times the size of the  
original data because of how its fields are analyzed.  I can change  
one analysis-level setting and make that ratio go down to 2.  So I'd  
look at other, more straight forward things first.  There is a Wiki  
page either on Solr or Lucene Wiki dedicated to various search  
performance tricks.


Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: Jibo John jiboj...@mac.com
To: solr-user@lucene.apache.org
Sent: Thursday, July 23, 2009 12:08:26 PM
Subject: Re: Storing string field in solr.ExternalFieldFile type

Thanks for the response, Eric.

We have seen that size of the index has a direct impact on the  
search speed,
especially when the index size is in GBs, so trying all possible  
ways to keep

the index size as low as we can.

We thought solr.ExternalFileField type would help to keep the index  
size low by

storing a text field out side of the index.

Here's what we were planning: initially, all the fields except the
solr.ExternalFileField type field will be queried and will be  
displayed to the

end user. . There will be subsequent calls from the UI  to pull the
solr.ExternalFileField field that will be loaded in a lazy manner.

However, realized that solr.ExternalFileField only supports float  
type, however,
the data that we're planning to keep as an external field is a  
string type.


Thanks,
-Jibo



On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:


Hoping the experts chime in if I'm wrong, but
As far as I know, while storing a field increases the size of an  
index,

it doesn't have much impact on the search speed. Which you could
pretty easily test by creating the index both ways and firing off  
some
timing queries and comparing. Although it would be time  
consuming...


I believe there's some info on the Lucene Wiki about this, but my  
memory

isn't what it used to be.

Erick


On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote:


We're in the process of building a log searcher application.

In order to reduce the index size to improve the query  
performance, we're

exploring the possibility of having:

1. One field for each log line with 'indexed=true  stored=false'  
that

will be used for searching
2. Another field for each log line of type solr.ExternalFileField  
that

will be used only for display purpose.

We realized that currently solr.ExternalFileField supports only  
float type.


Is there a way we can override this to support string type? Any  
issues with

this approach?

Any ideas are welcome.


Thanks,
-Jibo









Storing string field in solr.ExternalFieldFile type

2009-07-21 Thread Jibo John

We're in the process of building a log searcher application.

In order to reduce the index size to improve the query performance,  
we're exploring the possibility of having:


 1. One field for each log line with 'indexed=true  stored=false'  
that will be used for searching
 2. Another field for each log line of type solr.ExternalFileField  
that will be used only for display purpose.


We realized that currently solr.ExternalFileField supports only float  
type.


Is there a way we can override this to support string type? Any issues  
with this approach?


Any ideas are welcome.


Thanks,
-Jibo