RE: highlighting performance poor with *.tar, *.gz files

2011-11-25 Thread Shyam Bhaskaran
Hi Eric,

Thanks for the response.

I am already using termVectors with offsets & positions enabled as shown below.





I am indexing FAQ content and some these FAQ has attachments linked to them and 
these attachments have files like PDF, DOC *.TAR , *.GZIP files that contains 
additional information related to the FAQ and all these contents are indexed. 
But while searching and highlighting it is observed that for archived files 
like *.gz, *.tar, *.zip the search performance degrades and using the debug 
flag I am finding that the time taken for highlighting these *.gz, *.tar, *.zip 
archived files is taking more time.

What could be the reason behind it ? Is it because these files are unzipped and 
then highlighted from the index during display time ?

Is the highlighting dependent on file size what I mean is if the file size is 
more, then does the performance of the search degrades because of the 
highlighting ?

I have tried to reduce the maxAnalyzedChars value from 5MB to 1 MB bus still do 
not see any significant improvement in the search and highlighting for these 
kind of files.

Let me know if you can suggest any workaround for improving the highlighting 
and search performance for these kind of files or even files having large file 
size ?


Thanks
Shyam

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, November 26, 2011 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: highlighting performance poor with *.tar, *.gz files

Highlighting is dependent on the size of the
data being fed through the highlighter. Unless you have
termVectors & offsets & positions enabled, the text
must be re-analyzed, see:
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29

But highlighting compressed files seems like an odd
use-case, what is the business reason you need to do this?

Best
Erick

On Thu, Nov 24, 2011 at 10:28 AM, Shyam Bhaskaran
 wrote:
> Hi,
>
> It is observed that highlighting of search results is taking too much time 
> especially for highlighting terms for archived files like *.gz, *.tar, *.zip.
> What could be the reason behind it ? Is it because these files are unzipped 
> and then highlighted from the index during display time ?
> Or is it dependent on the size of the file ? Is there any way by which the 
> search & highlighter performance improves for these kind of archived files 
> (*.tar, *.zip etc)
>
> Let me know if there is any workaround for improving the highlighting and 
> search performance for these kind of files?
>
> -Shyam
>


Re: inconsistent JVM crash with version 4.0-SNAPSHOT

2011-11-25 Thread Erick Erickson
Don't know if its this particular issue, but have you seen:
https://issues.apache.org/jira/browse/LUCENE-3588

Best
Erick

On Fri, Nov 25, 2011 at 4:59 PM, Justin Caratzas
 wrote:
> Lasse Aagren  writes:
>
>> Hi,
>>
>> We are running Solr-Lucene 4.0-SNAPSHOT (1199777M - hudson - 2011-11-09 
>> 14:58:50) on severel servers running:
>>
>> 64bit Debian Squeeze (6.0.3)
>> OpenJDK6 (b18-1.8.9-0.1~squeeze1)
>> Tomcat 6.028 (6.0.28-9+squeeze1)
>>
>> Some of the servers have 48G RAM and in that case java have 16G (-Xmx16g) 
>> and some of the servers have 96G RAM and in that case java have 48G 
>> (-Xmx48G).
>>
>> We are seeing some inconsistent crashes of tomcat's JVM under different 
>> Solr/Lucene operations/circumstances. Sadly we can't replicate it.
>>
>> It doesn't happen often, but often enough that we can't rely on it in 
>> production.
>>
>> When it happens, something like the following appears in the logs:
>>
>> ==
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x7f6c318d0902, pid=16516, tid=139772378892032
>> #
>> # JRE version: 6.0_18-b18
>> # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
>> # Derivative: IcedTea6 1.8.9
>> # Distribution: Debian GNU/Linux 6.0.2 (squeeze), package 
>> 6b18-1.8.9-0.1~squeeze1
>> # Problematic frame:
>> # j  
>> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;)Lorg/apache/lucene/search/DocIdSet;+193
>> #
>> # An error report file with more information is saved as:
>> # /tmp/hs_err_pid16516.log
>> #
>> # If you would like to submit a bug report, please include
>> # instructions how to reproduce the bug and visit:
>> #   http://icedtea.classpath.org/bugzilla
>> #
>> ==
>>
>> Every time it happens the problematic frame is:
>>
>> Problematic frame:
>> # j  
>> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;
>> )Lorg/apache/lucene/search/DocIdSet;+193
>>
>> And /tmp/hs_err_pid16516.log is attached to this mail.
>>
>> Has anyone seen this before?
>>
>> Please don't hesitate to ask for further specification about our setup.
>>
>> Best regards,
>
> I seem to remember a recent java released fixed seemingly random
> SIGSEGV's causing Solr/Lucene to crash non-deterministicly.
>
> http://lucene.apache.org/solr/#26+October+2011+-+Java+7u1+fixes+index+corruption+and+crash+bugs+in+Apache+Lucene+Core+and+Apache+Solr
>
> Hopefully this will provide you with some answers. If not, please let
> the list know.
>
> justin
>
>


Re: remove answers with identical scores

2011-11-25 Thread Erick Erickson
Have you considered removing them at index time? See:
http://wiki.apache.org/solr/Deduplication

Best
Erick

On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning  wrote:
> See http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> The obvious thought that I had just after hitting send was that you could
> put the LSH signatures on the documents.  That would let you do the scan at
> low volume and using LSH would make the duplicate scan almost as fast as
> your score scan idea.
>
> Whether Solr will do this for you is really neither here nor there.  Solr
> does an awful lot of stuff for a an awful lot of people who find it very
> congenial.  They probably don't have lots of duplicate documents.  If you
> really think that this capability is core, then you can contribute an
> implementation to Solr and all will be made whole.  In the short-term, I
> would recommend you prototype independently.
>
> On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman wrote:
>
>> thanks.  i did consider postprocessing and may wind up doing that, i was
>> hoping there was a way to have Solr do it for me! that I have to as this
>> question is probably not a good sign, but what is LSH clustering?
>>
>> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning 
>> wrote:
>>
>> > You can do that pretty easily by just retrieving extra documents and post
>> > processing the results list.
>> >
>> > You are likely to have a significant number of apparent duplicates this
>> > way.
>> >
>> > To really get rid of duplicates in results, it might be better to remove
>> > them from the corpus by deploying something like LSH clustering.
>> >
>> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman > > >wrote:
>> >
>> > > I have a corpus that has a lot of identical or nearly identical
>> > documents.
>> > > I'd like to return only the unique ones (excluding the "nearly
>> identical"
>> > > which are redirects).  I notice that all the identical/nearly
>> identicals
>> > > have identical Solr scores. How can I tell Solr to  throw out all the
>> > > successive documents in an answer set that have identical scores?
>> > >
>> > > doc 1 score 5.0
>> > > doc 2  score 5.0
>> > > doc 3 score 5.0
>> > > doc 4 score 4.9
>> > >
>> > > skip docs 2 and 3
>> > >
>> > > bring back 10 docs with unique scores
>> > >
>> >
>>
>


Re: Index a null text field

2011-11-25 Thread Erick Erickson
Are you committing after the run?

Best
Erick

On Fri, Nov 25, 2011 at 1:32 PM, Young, Cody  wrote:
> I don't see anything wrong so far other than a typo here (missing a p in
> the second price):
> 
>
>  Can you see if there are any warnings in the log about documents not
> being able to be created?
>
> Also, you should have a field type definition for text in your schema.
> It will look something like
>
>     positionIncrementGap="100">
>      
>
> Can you send the full field type definition along as well?
>
> You can also try running a query like:
> ?q=keyword_stock:[* TO *]
> That will return any documents where keyword_stock is populated.
>
> Thanks,
> Cody
>
> -Original Message-
> From: jawedshamshedi [mailto:jawedshamsh...@gmail.com]
> Sent: Thursday, November 24, 2011 9:42 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Index a null text field
>
> Hi Cody,
>
> Thanks for the reply.
>
> Please find the detail of that I am doing.
>
> Yes, I am using dataimport handler and the code snippet of it from
> solrconfig.xml is given below.
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>        
>                data-config.xml
>        
> 
>
> The data-config.xml is give below.
>
> 
>         url="jdbc:mysql://localhost/database?zeroDateTimeBehavior=convertToNull"
> user="username" password="password"/>
>        
>         
>            
>                         name="start_bidprice" />
>                        
>                        
>                         name="lastbid_rice" />
>
>                        
>
>                
>        
> 
>
> schema.xml
>
>
>  
>
>
>     required="true" />
>         stored="true" />
>         stored="true" />
>         stored="true" />
>         stored="true" />
>         stored="true" />
>
>
>  
>
>
>  un_id
>
>
>  ST_Name
>
> he date type in mysql is given below.
>
> keyword     text
> start_bidprice  float(12,2)
> end_date    datetime
> start_bidprice  float(12,2)
> start_date      datetime
>
>
> for some fields that are simple float, there index are being created. I
> also added this in data-config.xml's url
> zeroDateTimeBehavior=convertToNull but no avail.
>
> Please help Thanks in advance.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Index-a-null-text-field-tp3533636p353
> 5376.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solrQueryParser defaultOperator

2011-11-25 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists

you're asking us to figure out what you've done. IN particular,
are you using either dismax or edismax? They don't respect
the defaultOperator. Use the mm param to get this kind
of behavior.

Best
Erick

On Thu, Nov 24, 2011 at 6:33 PM, toto  wrote:
> Hi,
> I install Apache solr and integrate it on a drupal website. Everythings
> works perfectly. The default search operator is OR, so I changed it in my
> schema.xml as :
>
> 
>
> But, it seems no working. For example, when I search : "bakery california",
> solr return all the results contains "bakery" OR "california".
>
> Is there any solution for fix it?
>
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solrQueryParser-defaultOperator-tp3534984p3534984.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: WordDelimiterFilter MultiPhraseQuery case insesitive Issue

2011-11-25 Thread Erick Erickson
Have you looked at the admin/analysis page? That's invaluable
for answering this kind of question.

Best
Erick

On Thu, Nov 24, 2011 at 2:30 PM, Uomesh  wrote:
> Hi,
>
> I tried with preserveOriginal="1" and reindex too but still no result.
>
> Thanks,
> Umesh
>
> On Wed, Nov 23, 2011 at 5:33 PM, Shawn Heisey-4 [via Lucene] <
> ml-node+s472066n3532405...@n3.nabble.com> wrote:
>
>> On 11/23/2011 2:54 PM, Uomesh wrote:
>>
>> > Hi,
>> >
>> > case insesitive search is not working if I use WordDelimiterFilter
>> > splitOnCaseChange="1"
>> >
>> > I am searching for word norton and here is result
>> >
>> > norton: returns result
>> > Norton: returns result
>> > but
>> > nOrton: no results
>> >
>> > I want nOrton should results. Please help. below is my field type.
>>
>> Try adding preserveOriginal="1" to your WDF options.  You may not need
>> to actually reindex before you see results, but it would be a good idea
>> to reindex.  This will result in an increase in your index size.
>>
>> Thanks,
>> Shawn
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://lucene.472066.n3.nabble.com/WordDelimiterFilter-MultiPhraseQuery-case-insesitive-Issue-tp3532209p3532405.html
>>  To unsubscribe from WordDelimiterFilter MultiPhraseQuery case insesitive
>> Issue, click 
>> here
>> .
>> NAML
>>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/WordDelimiterFilter-MultiPhraseQuery-case-insesitive-Issue-tp3532209p3534518.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Incorrect Search results

2011-11-25 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists

You have given us virtually no information that would allow
us to help...

Best
Erick

On Thu, Nov 24, 2011 at 1:57 PM, GAURAV PAREEK
 wrote:
> I am serching some of the key word but I am not getting the correct result.
>
> According my understanding *X* should give equal to or more than X result.
>
> *But I am getting the less result in *X*.*
>
> Regards,
> Gaurav
>


Re: highlighting performance poor with *.tar, *.gz files

2011-11-25 Thread Erick Erickson
Highlighting is dependent on the size of the
data being fed through the highlighter. Unless you have
termVectors & offsets & positions enabled, the text
must be re-analyzed, see:
http://wiki.apache.org/solr/FieldOptionsByUseCase?highlight=%28termvector%29%7C%28retrieve%29%7C%28contents%29

But highlighting compressed files seems like an odd
use-case, what is the business reason you need to do this?

Best
Erick

On Thu, Nov 24, 2011 at 10:28 AM, Shyam Bhaskaran
 wrote:
> Hi,
>
> It is observed that highlighting of search results is taking too much time 
> especially for highlighting terms for archived files like *.gz, *.tar, *.zip.
> What could be the reason behind it ? Is it because these files are unzipped 
> and then highlighted from the index during display time ?
> Or is it dependent on the size of the file ? Is there any way by which the 
> search & highlighter performance improves for these kind of archived files 
> (*.tar, *.zip etc)
>
> Let me know if there is any workaround for improving the highlighting and 
> search performance for these kind of files?
>
> -Shyam
>


Re: Query a field with no value or a particular value.

2011-11-25 Thread Erick Erickson
You just need two clauses, something like
q=field:yes (field:* -field:[* TO *])

fq could work here too.


Best
Erick


On Fri, Nov 25, 2011 at 10:06 AM, Phil Hoy  wrote:
> Hi,
>
> Thanks for getting back to me, and sorry the default q value was *:* so I 
> omitted it from the example.
>
> I do not have a problem getting the null values so q=*:*&fq=-field:[* TO *] 
> indeed works but I also need the docs with a specific value e.g. 
> fq=field:yes. Is this possible?
>
> Phil
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 25 November 2011 13:59
> To: solr-user@lucene.apache.org
> Subject: Re: Query a field with no value or a particular value.
>
> You haven't specified any "q" clause, just an "fq" clause. Try
> q=*:* -field:[* TO *]
> or
> q=*:*&fq=-field:[* TO *]
>
> BTW, the logic of field:yes -field:[* TO *] makes no sense
> You're saying "find me all the fields containing the value "yes" and
> remove from that set all the fields containing any value at all"
>
> Best
> Erick
>
> On Fri, Nov 25, 2011 at 7:28 AM, Phil Hoy  wrote:
>> Hi,
>>
>> Is it possible to constrain the results of a query to return docs were a 
>> field contains no value or a particular value?
>>
>> I tried  ?fq=(field:yes OR -field:[* TO *]) but I get no results even though 
>> queries with either ?fq=field:yes or ?fq=-field:[* TO *]) do return results.
>>
>>
>> Phil
>>
>
> __
> This email has been scanned by the brightsolid Email Security System. Powered 
> by MessageLabs
> __
>


Re: trouble with CollationKeyFilter

2011-11-25 Thread Erick Erickson
It's checked in, SOLR-2438. Although it's getting some surgery so you
can expect it to morph a bit.

Erick

On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov  wrote:
> Thanks for confirming that, and laying out the options, Robert.
>
> -Mike
>
> On 11/23/2011 9:03 PM, Robert Muir wrote:
>>
>> hi,
>>
>> locale sensitive range queries don't work with these filters, only sort,
>> although erick erickson has a patch that will enable this (the lowercasing
>> wildcards patch, then you could add this filter to your multiterm chain).
>>
>> separately locale range queries and sort both work easily on trunk (with
>> binary terms)... just use collationfield or icucollationfield if you are
>> able to use trunk...
>>
>> otherwise for 3.x I think that patch is pretty close any day now, so we
>> can
>> add an example for localized range queries that makes use of it.
>>
>> On Nov 23, 2011 4:39 PM, "Michael Sokolov"  wrote:
>>>
>>> I'm using CollectionKeyFilter to sort my documents using the Unicode root
>>
>> collation, and my documents do appear to be getting sorted correctly, but
>> I'm getting weird results when performing range filtering using the sort
>> key field.  For example:
>>>
>>> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>>>
>>> and
>>>
>>> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>>>
>>> both return 0 hits
>>>
>>> but
>>>
>>> ifp_sortkey_ls:"youth culture"
>>>
>>> returns 1 hit
>>>
>>> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
>>
>> acting as if the terms A, B are greater than all documents whose sortkeys
>> start with an A-Z character, but less than a few documents that have greek
>> letters as their first characters of their sortkeys.
>>>
>>> the analysis chain for ifp_sortkey_ls is:
>>>
>>> >
>> class="solr.TextField" positionIncrementGap="100" omitNorms="true"
>> omitTermFreqAndPositions="true">
>>>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> >>                language=""
>>>                strength="primary"
>>>                />
>>> 
>>> 
>>>
>>> Does anyone have any idea what might be going on here?
>>>
>
>


Re: inconsistent JVM crash with version 4.0-SNAPSHOT

2011-11-25 Thread Justin Caratzas
Lasse Aagren  writes:

> Hi,
>
> We are running Solr-Lucene 4.0-SNAPSHOT (1199777M - hudson - 2011-11-09 
> 14:58:50) on severel servers running:
>
> 64bit Debian Squeeze (6.0.3)
> OpenJDK6 (b18-1.8.9-0.1~squeeze1)
> Tomcat 6.028 (6.0.28-9+squeeze1)
>
> Some of the servers have 48G RAM and in that case java have 16G (-Xmx16g) and 
> some of the servers have 96G RAM and in that case java have 48G (-Xmx48G).
>
> We are seeing some inconsistent crashes of tomcat's JVM under different 
> Solr/Lucene operations/circumstances. Sadly we can't replicate it. 
>
> It doesn't happen often, but often enough that we can't rely on it in 
> production.
>
> When it happens, something like the following appears in the logs:
>
> ==
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f6c318d0902, pid=16516, tid=139772378892032
> #
> # JRE version: 6.0_18-b18
> # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
> # Derivative: IcedTea6 1.8.9
> # Distribution: Debian GNU/Linux 6.0.2 (squeeze), package 
> 6b18-1.8.9-0.1~squeeze1
> # Problematic frame:
> # j  
> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;)Lorg/apache/lucene/search/DocIdSet;+193
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid16516.log
> #
> # If you would like to submit a bug report, please include
> # instructions how to reproduce the bug and visit:
> #   http://icedtea.classpath.org/bugzilla
> #
> ==
>
> Every time it happens the problematic frame is:
>
> Problematic frame:
> # j  
> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;
> )Lorg/apache/lucene/search/DocIdSet;+193
>
> And /tmp/hs_err_pid16516.log is attached to this mail.
>
> Has anyone seen this before? 
>
> Please don't hesitate to ask for further specification about our setup.
>
> Best regards,

I seem to remember a recent java released fixed seemingly random
SIGSEGV's causing Solr/Lucene to crash non-deterministicly.

http://lucene.apache.org/solr/#26+October+2011+-+Java+7u1+fixes+index+corruption+and+crash+bugs+in+Apache+Lucene+Core+and+Apache+Solr

Hopefully this will provide you with some answers. If not, please let
the list know.

justin



Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
See http://en.wikipedia.org/wiki/Locality-sensitive_hashing

The obvious thought that I had just after hitting send was that you could
put the LSH signatures on the documents.  That would let you do the scan at
low volume and using LSH would make the duplicate scan almost as fast as
your score scan idea.

Whether Solr will do this for you is really neither here nor there.  Solr
does an awful lot of stuff for a an awful lot of people who find it very
congenial.  They probably don't have lots of duplicate documents.  If you
really think that this capability is core, then you can contribute an
implementation to Solr and all will be made whole.  In the short-term, I
would recommend you prototype independently.

On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman wrote:

> thanks.  i did consider postprocessing and may wind up doing that, i was
> hoping there was a way to have Solr do it for me! that I have to as this
> question is probably not a good sign, but what is LSH clustering?
>
> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning 
> wrote:
>
> > You can do that pretty easily by just retrieving extra documents and post
> > processing the results list.
> >
> > You are likely to have a significant number of apparent duplicates this
> > way.
> >
> > To really get rid of duplicates in results, it might be better to remove
> > them from the corpus by deploying something like LSH clustering.
> >
> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman  > >wrote:
> >
> > > I have a corpus that has a lot of identical or nearly identical
> > documents.
> > > I'd like to return only the unique ones (excluding the "nearly
> identical"
> > > which are redirects).  I notice that all the identical/nearly
> identicals
> > > have identical Solr scores. How can I tell Solr to  throw out all the
> > > successive documents in an answer set that have identical scores?
> > >
> > > doc 1 score 5.0
> > > doc 2  score 5.0
> > > doc 3 score 5.0
> > > doc 4 score 4.9
> > >
> > > skip docs 2 and 3
> > >
> > > bring back 10 docs with unique scores
> > >
> >
>


Re: How many defaultsearchfields we can have in one schema.xml file?

2011-11-25 Thread Lee Carroll
only one field can be a default. use copy field and copy the fields
you need to search into a single field and set the copy field to be
the default. That might be ok depending upon your circumstances

On 25 November 2011 12:46, kiran.bodigam  wrote:
> In my schema i have defined below tag for indexing the fields because in my
> use case except the uniquekey remaining fields needs to be indexed as it is
> (with same datatype)
>  multiValued="true" />
>
> Here i would like to search all of them with out field name unfortunately i
> can't put all of them using  option coz its dynamicfield
> ? how to make all of them default search please suggest?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-many-defaultsearchfields-we-can-have-in-one-schema-xml-file-tp3536020p3536020.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Huge Performance: Solr distributed search

2011-11-25 Thread Mikhail Garber
in general terms, when your Java heap is so large, it is beneficial to
set mx and ms to the same size.

On Wed, Nov 23, 2011 at 5:12 AM, Artem Lokotosh  wrote:
> Hi!
>
> * Data:
> - Solr 3.4;
> - 30 shards ~ 13GB, 27-29M docs each shard.
>
> * Machine parameters (Ubuntu 10.04 LTS):
> user@Solr:~$ uname -a
> Linux Solr 2.6.32-31-server #61-Ubuntu SMP Fri Apr 8 19:44:42 UTC 2011
> x86_64 GNU/Linux
> user@Solr:~$ cat /proc/cpuinfo
> processor       : 0 - 3
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 44
> model name      : Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz
> stepping        : 2
> cpu MHz         : 3458.000
> cache size      : 12288 KB
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
> tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1
> sse4_2 popcnt aes hypervisor lahf_lm ida arat
> bogomips        : 6916.00
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 40 bits physical, 48 bits virtual
> power management:
> user@Solr:~$ cat /proc/meminfo
> MemTotal:       16992680 kB
> MemFree:          110424 kB
> Buffers:            9976 kB
> Cached:         11588380 kB
> SwapCached:        41952 kB
> Active:          9860764 kB
> Inactive:        6198668 kB
> Active(anon):    4062144 kB
> Inactive(anon):   398972 kB
> Active(file):    5798620 kB
> Inactive(file):  5799696 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      46873592 kB
> SwapFree:       46810712 kB
> Dirty:                36 kB
> Writeback:             0 kB
> AnonPages:       4424756 kB
> Mapped:           940660 kB
> Shmem:                40 kB
> Slab:             362344 kB
> SReclaimable:     350372 kB
> SUnreclaim:        11972 kB
> KernelStack:        2488 kB
> PageTables:        68568 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    55369932 kB
> Committed_AS:    5740556 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      350532 kB
> VmallocChunk:   34359384964 kB
> HardwareCorrupted:     0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       10240 kB
> DirectMap2M:    17299456 kB
>
> - Apache Tomcat 6.0.32:
> 
> -XX:+DisableExplicitGC
> -XX:PermSize=512M
> -XX:MaxPermSize=512M
> -Xmx12G
> -Xms3G
> -XX:NewSize=128M
> -XX:MaxNewSize=128M
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSClassUnloadingEnabled
> -XX:CMSInitiatingOccupancyFraction=50
> -XX:GCTimeRatio=9
> -XX:MinHeapFreeRatio=25
> -XX:MaxHeapFreeRatio=25
> -verbose:gc
> -XX:+PrintGCTimeStamps
> -Xloggc:/opt/search/tomcat/logs/gc.log
>
> Out search schema is:
> - 5 servers with configuration above;
> - one tomcat6 application on each server with 6 solr applications.
>
> - Full addresses are:
> 1) 
> http://192.168.1.85:8080/solr1,http://192.168.1.85:8080/solr2,...,http://192.168.1.85:8080/solr6
> 2) 
> http://192.168.1.86:8080/solr7,http://192.168.1.86:8080/solr8,...,http://192.168.1.86:8080/solr12
> ...
> 5) 
> http://192.168.1.89:8080/solr25,http://192.168.1.89:8080/solr26,...,http://192.168.1.89:8080/solr30
> - At another server there is a additional "common" application with
> shards paramerter:
> 
> 
> explicit
>  name="shards">192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,...,192.168.1.89:8080/solr30
> 10
> 
> 
> - schema and solrconfig are identical for all shards, for first shard
> see attach;
> - on these servers are only search, indexation is on another
> (optimized to 2 segments shards replicate with ssh/rsync scripts).
>
> So now the major problem is huge performance on distributed search.
> Take look on, for example, these logs:
> This is on 30 shards:
> INFO: [] webapp=/solr
> path=/select/params={fl=*,score&ident=true&start=0&q=(barium)&rows=2000}
> status=0 QTime=40712
> INFO: [] webapp=/solr
> path=/select/params={fl=*,score&ident=true&start=0&q=(pittances)&rows=2000}
> status=0 QTime=36097
> INFO: [] webapp=/solr
> path=/select/params={fl=*,score&ident=true&start=0&q=(reliability)&rows=2000}
> status=0 QTime=75756
> INFO: [] webapp=/solr
> path=/select/params={fl=*,score&ident=true&start=0&q=(blessing's)&rows=2000}
> status=0 QTime=30342
> INFO: [] webapp=/solr
> path=/select/params={fl=*,score&ident=true&start=0&q=(reiterated)&rows=2000}
> status=0 QTime=55690
>
> Sometimes QTime is more than 15. But when we run identical queries
> on one shard separately, QTime is between 200 and 1500.
> Does ditributed solr search really slow or our architecture is non
> optimal? Or maybe need to use any third-party applications?
> Thanks for any replies.
>
> --
> Best regards,
> Artem
>


RE: Index a null text field

2011-11-25 Thread Young, Cody
I don't see anything wrong so far other than a typo here (missing a p in
the second price):


 Can you see if there are any warnings in the log about documents not
being able to be created?

Also, you should have a field type definition for text in your schema.
It will look something like 


  

Can you send the full field type definition along as well?

You can also try running a query like: 
?q=keyword_stock:[* TO *]
That will return any documents where keyword_stock is populated.

Thanks,
Cody

-Original Message-
From: jawedshamshedi [mailto:jawedshamsh...@gmail.com] 
Sent: Thursday, November 24, 2011 9:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Index a null text field

Hi Cody,

Thanks for the reply.

Please find the detail of that I am doing. 

Yes, I am using dataimport handler and the code snippet of it from
solrconfig.xml is given below.



data-config.xml



The data-config.xml is give below.




 












schema.xml


 


   
 
 
 
  
 
  

 

 
 un_id

 
 ST_Name

he date type in mysql is given below.

keyword text
start_bidprice  float(12,2)
end_datedatetime
start_bidprice  float(12,2)
start_date  datetime


for some fields that are simple float, there index are being created. I
also added this in data-config.xml's url
zeroDateTimeBehavior=convertToNull but no avail.

Please help Thanks in advance.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Index-a-null-text-field-tp3533636p353
5376.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: trouble with CollationKeyFilter

2011-11-25 Thread Robert Muir
On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov  wrote:
> Thanks for confirming that, and laying out the options, Robert.
>

FYI: Erick committed the multiterm stuff, so I opened an issue for
this: https://issues.apache.org/jira/browse/SOLR-2919

-- 
lucidimagination.com


Re: Boosted documents not appearing higher than less-boosted ones for equal relevancy.

2011-11-25 Thread Tomás Fernández Löbbe
I don't think there is a way of seeing the "boosts" from the index, as
those are encoded as "norms" (together with length normalization). You can
see the norms with Luke if you want to and in the debugQuery output the
index-time boost should be represented  in the "fieldNorm" section. (if you
click in "view source" you'll see the explain section of the debugQuery
indented, much more easy to read).

In the Similarity javadoc (
http://lucene.apache.org/java/3_1_0/api/all/org/apache/lucene/search/Similarity.html
for
Lucene/Solr 3.1) you can see how the norm is calculated, In your debugQuery
I can see that all the "fieldNorm" are 1.5 and I'm not so sure of why that
can happen to you.

On Fri, Nov 25, 2011 at 1:08 PM, Andrew Ingram <
andrew.ing...@tangentlabs.co.uk> wrote:

> Hi all,
>
> I have 4 products, let's call them p1,p2, p3 and p4, at the point of
> indexing I'm boosting each document as follows (using ):
>
> p1 = 2.3434156476491901
> p2 = 2.1894875146124502
> p3 = 2.51677824126855
> p4 = 2.2773491010634999
>
> (Note: scores may not be identical to what it currently indexed, because I
> can't figure out how to get this information from Solr, these values are
> simply illustrating what is being fed into the index)
>
> When I'm performing a search query, they are all being given an equal
> score of 23.54723 for one example case (see debugQuery details below). As
> far as I an tell the boost I'm provided isn't contributing to the score,
> but across my overall index the boosting is successfully promoting more
> popular products over less popular ones (the boost is calculated based on a
> number of factors such as popularity).
>
> So my question is, why are these 4 products all being given the same
> score, is the document boosting not being considered correctly?
>
> Additionally I'm sorting by "can_purchase+desc,+score+desc", where
> can_purchase is a boolean field.
>
> I would greatly appreciate any help with this.
>
> Regards,
> Andrew Ingram
>
> > 
> > (text:jeffrey AND text:archer)
> > (text:jeffrey AND text:archer)
> > +(text:JFR text:jeffrey) +(text:ARXR
> text:archer)
> > +(text:JFR text:jeffrey) +(text:ARXR
> text:archer)
> > 
> >
> > ... (other results) ...
> >
> > 
> > 23.54723 = (MATCH) sum of: 9.63586 = (MATCH) sum of: 4.285661 = (MATCH)
> weight(text:JFR in 1494239), product of: 0.42661786 =
> queryWeight(text:JFR), product of: 6.6971116 = idf(docFreq=49173,
> maxDocs=14654117) 0.06370177 = queryNorm 10.045668 = (MATCH)
> fieldWeight(text:JFR in 1494239), product of: 1.0 =
> tf(termFreq(text:JFR)=1) 6.6971116 = idf(docFreq=49173, maxDocs=14654117)
> 1.5 = fieldNorm(field=text, doc=1494239) 5.3501997 = (MATCH)
> weight(text:jeffrey in 1494239), product of: 0.47666705 =
> queryWeight(text:jeffrey), product of: 7.482791 = idf(docFreq=22413,
> maxDocs=14654117) 0.06370177 = queryNorm 11.224186 = (MATCH)
> fieldWeight(text:jeffrey in 1494239), product of: 1.0 =
> tf(termFreq(text:jeffrey)=1) 7.482791 = idf(docFreq=22413,
> maxDocs=14654117) 1.5 = fieldNorm(field=text, doc=1494239) 13.91137 =
> (MATCH) sum of: 6.4868336 = (MATCH) weight(text:ARXR in 1494239), product
> of: 0.52486366 = queryWeight(text:ARXR), product of: 8.239388 =
> idf(docFreq=10517, maxDocs=14654117) 0.06370177 = queryNorm 12.359083 =
> (MATCH) fieldWeight(text:ARXR in 1494239), product of: 1.0 =
> tf(termFreq(text:ARXR)=1) 8.239388 = idf(docFreq=10517, maxDocs=14654117)
> 1.5 = fieldNorm(field=text, doc=1494239) 7.4245367 = (MATCH)
> weight(text:archer in 1494239), product of: 0.56151944 =
> queryWeight(text:archer), product of: 8.814816 = idf(docFreq=5915,
> maxDocs=14654117) 0.06370177 = queryNorm 13.25 = (MATCH)
> fieldWeight(text:archer in 1494239), product of: 1.0 =
> tf(termFreq(text:archer)=1) 8.814816 = idf(docFreq=5915, maxDocs=14654117)
> 1.5 = fieldNorm(field=text, doc=1494239)
> > 
> > 
> > 23.54723 = (MATCH) sum of: 9.63586 = (MATCH) sum of: 4.285661 = (MATCH)
> weight(text:JFR in 1526040), product of: 0.42661786 =
> queryWeight(text:JFR), product of: 6.6971116 = idf(docFreq=49173,
> maxDocs=14654117) 0.06370177 = queryNorm 10.045668 = (MATCH)
> fieldWeight(text:JFR in 1526040), product of: 1.0 =
> tf(termFreq(text:JFR)=1) 6.6971116 = idf(docFreq=49173, maxDocs=14654117)
> 1.5 = fieldNorm(field=text, doc=1526040) 5.3501997 = (MATCH)
> weight(text:jeffrey in 1526040), product of: 0.47666705 =
> queryWeight(text:jeffrey), product of: 7.482791 = idf(docFreq=22413,
> maxDocs=14654117) 0.06370177 = queryNorm 11.224186 = (MATCH)
> fieldWeight(text:jeffrey in 1526040), product of: 1.0 =
> tf(termFreq(text:jeffrey)=1) 7.482791 = idf(docFreq=22413,
> maxDocs=14654117) 1.5 = fieldNorm(field=text, doc=1526040) 13.91137 =
> (MATCH) sum of: 6.4868336 = (MATCH) weight(text:ARXR in 1526040), product
> of: 0.52486366 = queryWeight(text:ARXR), product of: 8.239388 =
> idf(docFreq=10517, maxDocs=14654117) 0.06370177 = queryNorm 12.359083 =
> (MATCH) fieldWeight(text:ARXR in 1526040), product of: 1.0 =
> tf(termFreq(text:ARXR)=1) 

Re: Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Update on this: I've established:
* It's not a problem in the DB (I can index from this DB into a Solr
instance on another server)
* It's not Tomcat (I get the same problem in Jetty)
* It's not the schema (I have simplified it to one field)

That leaves SolrConfig.xml and data-config.

Only thing changed in SolrConfig.xml is adding:

  
  
  

  
D:/Software/Solr/example/solr/conf/data-config.xml
  


And data-config.xml is pretty much as attached - except simpler.

Any help or any advice on how to diagnose would be appreciated!


On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger  wrote:
> Hi I have copied my Solr config from a working Windows server to a new
> one, and it can't seem to run an import.
>
> They're both using win server 2008 and SQL 2008R2. This is the data
> importer config
>
>    
>                  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>            url="jdbc:sqlserver://localhost;databaseName=DB"
>            user="Solr"
>            password="pwd"/>
>      
>                query="EXEC SOLR_COMPANY_SEARCH_DATA"
>        deltaImportQuery="SELECT * FROM Company_Search_Data WHERE
> [key]='${dataimporter.delta.key}'"
>        deltaQuery="SELECT [key] FROM Company_Search_Data WHERE modify_dt
>> '${dataimporter.last_index_time}'">
>               name="WorkDesc_Comments_Split" />
>               />
>        
>      
>    
>
> I can use MS SQL Profiler to watch the Solr user log in successfully,
> but then nothing. It doesn't seem to even try and execute the stored
> procedure. Any ideas why this would be working one server and not on
> another?
>
> FTR the only thing in the tomcat catalina log is:
>
>    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
>    INFO: Creating a connection for entity data with URL:
> jdbc:sqlserver://localhost;databaseName=CATLive
>
> --
> Ian
>
> i...@isfluent.com
> +44 (0)1223 257903
>



-- 
Ian

i...@isfluent.com
+44 (0)1223 257903


RE: XML Manager for Solr

2011-11-25 Thread Steven A Rowe
Hi Stephane,

Do you know about Solr's DataImportHandler, aka DIH?: 
http://wiki.apache.org/solr/DataImportHandler

Steve

> -Original Message-
> From: KabooHahahein [mailto:stele...@hotmail.com]
> Sent: Friday, November 25, 2011 10:33 AM
> To: solr-user@lucene.apache.org
> Subject: XML Manager for Solr
> 
> Hi,
> 
> I am new to Solr, and from what I understand, Solr indexes an XML database
> into its own format in order to enter the data into the search engine.
> 
> I am currently trying to find an XML solution for management of these XML
> files. My database will include multiple XML files, and I'd like to be
> able
> to manually edit some entries before allowing solr to parse the XML files.
> I
> would rather use a third party tool than code my own XML manager by hand.
> 
> What does everyone recommend?
> Thanks,
> Stephane
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/XML-
> Manager-for-Solr-tp3536383p3536383.html
> Sent from the Solr - User mailing list archive at Nabble.com.


XML Manager for Solr

2011-11-25 Thread KabooHahahein
Hi,

I am new to Solr, and from what I understand, Solr indexes an XML database
into its own format in order to enter the data into the search engine.

I am currently trying to find an XML solution for management of these XML
files. My database will include multiple XML files, and I'd like to be able
to manually edit some entries before allowing solr to parse the XML files. I
would rather use a third party tool than code my own XML manager by hand.

What does everyone recommend?
Thanks,
Stephane

--
View this message in context: 
http://lucene.472066.n3.nabble.com/XML-Manager-for-Solr-tp3536383p3536383.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Huge Performance: Solr distributed search

2011-11-25 Thread Artem Lokotosh

On 11/25/2011 3:13 AM, Mark Miller wrote:


When you search each shard, are you positive that you are using all of the
same parameters? You are sure you are hitting request handlers that are
configured exactly the same and sending exactly the same queries?

I'm my experience, the overhead for distrib search is usually very low.

What types of queries are you trying?


I'm using the simple queries like this

http://192.168.1.90:9090/solr/select/?fl=*,score&start=0&q=(superstar)&qt=requestShards&rows=2000

The requestShards handler defined as






192.168.1.85:8080/solr1,192.168.1.85:8080/solr2,...,192.168.1.85:8080/solr6,
192.168.1.86:8080/solr7,192.168.1.86:8080/solr8,...,192.168.1.86:8080/solr12,
...,
192.168.1.89:8080/solr25,192.168.1.89:8080/solr26,...,192.168.1.89:8080/solr30

10






--
Best regards,
Artem Lokotoshmailto:arco...@gmail.com




Boosted documents not appearing higher than less-boosted ones for equal relevancy.

2011-11-25 Thread Andrew Ingram
Hi all,

I have 4 products, let's call them p1,p2, p3 and p4, at the point of indexing 
I'm boosting each document as follows (using ):

p1 = 2.3434156476491901
p2 = 2.1894875146124502
p3 = 2.51677824126855
p4 = 2.2773491010634999

(Note: scores may not be identical to what it currently indexed, because I 
can't figure out how to get this information from Solr, these values are simply 
illustrating what is being fed into the index)

When I'm performing a search query, they are all being given an equal score of 
23.54723 for one example case (see debugQuery details below). As far as I an 
tell the boost I'm provided isn't contributing to the score, but across my 
overall index the boosting is successfully promoting more popular products over 
less popular ones (the boost is calculated based on a number of factors such as 
popularity).

So my question is, why are these 4 products all being given the same score, is 
the document boosting not being considered correctly?

Additionally I'm sorting by "can_purchase+desc,+score+desc", where can_purchase 
is a boolean field.

I would greatly appreciate any help with this.

Regards,
Andrew Ingram

> 
> (text:jeffrey AND text:archer)
> (text:jeffrey AND text:archer)
> +(text:JFR text:jeffrey) +(text:ARXR 
> text:archer)
> +(text:JFR text:jeffrey) +(text:ARXR 
> text:archer)
> 
> 
> ... (other results) ...
> 
> 
> 23.54723 = (MATCH) sum of: 9.63586 = (MATCH) sum of: 4.285661 = (MATCH) 
> weight(text:JFR in 1494239), product of: 0.42661786 = queryWeight(text:JFR), 
> product of: 6.6971116 = idf(docFreq=49173, maxDocs=14654117) 0.06370177 = 
> queryNorm 10.045668 = (MATCH) fieldWeight(text:JFR in 1494239), product of: 
> 1.0 = tf(termFreq(text:JFR)=1) 6.6971116 = idf(docFreq=49173, 
> maxDocs=14654117) 1.5 = fieldNorm(field=text, doc=1494239) 5.3501997 = 
> (MATCH) weight(text:jeffrey in 1494239), product of: 0.47666705 = 
> queryWeight(text:jeffrey), product of: 7.482791 = idf(docFreq=22413, 
> maxDocs=14654117) 0.06370177 = queryNorm 11.224186 = (MATCH) 
> fieldWeight(text:jeffrey in 1494239), product of: 1.0 = 
> tf(termFreq(text:jeffrey)=1) 7.482791 = idf(docFreq=22413, maxDocs=14654117) 
> 1.5 = fieldNorm(field=text, doc=1494239) 13.91137 = (MATCH) sum of: 6.4868336 
> = (MATCH) weight(text:ARXR in 1494239), product of: 0.52486366 = 
> queryWeight(text:ARXR), product of: 8.239388 = idf(docFreq=10517, 
> maxDocs=14654117) 0.06370177 = queryNorm 12.359083 = (MATCH) 
> fieldWeight(text:ARXR in 1494239), product of: 1.0 = 
> tf(termFreq(text:ARXR)=1) 8.239388 = idf(docFreq=10517, maxDocs=14654117) 1.5 
> = fieldNorm(field=text, doc=1494239) 7.4245367 = (MATCH) weight(text:archer 
> in 1494239), product of: 0.56151944 = queryWeight(text:archer), product of: 
> 8.814816 = idf(docFreq=5915, maxDocs=14654117) 0.06370177 = queryNorm 
> 13.25 = (MATCH) fieldWeight(text:archer in 1494239), product of: 1.0 = 
> tf(termFreq(text:archer)=1) 8.814816 = idf(docFreq=5915, maxDocs=14654117) 
> 1.5 = fieldNorm(field=text, doc=1494239)
> 
> 
> 23.54723 = (MATCH) sum of: 9.63586 = (MATCH) sum of: 4.285661 = (MATCH) 
> weight(text:JFR in 1526040), product of: 0.42661786 = queryWeight(text:JFR), 
> product of: 6.6971116 = idf(docFreq=49173, maxDocs=14654117) 0.06370177 = 
> queryNorm 10.045668 = (MATCH) fieldWeight(text:JFR in 1526040), product of: 
> 1.0 = tf(termFreq(text:JFR)=1) 6.6971116 = idf(docFreq=49173, 
> maxDocs=14654117) 1.5 = fieldNorm(field=text, doc=1526040) 5.3501997 = 
> (MATCH) weight(text:jeffrey in 1526040), product of: 0.47666705 = 
> queryWeight(text:jeffrey), product of: 7.482791 = idf(docFreq=22413, 
> maxDocs=14654117) 0.06370177 = queryNorm 11.224186 = (MATCH) 
> fieldWeight(text:jeffrey in 1526040), product of: 1.0 = 
> tf(termFreq(text:jeffrey)=1) 7.482791 = idf(docFreq=22413, maxDocs=14654117) 
> 1.5 = fieldNorm(field=text, doc=1526040) 13.91137 = (MATCH) sum of: 6.4868336 
> = (MATCH) weight(text:ARXR in 1526040), product of: 0.52486366 = 
> queryWeight(text:ARXR), product of: 8.239388 = idf(docFreq=10517, 
> maxDocs=14654117) 0.06370177 = queryNorm 12.359083 = (MATCH) 
> fieldWeight(text:ARXR in 1526040), product of: 1.0 = 
> tf(termFreq(text:ARXR)=1) 8.239388 = idf(docFreq=10517, maxDocs=14654117) 1.5 
> = fieldNorm(field=text, doc=1526040) 7.4245367 = (MATCH) weight(text:archer 
> in 1526040), product of: 0.56151944 = queryWeight(text:archer), product of: 
> 8.814816 = idf(docFreq=5915, maxDocs=14654117) 0.06370177 = queryNorm 
> 13.25 = (MATCH) fieldWeight(text:archer in 1526040), product of: 1.0 = 
> tf(termFreq(text:archer)=1) 8.814816 = idf(docFreq=5915, maxDocs=14654117) 
> 1.5 = fieldNorm(field=text, doc=1526040)
> 
> 
> 23.54723 = (MATCH) sum of: 9.63586 = (MATCH) sum of: 4.285661 = (MATCH) 
> weight(text:JFR in 1562638), product of: 0.42661786 = queryWeight(text:JFR), 
> product of: 6.6971116 = idf(docFreq=49173, maxDocs=14654117) 0.06370177 = 
> queryNorm 10.045668 = (MATCH) fieldWeight(text:JFR in 1562638), product of: 
> 1.0 = tf(termFr

RE: Sort question

2011-11-25 Thread Phil Hoy
You might be able to sort by the map function q=*:*&sort=map(price,0,100, 
10) asc, price asc.

Phil

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 25 November 2011 13:49
To: solr-user@lucene.apache.org
Subject: Re: Sort question

Not that I know of. You could conceivably do some
work at index time to create a field that would sort
in that order by doing some sort of mapping from
these values into a field that sorts the way you
want, or you might be able to do a plugin

Best
Erick

On Wed, Nov 23, 2011 at 3:29 AM, vraa  wrote:
> Hi
>
> I have a query where i sort by a column "price". This field can contain the
> following values
>
> 10
> 75000
> 15
> 1
> 225000
> 50
> 40
>
> I want to sort these values so that always between 0 and 100 always comes
> last.
>
> Eg sorting by price asc should look like this:
> 75000
> 10
> 15
> 225000
> 1
> 40
> 50
>
> Is this possible?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Sort-question-tp3530070p3530070.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs
__


RE: Query a field with no value or a particular value.

2011-11-25 Thread Phil Hoy
Hi,

Thanks for getting back to me, and sorry the default q value was *:* so I 
omitted it from the example.

I do not have a problem getting the null values so q=*:*&fq=-field:[* TO *] 
indeed works but I also need the docs with a specific value e.g. fq=field:yes. 
Is this possible?

Phil

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 25 November 2011 13:59
To: solr-user@lucene.apache.org
Subject: Re: Query a field with no value or a particular value.

You haven't specified any "q" clause, just an "fq" clause. Try
q=*:* -field:[* TO *]
or
q=*:*&fq=-field:[* TO *]

BTW, the logic of field:yes -field:[* TO *] makes no sense
You're saying "find me all the fields containing the value "yes" and
remove from that set all the fields containing any value at all"

Best
Erick

On Fri, Nov 25, 2011 at 7:28 AM, Phil Hoy  wrote:
> Hi,
>
> Is it possible to constrain the results of a query to return docs were a 
> field contains no value or a particular value?
>
> I tried  ?fq=(field:yes OR -field:[* TO *]) but I get no results even though 
> queries with either ?fq=field:yes or ?fq=-field:[* TO *]) do return results.
>
>
> Phil
>

__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs
__


Re: Separate ACL and document index

2011-11-25 Thread Erick Erickson
There's another approach that *may* help, see:
https://issues.apache.org/jira/browse/SOLR-2429

This is probably suitable if you don't have a zillion results
to sort through. The idea here is that you can specify a
filter query that only executes after all the other parts
of a query are done, i.e. is only calculated for documents
that have been selected and passed through the lower-
cost filter queries etc. You could create a custom component
that calculated whether a user had access to the docs
on the fly.

Yet another approach is to define the problem away. If you can
define a reasonably small number of *groups* (where small
might be 100s) and assign users to groups and then
grant/deny access based on group membership, you can
then assign users to groups and have their access controlled
without re-indexing the doc. You do have to get the group list
that the user belongs to from some external source and
use *that* as your filter.

But the ACL problem is yucky when it gets very complex.

Best
Erick


On Wed, Nov 23, 2011 at 9:49 PM, Floyd Wu  wrote:
> Thank you for your sharing, My current solution is similar to 2).
> But my problem is ACL is early-binding (which means I build index and
> embedded ACL with document index) I don't want to rebuild full index(a
> lucene/solr Document with PDF content and ACL) when front end change
> only permission settings.
>
> Seems solution 2)  have same problem.
>
> Floyd
>
>
> 2011/11/24 Robert Stewart :
>> I have used two different ways:
>>
>> 1) Store mapping from users to documents in some external database
>> such as MySQL.  At search time, lookup mapping for user to some unique
>> doc ID or some group ID, and then build query or doc set which you can
>> cache in SOLR process for some period.  Then use that as a filter in
>> your search.  This is more involved approach but better if you have
>> lots of ACLs per user, but it is non-trivial to implement it well.  I
>> used this in a system with over 100 million docs, and approx. 20,000
>> ACLs per user.  The ACL mapped user to a set of group IDs, and each
>> group could have 10,000+ documents.
>>
>> 2) Generate a query filter that you pass to SOLR as part of the
>> search.  Potentially it could be a pretty large query if user has
>> granular ACL over may documents or groups.  I've seen it work ok with
>> up to 1000 or so ACLs per user query.  So you build that filter query
>> from the client using some external database to lookup user ACLs
>> before sending request to SOLR.
>>
>> Bob
>>
>>
>> On Tue, Nov 22, 2011 at 10:48 PM, Floyd Wu  wrote:
>>> Hi there,
>>>
>>> Is it possible to separate ACL index and document index and achieve to
>>> search by user role in SOLR?
>>>
>>> Currently my implementation is to index ACL with document, but the
>>> document itself change frequently. I have to perform rebuild index
>>> every time when ACL change. It's heavy for whole system due to
>>> document are so many and content are huge.
>>>
>>> Do you guys have any solution to solve this problem. I've been read
>>> mailing list for a while. Seem there is not suitable solution for me.
>>>
>>> I want user searches result only for him according to his role but I
>>> don't want to re-index document every time when document's ACL change.
>>>
>>> To my knowledge, is this possible to perform a join like database to
>>> achieve this? How and possible?
>>>
>>> Thanks
>>>
>>> Floyd
>>>
>>
>


Re: Synonyms 1 fetching 2001, how to avoid

2011-11-25 Thread Erick Erickson
Please review:
http://wiki.apache.org/solr/UsingMailingLists

You haven't shown the relevant parts of your configs.
You haven't shown the queries you're using, with &debugQuery=on
You haven't shown the input
You haven't explained why you think synonyms have anything
 to do with the problem.

So it's really hard to say much of anything.

Best
Erick

On Wed, Nov 23, 2011 at 6:30 PM, RaviWhy  wrote:
> Hi,
>
> I am searching on movie titles. with synonyms text file mapped to 1,one.
>
> With this, when I am searching for '1'  I am expecting '1 in kind' but I am
> getting results which have titles like "2001: My year" .
>
> I am using query time analyser with
>
>                                         ignoreCase="true" expand="true" />
>
> I am going to try with expand=false. But anything else I need to look at?
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Synonyms-1-fetching-2001-how-to-avoid-tp3532398p3532398.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: strange behavior of scores and term proximity use

2011-11-25 Thread Erick Erickson
You  might try with a less "fraught" search phrase,
"to be or not to be" is a classic query that may be all
stop words.

Otherwise, I'm clueless.

On Wed, Nov 23, 2011 at 3:15 PM, Ariel Zerbib  wrote:
> I tested with the version 4.0-2011-11-04_09-29-42.
>
> Ariel
>
>
> 2011/11/17 Erick Erickson 
>
>> Hmmm, I'm not seeing similar behavior on a trunk from today, when did
>> you get your copy?
>>
>> Erick
>>
>> On Wed, Nov 16, 2011 at 2:06 PM, Ariel Zerbib 
>> wrote:
>> > Hi,
>> >
>> > For this term proximity query: ab_main_title_l0:"to be or not to be"~1000
>> >
>> >
>> http://localhost:/solr/select?q=ab_main_title_l0%3A%22og54ct8n+to+be+or+not+to+be+5w8ojsx2%22~1000&sort=score+desc&start=0&rows=3&fl=ab_main_title_l0%2Cscore%2Cid&debugQuery=true
>> >
>> > The third first results are the following one:
>> >
>> > 
>> > 
>> > 
>> >  0
>> >  5
>> > 
>> > 
>> >  
>> >    2315190010001021
>> >    
>> >      og54ct8n To be or not to be a Jew. 5w8ojsx2
>> >    
>> >    3.0814114
>> >  
>> >    2313006480001021
>> >    
>> >      og54ct8n To be or not to be 5w8ojsx2
>> >    
>> >    3.0814114
>> >  
>> >    2356410250001021
>> >    
>> >      og54ct8n Rumspringa : to be or not to be Amish / 5w8ojsx2
>> >    
>> >    3.0814114
>> > 
>> > 
>> >  ab_main_title_l0:"og54ct8n to be or not to be
>> > 5w8ojsx2"~1000
>> >  ab_main_title_l0:"og54ct8n to be or not to be
>> > 5w8ojsx2"~1000
>> >  PhraseQuery(ab_main_title_l0:"og54ct8n to be or
>> > not to be 5w8ojsx2"~1000)
>> >  ab_main_title_l0:"og54ct8n to be or not
>> > to be 5w8ojsx2"~1000
>> >  
>> >    
>> > 5.337161 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
>> > 5w8ojsx2"~1000 in 378403) [DefaultSimilarity], result of:
>> >  5.337161 = fieldWeight in 378403, product of:
>> >    0.57735026 = tf(freq=0.3334), with freq of:
>> >      0.3334 = phraseFreq=0.3334
>> >    29.581549 = idf(), sum of:
>> >      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
>> >      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
>> >    0.3125 = fieldNorm(doc=378403)
>> > 
>> >    
>> > 9.244234 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
>> > 5w8ojsx2"~1000 in 482807) [DefaultSimilarity], result of:
>> >  9.244234 = fieldWeight in 482807, product of:
>> >    1.0 = tf(freq=1.0), with freq of:
>> >      1.0 = phraseFreq=1.0
>> >    29.581549 = idf(), sum of:
>> >      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
>> >      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
>> >    0.3125 = fieldNorm(doc=482807)
>> > 
>> >    
>> > 5.337161 = (MATCH) weight(ab_main_title_l0:"og54ct8n to be or not to be
>> > 5w8ojsx2"~1000 in 1317563) [DefaultSimilarity], result of:
>> >  5.337161 = fieldWeight in 1317563, product of:
>> >    0.57735026 = tf(freq=0.3334), with freq of:
>> >      0.3334 = phraseFreq=0.3334
>> >    29.581549 = idf(), sum of:
>> >      1.0012436 = idf(docFreq=3297332, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      4.3826413 = idf(docFreq=112108, maxDocs=3301436)
>> >      6.3982043 = idf(docFreq=14937, maxDocs=3301436)
>> >      3.0405464 = idf(docFreq=429046, maxDocs=3301436)
>> >      5.3583193 = idf(docFreq=42257, maxDocs=3301436)
>> >      1.0017256 = idf(docFreq=3295743, maxDocs=3301436)
>> >    0.3125 = fieldNorm(doc=1317563)
>> > 
>> > 
>> >
>> > The used version is a 4.0 October snapshot.
>> >
>> > I have 2 questions about the result:
>> > - Why debug print and scores in result are different?
>> > - What is the expected behavior of this kind of term proximity query?
>> >          - The debug scores seem to be well ordered but the result scores
>> > seem to be wrong.
>> >
>> >
>> > Thanks,
>> > Ariel
>> >
>>
>


Re: Solr dismax scoring and weight

2011-11-25 Thread Erick Erickson
No, I mean the number that's used to hold the length of the field is a byte,
but that it's not just a simple byte. It's encoded to handle very long
fields in that byte, but there's some loss of precision. For instance,
and I'm pulling numbers out of thin air here, fields of 1-25 terms may
collapse to the same length value. Same with 26-100 etc. But I really don't
know the details of what the buckets are.

Best
Erick

On Wed, Nov 23, 2011 at 2:47 PM, darul  wrote:
> Thanks a lot Erick for this explanation. Do you mean words are stored in
> bytes, that's it ?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-dismax-scoring-and-weight-tp3490096p3531917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Search for misspelled search term

2011-11-25 Thread Erick Erickson
Did you turn it on? In the defaults section, something like:
on

BTW, I would NOT do the spellcheck.build=true on every
request, this will rebuild your dictionary every time which
is a definite performance problem!

Best
Erick

On Wed, Nov 23, 2011 at 7:32 AM, meghana  wrote:
>
> I have configured specllchecker component in my solrconfig
> below is the configuration
>
> 
>    
>
>      false
>
>      false
>
>      1
>    
>    
>      spellcheck
>    
>  
>
> using above configuration it works with below url
> http://192.168.1.59:8080/solr/core0/spellcheck?q=sc:directry&spellcheck=true&spellcheck.build=true
>
> But when i set the same config in my standard request handler then i dont
> work,
> below is config setting for that
>
>  
>
>     
>       explicit
>
>      false
>
>      false
>
>      1
>     
>  
>      spellcheck
>    
>  
>
> then its not working with below url
> http://192.168.1.59:8080/solr/core0/select?q=sc:directry&spellcheck=true&spellcheck.build=true.
>
> anybody have any idea?
> neuron005 wrote
>>
>> Do you mean stemming?
>> For misspelled  words you will have to edit your dictionary
>> (stopwords.txt) i think where you can set solution for misspelled words!
>> Hope So :)
>>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Search-for-misspelled-search-term-tp3529961p3530526.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: date range in solr 3.1

2011-11-25 Thread Erick Erickson
I think you're asking for something like:
fq=date:[NOW/DAY-5DAYS TO NOW/DAY+1DAY]?

Best
Erick

On Wed, Nov 23, 2011 at 6:29 AM, do3do3  wrote:
> what i got is the number of this period but i want to get this result only,
> what is the query to can get that like
> fq=source:"news"
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/date-range-in-solr-3-1-tp3527498p3530424.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query a field with no value or a particular value.

2011-11-25 Thread Erick Erickson
You haven't specified any "q" clause, just an "fq" clause. Try
q=*:* -field:[* TO *]
or
q=*:*&fq=-field:[* TO *]

BTW, the logic of field:yes -field:[* TO *] makes no sense
You're saying "find me all the fields containing the value "yes" and
remove from that set all the fields containing any value at all"

Best
Erick

On Fri, Nov 25, 2011 at 7:28 AM, Phil Hoy  wrote:
> Hi,
>
> Is it possible to constrain the results of a query to return docs were a 
> field contains no value or a particular value?
>
> I tried  ?fq=(field:yes OR -field:[* TO *]) but I get no results even though 
> queries with either ?fq=field:yes or ?fq=-field:[* TO *]) do return results.
>
>
> Phil
>


Re: Sort question

2011-11-25 Thread Erick Erickson
Not that I know of. You could conceivably do some
work at index time to create a field that would sort
in that order by doing some sort of mapping from
these values into a field that sorts the way you
want, or you might be able to do a plugin

Best
Erick

On Wed, Nov 23, 2011 at 3:29 AM, vraa  wrote:
> Hi
>
> I have a query where i sort by a column "price". This field can contain the
> following values
>
> 10
> 75000
> 15
> 1
> 225000
> 50
> 40
>
> I want to sort these values so that always between 0 and 100 always comes
> last.
>
> Eg sorting by price asc should look like this:
> 75000
> 10
> 15
> 225000
> 1
> 40
> 50
>
> Is this possible?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Sort-question-tp3530070p3530070.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Can files be faceted based on their size ?

2011-11-25 Thread Erick Erickson
Well, you can try adding a   directive to put it into
a numeric field

But you need to provide significantly more details. From what
you've said there's not enough information to say much besides
"it should work".

Perhaps you should review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Nov 23, 2011 at 1:35 AM, neuron005  wrote:
> Thanks for replying
> I tried using "Trie" types for faceting solr but that did not solve the
> problem. If I use Trie types(for e.g. I used tlong)...it shows "schema
> mismatch error" as in FileListEntityProcessor api , fileSize has been
> defined of type string. That means we can not apply facet.range on fileSize.
> Am I right?
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Can-files-be-faceted-based-on-their-size-tp3518393p3529923.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Faceting is not Using Field Value Cache . . ?

2011-11-25 Thread Erick Erickson
In addition to Samuel's comment, the filterCache is also used under
certain circumstances

Best
Erick

2011/11/22 Samuel García Martínez :
> AFAIK, FieldValueCache is only used for faceting on tokenized fields.
> Maybe, are you getting confused with FieldCache (
> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/FieldCache.html)?
> This is used for common facets (using facet.method=fc and not tokenized
> fields).
>
> This makes any sense for you?
>
> On Tue, Nov 22, 2011 at 7:21 PM, CRB wrote:
>
>>
>> Seeing something odd going on with faceting . . . we execute facets with
>> every query and yet the fieldValueCache is not being used:
>>
>>        name:      fieldValueCache
>> class:      org.apache.solr.search.**FastLRUCache
>> version:      1.0
>> description:      Concurrent LRU Cache(maxSize=1, initialSize=10,
>> minSize=9000, acceptableSize=9500, cleanupThread=false)
>> stats:     lookups : 0
>> hits : 0
>> hitratio : 0.00
>> inserts : 0
>> evictions : 0
>> size : 0
>> warmupTime : 0
>> cumulative_lookups : 0
>> cumulative_hits : 0
>> cumulative_hitratio : 0.00
>> cumulative_inserts : 0
>> cumulative_evictions : 0
>>
>> I was under the impression the fieldValueCache  was an implicit cache (if
>> you don't define it, it will still exist).
>>
>> We are running Solr v3.3 (and NOT using {!cache=false}).
>>
>> Thoughts?
>>
>
>
>
> --
> Un saludo,
> Samuel García.
>


Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
thanks.  i did consider postprocessing and may wind up doing that, i was
hoping there was a way to have Solr do it for me! that I have to as this
question is probably not a good sign, but what is LSH clustering?

On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning  wrote:

> You can do that pretty easily by just retrieving extra documents and post
> processing the results list.
>
> You are likely to have a significant number of apparent duplicates this
> way.
>
> To really get rid of duplicates in results, it might be better to remove
> them from the corpus by deploying something like LSH clustering.
>
> On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman  >wrote:
>
> > I have a corpus that has a lot of identical or nearly identical
> documents.
> > I'd like to return only the unique ones (excluding the "nearly identical"
> > which are redirects).  I notice that all the identical/nearly identicals
> > have identical Solr scores. How can I tell Solr to  throw out all the
> > successive documents in an answer set that have identical scores?
> >
> > doc 1 score 5.0
> > doc 2  score 5.0
> > doc 3 score 5.0
> > doc 4 score 4.9
> >
> > skip docs 2 and 3
> >
> > bring back 10 docs with unique scores
> >
>


How many defaultsearchfields we can have in one schema.xml file?

2011-11-25 Thread kiran.bodigam
In my schema i have defined below tag for indexing the fields because in my
use case except the uniquekey remaining fields needs to be indexed as it is
(with same datatype)


Here i would like to search all of them with out field name unfortunately i
can't put all of them using  option coz its dynamicfield
? how to make all of them default search please suggest?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-many-defaultsearchfields-we-can-have-in-one-schema-xml-file-tp3536020p3536020.html
Sent from the Solr - User mailing list archive at Nabble.com.


Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Hi I have copied my Solr config from a working Windows server to a new
one, and it can't seem to run an import.

They're both using win server 2008 and SQL 2008R2. This is the data
importer config


  
  

  
  

  


I can use MS SQL Profiler to watch the Solr user log in successfully,
but then nothing. It doesn't seem to even try and execute the stored
procedure. Any ideas why this would be working one server and not on
another?

FTR the only thing in the tomcat catalina log is:

org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity data with URL:
jdbc:sqlserver://localhost;databaseName=CATLive

-- 
Ian

i...@isfluent.com
+44 (0)1223 257903


Query a field with no value or a particular value.

2011-11-25 Thread Phil Hoy
Hi,

Is it possible to constrain the results of a query to return docs were a field 
contains no value or a particular value?

I tried  ?fq=(field:yes OR -field:[* TO *]) but I get no results even though 
queries with either ?fq=field:yes or ?fq=-field:[* TO *]) do return results.


Phil


Re: Efficient title sorting on large result sets.

2011-11-25 Thread Andrew Ingram

On 21 Nov 2011, at 23:17, Chris Hostetter wrote:

> 
> : The way that I've solved this in the past is to make a field
> : specifically for sorting and then truncate the string to a small number
> : of characters and sort on that. You have to accept that in some cases
> 
> Something to consider is the ICUCollationKeyFilterFactory.  As noted on 
> the wiki...
> 
>   This filter works like ?CollationKeyFilterFactory, except it uses ICU 
>   for collation. This makes smaller and faster sort keys, ...
> 
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUCollationKeyFilterFactory
> 
> 
> -Hoss

Thanks for your help. So it's seeming like accurate string sorting over a large 
result set is always going to be problematic. My preferred solution is to not 
expose sorting functionality until the number of results is sufficiently small 
(eg less than 1000). I'll feed this back to the powers that be.

Regards,
Andrew Ingram


Re: Clustering and FieldType

2011-11-25 Thread Stanislaw Osinski
Hi,

You're right -- currently Carrot2 clustering ignores the Solr analysis
chain and uses its own pipeline. It is possible to integrate with Solr's
analysis components to some extent, see the discussion here:
https://issues.apache.org/jira/browse/SOLR-2917.

Staszek


> > Hi
> > Trying to use carrot2 for clustering search results. I have it setup
> except it seems to treat the field as regular text instead of applying some
> custom filters I have.
> >
> > So my schema says something like
> >  omitNorms="true"/>
> >  compressed="true"/>
> >
> > ic_text is our internal fieldtype with some custom analysers that strip
> out certain special characters from the text.
> >
> > My solrconfig has something like this setup in our default search
> handler.
> > true
> > default
> > true
> > 
> > title
> > 
> > content
> >
> > In my search results, I see clusters but the labels on these clusters
> have the special characters in them - which means that the clustering must
> be running on raw text and not on the "ic_text" field.
> > Can someone let me know if this is the default setup and if there is a
> way to fix this ?
> > Thanks !
> > Geetu
> >
>


Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
You can do that pretty easily by just retrieving extra documents and post
processing the results list.

You are likely to have a significant number of apparent duplicates this
way.

To really get rid of duplicates in results, it might be better to remove
them from the corpus by deploying something like LSH clustering.

On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman wrote:

> I have a corpus that has a lot of identical or nearly identical documents.
> I'd like to return only the unique ones (excluding the "nearly identical"
> which are redirects).  I notice that all the identical/nearly identicals
> have identical Solr scores. How can I tell Solr to  throw out all the
> successive documents in an answer set that have identical scores?
>
> doc 1 score 5.0
> doc 2  score 5.0
> doc 3 score 5.0
> doc 4 score 4.9
>
> skip docs 2 and 3
>
> bring back 10 docs with unique scores
>


Re: Huge Performance: Solr distributed search

2011-11-25 Thread Dmitry Kan
45 000 000 per shard approx, Tomcat, caching was tweaked in solrconfig and
shard given 12GB of RAM max.



filterCache class="solr.FastLRUCache" size="1200" initialSize="1200"
autowarmCount="128"/>




















true









50





200



In you case I would first check if the network throughput is a bottleneck.

It would be nice if you could check timestamps of completing a request on
each of the shards and arrival time (via some http sniffer) at the frondend
SOLR's servers. Then you will see if it is frontend taking so much time or
was it a network issue.

Are you shards btw well balanced?

On Thu, Nov 24, 2011 at 7:06 PM, Artem Lokotosh  wrote:

> >> Can you merge, e.g. 3 shards together or is it much effort for your
> team?>Yes, we can merge. We'll try to do this and review how it will works
> Merge does not help :(I've tried to merge two shards in one, three
> shards in one, but results are similar to results first configuration
> with 30 shardsbut this solution have an one big minus the optimization
> proccess may take more time
> >>In our setup we currently have 16 shards with ~30GB each, but we
> rarely>>search in all of them at once
> How many documents per shards in your setup?Any difference between
> Tomcat, Jetty or other?
> Have you configured your servlet more specifically than default
> configuration?
>
>
> On Wed, Nov 23, 2011 at 4:38 PM, Artem Lokotosh  wrote:
> >> Is this log from the frontend SOLR (aggregator) or from a shard?
> > from aggregator
> >
> >> Can you merge, e.g. 3 shards together or is it much effort for your
> team?
> > Yes, we can merge. We'll try to do this and review how it will works
> > Thanks, Dmitry
> >
> > Any another ideas?
> >
> > On Wed, Nov 23, 2011 at 4:01 PM, Dmitry Kan 
> wrote:
> >> Hello,
> >>
> >> Is this log from the frontend SOLR (aggregator) or from a shard?
> >> Can you merge, e.g. 3 shards together or is it much effort for your
> team?
> >>
> >> In our setup we currently have 16 shards with ~30GB each, but we rarely
> >> search in all of them at once.
> >>
> >> Best,
> >> Dmitry
> >>
> >> On Wed, Nov 23, 2011 at 3:12 PM, Artem Lokotosh 
> wrote:
> >>
> > --
> > Best regards,
> > Artem Lokotoshmailto:arco...@gmail.com
> >
>
> --
> Best regards,
> Artem Lokotoshmailto:arco...@gmail.com
>



-- 
Regards,

Dmitry Kan