Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Alex Baranov

Hello,

It seems to me that there is no way how I can use dismax handler for
searching in both tokenized and untokenized fields while I'm searching for a
phrase.

Consider the next example. I have two fields in index: product_name and
product_name_un. The schema looks like:

fieldType name=string_ignore_case class=solr.TextField
positionIncrementGap=100 omitNorms=true
  analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/ 
  /analyzer
/fieldType

fieldType name=text_no_stopwords_en class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English/
  /analyzer
/fieldType

   field name=product_name type=text_no_stopwords_en indexed=true
stored=true/
   field name=product_name_un type=string_ignore_case indexed=true
stored=true/
 
copyField source=product_name dest=product_name_un/

I'm using dismax to search in both of them at the same time:
defType=dismaxqf=product_name product_name_un^2.0. (this is done to bring
on top of the results the products which name _equals_ the entered
criteria).

1. When I'm searching for the phrase (two or more keywords), e.g. blue
car, the input string is tokenized and even I have in the index
product_name_un=blue car, the product_name_un^2.0 part of the dismax
config has no effect. 
2. When I enter blue car (in quotas) the string is not tokenized and
product_name_un^2.0 part works, but nothing could be found in product_name
field.

I.e. there is no way to have a proper search against two fields at the same
time. The workaround that I found is using bq parameter for specifying the
boost query for search in field product_name_un. But I don't think that this
should be the only solution.


Another note, related to that: when I set as a default field for search
product_name_un, and query with the ../select/?q=blue carrows=10... I got
empty results despite the fact that I have blue car value in the index in
that field. I have to use quotas again to fix that... Shouldn't it determine
the field type and apply corresponding analyzers/tokenizers/etc.?

-- 
View this message in context: 
http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Yonik Seeley
On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov alex.barano...@gmail.com wrote:

 Hello,

 It seems to me that there is no way how I can use dismax handler for
 searching in both tokenized and untokenized fields while I'm searching for a
 phrase.

 Consider the next example. I have two fields in index: product_name and
 product_name_un. The schema looks like:

        fieldType name=string_ignore_case class=solr.TextField
 positionIncrementGap=100 omitNorms=true
      analyzer
         tokenizer class=solr.KeywordTokenizerFactory/
         filter class=solr.LowerCaseFilterFactory/
      /analyzer
    /fieldType

    fieldType name=text_no_stopwords_en class=solr.TextField
 positionIncrementGap=100
      analyzer
        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.ISOLatin1AccentFilterFactory/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
        filter class=solr.SnowballPorterFilterFactory
 language=English/
      /analyzer
        /fieldType

   field name=product_name type=text_no_stopwords_en indexed=true
 stored=true/
   field name=product_name_un type=string_ignore_case indexed=true
 stored=true/

 copyField source=product_name dest=product_name_un/

 I'm using dismax to search in both of them at the same time:
 defType=dismaxqf=product_name product_name_un^2.0. (this is done to bring
 on top of the results the products which name _equals_ the entered
 criteria).

 1. When I'm searching for the phrase (two or more keywords), e.g. blue
 car, the input string is tokenized and even I have in the index
 product_name_un=blue car, the product_name_un^2.0 part of the dismax
 config has no effect.

Hmmm, right.  This is due to the fact that the Lucene query parser
(still actually used in dismax) breaks things up by whitespace
*before* analysis (so the analyzer for the untokenized field never
sees the two tokens together).

 2. When I enter blue car (in quotas) the string is not tokenized and
 product_name_un^2.0 part works, but nothing could be found in product_name
 field.

Using explicit quotes will make a phrase query, so blue and car must
appear right next to eachother in product_name.
If it's OK to require both blue and car, in product_name then you can
just set a slop for explicit phrase queries with the qs parameter.

-Yonik
http://www.lucidimagination.com





 I.e. there is no way to have a proper search against two fields at the same
 time. The workaround that I found is using bq parameter for specifying the
 boost query for search in field product_name_un. But I don't think that this
 should be the only solution.


 Another note, related to that: when I set as a default field for search
 product_name_un, and query with the ../select/?q=blue carrows=10... I got
 empty results despite the fact that I have blue car value in the index in
 that field. I have to use quotas again to fix that... Shouldn't it determine
 the field type and apply corresponding analyzers/tokenizers/etc.?

 --
 View this message in context: 
 http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Dismax: Impossible to search for a _phrase_ in tokenized and untokenized fields at the same time

2009-10-10 Thread Alex Baranov
I guess this is a bug that should be added in JIRA (if it is not there
already). Should I add it?


 Hmmm, right.  This is due to the fact that the Lucene query parser
 (still actually used in dismax) breaks things up by whitespace
 *before* analysis (so the analyzer for the untokenized field never
 sees the two tokens together).


Is there a way how to tell to Lucene parser not to break things up by the
whitespace? Should one use some whitespace code instead of actual space?

I think what we need here is some kind of a special quotas which will tell
not to use Lucene query parser at all (might be very useful for situation
like this when search is applied to the default field, i.e. when the field
is not specified).

If it's OK to require both blue and car, in product_name then you can
 just set a slop for explicit phrase queries with the qs parameter.


It's not good for me unfortunately, but thanks for the suggestion.

Alex Baranov.

On Sat, Oct 10, 2009 at 3:01 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sat, Oct 10, 2009 at 6:34 AM, Alex Baranov alex.barano...@gmail.com
 wrote:
 
  Hello,
 
  It seems to me that there is no way how I can use dismax handler for
  searching in both tokenized and untokenized fields while I'm searching
 for a
  phrase.
 
  Consider the next example. I have two fields in index: product_name and
  product_name_un. The schema looks like:
 
 fieldType name=string_ignore_case class=solr.TextField
  positionIncrementGap=100 omitNorms=true
   analyzer
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
 
 fieldType name=text_no_stopwords_en class=solr.TextField
  positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
  language=English/
   /analyzer
 /fieldType
 
field name=product_name type=text_no_stopwords_en indexed=true
  stored=true/
field name=product_name_un type=string_ignore_case indexed=true
  stored=true/
 
  copyField source=product_name dest=product_name_un/
 
  I'm using dismax to search in both of them at the same time:
  defType=dismaxqf=product_name product_name_un^2.0. (this is done to
 bring
  on top of the results the products which name _equals_ the entered
  criteria).
 
  1. When I'm searching for the phrase (two or more keywords), e.g. blue
  car, the input string is tokenized and even I have in the index
  product_name_un=blue car, the product_name_un^2.0 part of the dismax
  config has no effect.

 Hmmm, right.  This is due to the fact that the Lucene query parser
 (still actually used in dismax) breaks things up by whitespace
 *before* analysis (so the analyzer for the untokenized field never
 sees the two tokens together).

  2. When I enter blue car (in quotas) the string is not tokenized and
  product_name_un^2.0 part works, but nothing could be found in
 product_name
  field.

 Using explicit quotes will make a phrase query, so blue and car must
 appear right next to eachother in product_name.
 If it's OK to require both blue and car, in product_name then you can
 just set a slop for explicit phrase queries with the qs parameter.

 -Yonik
 http://www.lucidimagination.com





  I.e. there is no way to have a proper search against two fields at the
 same
  time. The workaround that I found is using bq parameter for specifying
 the
  boost query for search in field product_name_un. But I don't think that
 this
  should be the only solution.
 
 
  Another note, related to that: when I set as a default field for search
  product_name_un, and query with the ../select/?q=blue carrows=10... I
 got
  empty results despite the fact that I have blue car value in the index
 in
  that field. I have to use quotas again to fix that... Shouldn't it
 determine
  the field type and apply corresponding analyzers/tokenizers/etc.?
 
  --
  View this message in context:
 http://www.nabble.com/Dismax%3A-Impossible-to-search-for-a-_phrase_-in-tokenized-and-untokenized-fields-at-the-same-time-tp25832932p25832932.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: DIH and EmbeddedSolr

2009-10-10 Thread rohan rai
ModifiableSolrParams p = new ModifiableSolrParams();
p.add(qt, /dataimport);
p.add(command, full-import);
server.query(p, METHOD.POST);

I do this

But it starts giving me this exception

SEVERE: Full Import failed
java.util.concurrent.RejectedExecutionException
at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1760)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
at
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:216)
at
java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:366)
at
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.scheduleCommitWithin(DirectUpdateHandler2.java:466)
at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:322)
at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:192)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:332)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)




2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 you may need to extend a SolrRequest and set appropriate path
 (/dataimport) and other params
 then you may invoke the request method.

 On Sat, Oct 10, 2009 at 11:07 AM, rohan rai hiroha...@gmail.com wrote:
  The configuration is not an issue...
  But how doindex i invoke it...
 
  I only have known a url way to invoke it and thus import the data into
  index...
  like http://localhost:8983/solr/db/dataimport?command=full-import t
  But with embedded I havent been able to figure it out
 
  Regards
  Rohan
  2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com
 
  I guess it should be possible... what are the problems you encounter?
 
  On Sat, Oct 10, 2009 at 10:56 AM, rohan rai hiroha...@gmail.com
 wrote:
   Have been unable to use DIH for Embedded Solr
  
   Is there a way??
  
   Regards
   Rohan
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



Solr 1.4 Release Party

2009-10-10 Thread Israel Ekpo
I can't wait...

-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.


Re: DIH and EmbeddedSolr

2009-10-10 Thread rohan rai
This is pretty unstable...anyone has any clue...Sometimes it even creates
index, sometimes it does not ??

But everytime time I do get this exception

Regards
Rohan
On Sat, Oct 10, 2009 at 6:07 PM, rohan rai hiroha...@gmail.com wrote:

 ModifiableSolrParams p = new ModifiableSolrParams();
 p.add(qt, /dataimport);
 p.add(command, full-import);
 server.query(p, METHOD.POST);

 I do this

 But it starts giving me this exception

 SEVERE: Full Import failed
 java.util.concurrent.RejectedExecutionException
 at
 java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1760)
 at
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
 at
 java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:216)
 at
 java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:366)
 at
 org.apache.solr.update.DirectUpdateHandler2$CommitTracker.scheduleCommitWithin(DirectUpdateHandler2.java:466)
 at
 org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:322)
 at
 org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:69)
 at
 org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:192)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:332)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)





 2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 you may need to extend a SolrRequest and set appropriate path
 (/dataimport) and other params
 then you may invoke the request method.

 On Sat, Oct 10, 2009 at 11:07 AM, rohan rai hiroha...@gmail.com wrote:
  The configuration is not an issue...
  But how doindex i invoke it...
 
  I only have known a url way to invoke it and thus import the data into
  index...
  like http://localhost:8983/solr/db/dataimport?command=full-import t
  But with embedded I havent been able to figure it out
 
  Regards
  Rohan
  2009/10/10 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com
 
  I guess it should be possible... what are the problems you encounter?
 
  On Sat, Oct 10, 2009 at 10:56 AM, rohan rai hiroha...@gmail.com
 wrote:
   Have been unable to use DIH for Embedded Solr
  
   Is there a way??
  
   Regards
   Rohan
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com





Re: Question regarding proximity search

2009-10-10 Thread AHMET ARSLAN

 Hi
 I would appreciate if someone can throw some light on the
 following point
 regarding proximity search.
 i have a search box and if a use comes and type in honda
 car WITHOUT any
 double quotes, i want to get all documents with matches,
 and also they
 should be ranked based on proximity. i.e. the more the two
 terms are nearer
 the more is the rank. 
 From the admin looks like in order to test proximity i have
 to always give
 the word in double quote and a slop value
 http://localhost:8983/solr/select/?q=honda+car~12version=2.2start=0rows=10indent=on
 
 Hence looks like from admin point of view in order to do
 proximity i have to
 always give it in double quotes.
 
 My questions is in order to do proximity search we always
 have to pass the
 query as a phrase ie. in double quotes.

Yes, if you are using LuceneQParserPlugin.
 
 The next question is that i thought using dismax handler i
 could do a search
 on a field and i can specify the ps value in order to
 define proximity.

 and this is the query i am giving and i get back no
 results. any advice
 where i am going wrong
 
 http://localhost:8983/solr/proxTest/?q=honda car

Can you try http://localhost:8983/solr/proxTest/?q=honda+car
You don't need quotes in dismax.
You can append debugQuery=true to url to see whats going on.

Hope this helps.




  


Customizing solr search: SpanQueries (revisited)

2009-10-10 Thread seanoc5

Hi all,
I am trying to use SpanQueries to save*all* hits for custom query type
(e.g. defType=fooSpanQuery), along with token positions. I have this working
in straight lucene, so my challenge is to implement it half-intelligently in
solr. At the moment, I can't figure out where and how to customize the
'inner' search process.

So far, I have my own SpanQParser, and SpanQParserPlugin, which
successfully return a hard-coded span query (but this is not critical for my
current challenge, I believe).

I also have managed to configure solr to call my custom
SpanQueryComponent, which I believe is the focus of my challenge. At this
initial stage, I have simply extended QueryComponent, and overriden
QueryComponent.process() while I am trying to find my way through the code
:-).

So, with all that setup, can someone point me in the right direction for
custom processing of a query (or just the query results)? A few differences
for my use-case are:
-- I want to save every hit along with position information. I believe this
means I want to use SpanQueries (like I have in lucene), but perhaps there
are other options.
-- I do not need to build much in the way of a response. This is an
automated analysis, so no user will see the solr results. I will save them
to a database, but for simplicity just a
log.info(Score:{}, Term:{}, TokenNumber:{},...)
would be great at the moment.
-- I will always process every span, even those with near zero 'score'

 I think I want to focus on SpanQParser.process(), probably overriding
the functionality in (SolrIndexSearcher)searcher.search(result,cmd)
which seems to just call
getDocListC(qr,cmd);// ?? is this my main focus point??

Does this seem like a reasonable approach? If so, how do I do it? I
think I'm missing something obvious; perhaps there is an easy way to extend
SolrIndexSearcher in solrconfig.xml to have my custom SpanQueryComponent
call a custom IndexSearcher where I simply override getDocListC()?

And for extra-karma-credit: any thoughts on performance gains (or loss?)
if I basically drop must of the advanced optimization like TopDocsCollector
and such? If have thousands of queries, and want to save *every* span for
each query, is there likely to be significant overhead from the
optimizations which are intended for users to 'page' through windows of
hits?

Also, thanks to Grant for replying to my previous inquiry
(http://osdir.com/ml/solr-dev.lucene.apache.org/2009-05/msg00010.html). This
email is partly me trying to implement his suggestion, and partly just
trying to understand basic Solr customization. I tried sending out a
previous draft of this message yesterday, but haven't seen it on the lists,
so my apologies if this becomes a duplicate post.
Thank you,

Sean
-- 
View this message in context: 
http://www.nabble.com/Customizing-solr-search%3A-SpanQueries-%28revisited%29-tp25838412p25838412.html
Sent from the Solr - User mailing list archive at Nabble.com.



http replication transfer speed

2009-10-10 Thread Mark Miller
Anyone know why you would see a transfer speed of just 10-20MB over a
gigbit network connection?

Even with standard drives, I would expect to at least see around 40MB.
Has anyone seen over 10-20 using replication?

Any ideas on what the bottleneck should be? I think even a standard
drive can do writes of a bit of 40MB/s, and certainly reads over that.

Thoughts?

-- 
- Mark

http://www.lucidimagination.com





Optimize on slaves?

2009-10-10 Thread Matthew Painter
Hi,
 
Simple question! I have a nightly cron job to send the optimize command
to Solr on our master instance. Is this also required on Solr replicated
slaves to optimise their indexes?
 
Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) 
and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, 
please do not use, disclose, copy or distribute the message or the information 
it contains.  Instead, please notify me as soon as possible and delete the 
e-mail, including any attachments.  Thank you.


Re: Optimize on slaves?

2009-10-10 Thread Walter Underwood

No. The slaves will copy the current index, optimized or not. --wunder

On Oct 10, 2009, at 4:33 PM, Matthew Painter wrote:


Hi,

Simple question! I have a nightly cron job to send the optimize  
command
to Solr on our master instance. Is this also required on Solr  
replicated

slaves to optimise their indexes?

Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the  
addressee(s) and may also be LEGALLY PRIVILEGED.  If you are not the  
intended addressee, please do not use, disclose, copy or distribute  
the message or the information it contains.  Instead, please notify  
me as soon as possible and delete the e-mail, including any  
attachments.  Thank you.




RE: Optimize on slaves?

2009-10-10 Thread Matthew Painter
My apologies; I've just found the answer (that optimisation should be on
the master server only)



From: Matthew Painter 
Sent: Sunday, 11 October 2009 12:34 p.m.
To: 'solr-user@lucene.apache.org'
Subject: Optimize on slaves?


Hi,
 
Simple question! I have a nightly cron job to send the optimize command
to Solr on our master instance. Is this also required on Solr replicated
slaves to optimise their indexes?
 
Thanks,
Matt

This e-mail message and any attachments are CONFIDENTIAL to the addressee(s) 
and may also be LEGALLY PRIVILEGED.  If you are not the intended addressee, 
please do not use, disclose, copy or distribute the message or the information 
it contains.  Instead, please notify me as soon as possible and delete the 
e-mail, including any attachments.  Thank you.


Re: http replication transfer speed

2009-10-10 Thread Mark Miller



On a drive that can do 40+ that's getting query load might have it's  
writes knocked down to that?


- Mark

http://www.lucidimagination.com (mobile)

On Oct 10, 2009, at 6:41 PM, Mark Miller markrmil...@gmail.com wrote:


Anyone know why you would see a transfer speed of just 10-20MB over a
gigbit network connection?

Even with standard drives, I would expect to at least see around 40MB.
Has anyone seen over 10-20 using replication?

Any ideas on what the bottleneck should be? I think even a standard
drive can do writes of a bit of 40MB/s, and certainly reads over that.

Thoughts?

--
- Mark

http://www.lucidimagination.com





Tips on speeding up indexing needed...

2009-10-10 Thread William Pierce

Folks:

I have a corpus of approx 6 M documents each of approx 4K bytes. 
Currently, the way indexing is set up I read documents from a database and 
issue solr post requests in batches (batches are set up so that the 
maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
in each batch we write approx 600 or so documents to SOLR.  What I am seeing 
is that I am able to push about 2500 docs per minute or approx 40 or so per 
second.


I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 docs/sec 
have been achieved.  Needless to say I am sure that performance numbers vary 
widely and are dependent on the domain, machine configurations, etc.


I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

Any tips on what I can do to speed this up?

Thanks,

Bill 



Re: Tips on speeding up indexing needed...

2009-10-10 Thread William Pierce
Oh and one more thing...For historical reasons our apps run using msft 
technologies, so using SolrJ would be next to impossible at the present 
time


Thanks in advance for your help!

-- Bill

--
From: William Pierce evalsi...@hotmail.com
Sent: Saturday, October 10, 2009 5:47 PM
To: solr-user@lucene.apache.org
Subject: Tips on speeding up indexing needed...


Folks:

I have a corpus of approx 6 M documents each of approx 4K bytes. 
Currently, the way indexing is set up I read documents from a database and 
issue solr post requests in batches (batches are set up so that the 
maxPostSize of tomcat which is set to 2MB is adhered to).  This means that 
in each batch we write approx 600 or so documents to SOLR.  What I am 
seeing is that I am able to push about 2500 docs per minute or approx 40 
or so per second.


I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000 
docs/sec have been achieved.  Needless to say I am sure that performance 
numbers vary widely and are dependent on the domain, machine 
configurations, etc.


I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

Any tips on what I can do to speed this up?

Thanks,

Bill



Re: Tips on speeding up indexing needed...

2009-10-10 Thread Lance Norskog
A few things off the bat:
1) do not commit until the end.
2) use the DataImportHandler - it runs inside Solr and reads the
database. This cuts out the HTTP transfer/XML xlation overheads.
3) examine your schema. Some of the text analyzers are quite slow.

Solr tips:
http://wiki.apache.org/solr/SolrPerformanceFactors

Lucene tips:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

And, what you don't want to hear: for jobs like this, Solr/Lucene is
disk-bound. The Windows NTFS file system is much slower than what is
available for Linux or the Mac, and these numbers are for those
machines.

Good luck!

Lance Norskog


On Sat, Oct 10, 2009 at 5:57 PM, William Pierce evalsi...@hotmail.com wrote:
 Oh and one more thing...For historical reasons our apps run using msft
 technologies, so using SolrJ would be next to impossible at the present
 time

 Thanks in advance for your help!

 -- Bill

 --
 From: William Pierce evalsi...@hotmail.com
 Sent: Saturday, October 10, 2009 5:47 PM
 To: solr-user@lucene.apache.org
 Subject: Tips on speeding up indexing needed...

 Folks:

 I have a corpus of approx 6 M documents each of approx 4K bytes.
 Currently, the way indexing is set up I read documents from a database and
 issue solr post requests in batches (batches are set up so that the
 maxPostSize of tomcat which is set to 2MB is adhered to).  This means that
 in each batch we write approx 600 or so documents to SOLR.  What I am seeing
 is that I am able to push about 2500 docs per minute or approx 40 or so per
 second.

 I saw in Erik's talk on Friday that speeds of 250 docs/sec to 25000
 docs/sec have been achieved.  Needless to say I am sure that performance
 numbers vary widely and are dependent on the domain, machine configurations,
 etc.

 I am running on Windows 2003 server, with 4 GB RAM, dual core xeon.

 Any tips on what I can do to speed this up?

 Thanks,

 Bill





-- 
Lance Norskog
goks...@gmail.com


Re: Facets with an IDF concept

2009-10-10 Thread Lance Norskog
In Solr a facet is assigned one number: the number of documents in
which it appears. The facets are sorted by that number.  Would your
use case be solved with a second number that is formulated from the
relevance of the associated documents? For example:

   facet relevance = count * sum(scores of documents) with
coefficients for each input?

To do this, for each document counted by the facet, you then have to
find that document in the result list and pull the score. This would
be much slower than the current count the documents algorithm. But
if you have limited the document list via filter, this could still be
fast enough for interactive use.

If I wanted to make a tag cloud, this is how I would do it.

On Fri, Oct 9, 2009 at 3:58 PM, Asif Rahman a...@newscred.com wrote:
 Hi Wojtek:

 Sorry for the late, late reply.  I haven't implemented this yet, but it is
 on the (long) list of my todos.  Have you made any progress?

 Asif

 On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia wojte...@hotmail.com wrote:


 Hi Asif,

 Did you end up implementing this as a custom sort order for facets? I'm
 facing a similar problem, but not related to time. Given 2 terms:
 A: appears twice in half the search results
 B: appears once in every search result
 I think term A is more interesting. Using facets sorted by frequency,
 term
 B is more important (since it shows up first). To me, terms that appear in
 all documents aren't really that interesting. I'm thinking of using a
 combination of document count (in the result set, not globally) and term
 frequency (in the result set, not globally) to come up with a facet sort
 order.

 Wojtek
 --
 View this message in context:
 http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 Asif Rahman
 Lead Engineer - NewsCred
 a...@newscred.com
 http://platform.newscred.com




-- 
Lance Norskog
goks...@gmail.com


Re: Is negative boost possible?

2009-10-10 Thread ragi

If you dont want to do a pure negative query and just want boost a few
documents down based on a matching criteria try to use linear function (one
of the functions available in boost function) with a negative m (slope).
We could solve our problem this way.


We wanted to do negatively boost some documents based on certain keywords
while 

Marc Sturlese wrote:
 
 
 :the only way to negative boost is to positively boost the inverse...
 :
 :(*:* -field1:value_to_penalize)^10
 
 This will do the job aswell as bq supports pure negative queries (at least
 in trunk):
 bq=-field1:value_to_penalize^10
 
 http://wiki.apache.org/solr/SolrRelevancyFAQ#head-76e53db8c5fd31133dc3566318d1aad2bb23e07e
 
 
 hossman wrote:
 
 
 : Use decimal figure less than 1, e.g. 0.5, to express less importance.
 
 but that's stil la positive boost ... it still increases the scores of 
 documents that match.
 
 the only way to negative boost is to positively boost the inverse...
 
  (*:* -field1:value_to_penalize)^10
 
 :  I am looking for a way to assign negative boost to a term in Solr
 query.
 :  Our use scenario is that we want to boost matching documents that are
 :  updated recently and penalize those that have not been updated for a
 long
 :  time.  There are other terms in the query that would affect the
 scores as
 :  well.  For example we construct a query similar to this:
 :  
 :  *:* field1:value1^2  field2:value2^2 lastUpdateTime:[NOW/DAY-90DAYS
 TO *]^5
 :  lastUpdateTime:[* TO NOW/DAY-365DAYS]^-3
 :  
 :  I notice it's not possible to simply use a negative boosting factor
 in the
 :  query.  Is there any way to achieve such result?
 :  
 :  Regards,
 :  Shi Quan He
 :  
 :
 
 
 
 -Hoss
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-negative-boost-possible--tp25025775p25840621.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problems with WordDelimiterFilterFactory

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 3:33 AM, Patrick Jungermann 
patrick.jungerm...@googlemail.com wrote:

 Hi Bern,

 the problem is the character sequence --. A query is not allowed to
 have minus characters that consequent upon another one. Remove one minus
 character and the query will be parsed without problems.


Or you could escape the hyphen character. If you are using SolrJ, use
ClientUtils.escapeQueryChars on the query string.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Default query parameter for one core

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 7:56 PM, Michael solrco...@gmail.com wrote:

 Hm... still no success.  Can anyone point me to a doc that explains
 how to define and reference core properties?  I've had no luck
 searching Google.

 Shalin, I gave an identical 'property name=shardsParam/' tag to
 each of my cores, and referenced ${solr.core.shardsParam} (with no
 default specified via a colon) in solrconfig.xml.  I get an error on
 startup:


I should have mentioned it earlier but the property name in your case would
be just ${shardParam}. The solr.core string is only for automatically
added properties such as name, instanceDir, dataDir, configName, schemaName

-- 
Regards,
Shalin Shekhar Mangar.


Re: Default query parameter for one core

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 9:39 PM, Michael solrco...@gmail.com wrote:

 For posterity...

 After reading through http://wiki.apache.org/solr/SolrConfigXml and
 http://wiki.apache.org/solr/CoreAdmin and
 http://issues.apache.org/jira/browse/SOLR-646, I think there's no way
 for me to make only one core specify shards=foo, short of duplicating
 my solrconfig.xml for that core and adding one line:

 - I can't use a variable like ${shardsParam} in a single shared
 solrconfig.xml, because the line
str name=shards${shardsParam}/str
  has to be in there, and that forces a (possibly empty) shards
 parameter onto cores that *don't* need one, causing a
 NullPointerException.


Well, we can fix the NPE :)

Please raise an issue.


 - I can't suck in just that one str line via a SOLR-646-style import,
 like
#solrconfig.xml
requestHandler
  lst name=defaults
import file=${shardspec_file}/
  /list
/requestHandler
#solr.xml
core name=core0property name=shardspec_file
 value=some_file/...
core name=core1property name=shardspec_file
 value=/dev/null/...
  because SOLR-646's import feature got cut.

 So I think my best bet is to make two mostly-identical
 solrconfig.xmls, and point core0 to the one specifying a shards=
 parameter:
core name=core0 config=core0_solrconfig.xml/

 I don't like the duplication of config, but at least it accomplishes my
 goal!


There is another way too. Each plugin in Solr now supports a configuration
attribute named enable which can be true or false. You can control the
value (true/false) through a variable. So you can duplicate just the handle
instead of the complete solrconfig.xml

-- 
Regards,
Shalin Shekhar Mangar.


Re: Slave re-replication of index over and over

2009-10-10 Thread Shalin Shekhar Mangar
On Fri, Oct 9, 2009 at 9:49 PM, Moshe Cohen mos...@gmail.com wrote:

 Hi,
 I am using SOLR 1.4   (July 23rd nightly build), with a master-slave setup.
 I have encountered twice an occurrence of the slave recreating the indexes
 over and over gain.
 Couldn't find any pointers in the log.
 Any help would be appreciated


I vaguely remember a bug which caused the slave to loop. Can you upgrade to
the latest nightly and see if that solves the problem?

-- 
Regards,
Shalin Shekhar Mangar.