date:20111216

Solr Optimization Fail

2011-12-16 Thread Rajani Maski

Hi,

 When we do optimize, it actually reduces the data size right?

I have index of size 6gb(5 million documents). Index is already created
with commits for every 1 documents.

Now I was trying to do optimization with  http optimize command.   When i
did that,  data size became - 12gb.  Why this might have happened?

And can anyone please suggest me fix for it?

Thanks
Rajani

disable stemming on query parser.

2011-12-16 Thread meghana

Hi All, 

I am using Stemming in my solr , but i don't want to apply stemming always
for each search request. i am thinking of to disable stemming on one
specific query parser , can i do this?

Any help much appreciated.
Thanks in Advance


--
View this message in context: 
http://lucene.472066.n3.nabble.com/disable-stemming-on-query-parser-tp3591420p3591420.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Optimization Fail

2011-12-16 Thread Juan Pablo Mora

Maybe you are generating a snapshot of your index attached to the optimize ???
Look for post-commit or post-optimize events in your solr-config.xml


De: Rajani Maski [rajinima...@gmail.com]
Enviado el: viernes, 16 de diciembre de 2011 11:11
Para: solr-user@lucene.apache.org
Asunto: Solr Optimization Fail

Hi,

 When we do optimize, it actually reduces the data size right?

I have index of size 6gb(5 million documents). Index is already created
with commits for every 1 documents.

Now I was trying to do optimization with  http optimize command.   When i
did that,  data size became - 12gb.  Why this might have happened?

And can anyone please suggest me fix for it?

Thanks
Rajani

Re: Solr Optimization Fail

2011-12-16 Thread Rajani Maski

These parameters are commented in my solr config.xml

see the parameters attached.

!-- The RunExecutableListener executes an external command from a
  hook such as postCommit or postOptimize.
 exe - the name of the executable to run
 dir - dir to use as the current working directory. default=.
 wait - the calling thread waits until the executable returns.
default=true
 args - the arguments to pass to the program.  default=nothing
 env - environment variables to set.  default=nothing
  --
!-- A postCommit event is fired after every commit or optimize command
listener event=postCommit class=solr.RunExecutableListener
  str name=exesolr/bin/snapshooter/str
  str name=dir./str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener
--
!-- A postOptimize event is fired only after every optimize command
listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
/listener
--


When i do optimize on index of size 400 mb , it reduces the size of data
folder to 200 mb. But when data is huge it doubles it.
Why is that so?

Optimization : Actually should reduce the size of the data ? Or
just improves the search query performance?






On Fri, Dec 16, 2011 at 5:40 PM, Juan Pablo Mora jua...@informa.es wrote:

 Maybe you are generating a snapshot of your index attached to the optimize
 ???
 Look for post-commit or post-optimize events in your solr-config.xml

 
 De: Rajani Maski [rajinima...@gmail.com]
 Enviado el: viernes, 16 de diciembre de 2011 11:11
 Para: solr-user@lucene.apache.org
 Asunto: Solr Optimization Fail

 Hi,

  When we do optimize, it actually reduces the data size right?

 I have index of size 6gb(5 million documents). Index is already created
 with commits for every 1 documents.

 Now I was trying to do optimization with  http optimize command.   When i
 did that,  data size became - 12gb.  Why this might have happened?

 And can anyone please suggest me fix for it?

 Thanks
 Rajani

full-data import suddenly stopped working. Total Rows Fetched remains 0

2011-12-16 Thread PeterKerk

My full-data import stopped working all of  a sudden. Afaik I have not made
any changes that would cause this.

The response is:
response
script/
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
lst name=initArgs
lst name=defaults
str name=configwedding-data-config.xml/str
/lst
/lst
str name=commandfull-import/str
str name=statusbusy/str
str name=importResponseA command is still running.../str
lst name=statusMessages
str name=Time Elapsed0:6:4.112/str
str name=Total Requests made to DataSource1/str
str name=Total Rows Fetched0/str
str name=Total Documents Processed0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-12-16 13:12:29/str
/lst
str name=WARNING
This response format is experimental. It is likely to change in the future.
/str
/response

Doesnt matter how often I refresh this page, it stays like this, only thing
chaning is Time Elapsed.

Here's the log:

Dec 16, 2011 1:20:04 PM org.apache.solr.handler.dataimport.DataImporter
doFullIm
port
INFO: Starting Full Import
Dec 16, 2011 1:20:04 PM org.apache.solr.handler.dataimport.SolrWriter
readIndexe
rProperties
INFO: Read dataimport.properties
Dec 16, 2011 1:20:04 PM org.apache.solr.update.DirectUpdateHandler2
deleteAll
INFO: [cam] REMOVING ALL DOCUMENTS FROM INDEX
Dec 16, 2011 1:20:04 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
commit{dir=C:\My
Dropbox\inetpub\apache-solr-4.0-2010-10-12_08-05-48\exa
mple\example-DIH\solr\cam\data\index,segFN=segments_jb,version=1286962723772,gen
eration=695,filenames=[_iv.prx, _iv.frq, segments_jb, _iv.tis, _iv.nrm,
_iv.fdt,
 _iv.fdx, _iv.fnm, _iv.tii]
Dec 16, 2011 1:20:04 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 1286962723772
Dec 16, 2011 1:20:04 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call

INFO: Creating a connection for entity camera with URL:
jdbc:sqlserver://localho
st:1433;databaseName=tt
Dec 16, 2011 1:20:05 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:06 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:07 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:09 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:09 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:10 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:10 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0
Dec 16, 2011 1:20:11 PM org.apache.solr.core.SolrCore execute
INFO: [cam] webapp=/solr path=/dataimport params={command=full-import}
status=0
QTime=0

Once again...this always used to work...I have no idea why it now doesnt
since I see no error whatsoever.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/full-data-import-suddenly-stopped-working-Total-Rows-Fetched-remains-0-tp3591479p3591479.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Optimization Fail

2011-12-16 Thread Tomás Fernández Löbbe

Are you on Windows? There is a JVM bug that makes Solr keep the old files,
even if they are not used anymore. The files are going to be eventually
removed, but if you want them out of there immediately try optimizing
twice, the second optimize doesn't do much but it will remove the old files.

On Fri, Dec 16, 2011 at 9:10 AM, Juan Pablo Mora jua...@informa.es wrote:

 Maybe you are generating a snapshot of your index attached to the optimize
 ???
 Look for post-commit or post-optimize events in your solr-config.xml

 
 De: Rajani Maski [rajinima...@gmail.com]
 Enviado el: viernes, 16 de diciembre de 2011 11:11
 Para: solr-user@lucene.apache.org
 Asunto: Solr Optimization Fail

 Hi,

  When we do optimize, it actually reduces the data size right?

 I have index of size 6gb(5 million documents). Index is already created
 with commits for every 1 documents.

 Now I was trying to do optimization with  http optimize command.   When i
 did that,  data size became - 12gb.  Why this might have happened?

 And can anyone please suggest me fix for it?

 Thanks
 Rajani

Re: Solr Optimization Fail

2011-12-16 Thread Rajani Maski

Oh, yes on windows, using java 1.6 and Solr 1.4.1.

Ok let me try that one...

Thank you so much.

Regards,
Rajani



2011/12/16 Tomás Fernández Löbbe tomasflo...@gmail.com

 Are you on Windows? There is a JVM bug that makes Solr keep the old files,
 even if they are not used anymore. The files are going to be eventually
 removed, but if you want them out of there immediately try optimizing
 twice, the second optimize doesn't do much but it will remove the old
 files.

 On Fri, Dec 16, 2011 at 9:10 AM, Juan Pablo Mora jua...@informa.es
 wrote:

  Maybe you are generating a snapshot of your index attached to the
 optimize
  ???
  Look for post-commit or post-optimize events in your solr-config.xml
 
  
  De: Rajani Maski [rajinima...@gmail.com]
  Enviado el: viernes, 16 de diciembre de 2011 11:11
  Para: solr-user@lucene.apache.org
  Asunto: Solr Optimization Fail
 
  Hi,
 
   When we do optimize, it actually reduces the data size right?
 
  I have index of size 6gb(5 million documents). Index is already created
  with commits for every 1 documents.
 
  Now I was trying to do optimization with  http optimize command.   When i
  did that,  data size became - 12gb.  Why this might have happened?
 
  And can anyone please suggest me fix for it?
 
  Thanks
  Rajani

How to disable Auto Commit and Auto optimize operation after addition of few documents through dataimport handler

2011-12-16 Thread mechravi25

Hi,

I would like to know how can we disable the commit and optimize operation is
called by deafult after addition of few documents through dataimport
handlers.

In our application, the master solr instance is used for indexing purpose
and the slave solr is for user search request. Hence the replication has to
happen on regular interval of time. Master solr has around 1.4 million
document (Size : 2.7 GB). We have frequent addition/deletion of documents in
the master solr. After each addition/deletion commit and optimize operation
are called by default, which tends to be a costly operation. Also this makes
the replication time longer. So what I thought is that the commit operation
should be performed after certain amount of documents are added and optimize
operation should performed only once in a day or manually done.

Please let me know how to customize the setting for commit and optimize
operation in solrConfig.xml. do we have any documentation regarding the
same. Any pointers would be of great help. Thanks in advances.

Thanks  Regards,
Sivaganesh


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-disable-Auto-Commit-and-Auto-optimize-operation-after-addition-of-few-documents-through-datair-tp3591560p3591560.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrCloud Cores

2011-12-16 Thread Jamie Johnson

What is the most appropriate way to configure Solr when deploying in a
cloud environment?  Should the core name on all instances be the
collection name or is it more appropriate that each shard be a
separate core, or should each solr instance be a separate core (i.e.
master1, master1-replica are 2 separate cores)?

Re: How to disable Auto Commit and Auto optimize operation after addition of few documents through dataimport handler

2011-12-16 Thread Shawn Heisey


On 12/16/2011 5:57 AM, mechravi25 wrote:

I would like to know how can we disable the commit and optimize operation is
called by deafult after addition of few documents through dataimport
handlers.


Add this to the url you use to call the handler:

commit=falseoptimize=false

Thanks,
Shawn

Lock obtain timed out

2011-12-16 Thread Eric Tang

Hi,

I'm doing a lot reads and writes into a single solr server (on the
magnitude of 50ish per second), and have around 300,000 documents in the
index.

Now every 5 minutes I get this exception:
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: NativeFSLock@./solr/data/index/write.lock

And I have to restart my solr process.

I've done some googling, some people have suggested raising the limit for
linux file open #, or changing the merge factor, but that didn't work.
 Does anyone have insights into this?


Thanks,
Eric

Re: disable stemming on query parser.

2011-12-16 Thread Dmitry Kan

You can disable stemming in a copy field. So you need to define one field
with your input data on which stemming will be done and the other field
(copy field), on which stemming will not be done. Then on the client you
can decide which field to search against.

Dmitry

On Fri, Dec 16, 2011 at 2:00 PM, meghana meghana.rav...@amultek.com wrote:

 Hi All,

 I am using Stemming in my solr , but i don't want to apply stemming always
 for each search request. i am thinking of to disable stemming on one
 specific query parser , can i do this?

 Any help much appreciated.
 Thanks in Advance


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/disable-stemming-on-query-parser-tp3591420p3591420.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lock obtain timed out

2011-12-16 Thread Otis Gospodnetic

Hi Eric,

And you are using the latest version of Solr, 3.5.0?
What is the timeout in solrconfig.xml?
How many CPU cores does the machine have and how many concurrent indexer 
threads do you have running?

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Eric Tang eric.x.t...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Friday, December 16, 2011 10:08 AM
Subject: Lock obtain timed out
 
Hi,

I'm doing a lot reads and writes into a single solr server (on the
magnitude of 50ish per second), and have around 300,000 documents in the
index.

Now every 5 minutes I get this exception:
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: NativeFSLock@./solr/data/index/write.lock

And I have to restart my solr process.

I've done some googling, some people have suggested raising the limit for
linux file open #, or changing the merge factor, but that didn't work.
Does anyone have insights into this?


Thanks,
Eric

Re: Replication file become very very big

2011-12-16 Thread Otis Gospodnetic

Hi,

Hm, I don't know what this could be caused by.  But if you want to get rid of 
it, remote that Linux server our of the load balancer pool, stop Solr, remove 
the index, and restart Solr.  Then force replication and put the server back in 
the load balancer pool.

If you use SPM (see link in my signature below) you will see how your indices 
grow (and shrink!) over time and will catch this problem when it happens next 
time by looking at the graph that shows info about your index - size on FS, # 
of segments, documents, etc.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: ZiLi dangld...@163.com
To: solr-user@lucene.apache.org 
Cc: dangld...@163.com 
Sent: Thursday, December 15, 2011 9:28 PM
Subject: Replication file become very very big
 
Hi all,
    I meet a very strange problem .
    We use a windows server as master  serviced for  5 windows slaves and 3 
Linux slaves .
    It has worked normally for 2 months .But today we find one of the Linux 
slave's index file become very very big (150G! Others is 300M ). And we can't 
find the index folder under data folder .There is just four files 
:index.20111203090855 
（150G）、index.properties、replication.properties、spellchecker 。 By  the way , 
although this file is 150G , its service is normal and the query is very fast .
    By the way, our Linux slaves' index will poll from server every 40 minutes 
and every 15 minutes our program will update these server's  solr index.  
   We forbidden AutoCommit in solrconfig.xml . Is this caused the problem via 
some big transaction ?
   Any suggestion will be appreciate .

Re: Core overhead

2011-12-16 Thread Otis Gospodnetic

Hi,

I used to think this, too, but have learned this not to be entirely true.  We 
had a customer with a query rate of a few hundred QPS and 32 or 64 GB RAM 
(don't recall which any more) and a pretty large JVM heap.  Most queries were 
very fast, but once in a while a query would be very slow.  GC, we thought!  So 
the initial thinking was was - must be that big heap of theirs.  But long 
story short, instead of making the heap smaller we just tuned the JVM and took 
care of those slow queries.  Using SPM (link in sig) and seeing GC info 
(collection counts, times, heap size, etc.) was invaluable!

Otis


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html - FREE!




 From: Robert Stewart bstewart...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Thursday, December 15, 2011 2:16 PM
Subject: Re: Core overhead
 
One other thing I did not mention is GC pauses.  If you have smaller
heap sizes, you would have less very long GC pauses, so that can be an
advantage having many cores (if cores are distributed into seperate
SOLR instances, as seperate processes).  I think you can expect 1
second pause for each GB of heap size in worst case.



On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart bstewart...@gmail.com wrote:
 It is true number of terms may be much more than N/10 (or even N for
 each core), but it is the number of docs per term that will really
 matter.  So you can have N terms in each core but each term has 1/10
 number of docs on avg.




 2011/12/15 Yury Kats yuryk...@yahoo.com:
 On 12/15/2011 1:07 PM, Robert Stewart wrote:

 I think overall memory usage would be close to the same.

 Is this really so? I suspect that the consumed memory is in direct
 proportion to the number of terms in the index. I also suspect that
 if I divided 1 core with N terms into 10 smaller cores, each smaller
 core would have much more than N/10 terms. Let's say I'm indexing
 English texts, it's likely that all smaller cores would have almost
 the same number of terms, close to the original N. Not so?

Re: Lock obtain timed out

2011-12-16 Thread Eric Tang

Hi Otis,

I'm using 3.2 because I can't get velocity to run on 3.5.

I've changed my writeLockTimeout from 1000 to 1, and my
commitLockTimeout from 1 to 5

Running on a large ec2 box, which has 2 virtual cores.  I don't know how to
find out the # of concurrent indexer threads.  Is that the same as
maxWarmingSearchers?  If that's the case I've changed it from 2 to 5.  I
have about 12 processes running concurrently to read/write to solr at the
moment, but this is just a test and I'm planning to up this number to 50 -
100.

Thanks,
Eric



On Fri, Dec 16, 2011 at 10:14 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi Eric,

 And you are using the latest version of Solr, 3.5.0?
 What is the timeout in solrconfig.xml?
 How many CPU cores does the machine have and how many concurrent indexer
 threads do you have running?

 Otis
 
 Performance Monitoring SaaS for Solr -
 http://sematext.com/spm/solr-performance-monitoring/index.html



 
  From: Eric Tang eric.x.t...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, December 16, 2011 10:08 AM
 Subject: Lock obtain timed out
 
 Hi,
 
 I'm doing a lot reads and writes into a single solr server (on the
 magnitude of 50ish per second), and have around 300,000 documents in the
 index.
 
 Now every 5 minutes I get this exception:
 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed out: NativeFSLock@./solr/data/index/write.lock
 
 And I have to restart my solr process.
 
 I've done some googling, some people have suggested raising the limit for
 linux file open #, or changing the merge factor, but that didn't work.
 Does anyone have insights into this?
 
 
 Thanks,
 Eric

Re: Core overhead

2011-12-16 Thread Otis Gospodnetic

Hi Yury,

Not sure if this was already covered in this thread, but with N smaller cores 
on a single N-CPU-core box you could run N queries in parallel over smaller 
indices, which may be faster than a single query going against a single big 
index, depending on how many concurrent query requests the box is handling 
(i.e. how busy or idle the CPU cores are).

Otis


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Yury Kats yuryk...@yahoo.com
To: solr-user@lucene.apache.org 
Sent: Thursday, December 15, 2011 12:58 PM
Subject: Core overhead
 
Does anybody have an idea, or better yet, measured data,
to see what the overhead of a core is, both in memory and speed?

For example, what would be the difference between having 1 core
with 100M documents versus having 10 cores with 10M documents?

Re: Lock obtain timed out

2011-12-16 Thread Otis Gospodnetic

Hi,



I'm using 3.2 because I can't get velocity to run on 3.5.


Maybe this is worth asking about in a separate thread or maybe you already 
did that.

I've changed my writeLockTimeout from 1000 to 1, and my
commitLockTimeout from 1 to 5

Running on a large ec2 box, which has 2 virtual cores.  I don't know how to

Note: *2* *virtual* cores.

find out the # of concurrent indexer threads.  Is that the same as
maxWarmingSearchers?  If that's the case I've changed it from 2 to 5.  I

2 is better than 5 here

have about 12 processes running concurrently to read/write to solr at the
moment, but this is just a test and I'm planning to up this number to 50 -
100.


Some of these processes are writing to Solr (indexing), others are reading from 
it (searching).
Having more than 1-2 indexing processes on an EC2 box with just 2 *virtual* 
cores will be suboptimal.
Does the error go away if you change your application to have just 1 indexing 
thread?

Otis
Performance Monitoring SaaS for Solr 
- http://sematext.com/spm/solr-performance-monitoring/index.html






On Fri, Dec 16, 2011 at 10:14 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi Eric,

 And you are using the latest version of Solr, 3.5.0?
 What is the timeout in solrconfig.xml?
 How many CPU cores does the machine have and how many concurrent indexer
 threads do you have running?

 Otis
 
 Performance Monitoring SaaS for Solr -
 http://sematext.com/spm/solr-performance-monitoring/index.html



 
  From: Eric Tang eric.x.t...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Friday, December 16, 2011 10:08 AM
 Subject: Lock obtain timed out
 
 Hi,
 
 I'm doing a lot reads and writes into a single solr server (on the
 magnitude of 50ish per second), and have around 300,000 documents in the
 index.
 
 Now every 5 minutes I get this exception:
 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed out: NativeFSLock@./solr/data/index/write.lock
 
 And I have to restart my solr process.
 
 I've done some googling, some people have suggested raising the limit for
 linux file open #, or changing the merge factor, but that didn't work.
 Does anyone have insights into this?
 
 
 Thanks,
 Eric

Re: how to setup to archive expired documents?

2011-12-16 Thread Otis Gospodnetic

Hi,

We've done a fair number of such things over the years. :)
If daily shards don't work for you, why not weekly or monthly?
Have a look at Zoie's Hourglass concept/code.
Some Solr alternatives are currently better suited to handle this sort of 
setup...

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



- Original Message -
 From: Robert Stewart bstewart...@gmail.com
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Thursday, December 15, 2011 12:55 PM
 Subject: Re: how to setup to archive expired documents?
 
 I think managing 100 cores will be too much headache.  Also
 performance of querying 100 cores will not be good (need
 page_number*page_size from 100 cores, and then merge).
 
 I think having around 10 SOLR instances, each one about 10M docs.
 Always search all 10 nodes.  Index using some hash(doc) to distribute
 new docs among nodes.  Run some nightly/weekly job to delete old docs
 and force merge (optimize) to some min/max number of segments.  I
 think that will work ok, but not sure about how to handle
 replication/failover so each node is redundant.  If we use SOLR
 replication it will have problems with replication after optimize for
 large indexes.  Seems to take a long time to move 10M doc index from
 master to slave (around 100GB in our case).  Doing it once per week is
 probably ok.
 
 
 
 2011/12/15 Avni, Itamar itamar.a...@verint.com:
  What about managing a core for each day?
 
  This way the deletion/archive is very simple. No holes in the 
 index (which is often when deleting document by document).
  The index done against core [today-0].
  The query is done against cores [today-0],[today-1]...[today-99]. Quite a 
 headache.
 
  Itamar
 
  -Original Message-
  From: Robert Stewart [mailto:bstewart...@gmail.com]
  Sent: יום ה 15 דצמבר 2011 16:54
  To: solr-user@lucene.apache.org
  Subject: how to setup to archive expired documents?
 
  We have a large (100M) index where we add about 1M new docs per day.
  We want to keep index at a constant size so the oldest ones are removed 
 and/or archived each day (so index contains around 100 days of data).  What 
 is 
 the best way to do this?  We still want to keep older data in some archive 
 index, not just delete it (so is it possible to export older segments, etc. 
 into 
 some other index?).  If we have some daily job to delete old data, I assume 
 we'd need to optimize the index to actually remove and free space, but that 
 will require very large (and slow) replication after optimize which will 
 probably not work out well for so large an index.  Is there some way to shard 
 the data or other best practice?
 
  Thanks
  Bob
  This electronic message may contain proprietary and confidential 
 information of Verint Systems Inc., its affiliates and/or subsidiaries.
  The information is intended to be for the use of the individual(s) or
  entity(ies) named above.  If you are not the intended recipient (or 
 authorized to receive this e-mail for the intended recipient), you may not 
 use, 
 copy, disclose or distribute to anyone this message or any information 
 contained 
 in this message.  If you have received this electronic message in error, 
 please 
 notify us by replying to this e-mail.

Re: Solr AutoComplete - Address Search

2011-12-16 Thread Vijay Sampath

Just to add to it, I'm using Suggester component to implement Auto Complete 
http://wiki.apache.org/solr/Suggester

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-AutoComplete-Address-Search-tp3590112p3592017.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Poor performance on distributed search

2011-12-16 Thread Erick Erickson

The thing that jumps out at me is rows=2000. If you documentCache in
solrconfig.xml is still the defaults, it only holds 512. So you're running
all over your disk gathering up the fields to return, especially since
you also specified fl=*,score. And if you have large fields stored, you're
doing an awful lot of disk reading.

simple tests to see if this is on the right track, try these, singly
and in combination.

1 try with rows=10
2 try with fl=id assuming id is your uniqueKey

Best
Erick

On Thu, Dec 15, 2011 at 5:00 PM, ku3ia dem...@gmail.com wrote:
 Hi, all!

 I have a problem with distributed search. I downloaded one shard from my
 production. It has:
 * ~29M docs
 * 11 fields
 * ~105M terms
 * size of shard is: 13GB
 On production there are near 30 the same shards. I split this shard to 4
 more smaller shards, so now I have:
 small shard1:
 docs: 6.2M
 terms: 27.2M
 size: 2.89GB
 small shard2:
 docs: 6.3M
 terms: 28.7M
 size: 2.98GB
 small shard3:
 docs: 7.9M
 terms: 32.8M
 size: 3.60GB
 small shard4:
 docs: 8.2M
 terms: 32.6M
 size: 3.70GB

 My machine confguration:
 ABIT AX-78
 AMD Athlon 64 X2 5200+
 DDR2 Kingston 2x2G+2x1G = 6G
 WDC WD2500JS (System here)
 WDC WD20EARS (6 partitions = 30 GB for shards at begin of drive, and other
 empty, all partitions are well aligned)
 GNU/Linux Debian Squeeze
 Tomcat 6.0.32 with JAVA_OPTS:
 JAVA_OPTS=$JAVA_OPTS -XX:+DisableExplicitGC -server \
    -XX:PermSize=512M -XX:MaxPermSize=512M -Xmx4096M -Xms4096M
 -XX:NewSize=128M -XX:MaxNewSize=128M \
    -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled \
    -XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9
 -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 \
    -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:$CATALINA_HOME/logs/gc.log
 Solr 3.5

 I configured 4 cores and start Tomcat. I write a bash script. It's runing
 during 300 seconds and sending every 6 seconds queries like
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(assistants)rows=2000start=0fl=*,scoreqt=requestShards
 where qt=requestShards is my 4 shards. After test I have the results:

 Elapsed time: 299 secs
 --- solr ---
 Queries processed: 21  this is full response file
 Queries cancelled: 29  this is number of killed curls
 Average QTime is: 59645.6 ms
 Average RTime is: 59.7619 sec(s)  this is average time difference between
 start and stop the curl. There is a part of script:
 # dcs=`date +%s`
 # curl ${url} -s -H 'Content-type:text/xml; charset=utf-8' 
 ${F_DATADIR}/$dest.fdata
 # dce=`date +%s`
 # dcd=$(echo $dce - $dcs | bc)
 Size of data-dir is: 3346766 bytes  this is response dir size

 I'm using nmon to to monitor R/W disk speed, and I was surprised that read
 speed of my shards volumes WDC20EAR's drive was nearly 3 MB/s when script is
 working. After this I run benchmark test from disk utility. Here is results:
 Minimum read rate: 53.2MB/s
 Maximum Read rate: 126.4 MB/s
 Average Read rate: 95.8 MB/s

 But from the other side I tested queries like
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(assistants)rows=2000start=0fl=*,score
 results is:
 Elapsed time: 299 secs
 --- solr ---
 Queries processed: 50
 Queries cancelled: 0
 Average QTime is: 139.76 ms
 Average RTime is: 2.2 sec(s)
 Size of data-dir is: 6819259 bytes

 and quesries like
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(assistants)rows=2000start=0fl=*,scoreshards=127.0.0.1:8080/solr/shard1
 and result is:
 Elapsed time: 299 secs
 --- solr ---
 Queries processed: 49
 Queries cancelled: 1
 Average QTime is: 1878.37 ms
 Average RTime is: 1.95918 sec(s)
 Size of data-dir is: 4274099 bytes
 So we see the results are the same.

 My big question is: why is so slow drive read speed when Solr is working?
 Thanks for any replies

 P.S. And maybe my general problem is too much terms in shard, for example,
 query
 http://127.0.0.1:8080/solr/shard1/terms?terms.fl=field1
 shows:
 lst name=field1
 int name=a58641/int
 int name=the45022/int
 int name=i36339/int
 int name=s35637/int
 int name=d34247/int
 int name=m33869/int
 int name=b28961/int
 int name=r28147/int
 int name=e27654/int
 int name=n26940/int
 /lst

 Thanks.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3590028.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to setup to archive expired documents?

2011-12-16 Thread Robert Stewart

We actually have a system that uses weekly shards but that is all .NET 
(Lucene.NET) and has lots of code to manage adding new indexes.  We want to 
move to SOLR for performance and maintenance reasons.  

So if we use some sort of weekly or daily sharding, there needs to be some 
mechanism in place to dynamically add the new shard when the current one fills 
up.  (Which would also ideally know where to put the new shards on what server, 
etc.) Since SOLR does not implement that I was thinking of just having a static 
set of shards.  


On Dec 16, 2011, at 10:54 AM, Otis Gospodnetic wrote:

 Hi,
 
 We've done a fair number of such things over the years. :)
 If daily shards don't work for you, why not weekly or monthly?
 Have a look at Zoie's Hourglass concept/code.
 Some Solr alternatives are currently better suited to handle this sort of 
 setup...
 
 Otis 
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html
 
 
 
 - Original Message -
 From: Robert Stewart bstewart...@gmail.com
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Thursday, December 15, 2011 12:55 PM
 Subject: Re: how to setup to archive expired documents?
 
 I think managing 100 cores will be too much headache.  Also
 performance of querying 100 cores will not be good (need
 page_number*page_size from 100 cores, and then merge).
 
 I think having around 10 SOLR instances, each one about 10M docs.
 Always search all 10 nodes.  Index using some hash(doc) to distribute
 new docs among nodes.  Run some nightly/weekly job to delete old docs
 and force merge (optimize) to some min/max number of segments.  I
 think that will work ok, but not sure about how to handle
 replication/failover so each node is redundant.  If we use SOLR
 replication it will have problems with replication after optimize for
 large indexes.  Seems to take a long time to move 10M doc index from
 master to slave (around 100GB in our case).  Doing it once per week is
 probably ok.
 
 
 
 2011/12/15 Avni, Itamar itamar.a...@verint.com:
 What about managing a core for each day?
 
 This way the deletion/archive is very simple. No holes in the 
 index (which is often when deleting document by document).
 The index done against core [today-0].
 The query is done against cores [today-0],[today-1]...[today-99]. Quite a 
 headache.
 
 Itamar
 
 -Original Message-
 From: Robert Stewart [mailto:bstewart...@gmail.com]
 Sent: יום ה 15 דצמבר 2011 16:54
 To: solr-user@lucene.apache.org
 Subject: how to setup to archive expired documents?
 
 We have a large (100M) index where we add about 1M new docs per day.
 We want to keep index at a constant size so the oldest ones are removed 
 and/or archived each day (so index contains around 100 days of data).  What 
 is 
 the best way to do this?  We still want to keep older data in some archive 
 index, not just delete it (so is it possible to export older segments, etc. 
 into 
 some other index?).  If we have some daily job to delete old data, I assume 
 we'd need to optimize the index to actually remove and free space, but that 
 will require very large (and slow) replication after optimize which will 
 probably not work out well for so large an index.  Is there some way to 
 shard 
 the data or other best practice?
 
 Thanks
 Bob
 This electronic message may contain proprietary and confidential 
 information of Verint Systems Inc., its affiliates and/or subsidiaries.
 The information is intended to be for the use of the individual(s) or
 entity(ies) named above.  If you are not the intended recipient (or 
 authorized to receive this e-mail for the intended recipient), you may not 
 use, 
 copy, disclose or distribute to anyone this message or any information 
 contained 
 in this message.  If you have received this electronic message in error, 
 please 
 notify us by replying to this e-mail.

Re: edismax doesn't obey 'pf' parameter

2011-12-16 Thread Erick Erickson

A side note: specifying qt and defType on the same query is probably
not what you
intend. I'd just omit the qt bit since you're essentially passing all
the info you intend
explicitly...

I see the same behavior when I specify a non-tokenized field in 3.5

But I don't think this is a bug since it doesn't make sense to specify a phrase
field on a non-tokenized field since there's always exactly one token
at position
0. The whole idea of phrases is that multiple tokens must appear within
the slop.

Best
Erick

On Thu, Dec 15, 2011 at 5:46 PM, entdeveloper
cameron.develo...@gmail.com wrote:
 I'm observing strange results with both the correct and incorrect behavior
 happening depending on which field I put in the 'pf' param. I wouldn't think
 this should be analyzer specific, but is it?

 If I try:
 http://localhost:8080/solr/collection1/select?qt=%2Fsearchq=mickey%20mousedebugQuery=ondefType=edismaxpf=blah_exactqf=blah

 It looks correct:
 str name=rawquerystringmickey mouse/str
 str name=querystringmickey mouse/str
 str name=parsedquery+((DisjunctionMaxQuery((blah:mickey))
 DisjunctionMaxQuery((blah:mouse)))~2)
 DisjunctionMaxQuery((blah_exact:mickey mouse))/str
 str name=parsedquery_toString+(((blah:mickey) (blah:mouse))~2)
 (blah_exact:mickey mouse)/str

 However, If I put in the field I want, for some reason that phrase portion
 of the query just completely drops off:
 http://localhost:8080/solr/collection1/select?qt=%2Fsearchq=mickey%20mousedebugQuery=ondefType=edismaxpf=name_exactqf=name

 Results:
 str name=rawquerystringmickey mouse/str
 str name=querystringmickey mouse/str
 str name=parsedquery+((DisjunctionMaxQuery((name:mickey))
 DisjunctionMaxQuery((name:mouse)))~2) ()/str
 str name=parsedquery_toString+(((name:mickey) (name:mouse))~2) ()/str

 The name_exact field's analyzer uses KeywordTokenizer, but again, I think
 this query is being formed too early in the process for that to matter at
 this point

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/edismax-doesn-t-obey-pf-parameter-tp3589763p3590153.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Version Upgrade issue

2011-12-16 Thread Erick Erickson

Please start another thread and provide some details, there's not enough
information here to say anything. You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, Dec 15, 2011 at 11:50 PM, Pawan Darira pawan.dar...@gmail.com wrote:
 Thanks. I re-started from scratch  at least things have started working
 now. I upgraded by deploying 3.2 war in my jboss. Also, did conf changes as
 mentioned in CHANGES.txt

 It did expected to have a separate libdirectory which was not required in
 1.4.

 New problem is that it's taking very long to build indexes more than an
 hour. it took only 10 minutes in 1.4. Can u please guide regarding this.

 Should i attach my solrconfig.xml for reference

 On Wed, Dec 7, 2011 at 8:22 PM, Erick Erickson erickerick...@gmail.comwrote:

 How did you upgrade? What steps did you follow? Do you have
 any custom code? Any additional lib entries in your
 solrconfig.xml?

 These details help us diagnose your problem, but it's almost certainly
 that you have a mixture of jar files lying around your machine in
 a place you don't expect.

 Best
 Erick

 On Wed, Dec 7, 2011 at 1:28 AM, Pawan Darira pawan.dar...@gmail.com
 wrote:
  I checked that. there are only latest jars. I am not able to figure out
 the
  issue.
 
  On Tue, Dec 6, 2011 at 6:57 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  Looks like you must have a mix of old and new jars.
 
  On Tuesday, December 6, 2011, Pawan Darira pawan.dar...@gmail.com
 wrote:
   Hi
  
   I am trying to upgrade my SOLR version from 1.4 to 3.2. but it's
 giving
  me
   below exception. I have checked solr home path  it is correct..
 Please
  help
  
   SEVERE: Could not start Solr. Check solr/home property
   java.lang.NoSuchMethodError:
  
 
 
 org.apache.solr.common.SolrException.logOnce(Lorg/slf4j/Logger;Ljava/lang/String;Ljava/lang/Throwable;)V
          at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:321)
          at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
          at
  
 
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
          at
  
 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
          at
  
 
 
 org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
          at
  
 
 
 org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
          at
  
 
 
 org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:108)
          at
  
 
 
 org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3720)
          at
  
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4358)
          at
  
 
 
 org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
          at
  
 org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:732)
          at
   org.apache.catalina.core.StandardHost.addChild(StandardHost.java:553)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at
  
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at
  
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:585)
          at
  
 
 
 org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
          at
  
 org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
          at
   org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
          at
  
 org.apache.catalina.core.StandardContext.init(StandardContext.java:5300)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at
  
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at
  
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:585)
          at
  
 
 
 org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
          at
  
 org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
          at
   org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
          at
  
 
 
 org.jboss.web.tomcat.service.TomcatDeployer.performDeployInternal(TomcatDeployer.java:301)
          at
  
 
 
 org.jboss.web.tomcat.service.TomcatDeployer.performDeploy(TomcatDeployer.java:104)
          at
   org.jboss.web.AbstractWebDeployer.start(AbstractWebDeployer.java:375)
          at org.jboss.web.WebModule.startModule(WebModule.java:83)
  
 
  --
  - Mark
 
  http://www.lucidimagination.com
 




 --
 Thanks,
 Pawan

Re: edismax doesn't obey 'pf' parameter

2011-12-16 Thread Erick Erickson

That was a little confusing!

 there's always exactly one token at position 0.

Of course. What I meant to say was there is
always exactly one token in a non-tokenized
field and it's offset is always exactly 0. There
will never be tokens at position 1.

So asking to match phrases, which is based on
term positions is basically a no-op.

Hope that makes more sense
Erick

On Fri, Dec 16, 2011 at 11:44 AM, Erick Erickson
erickerick...@gmail.com wrote:
 A side note: specifying qt and defType on the same query is probably
 not what you
 intend. I'd just omit the qt bit since you're essentially passing all
 the info you intend
 explicitly...

 I see the same behavior when I specify a non-tokenized field in 3.5

 But I don't think this is a bug since it doesn't make sense to specify a 
 phrase
 field on a non-tokenized field since there's always exactly one token
 at position
 0. The whole idea of phrases is that multiple tokens must appear within
 the slop.

 Best
 Erick

 On Thu, Dec 15, 2011 at 5:46 PM, entdeveloper
 cameron.develo...@gmail.com wrote:
 I'm observing strange results with both the correct and incorrect behavior
 happening depending on which field I put in the 'pf' param. I wouldn't think
 this should be analyzer specific, but is it?

 If I try:
 http://localhost:8080/solr/collection1/select?qt=%2Fsearchq=mickey%20mousedebugQuery=ondefType=edismaxpf=blah_exactqf=blah

 It looks correct:
 str name=rawquerystringmickey mouse/str
 str name=querystringmickey mouse/str
 str name=parsedquery+((DisjunctionMaxQuery((blah:mickey))
 DisjunctionMaxQuery((blah:mouse)))~2)
 DisjunctionMaxQuery((blah_exact:mickey mouse))/str
 str name=parsedquery_toString+(((blah:mickey) (blah:mouse))~2)
 (blah_exact:mickey mouse)/str

 However, If I put in the field I want, for some reason that phrase portion
 of the query just completely drops off:
 http://localhost:8080/solr/collection1/select?qt=%2Fsearchq=mickey%20mousedebugQuery=ondefType=edismaxpf=name_exactqf=name

 Results:
 str name=rawquerystringmickey mouse/str
 str name=querystringmickey mouse/str
 str name=parsedquery+((DisjunctionMaxQuery((name:mickey))
 DisjunctionMaxQuery((name:mouse)))~2) ()/str
 str name=parsedquery_toString+(((name:mickey) (name:mouse))~2) ()/str

 The name_exact field's analyzer uses KeywordTokenizer, but again, I think
 this query is being formed too early in the process for that to matter at
 this point

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/edismax-doesn-t-obey-pf-parameter-tp3589763p3590153.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Announcement of Soldash - a dashboard for multiple Solr instances

2011-12-16 Thread Otis Gospodnetic

Nice!

May be good to upload some screenshots there...

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



- Original Message -
 From: Alexander Valet | edelight alexander.va...@edelight.de
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Thursday, December 15, 2011 9:50 AM
 Subject: Announcement of Soldash - a dashboard for multiple Solr instances
 
 We use Solr quite a bit at edelight -- and love it. However, we encountered 
 one 
 minor peeve: although each individual
 Solr server has its own dashboard, there's no easy way of getting a complete 
 overview of an entire Solr cluster and the
 status of its nodes.
 
 Over the last weeks our own Aengus Walton developed Soldash, a dashboard for 
 your entire Solr cluster.
 
 Although still in its infancy, Soldash gives you an overview of:
 
     - your Solr servers
     - what version of Solr they're running
     - what index version they have, and whether slaves are in sync with their 
 master
 
 as well as allowing you to:
 
     - turn polling and replication on or off
     - force an index fetch on a slave
     - display a file list of the current index
     - backup the index
     - reload the index
 
 It is worth noting that due to the set-up of our own environment, Soldash has 
 been programmed to automatically presume all Solr instances have the same 
 cores. 
 This may change in future releases, depending on community reaction.
 
 The project is open-source and hopefully some of you shall find this tool 
 useful 
 in day-to-day administration of Solr.
 
 The newest version (0.2.2) can be downloaded at:
 https://github.com/edelight/soldash/tags
 
 Instructions on how to configure Soldash can be found at the project's 
 homepage on github:
 https://github.com/edelight/soldash
 
 Feedback and suggestions are very welcome!
 
 
 
 --
 edelight GmbH, Wilhelmstr. 4a, 70182 Stuttgart
 
 Fon: +49 (0)711-912590-14 | Fax: +49 (0)711-912590-99
 
 Geschäftsführer: Peter Ambrozy, Tassilo Bestler
 Amtsgericht Stuttgart, HRB 722861
 Ust.-IdNr. DE814842587
 
 Diese E-Mail ist vertraulich. Wenn Sie nicht der rechtmäßige Empfänger sind, 
 dürfen Sie den Inhalt weder kopieren noch verbreiten oder benutzen. Sollten 
 Sie 
 diese E-Mail versehentlich erhalten haben, senden Sie sie bitte an uns zurück 
 und löschen Sie sie anschließend.
 
 This email is confidential. If you are not the intended recipient, you must 
 not 
 copy, disclose or use its contents. If you have received it in error, please 
 inform us immediately by return email and delete the document.

Re: Core overhead

2011-12-16 Thread Jason Rutherglen

Wow the shameless plugging of product (footer) has hit a new low Otis.

On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi Yury,

 Not sure if this was already covered in this thread, but with N smaller cores 
 on a single N-CPU-core box you could run N queries in parallel over smaller 
 indices, which may be faster than a single query going against a single big 
 index, depending on how many concurrent query requests the box is handling 
 (i.e. how busy or idle the CPU cores are).

 Otis
 

 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Yury Kats yuryk...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Thursday, December 15, 2011 12:58 PM
Subject: Core overhead

Does anybody have an idea, or better yet, measured data,
to see what the overhead of a core is, both in memory and speed?

For example, what would be the difference between having 1 core
with 100M documents versus having 10 cores with 10M documents?

updates to runbot.sh script

2011-12-16 Thread Christopher Gross

http://wiki.apache.org/nutch/Crawl

This script no longer works.  See:
echo - Index (Step 5 of $steps) -
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \
crawl/segments/*

The index call doesn't existso what does this line get replaced
with?  Is there an updated runbot.sh script?  Has anyone created a new
one that will work?  I've done some changes on it, but I just don't
know what to do for this part.

Thanks!

-- Chris

Re: Poor performance on distributed search

2011-12-16 Thread ku3ia

Hi, Erick, thanks for your reply

Yeah, you are right - document cache is default, but I tried to decrease and
increase values but I didn't get the desired result.

I tried the tests. Here are results:

1 try with rows=10
successfully started at 19:48:34
Queries interval is: 10 queries per minute
http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(gulping)rows=10start=0fl=*,scoreqt=requestShards
...
http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(tabors)rows=10start=0fl=*,scoreqt=requestShards
utility successfully stopped at 19:53:33
Elapsed time: 299 secs
--- solr ---
Queries processed: 50
Queries cancelled: 0
Average QTime is: 764 ms
Average RTime is: 0.68 sec(s)
Size of data-dir is: 235784 bytes

2 try with fl=id assuming id is your uniqueKey 
successfully started at 19:56:23
Queries interval is: 10 queries per minute
http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(psyche's)rows=2000start=0fl=RecordIDqt=requestShards
...
http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(betook)rows=2000start=0fl=RecordIDqt=requestShards
utility successfully stopped at 20:01:24
Elapsed time: 301 secs
--- solr ---
Queries processed: 15
Queries cancelled: 35
Average QTime is: 52775.7 ms
Average RTime is: 53.2667 sec(s)
Size of data-dir is: 212978 bytes

In first test disk usage by nmon: ~30-40% and in the second - 100%. Drive
read speed starting from 3-5 MB/s and falls to 500-700 KB/s in both tests.

Have you any ideas?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3592364.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Core overhead

2011-12-16 Thread Ted Dunning

I thought it was slightly clumsy, but it was informative.  It seemed like a
fine thing to say.  Effectively it was I/we have developed a tool that
will help you solve your problem.  That is responsive to the OP and it is
clear that it is a commercial deal.

On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Wow the shameless plugging of product (footer) has hit a new low Otis.

 On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
  Hi Yury,
 
  Not sure if this was already covered in this thread, but with N smaller
 cores on a single N-CPU-core box you could run N queries in parallel over
 smaller indices, which may be faster than a single query going against a
 single big index, depending on how many concurrent query requests the box
 is handling (i.e. how busy or idle the CPU cores are).
 
  Otis
  
 
  Performance Monitoring SaaS for Solr -
 http://sematext.com/spm/solr-performance-monitoring/index.html
 
 
 
 
  From: Yury Kats yuryk...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, December 15, 2011 12:58 PM
 Subject: Core overhead
 
 Does anybody have an idea, or better yet, measured data,
 to see what the overhead of a core is, both in memory and speed?
 
 For example, what would be the difference between having 1 core
 with 100M documents versus having 10 cores with 10M documents?

Re: updates to runbot.sh script

2011-12-16 Thread Chris Hostetter


: http://wiki.apache.org/nutch/Crawl
: 
: This script no longer works.  See:

If you have a question about something on the nutch wiki, or included in 
the nutch release, i would suggest you email the nutch user list.


-Hoss

Re: updates to runbot.sh script

2011-12-16 Thread Christopher Gross

Ha, sorry Hoss.  Thought i hit user@nutch, gmail did the replace and I
wasn't paying attention.

-- Chris



On Fri, Dec 16, 2011 at 2:46 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : http://wiki.apache.org/nutch/Crawl
 :
 : This script no longer works.  See:

 If you have a question about something on the nutch wiki, or included in
 the nutch release, i would suggest you email the nutch user list.


 -Hoss

Re: Core overhead

2011-12-16 Thread Jason Rutherglen

Ted,

...- FREE! is stupid idiot spam.  It's annoying and not suitable.

On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I thought it was slightly clumsy, but it was informative.  It seemed like a
 fine thing to say.  Effectively it was I/we have developed a tool that
 will help you solve your problem.  That is responsive to the OP and it is
 clear that it is a commercial deal.

 On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 Wow the shameless plugging of product (footer) has hit a new low Otis.

 On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
  Hi Yury,
 
  Not sure if this was already covered in this thread, but with N smaller
 cores on a single N-CPU-core box you could run N queries in parallel over
 smaller indices, which may be faster than a single query going against a
 single big index, depending on how many concurrent query requests the box
 is handling (i.e. how busy or idle the CPU cores are).
 
  Otis
  
 
  Performance Monitoring SaaS for Solr -
 http://sematext.com/spm/solr-performance-monitoring/index.html
 
 
 
 
  From: Yury Kats yuryk...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, December 15, 2011 12:58 PM
 Subject: Core overhead
 
 Does anybody have an idea, or better yet, measured data,
 to see what the overhead of a core is, both in memory and speed?
 
 For example, what would be the difference between having 1 core
 with 100M documents versus having 10 cores with 10M documents?

Re: Core overhead

2011-12-16 Thread Ted Dunning

Sounds like we disagree.

On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Ted,

 ...- FREE! is stupid idiot spam.  It's annoying and not suitable.

 On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  I thought it was slightly clumsy, but it was informative.  It seemed
 like a
  fine thing to say.  Effectively it was I/we have developed a tool that
  will help you solve your problem.  That is responsive to the OP and it
 is
  clear that it is a commercial deal.
 
  On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  Wow the shameless plugging of product (footer) has hit a new low Otis.
 
  On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hi Yury,
  
   Not sure if this was already covered in this thread, but with N
 smaller
  cores on a single N-CPU-core box you could run N queries in parallel
 over
  smaller indices, which may be faster than a single query going against a
  single big index, depending on how many concurrent query requests the
 box
  is handling (i.e. how busy or idle the CPU cores are).
  
   Otis
   
  
   Performance Monitoring SaaS for Solr -
  http://sematext.com/spm/solr-performance-monitoring/index.html
  
  
  
  
   From: Yury Kats yuryk...@yahoo.com
  To: solr-user@lucene.apache.org
  Sent: Thursday, December 15, 2011 12:58 PM
  Subject: Core overhead
  
  Does anybody have an idea, or better yet, measured data,
  to see what the overhead of a core is, both in memory and speed?
  
  For example, what would be the difference between having 1 core
  with 100M documents versus having 10 cores with 10M documents?

Re: Poor performance on distributed search

2011-12-16 Thread Erick Erickson

OK, so your speed differences are pretty much dependent upon whether you specify
rows=2000 or rows=10, right? Why do you need 2,000 rows?

Or is the root question why there's such a difference when you specify
qt=requestShards? In which case I'm curious to see that request
handler definition...

Best
Erick

On Fri, Dec 16, 2011 at 1:38 PM, ku3ia dem...@gmail.com wrote:
 Hi, Erick, thanks for your reply

 Yeah, you are right - document cache is default, but I tried to decrease and
 increase values but I didn't get the desired result.

 I tried the tests. Here are results:

1 try with rows=10
 successfully started at 19:48:34
 Queries interval is: 10 queries per minute
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(gulping)rows=10start=0fl=*,scoreqt=requestShards
 ...
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(tabors)rows=10start=0fl=*,scoreqt=requestShards
 utility successfully stopped at 19:53:33
 Elapsed time: 299 secs
 --- solr ---
 Queries processed: 50
 Queries cancelled: 0
 Average QTime is: 764 ms
 Average RTime is: 0.68 sec(s)
 Size of data-dir is: 235784 bytes

2 try with fl=id assuming id is your uniqueKey
 successfully started at 19:56:23
 Queries interval is: 10 queries per minute
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(psyche's)rows=2000start=0fl=RecordIDqt=requestShards
 ...
 http://127.0.0.1:8080/solr/shard1/select/?ident=trueq=(betook)rows=2000start=0fl=RecordIDqt=requestShards
 utility successfully stopped at 20:01:24
 Elapsed time: 301 secs
 --- solr ---
 Queries processed: 15
 Queries cancelled: 35
 Average QTime is: 52775.7 ms
 Average RTime is: 53.2667 sec(s)
 Size of data-dir is: 212978 bytes

 In first test disk usage by nmon: ~30-40% and in the second - 100%. Drive
 read speed starting from 3-5 MB/s and falls to 500-700 KB/s in both tests.

 Have you any ideas?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3592364.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Possible to facet across two indices, or document types in single index?

2011-12-16 Thread Chris Hostetter


: Chris, you replied:
: 
:  : But there is a workaround:
:  : 1) Do a normal query without facets (you only need to request doc ids
:  : at this point)
:  : 2) Collect all the IDs of the documents returned
:  : 3) Do a second query for all fields and facets, adding a filter to
:  : restrict result to those IDs collected in step 2.

FYI: that was actually Erick's suggestion, i just pointed out that #3 
wasn't neccessary if you *only* care about the docs on page #1 ... but 
given thta in your situation you really need data from two different 
collections, it's a much differnet problem.

: When the initial search query comes in, I can do 1-3 above as you 
: describe.  I have fewer than 200K documents in the index. Given the 
: generalness of the search terms, let's say I get 7500 document IDs back 
: per 1 and 2.  It sounds like I need to create a filter query which 
: includes all 7500 IDs, and issue the 2nd query (in my case to another 
: core) and have it facet on the additional field(s) I'm interested in.  
: I don't need to return results from this, just get the facet 
: values/counts.

so far so good -- what you are describing is exactly what Join does (or 
specificly: what it was designed to do except for the anoying bug in how 
it parses the query) except that you are choosing to ignore the results 
and only look at the facet counts.

: Step 4 for me is to search the first index again, to obtain the 
: requested number of rows of results, return the appropriate fields, and 
: calculate facets for that content.  I can then merge the facet results 
: of both indexes, and the client is none the wiser.

here's where you've lost me...

how are you going to merge the facet counts from the two cores?  you 
could just lump them all in together (fieldA1 and fieldA2 from coreA, 
in a map with fieldB1 and fieldB2 from coreB) but they are counting 
ocmpleltey differnet things from comletely differnet cores -- if your main 
result set is from coreA, but you also show these facet counts based on 
the join against coreB, the constraint counts for values from fieldB2 
aren't going to mean much relative to the results you return.

I mean: consider a concrete example of having a books core and an 
authors core - wher every book has a field identifying the author by id.

if a user searches for authors who live in oregon, and then you get that 
list of 98 authors, and join them against the books core and facet on 
genre you can return some data like this...

  Genre:
   Biography: 1023
   Romance: 854
   Mystery: 674
   ...

...but thta doesn't really tell you anything about the author documents 
you are returning does it?  you know that some subset of those 98 authors 
wrote a total of 854 romance novels, but is that actaully useful in 
some way?  I suspect what you really want is to know the number of 
*authors* who have written books in each of those genres -- and nothing 
you've described so far will get you that.  (once again, we're back to the 
issue of denormalizing)

Setting asside that issue for a moment...

: A couple questions though (aren't there always? :))  Is this very 
: efficient?  Beyond building the string of 7500 IDs within my app, can 
: Solr swallow that okay?  I'm using SolrJ, javabin format, so hopefully 
: there is not a URL length issue (between my app and Solr)?  I'm guessing 
: javabin uses HTTP POST.

efficient is vauge... it can be done, but there's a lot of data going 
over the wire.  it would probably be more efficint to do is server side in 
a custom request handler (similar to how Join works)

: What is a reasonable way for the facets derived from the 2nd index to be 
: used for narrowing like those in the main content index? That is, 
: pinning down facet values from the second index is not going to affect 
: the results (document IDs) from searching the first index.  Perhaps that 

Now we're back to the problem i mentioned before, except you're 
describing it at the moment when a person attempts to filter on a facet 
constraint -- but as i've pointed out, you already have to deal with this 
just to generate the list of facet constraints and their counts.


-Hoss

Re: SolrCloud Cores

2011-12-16 Thread Mark Miller

On Fri, Dec 16, 2011 at 8:14 AM, Jamie Johnson jej2...@gmail.com wrote:

 What is the most appropriate way to configure Solr when deploying in a
 cloud environment?  Should the core name on all instances be the
 collection name or is it more appropriate that each shard be a
 separate core, or should each solr instance be a separate core (i.e.
 master1, master1-replica are 2 separate cores)?


At this point, its probably best/easiest to name them after the collection.


-- 
- Mark

http://www.lucidimagination.com

Re: Core overhead

2011-12-16 Thread Jason Rutherglen

Ted,

The list would be unreadable if everyone spammed at the bottom their
email like Otis'.  It's just bad form.

Jason

On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Sounds like we disagree.

 On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 Ted,

 ...- FREE! is stupid idiot spam.  It's annoying and not suitable.

 On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  I thought it was slightly clumsy, but it was informative.  It seemed
 like a
  fine thing to say.  Effectively it was I/we have developed a tool that
  will help you solve your problem.  That is responsive to the OP and it
 is
  clear that it is a commercial deal.
 
  On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  Wow the shameless plugging of product (footer) has hit a new low Otis.
 
  On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hi Yury,
  
   Not sure if this was already covered in this thread, but with N
 smaller
  cores on a single N-CPU-core box you could run N queries in parallel
 over
  smaller indices, which may be faster than a single query going against a
  single big index, depending on how many concurrent query requests the
 box
  is handling (i.e. how busy or idle the CPU cores are).
  
   Otis
   
  
   Performance Monitoring SaaS for Solr -
  http://sematext.com/spm/solr-performance-monitoring/index.html
  
  
  
  
   From: Yury Kats yuryk...@yahoo.com
  To: solr-user@lucene.apache.org
  Sent: Thursday, December 15, 2011 12:58 PM
  Subject: Core overhead
  
  Does anybody have an idea, or better yet, measured data,
  to see what the overhead of a core is, both in memory and speed?
  
  For example, what would be the difference between having 1 core
  with 100M documents versus having 10 cores with 10M documents?

Re: Core overhead

2011-12-16 Thread Ted Dunning

We still disagree.

On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Ted,

 The list would be unreadable if everyone spammed at the bottom their
 email like Otis'.  It's just bad form.

 Jason

 On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  Sounds like we disagree.
 
  On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  Ted,
 
  ...- FREE! is stupid idiot spam.  It's annoying and not suitable.
 
  On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   I thought it was slightly clumsy, but it was informative.  It seemed
  like a
   fine thing to say.  Effectively it was I/we have developed a tool
 that
   will help you solve your problem.  That is responsive to the OP and
 it
  is
   clear that it is a commercial deal.
  
   On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen 
   jason.rutherg...@gmail.com wrote:
  
   Wow the shameless plugging of product (footer) has hit a new low
 Otis.
  
   On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
   otis_gospodne...@yahoo.com wrote:
Hi Yury,
   
Not sure if this was already covered in this thread, but with N
  smaller
   cores on a single N-CPU-core box you could run N queries in parallel
  over
   smaller indices, which may be faster than a single query going
 against a
   single big index, depending on how many concurrent query requests the
  box
   is handling (i.e. how busy or idle the CPU cores are).
   
Otis

   
Performance Monitoring SaaS for Solr -
   http://sematext.com/spm/solr-performance-monitoring/index.html
   
   
   
   
From: Yury Kats yuryk...@yahoo.com
   To: solr-user@lucene.apache.org
   Sent: Thursday, December 15, 2011 12:58 PM
   Subject: Core overhead
   
   Does anybody have an idea, or better yet, measured data,
   to see what the overhead of a core is, both in memory and speed?
   
   For example, what would be the difference between having 1 core
   with 100M documents versus having 10 cores with 10M documents?

Re: Poor performance on distributed search

2011-12-16 Thread ku3ia

 OK, so your speed differences are pretty much dependent upon whether you
specify 
 rows=2000 or rows=10, right? Why do you need 2,000 rows? 
Yes, big difference is 10 v. 2K records. Limit of 2K rows is setted by
manager and I can't decrease it. It is a minimum row count needed to process
data.

 Or is the root question why there's such a difference when you specify 
 qt=requestShards? In which case I'm curious to see that request 
 handler definition... 

  requestHandler name=requestShards class=solr.SearchHandler
default=false
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str
name=shards127.0.0.1:8080/solr/shard1,127.0.0.1:8080/solr/shard2,127.0.0.1:8080/solr/shard3,127.0.0.1:8080/solr/shard4/str
 /lst
/requestHandler

This request handler is defined at shard1's solrconfig.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3592734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Retrieving Documents

2011-12-16 Thread Dan McGinn-Combs

I've been doing a fair amount of reading and experimenting with Solr
lately. I find that it does a good job of indexing very structured
documents. However, the application I have in mind is build around
long EPUB documents.

Of course, I found the Extract components useful for indexing the
EPUBs. However, I would like to be able to

* Size the highlight portion of text around the query parameters
(i.e. show 20 or 30 words) and

* Retrieve a location within the document so I can display that page
from the EPUB.

What is common practice for these? I notice that if I have a list of
(short) text segments in fields, they are stored without too much fuss
and are retrievable. However, I'm talking about a field of potentially
hundreds of words.

Thanks for any pointers,
Dan

-- 
Dan McGinn-Combs
dgco...@gmail.com
Peachtree City, Georgia USA

Call RequestHandler from QueryComponent

2011-12-16 Thread Vazquez, Maria (STM)

Hi!

I have a solrconfig.xml like:

 requestHandler name=/ABC class=solr.SearchHandler
lst name=defaults
  str name=echoParamsall/str
  int name=start0/int
  int name=rows10/int
  str name=wtABC/str
  str name=sortscore desc,rating asc/str
  str name=fqCUSTOM FQ/str
  str name=version2.2/str
  str name=flCUSTOM FL/str
/lst
arr name=components
strvalidate/str
strCUSTOM ABC QUERY COMPONENT/str
strstats/str
strdebug/str
/arr
  /requestHandler

  requestHandler name=/XYZ class=solr.SearchHandler
lst name=defaults
  str name=echoParamsall/str
  int name=start0/int
  int name=rows1/int
  str name=wtXYZ/str
  str name=sortscore desc/str
  str name=flCUSTOM FL/str
  str name=version2.2/str

  str name=defTypeedismax/str
  float name=tie1/float
  str name=qfCUSTOM QF/str
  str name=qs0/str
  str name=mm1/str
  str name=q.alt*:*/str
/lst
arr name=components
strvalidate/str
strCUSTOM XYZ QUERY COMPONENT/str
strstats/str
strdebug/str
/arr
  /requestHandler

In ABC QUERY COMPONENT, I customize prepare() and process(). In its process() I 
want to call the /XYZ request handler and include those results in the results 
for ABC. Is that possible?
I know the org.apache.solr.spelling.SpellCheckCollator calls a QueryComponent 
and invokes prepare and process on it, but I want to invoke the request handler 
directly. It’d be silly to use SolrJ since both handlers are in the same core.

Any suggestions?

Thanks!
Maria

Re: r1201855 broke stats.facet on long fields

2011-12-16 Thread Chris Hostetter


Wow ... either i'm a huge idiot and everyone has just been really polite 
about it in most threads, or something about this thread in particular 
made me really stupid.

(Luis: i'm sorry for all the things i have said so far in this email 
thread that were a complete waste of your time - hopefully this email 
will make up for it)

Idiocy #1...

:   Solr can
:  not reasonably compute stats on a multivalued field
: 
: Wasn't that added here?
: https://issues.apache.org/jira/browse/SOLR-1380

Yes, correct.  I didn't realize that functionality had ever been added, 
but it was and it does still work just fine in Solr 3.5 (you can see this 
in any of the StatsComponentTest methods that call 
doTestMVFieldStatisticsResult)

Idiocy #2...

 Subject : Re: r1201855 broke stats.facet on long fields

...in spite of this subject, and multiple references to stats.facet in 
Luis's original email I complely overlooked the entire crux of the Luis's 
problem.  I thought the issue was that he couldn't get *stats* on a 
multi-valued field, I didn't realize that it was the stats.facet param 
that had started failing for him in Solr 3.5

I believe that the intention of the code Luis quoted, which was committed 
as part of SOLR-1023 in r1201855, was actually to pre-emptively avoid the 
problems mentioned in SOLR-1782 (which Luis actually mentioned and i 
*still* didn't realize this was about stats.facet - Idiocy#3) ...

  if (facetFieldType.isTokenized() || facetFieldType.isMultiValued()) {
 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,

...given the way the stats faceting code works, that sanity check does 
make sense, and seems like a good idea.  but the crux of the issue in 
Luis's case... 

 fieldType name=long class=solr.TrieLongField precisionStep=0 
  omitNorms=true positionIncrementGap=0 /
 field name=ts type=long indexed=true stored=true 
  required=true /

...seems to be that the isTokenized() test that's failing (and *not* the 
isMultiValued() check that i immediatley assumed - Idiocy#4) because 
TrieField.isTokenized() is hardcoded to return true.

I *think* TrieField.isTokenized should be changed to depend on the value 
of the precisionStep, but i'm not sure what all the ramifications of 
that are just yet -- but i've opened SOLR-2976 to look into it.


-Hoss

Re: Call RequestHandler from QueryComponent

2011-12-16 Thread Chris Hostetter


Maria: sending the same email 4 times in less the 48 hours isn't really a 
good way to encourange people to help you -- it just means more total mail 
people have to wade thorugh which slows them down and makes them less 
likeely to want to help.

: In ABC QUERY COMPONENT, I customize prepare() and process(). In its
: process() I want to call the /XYZ request handler and include those results
: in the results for ABC. Is that possible?

certianly -- you can execute any java code you wnat in a custom component, 
take a look at how SolrDispatchFilter exeuts the original request on the 
SolrCore, you can do something similar in your custom component (but 
you'll want to use a LocalSolrQueryRequest that you populate with params 
-- see the TestHarness for an example) and then take whatever data you 
want out of the inner SolrQueryResponse you get back and add it directly 
to the outer SolrQueryResponse.

One thing you might have to watch out for is ensuring that the same 
SolrIndexSearcher used in the outer request is also the one used in the 
inner request -- the consistency is crucial to ensuring any DocList 
you copy is meaninful -- but i'm not sure if you can do that easily with 
LocalSolrQueryRequest, you might need to tweak it.

-Hoss

Re: NRT or similar for Solr 3.5?

2011-12-16 Thread Steven Ou

Hey Vikram,

I finally got around to getting Solr-RA installed but I'm having trouble
getting the NRT to work. Could you help me out?

I added these four lines immediately after config in solrconfig.xml:

  realtime visible=200true/realtime

  libraryrankingalgorithm/library

  realtime visible=200 facet=truetrue/realtime

  libraryrankingalgorithm/library

Is that correct? I also read something about disabling caching, so I took
out the queryResultCache. Is that right?

What else do I need to do to get NRT working? Do I need to switch some
engine to Solr-RA? If so, how do I do that? Are there other caches I need
to disable?

Any help appreciated. Thanks.
--
Steven Ou | 歐偉凡

*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880


2011/12/12 vikram kamath kmar...@gmail.com

 @Steven .. try some alternate email address(besides google/yahoo)  and
 check your spam

 [image: twitter] http://twitter.com/kmarkiv[image:
 facebook]http://facebook.com/kmarkiv[image:
 google-buzz] http://profiles.google.com/kmarkiv#buzz[image:
 linkedin]http://linkedin.com/in/vikramkamathc

 Regards
 Vikram Kamath



 2011/12/13 Steven Ou steve...@gmail.com

  Yeah, running Chrome on OSX and doesn't do anything.
 
  Just switched to Firefox and it works. *But*, also don't seem to be
  receiving confirmation email.
  --
  Steven Ou | 歐偉凡
 
  *ravn.com* | Chief Technology Officer
  steve...@gmail.com | +1 909-569-9880
 
 
  2011/12/12 vikram kamath kmar...@gmail.com
 
   The Onclick handler does not seem to be called on google chrome (Ubuntu
  ).
  
   Also , I dont seem to receive the email with the confirmation link on
   registering (I have checked my spam)
  
  
  
  
   Regards
   Vikram Kamath
  
  
  
   2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com
  
Steven:
   
There is an onclick handler that allows you to download the src. BTW,
  an
early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is
available for download. So please give it a try.
   
Regards,
   
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org
   
   
On 12/10/2011 11:18 PM, Steven Ou wrote:
 All the links on the download section link to
   http://solr-ra.tgels.org/#
 --
 Steven Ou | 歐偉凡

 *ravn.com* | Chief Technology Officer
 steve...@gmail.com | +1 909-569-9880


 2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com

 Steven:

 Not sure why you had problems, #downloads (
 http://solr-ra.tgels.org/#downloads ) should point you to the
   downloads
 section showing the different versions available for download ?
  Please
 share if this is not so ( there were downloads yesterday with no
problems )

 Regarding NRT, you can switch between RA and Lucene at query level
  or
   at
 config level; in the current version with RA, NRT is in effect
 while
 with lucene, it is not, you can get more information from here:

 http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf

 Solr 3.5 with RankingAlgorithm 1.3 should be available next week.

 Regards,

 - Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org

 On 12/9/2011 4:49 PM, Steven Ou wrote:
 Hey Nagendra,

 I took a look and Solr-RA looks promising - but:

- I could not figure out how to download it. It seems like all
  the
download links just point to #
- I wasn't looking for another ranking algorithm, so would it
 be
possible for me to use NRT but *not* RA (i.e. just use the
  normal
 Lucene
library)?

 --
 Steven Ou | 歐偉凡

 *ravn.com* | Chief Technology Officer
 steve...@gmail.com | +1 909-569-9880


 On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya 
 nnagaraja...@transaxtions.com wrote:

 Steven:

 Please take a look at Solr  with RankingAlgorithm. It offers NRT
 functionality. You can set your autoCommit to about 15 mins. You
  can
get
 more information from here:

  http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x
 http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x

 Regards,

 - Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.**org 
http://rankingalgorithm.tgels.org


 On 12/8/2011 9:30 PM, Steven Ou wrote:

 Hi guys,

 I'm looking for NRT functionality or similar in Solr 3.5. Is
 that
 possible?
 From what I understand there's NRT in Solr 4, but I can't
 figure
   out
 whether or not 3.5 can do it as well?

 If not, is it feasible to use an autoCommit every 1000ms? We
  don't
 currently process *that* much data so I wonder if it's OK to
 just
 commit
 very often? Obviously not scalable on a large scale, but it is
feasible
 for

Re: Core overhead

2011-12-16 Thread Chris Hostetter


: The list would be unreadable if everyone spammed at the bottom their
: email like Otis'.  It's just bad form.

If you'd like to debate project policy on what is/isn't acceptible on any 
of the Lucene mailing lists, please start a new thread on general@lucene 
(the list that exists precisely for the purpose of discussing meta-issues 
related to the Project/Community) instead of spamming the substantial 
solr-user@lucene subscriber base who probably subscribed to this list 
because they were interested in getting emails about using solr, not 
debating email etiquite.



-Hoss

Call RequestHandler from QueryComponent

2011-12-16 Thread marita

Hi!

I have a solrconfig.xml like:

 requestHandler name=/ABC class=solr.SearchHandler
lst name=defaults
  str name=echoParamsall/str
  int name=start0/int
  int name=rows10/int
  str name=wtABC/str
  str name=sortscore desc,rating asc/str
  str name=fqCUSTOM FQ/str
  str name=version2.2/str
  str name=flCUSTOM FL/str
/lst
arr name=components
strvalidate/str
strCUSTOM ABC QUERY COMPONENT/str
strstats/str
strdebug/str
/arr
  /requestHandler

  requestHandler name=/XYZ class=solr.SearchHandler
lst name=defaults
  str name=echoParamsall/str
  int name=start0/int
  int name=rows1/int
  str name=wtXYZ/str
  str name=sortscore desc/str
  str name=flCUSTOM FL/str
  str name=version2.2/str

  str name=defTypeedismax/str
  float name=tie1/float
  str name=qfCUSTOM QF/str
  str name=qs0/str
  str name=mm1/str
  str name=q.alt*:*/str
/lst
arr name=components
strvalidate/str
strCUSTOM XYZ QUERY COMPONENT/str
strstats/str
strdebug/str
/arr
  /requestHandler

In ABC QUERY COMPONENT, I customize prepare() and process(). In its
process() I want to call the /XYZ request handler and include those results
in the results for ABC. Is that possible?
I know the org.apache.solr.spelling.SpellCheckCollator calls a
QueryComponent and invokes prepare and process on it, but I want to invoke
the request handler directly. It’d be silly to use SolrJ since both
handlers are in the same core.

Any suggestions?

Thanks!
Maria

Re: how to setup to archive expired documents?

2011-12-16 Thread Chris Hostetter


: So if we use some sort of weekly or daily sharding, there needs to be 
: some mechanism in place to dynamically add the new shard when the 
: current one fills up.  (Which would also ideally know where to put the 
: new shards on what server, etc.) Since SOLR does not implement that I 
: was thinking of just having a static set of shards.

You may want to consider taking a look at the ongoing work on improving 
Solr Cloud -- particularly the distributed indexing and shard failure 
logic.

My understanding (from past discussions, this may have changed w/o me 
realizing it) is that doc-shard mapping will nominally be a simple hash 
function on the uniqueKey, but that a plugin could customize that so that 
documents are sharded by date -- and then when you only need to query 
recent docs you could query those shards explicitly.  (no idea if things 
are stable enough yet for such a plugin to be written -- but the sooner 
someone tries the tackle it to solve their use case, the sooner people 
will be confident that the API is stable)

https://wiki.apache.org/solr/SolrCloud
https://issues.apache.org/jira/browse/SOLR-2358
https://svn.apache.org/viewvc/lucene/dev/branches/solrcloud/

-Hoss

Re: Poor performance on distributed search

2011-12-16 Thread Erick Erickson

Right, are you falling afoul of the recursive shard thing? That is,
if you shards point back to itself. As far as I understand, your
shards parameter in your request handler shouldn't point back
to itself

But I'm guessing here.

Best
Erick

On Fri, Dec 16, 2011 at 4:27 PM, ku3ia dem...@gmail.com wrote:
 OK, so your speed differences are pretty much dependent upon whether you
 specify
 rows=2000 or rows=10, right? Why do you need 2,000 rows?
 Yes, big difference is 10 v. 2K records. Limit of 2K rows is setted by
 manager and I can't decrease it. It is a minimum row count needed to process
 data.

 Or is the root question why there's such a difference when you specify
 qt=requestShards? In which case I'm curious to see that request
 handler definition...

  requestHandler name=requestShards class=solr.SearchHandler
 default=false
     lst name=defaults
       str name=echoParamsexplicit/str
       int name=rows10/int
       str
 name=shards127.0.0.1:8080/solr/shard1,127.0.0.1:8080/solr/shard2,127.0.0.1:8080/solr/shard3,127.0.0.1:8080/solr/shard4/str
     /lst
    /requestHandler

 This request handler is defined at shard1's solrconfig.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3592734.html
 Sent from the Solr - User mailing list archive at Nabble.com.

RE: Call RequestHandler from QueryComponent

2011-12-16 Thread Vazquez, Maria (STM)

I am very very sorry. My mail client was not working from work and it looked 
like it was not being delivered, that's why I tried a few times. Sorry 
everybody!

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, December 16, 2011 3:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Call RequestHandler from QueryComponent

Maria: sending the same email 4 times in less the 48 hours isn't really a 
good way to encourange people to help you -- it just means more total mail 
people have to wade thorugh which slows them down and makes them less 
likeely to want to help.

: In ABC QUERY COMPONENT, I customize prepare() and process(). In its
: process() I want to call the /XYZ request handler and include those results
: in the results for ABC. Is that possible?

certianly -- you can execute any java code you wnat in a custom component, 
take a look at how SolrDispatchFilter exeuts the original request on the 
SolrCore, you can do something similar in your custom component (but 
you'll want to use a LocalSolrQueryRequest that you populate with params 
-- see the TestHarness for an example) and then take whatever data you 
want out of the inner SolrQueryResponse you get back and add it directly 
to the outer SolrQueryResponse.

One thing you might have to watch out for is ensuring that the same 
SolrIndexSearcher used in the outer request is also the one used in the 
inner request -- the consistency is crucial to ensuring any DocList 
you copy is meaninful -- but i'm not sure if you can do that easily with 
LocalSolrQueryRequest, you might need to tweak it.

-Hoss

Looking for a good Text on Solr

2011-12-16 Thread Shiv Deepak

I am looking for a good book to read from and get a better understanding of 
solr.

 On amazon, all the books on Solr have average rating (which I supposed no one 
tried them or bothered to post a review) but this one: Solr 1.4 Enterprise 
Search Server by David Smiley, Eric Pugh has a pretty decent review. But the 
current version of Solr is 3.5, so should I proceed with David Smiley's book or 
is there a better text available.

Thanks,
Shiv Deepak

Re: Looking for a good Text on Solr

2011-12-16 Thread Brendan Grainger

There is an update to that book for Solr 3:

http://www.packtpub.com/apache-solr-3-enterprise-search-server/book

I actually bought it recently, but haven't looked at it yet.

Good luck.
Brendan

On Dec 16, 2011, at 9:01 PM, Shiv Deepak wrote:

 I am looking for a good book to read from and get a better understanding of 
 solr.
 
 On amazon, all the books on Solr have average rating (which I supposed no one 
 tried them or bothered to post a review) but this one: Solr 1.4 Enterprise 
 Search Server by David Smiley, Eric Pugh has a pretty decent review. But the 
 current version of Solr is 3.5, so should I proceed with David Smiley's book 
 or is there a better text available.
 
 Thanks,
 Shiv Deepak

Re: Looking for a good Text on Solr

2011-12-16 Thread Hector Castro

Hi Shiv, 

For me, a combination of the following has helped me learn a lot about Solr in 
a short period of time:

* Apache Solr 3 Enterprise Search Server: 
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
* Solr Wiki: http://wiki.apache.org/solr/
* Pretty much every single post on this blog: 
http://www.hathitrust.org/blogs/large-scale-search

Hope this helps,

-- 
Hector


On Friday, December 16, 2011 at 9:01 PM, Shiv Deepak wrote:

 I am looking for a good book to read from and get a better understanding of 
 solr.
 
 On amazon, all the books on Solr have average rating (which I supposed no one 
 tried them or bothered to post a review) but this one: Solr 1.4 Enterprise 
 Search Server by David Smiley, Eric Pugh has a pretty decent review. But the 
 current version of Solr is 3.5, so should I proceed with David Smiley's book 
 or is there a better text available.
 
 Thanks,
 Shiv Deepak

Re: Looking for a good Text on Solr

2011-12-16 Thread Shiv Deepak

Hey Brendan, Hey Hector,

 That was very helpful. :)

Thanks,
Shiv Deepak

On 17-Dec-2011, at 07:52 , Hector Castro wrote:

 Hi Shiv, 
 
 For me, a combination of the following has helped me learn a lot about Solr 
 in a short period of time:
 
 * Apache Solr 3 Enterprise Search Server: 
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 * Solr Wiki: http://wiki.apache.org/solr/
 * Pretty much every single post on this blog: 
 http://www.hathitrust.org/blogs/large-scale-search
 
 Hope this helps,
 
 -- 
 Hector
 
 
 On Friday, December 16, 2011 at 9:01 PM, Shiv Deepak wrote:
 
 I am looking for a good book to read from and get a better understanding of 
 solr.
 
 On amazon, all the books on Solr have average rating (which I supposed no 
 one tried them or bothered to post a review) but this one: Solr 1.4 
 Enterprise Search Server by David Smiley, Eric Pugh has a pretty decent 
 review. But the current version of Solr is 3.5, so should I proceed with 
 David Smiley's book or is there a better text available.
 
 Thanks,
 Shiv Deepak

Re: Retrieving Documents

2011-12-16 Thread Otis Gospodnetic

Hi Dan,

1) Are you looking 
for http://wiki.apache.org/solr/HighlightingParameters#hl.fragsize ?

2) Hundreds of words in a field should not be a problem for highlighting.  But 
it sounds like this long field may contain content that corresponds to N 
different pages in a publication and you would like to inform the searcher 
which page the match was on, and not just that a match was somewhere in that 
big piece of text.  One way to deal with that is to break your document into N 
smaller documents - one document for each page.

Otis


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Dan McGinn-Combs dgco...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Friday, December 16, 2011 4:33 PM
Subject: Retrieving Documents
 
I've been doing a fair amount of reading and experimenting with Solr
lately. I find that it does a good job of indexing very structured
documents. However, the application I have in mind is build around
long EPUB documents.

Of course, I found the Extract components useful for indexing the
EPUBs. However, I would like to be able to

* Size the highlight portion of text around the query parameters
(i.e. show 20 or 30 words) and

* Retrieve a location within the document so I can display that page
from the EPUB.

What is common practice for these? I notice that if I have a list of
(short) text segments in fields, they are stored without too much fuss
and are retrievable. However, I'm talking about a field of potentially
hundreds of words.

Thanks for any pointers,
Dan

-- 
Dan McGinn-Combs
dgco...@gmail.com
Peachtree City, Georgia USA

55 matches

Mail list logo