RE: How to query for similar documents before indexing

2010-05-11 Thread Matthieu Labour
Hi Markus

Thank you for your answer

Here is a use case where I think it would be nice to know there is a dup before 
I insert it. 

Let's say I create a summary out of the document and I only index the summary 
and store the document itself on a separate device (S3, Cassandra etc ...). 
Then I would need that addDocument on the summary failed because it detected a 
duplicate so that I don't neet to store the document.
  
When you write:
On the other hand, you can also have a manual process that finds
duplicates based on that signature and gather that information yourself
as long as such a feature isn't there.

Can you explain more what you have in mind ?

Thank you for your help!

matt

--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 5:07 PM

Hi Matthieu,

 

 

On the top of the wiki page you can see it's in 1.4 already. As far as i know 
the API doesn't return information on found duplicates in its response header, 
the wiki isn't clear on that subject. I, at least, never saw any other response 
than an error or the usual status code and QTime.

 

Perhaps it would be a nice feature. On the other hand, you can also have a 
manual process that finds duplicates based on that signature and gather that 
information yourself as long as such a feature isn't there.

 

 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org; 
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication



 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



      


  

How to query for similar documents before indexing

2010-05-10 Thread Matthieu Labour
Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




  

RE: How to query for similar documents before indexing

2010-05-10 Thread Matthieu Labour
Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication


 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



  

Re: replication issue

2010-03-04 Thread Matthieu Labour
Hi

I just want to post a follow up on the replication issue I encountered

I have a master on which many document updates (delete and add) are happening

There is one slave replicating from the master. There is only search request 
hitting the slave. 

I can see the size of the downloaded data increasing on the slave in the 
index.X.

And then I see the following error in the log file

[2010-03-02 21:24:40] [pool-3-thread-1] ERROR(ReplicationHandler.java:266) - 
SnapPull failed org.apache.solr.common.SolrException: Unable to download 
_7h0y.fdx completely. 

the entire index.X gets deleted and no data is being merged with index even 
though some data got downloaded already...like if all the files downloaded were 
part of the same transaction

 _7h0y.fdx is no longer on the master

Increasing commitReserveDuration to 1 hour allowed enough time for the data to 
be downloaded on the slave without any deletion happening on the master. 
Therefore I didn't see the SolrException in the slave log files and the 
replication worked

Thank you

--- On Tue, 3/2/10, Matthieu Labour matthieu_lab...@yahoo.com wrote:

From: Matthieu Labour matthieu_lab...@yahoo.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 4:59 PM

Otis
Thank your for your response. I apologize for not being specific enough
-- yes it happened over  over.
-- apache-solr-1.4.0
-- I restarted the indexing+replication from scratch. Before I did that, I 
backed up the master index directory. I don't see _7h0y.fdx in it 
What could have possibly happen?



--- On Tue, 3/2/10, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

From: Otis Gospodnetic otis_gospodne...@yahoo.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 4:40 PM

Hi Matthieu,

Does this happen over and over?
Is this with Solr 1.4 or some other version?
Is there anything unusual about _7h0y.fdx?
Does _7h0y.fdx still exist on the master when the replication fails?
...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Matthieu Labour matthieu_lab...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 2, 2010 4:35:46 PM
 Subject: Re: replication issue
 
 The replication does not work for me
 
 
 I have a big master solr and I want to start replicating it. I can see that 
 the 
 slave is downloading data from the master... I see a directory 
 index.20100302093000 gets created in data/ next to index... I can see its 
 size 
 growing but then the directory gets deleted
 
 Here is the complete trace (I added a couple of LOG messages and compile solr)
 
 [2010-03-02 21:24:00] [pool-3-thread-1] 
 DEBUG(MultiThreadedHttpConnectionManager.java:961) - Notifying no-one, there 
 are 
 no waiting threads
 [2010-03-02 21:24:00] [pool-3-thread-1] INFO (SnapPuller.java:278) - Number 
 of 
 files in latest index in master: 163
 [2010-03-02 21:24:00] [pool-3-thread-1] DEBUG(SnapPuller.java:536) - 
 downloadIndexFiles(downloadCompleteIndex=false,tmpIdxDir=../solr/data/index.20100302092400,latestVersion=1266003907838)
 [2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:541) - 
 --localIndexFile=/opt/solr_env/solr/data/index/_7h0y.fdx
 [2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:900) - 
 fetchFile()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:321) - enter 
 HttpClient.executeMethod(HttpMethod)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:374) - enter 
 HttpClient.executeMethod(HostConfiguration,HttpMethod,HttpState)
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 TRACE

Re: replication issue

2010-03-02 Thread Matthieu Labour
Hi Paul
Thank you for your amswer
I did put all the directory structure on /raid ... /raid/solr_env/solr ... , 
/raid/solr_env/jetty ...
And it still didn't work even after I applied patch  SOLR-1736
I am investigating if this is because tempDir and data dir are not on the same 
partition
matt

--- On Mon, 3/1/10, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com wrote:

From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Monday, March 1, 2010, 10:30 PM

The data/index.20100226063400 dir is a temporary dir and isc reated in
the same dir where the index dir is located.

I'm wondering if the symlink is causing the problem. Why don't you set
the data dir as /raid/data instead of /solr/data

On Sat, Feb 27, 2010 at 12:13 AM, Matthieu Labour
matthieu_lab...@yahoo.com wrote:
 Hi

 I am still having issues with the replication and wonder if things are 
 working properly

 So I have 1 master and 1 slave

 On the slave, I deleted the data/index directory and 
 data/replication.properties file and restarted solr.

 When slave is pulling data from master, I can see that the size of data 
 directory is growing

 r...@slr8:/raid/data# du -sh
 3.7M    .
 r...@slr8:/raid/data# du -sh
 4.7M    .

 and I can see that data/replication.properties  file got created and also a 
 directory data/index.20100226063400

 soon after index.20100226063400 disapears and the size of data/index is back 
 to 12K

 r...@slr8:/raid/data/index# du -sh
 12K    .

 And when I look for the number of documents via the admin interface, I still 
 see 0 documents so I feel something is wrong

 One more thing, I have a symlink for /solr/data --- /raid/data

 Thank you for your help !

 matt










-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com



  

Re: replication issue

2010-03-02 Thread Matthieu Labour
I think this issue is tot related to patch  SOLR-1736

Here is the error I get ... Thank you for any help


[2010-03-02 19:07:26] [pool-3-thread-1] ERROR(ReplicationHandler.java:266) - 
SnapPull failed
org.apache.solr.common.SolrException: Unable to download _7bre.fdt completely. 
Downloaded 0!=15591
    at 
org.apache.solr.handler.SnapPuller$FileFetcher.cleanup(SnapPuller.java:1036)
    at 
org.apache.solr.handler.SnapPuller$FileFetcher.fetchFile(SnapPuller.java:916)
    at 
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:541)
    at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:294)
    at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
    at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:146)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:170)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
    at java.lang.Thread.run(Thread.java:595)


--- On Tue, 3/2/10, Matthieu Labour matthieu_lab...@yahoo.com wrote:

From: Matthieu Labour matthieu_lab...@yahoo.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 11:23 AM

Hi Paul
Thank you for your amswer
I did put all the directory structure on /raid ... /raid/solr_env/solr ... , 
/raid/solr_env/jetty ...
And it still didn't work even after I applied patch  SOLR-1736
I am investigating if this is because tempDir and data dir are not on the same 
partition
matt

--- On Mon, 3/1/10, Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com wrote:

From: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@gmail.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Monday, March 1, 2010, 10:30 PM

The data/index.20100226063400 dir is a temporary dir and isc reated in
the same dir where the index dir is located.

I'm wondering if the symlink is causing the problem. Why don't you set
the data dir as /raid/data instead of /solr/data

On Sat, Feb 27, 2010 at 12:13 AM, Matthieu Labour
matthieu_lab...@yahoo.com wrote:
 Hi

 I am still having issues with the replication and wonder if things are 
 working properly

 So I have 1 master and 1 slave

 On the slave, I deleted the data/index directory and 
 data/replication.properties file and restarted solr.

 When slave is pulling data from master, I can see that the size of data 
 directory is growing

 r...@slr8:/raid/data# du -sh
 3.7M    .
 r...@slr8:/raid/data# du -sh
 4.7M    .

 and I can see that data/replication.properties  file got created and also a 
 directory data/index.20100226063400

 soon after index.20100226063400 disapears and the size of data/index is back 
 to 12K

 r...@slr8:/raid/data/index# du -sh
 12K    .

 And when I look for the number of documents via the admin interface, I still 
 see 0 documents so I feel something is wrong

 One more thing, I have a symlink for /solr/data --- /raid/data

 Thank you for your help !

 matt










-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com



      


  

Re: replication issue

2010-03-02 Thread Matthieu Labour
The replication does not work for me


I have a big master solr and I want to start replicating it. I can see that the 
slave is downloading data from the master... I see a directory 
index.20100302093000 gets created in data/ next to index... I can see its size 
growing but then the directory gets deleted

Here is the complete trace (I added a couple of LOG messages and compile solr)

[2010-03-02 21:24:00] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:961) - Notifying no-one, there 
are no waiting threads
[2010-03-02 21:24:00] [pool-3-thread-1] INFO (SnapPuller.java:278) - Number of 
files in latest index in master: 163
[2010-03-02 21:24:00] [pool-3-thread-1] DEBUG(SnapPuller.java:536) - 
downloadIndexFiles(downloadCompleteIndex=false,tmpIdxDir=../solr/data/index.20100302092400,latestVersion=1266003907838)
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:541) - 
--localIndexFile=/opt/solr_env/solr/data/index/_7h0y.fdx
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:900) - fetchFile()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:321) - enter 
HttpClient.executeMethod(HttpMethod)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:374) - enter 
HttpClient.executeMethod(HostConfiguration,HttpMethod,HttpState)
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:405) - enter 
HttpConnectionManager.getConnectionWithTimeout(HostConfiguration, long)
[2010-03-02 21:24:40] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:412) - 
HttpConnectionManager.getConnection:  config = 
HostConfiguration[host=http://myserver.com:8983], timeout = 0
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
[2010-03-02 21:24:40] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:839) - Getting free connection, 
hostConfig=HostConfiguration[host=http://myserver.com:8983]
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodDirector.java:379) - 
Attempt number 1 to process request
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:1079) - enter 
HttpMethodBase.execute(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2057) - enter 
HttpMethodBase.writeRequest(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2212) - enter 
HttpMethodBase.writeRequestLine(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:1496) - enter 
HttpMethodBase.generateRequestLine(HttpConnection, String, String, String, 
String)
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(Wire.java:70) -  POST 
/solr/replication HTTP/1.1[\r][\n]
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:1032) - enter 
HttpConnection.print(String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:942) - enter 
HttpConnection.write(byte[])
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:963) - enter 
HttpConnection.write(byte[], int, int)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2175) - enter 
HttpMethodBase.writeRequestHeaders(HttpState,HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:370) - 
enter EntityEnclosingMethod.addRequestHeaders(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(ExpectContinueMethod.java:183) - 
enter ExpectContinueMethod.addRequestHeaders(HttpState, HttpConnection)

Re: replication issue

2010-03-02 Thread Matthieu Labour
One More information

I deleted the index on the master and I restarted the master and restarted the 
slave and now the replication works

Would it be possible that the replication doesn work well when started against 
an already existing big index ?

Thank you

--- On Tue, 3/2/10, Matthieu Labour matthieu_lab...@yahoo.com wrote:

From: Matthieu Labour matthieu_lab...@yahoo.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 3:35 PM

The replication does not work for me


I have a big master solr and I want to start replicating it. I can see that the 
slave is downloading data from the master... I see a directory 
index.20100302093000 gets created in data/ next to index... I can see its size 
growing but then the directory gets deleted

Here is the complete trace (I added a couple of LOG messages and compile solr)

[2010-03-02 21:24:00] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:961) - Notifying no-one, there 
are no waiting threads
[2010-03-02 21:24:00] [pool-3-thread-1] INFO (SnapPuller.java:278) - Number of 
files in latest index in master: 163
[2010-03-02 21:24:00] [pool-3-thread-1] DEBUG(SnapPuller.java:536) - 
downloadIndexFiles(downloadCompleteIndex=false,tmpIdxDir=../solr/data/index.20100302092400,latestVersion=1266003907838)
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:541) - 
--localIndexFile=/opt/solr_env/solr/data/index/_7h0y.fdx
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:900) - fetchFile()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
PostMethod.addParameter(String, String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) - 
enter EntityEnclosingMethod.clearRequestBody()
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:321) - enter 
HttpClient.executeMethod(HttpMethod)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:374) - enter 
HttpClient.executeMethod(HostConfiguration,HttpMethod,HttpState)
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:405) - enter 
HttpConnectionManager.getConnectionWithTimeout(HostConfiguration, long)
[2010-03-02 21:24:40] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:412) - 
HttpConnectionManager.getConnection:  config = 
HostConfiguration[host=http://myserver.com:8983], timeout = 0
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
[2010-03-02 21:24:40] [pool-3-thread-1] 
TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
[2010-03-02 21:24:40] [pool-3-thread-1] 
DEBUG(MultiThreadedHttpConnectionManager.java:839) - Getting free connection, 
hostConfig=HostConfiguration[host=http://myserver.com:8983]
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodDirector.java:379) - 
Attempt number 1 to process request
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:1079) - enter 
HttpMethodBase.execute(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2057) - enter 
HttpMethodBase.writeRequest(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2212) - enter 
HttpMethodBase.writeRequestLine(HttpState, HttpConnection)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:1496) - enter 
HttpMethodBase.generateRequestLine(HttpConnection, String, String, String, 
String)
[2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(Wire.java:70) -  POST 
/solr/replication HTTP/1.1[\r][\n]
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:1032) - enter 
HttpConnection.print(String)
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:942) - enter 
HttpConnection.write(byte[])
[2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpConnection.java:963) - enter

Re: replication issue

2010-03-02 Thread Matthieu Labour
Otis
Thank your for your response. I apologize for not being specific enough
-- yes it happened over  over.
-- apache-solr-1.4.0
-- I restarted the indexing+replication from scratch. Before I did that, I 
backed up the master index directory. I don't see _7h0y.fdx in it 
What could have possibly happen?



--- On Tue, 3/2/10, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

From: Otis Gospodnetic otis_gospodne...@yahoo.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Tuesday, March 2, 2010, 4:40 PM

Hi Matthieu,

Does this happen over and over?
Is this with Solr 1.4 or some other version?
Is there anything unusual about _7h0y.fdx?
Does _7h0y.fdx still exist on the master when the replication fails?
...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
 From: Matthieu Labour matthieu_lab...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 2, 2010 4:35:46 PM
 Subject: Re: replication issue
 
 The replication does not work for me
 
 
 I have a big master solr and I want to start replicating it. I can see that 
 the 
 slave is downloading data from the master... I see a directory 
 index.20100302093000 gets created in data/ next to index... I can see its 
 size 
 growing but then the directory gets deleted
 
 Here is the complete trace (I added a couple of LOG messages and compile solr)
 
 [2010-03-02 21:24:00] [pool-3-thread-1] 
 DEBUG(MultiThreadedHttpConnectionManager.java:961) - Notifying no-one, there 
 are 
 no waiting threads
 [2010-03-02 21:24:00] [pool-3-thread-1] INFO (SnapPuller.java:278) - Number 
 of 
 files in latest index in master: 163
 [2010-03-02 21:24:00] [pool-3-thread-1] DEBUG(SnapPuller.java:536) - 
 downloadIndexFiles(downloadCompleteIndex=false,tmpIdxDir=../solr/data/index.20100302092400,latestVersion=1266003907838)
 [2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:541) - 
 --localIndexFile=/opt/solr_env/solr/data/index/_7h0y.fdx
 [2010-03-02 21:24:40] [pool-3-thread-1] DEBUG(SnapPuller.java:900) - 
 fetchFile()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(PostMethod.java:265) - enter 
 PostMethod.addParameter(String, String)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(EntityEnclosingMethod.java:150) 
 - 
 enter EntityEnclosingMethod.clearRequestBody()
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:321) - enter 
 HttpClient.executeMethod(HttpMethod)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpClient.java:374) - enter 
 HttpClient.executeMethod(HostConfiguration,HttpMethod,HttpState)
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 TRACE(MultiThreadedHttpConnectionManager.java:405) - enter 
 HttpConnectionManager.getConnectionWithTimeout(HostConfiguration, long)
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 DEBUG(MultiThreadedHttpConnectionManager.java:412) - 
 HttpConnectionManager.getConnection:  config = 
 HostConfiguration[host=http://myserver.com:8983], timeout = 0
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
 HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 TRACE(MultiThreadedHttpConnectionManager.java:805) - enter 
 HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
 [2010-03-02 21:24:40] [pool-3-thread-1] 
 DEBUG(MultiThreadedHttpConnectionManager.java:839) - Getting free connection, 
 hostConfig=HostConfiguration[host=http://myserver.com:8983]
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodDirector.java:379) - 
 Attempt number 1 to process request
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:1079) - 
 enter 
 HttpMethodBase.execute(HttpState, HttpConnection)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE(HttpMethodBase.java:2057) - 
 enter 
 HttpMethodBase.writeRequest(HttpState, HttpConnection)
 [2010-03-02 21:24:40] [pool-3-thread-1] TRACE

Re: replication issue

2010-03-01 Thread Matthieu Labour
This replication does not work well. temp directory  and /data/index are on 
different device/disks



I see the following message



[2010-03-02 01:22:07] [pool-3-thread-1] ERROR(ReplicationHandler.java:266) - 
SnapPull failed 



And Yet I applied the patch SOLR-1736 



I ll uni test patch SOLR-1736  and see what tmpIdxDir gets picked up...



What would be cool is the ability to set up solr temp file via config file so 
that it can live in the same partition than the data directory



Thank you

--- On Fri, 2/26/10, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

From: Shalin Shekhar Mangar shalinman...@gmail.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Friday, February 26, 2010, 2:06 PM

On Sat, Feb 27, 2010 at 12:13 AM, Matthieu Labour matthieu_lab...@yahoo.com
 wrote:

 Hi

 I am still having issues with the replication and wonder if things are
 working properly

 So I have 1 master and 1 slave

 On the slave, I deleted the data/index directory and
 data/replication.properties file and restarted solr.

 When slave is pulling data from master, I can see that the size of data
 directory is growing

 r...@slr8:/raid/data# du -sh
 3.7M    .
 r...@slr8:/raid/data# du -sh
 4.7M    .

 and I can see that data/replication.properties  file got created and also a
 directory data/index.20100226063400

 soon after index.20100226063400 disapears and the size of data/index is
 back to 12K

 r...@slr8:/raid/data/index# du -sh
 12K    .

 And when I look for the number of documents via the admin interface, I
 still see 0 documents so I feel something is wrong

 One more thing, I have a symlink for /solr/data --- /raid/data


The ReplicationHandler moves files out of the temp directory into the index
directory. Java's File#renameTo can fail if the source and target
directories are on different partitions/disks. Is that the case here? I
believe SOLR-1736 fixes this issue in trunk but that was implemented after
the 1.4 release.

-- 
Regards,
Shalin Shekhar Mangar.



  

replication. when the slave goes down...

2010-02-26 Thread Matthieu Labour
Hi
I have 2 solr machine. 1 master, 1 slave replicating the index from the master
The machine on which the slave is running went down while the replication was 
running
I suppose the index must be corrupted. Can I safely remove the index on the 
slave and restart the slave and the slave will start over the replication from 
scratch?
Thank you



  

replication issue

2010-02-26 Thread Matthieu Labour
Hi

I am still having issues with the replication and wonder if things are working 
properly

So I have 1 master and 1 slave

On the slave, I deleted the data/index directory and 
data/replication.properties file and restarted solr.

When slave is pulling data from master, I can see that the size of data 
directory is growing

r...@slr8:/raid/data# du -sh
3.7M    .
r...@slr8:/raid/data# du -sh
4.7M    .

and I can see that data/replication.properties  file got created and also a 
directory data/index.20100226063400

soon after index.20100226063400 disapears and the size of data/index is back to 
12K

r...@slr8:/raid/data/index# du -sh
12K    .

And when I look for the number of documents via the admin interface, I still 
see 0 documents so I feel something is wrong

One more thing, I have a symlink for /solr/data --- /raid/data

Thank you for your help !

matt






  

Re: replication issue

2010-02-26 Thread Matthieu Labour
Shalin

Thank you so much for your answer
This is the case here
How can I find out which temp directory Solr replication is using?
Do you have a way to set up the source (temp directory? used by solr) and 
target directory via solr config file so that they live on the same partition ?
Thank you
matt


--- On Fri, 2/26/10, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

From: Shalin Shekhar Mangar shalinman...@gmail.com
Subject: Re: replication issue
To: solr-user@lucene.apache.org
Date: Friday, February 26, 2010, 2:06 PM

On Sat, Feb 27, 2010 at 12:13 AM, Matthieu Labour matthieu_lab...@yahoo.com
 wrote:

 Hi

 I am still having issues with the replication and wonder if things are
 working properly

 So I have 1 master and 1 slave

 On the slave, I deleted the data/index directory and
 data/replication.properties file and restarted solr.

 When slave is pulling data from master, I can see that the size of data
 directory is growing

 r...@slr8:/raid/data# du -sh
 3.7M    .
 r...@slr8:/raid/data# du -sh
 4.7M    .

 and I can see that data/replication.properties  file got created and also a
 directory data/index.20100226063400

 soon after index.20100226063400 disapears and the size of data/index is
 back to 12K

 r...@slr8:/raid/data/index# du -sh
 12K    .

 And when I look for the number of documents via the admin interface, I
 still see 0 documents so I feel something is wrong

 One more thing, I have a symlink for /solr/data --- /raid/data


The ReplicationHandler moves files out of the temp directory into the index
directory. Java's File#renameTo can fail if the source and target
directories are on different partitions/disks. Is that the case here? I
believe SOLR-1736 fixes this issue in trunk but that was implemented after
the 1.4 release.

-- 
Regards,
Shalin Shekhar Mangar.



  

expire/delete documents

2010-02-12 Thread Matthieu Labour
HiIs there a way for solr or lucene to expire documents based on a field in a 
document. Let's say that I have a createTime field whose type is date, can i 
set a policy in schema.xml for solr to delete the documents older than X 
days?Thank you


  

query on not stored field

2010-02-01 Thread Matthieu Labour
Hi

on the following field

fields name=status
[...]
field name=message index=analyzed store=yes default=true/
[...]
/fields

the following query works

{!lucene q.op=AND} [...] AND (status.messageSTRING_ANALYZED_NO_US:(some 
keywords) AND [...]

I was wondering If the query syntax above works as well if the store property 
of the field is set to NO. 

fields name=status

[...]

field name=message index=analyzed store=no default=true/

[...]

/fields


I have tried it and it seems to work. I would appreciate if someone could 
confirm!

Thank you





  

Re: query on not stored field

2010-02-01 Thread Matthieu Labour
Koji, Eric
Thank you for your reply
One more question:
What about a field that is both indexed=false stored=false ... does it have 
an impact into solr meaning is it being ignored by solr/lucene? is it like the 
field was not being passed?
Thank you!


--- On Mon, 2/1/10, Erik Hatcher erik.hatc...@gmail.com wrote:

From: Erik Hatcher erik.hatc...@gmail.com
Subject: Re: query on not stored field
To: solr-user@lucene.apache.org
Date: Monday, February 1, 2010, 6:32 PM

First of all, the schema snippets you provided aren't right.  It's 
indexed=true, not index=analyzed.  And it's stored, not store.

But, to answer your question, the stored nature of the field has nothing 
whatsoever to do with it's searchability.  Stored only affects whether you can 
get that value back in the documents returned from a search, or not.

    Erik


On Feb 1, 2010, at 7:12 PM, Matthieu Labour wrote:

 Hi
 
 on the following field
 
 fields name=status
 [...]
 field name=message index=analyzed store=yes default=true/
 [...]
 /fields
 
 the following query works
 
 {!lucene q.op=AND} [...] AND (status.messageSTRING_ANALYZED_NO_US:(some 
 keywords) AND [...]
 
 I was wondering If the query syntax above works as well if the store property 
 of the field is set to NO.
 
 fields name=status
 
 [...]
 
 field name=message index=analyzed store=no default=true/
 
 [...]
 
 /fields
 
 
 I have tried it and it seems to work. I would appreciate if someone could 
 confirm!
 
 Thank you
 
 
 
 
 




  

Re: Multiple Cores Vs. Single Core for the following use case

2010-01-28 Thread Matthieu Labour


Thanks a lot everybody for the responses ... I am going to do some 
practical/empirical testing and will report
matt

--- On Wed, 1/27/10, Tom Hill solr-l...@worldware.com wrote:

From: Tom Hill solr-l...@worldware.com
Subject: Re: Multiple Cores Vs. Single Core for the following use case
To: solr-user@lucene.apache.org
Date: Wednesday, January 27, 2010, 2:47 PM

Hi -

I'd probably go with a single core on this one, just for ease of operations..

But here are some thoughts:

One advantage I can see to multiple cores, though, would be better idf
calculations. With individual cores, each user only sees the idf for his own
documents. With a single core, the idf will be across all documents. In
theory, better relevance.

While multi-core will use more ram to start with, and I would expect it to
use more disk (term dictionary per core). Filters would add to the memory
footprint of the multiple core setup.

However, if you only end up sorting/faceting on some of the cores, your
memory use with multiple cores may actually be less. With multiple cores,
each field cache only covers one user's docs. With single core, you have one
field cache entry per doc in the whole corpus. Depending on usage patterns,
index sizes, etc, this could be a significant amount of memory.

Tom


On Wed, Jan 27, 2010 at 11:38 AM, Amit Nithian anith...@gmail.com wrote:

 It sounds to me that multiple cores won't scale.. wouldn't you have to
 create multiple configurations per each core and does the ranking function
 change per user?

 I would imagine that the filter method would work better.. the caching is
 there and as mentioned earlier would be fast for multiple searches. If you
 have searches for the same user, then add that to your warming queries list
 so that on server startup, the cache will be warm for certain users that
 you
 know tend to do a lot of searches. This can be known empirically or by log
 mining.

 I haven't used multiple cores but I suspect that having that many
 configuration files parsed and loaded in memory can't be good for memory
 usage over filter caching.

 Just my 2 cents
 Amit

 On Wed, Jan 27, 2010 at 8:58 AM, Matthieu Labour
 matthieu_lab...@yahoo.comwrote:

  Thanks Didier for your response
  And in your opinion, this should be as fast as if I would getCore(userId)
  -- provided that the core is already open -- and then search for Paris
 ?
  matt
 
  --- On Wed, 1/27/10, didier deshommes dfdes...@gmail.com wrote:
 
  From: didier deshommes dfdes...@gmail.com
  Subject: Re: Multiple Cores Vs. Single Core for the following use case
  To: solr-user@lucene.apache.org
  Date: Wednesday, January 27, 2010, 10:52 AM
 
  On Wed, Jan 27, 2010 at 9:48 AM, Matthieu Labour
  matthieu_lab...@yahoo.com wrote:
   What I am trying to understand is the search/filter algorithm. If I
 have
  1 core with all documents and I  search for Paris for userId=123, is
  lucene going to first search for all Paris documents and then apply a
 filter
  on the userId ? If this is the case, then I am better off having a
 specific
  index for the user=123 because this will be faster
 
  If you want to apply the filter to userid first, use filter queries
  (http://wiki.apache.org/solr/CommonQueryParameters#fq). This will
  filter by userid first then search for Paris.
 
  didier
 
  
  
  
  
  
   --- On Wed, 1/27/10, Marc Sturlese marc.sturl...@gmail.com wrote:
  
   From: Marc Sturlese marc.sturl...@gmail.com
   Subject: Re: Multiple Cores Vs. Single Core for the following use case
   To: solr-user@lucene.apache.org
   Date: Wednesday, January 27, 2010, 2:22 AM
  
  
   In case you are going to use core per user take a look to this patch:
   http://wiki.apache.org/solr/LotsOfCores
  
   Trey-13 wrote:
  
   Hi Matt,
  
   In most cases you are going to be better off going with the userid
  method
   unless you have a very small number of users and a very large number
 of
   docs/user. The userid method will likely be much easier to manage, as
  you
   won't have to spin up a new core every time you add a new user.  I
 would
   start here and see if the performance is good enough for your
  requirements
   before you start worrying about it not being efficient.
  
   That being said, I really don't have any idea what your data looks
 like.
   How many users do you have?  How many documents per user?  Are any
   documents
   shared by multiple users?
  
   -Trey
  
  
  
   On Tue, Jan 26, 2010 at 7:27 PM, Matthieu Labour
   matthieu_lab...@yahoo.comwrote:
  
   Hi
  
  
  
   Shall I set up Multiple Core or Single core for the following use
 case:
  
  
  
   I have X number of users.
  
  
  
   When I do a search, I always know for which user I am doing a search
  
  
  
   Shall I set up X cores, 1 for each user ? Or shall I set up 1 core
 and
   add
   a userId field to each document?
  
  
  
   If I choose the 1 core solution then I am concerned with performance.
   Let's say I search for NewYork ... If lucene returns all New York

Re: Multiple Cores Vs. Single Core for the following use case

2010-01-27 Thread Matthieu Labour
@Marc: Thank you marc. This is a logic we had to implement in the client 
application. Will look into applying the patch to replace our own grown logic

@Trey: I have 1000 users per machine. 1 core / user. Each core is 35000 
documents. Documents are small...each core goes from 100MB to 1.3GB at most. 
There are 7 types of documents.
What I am trying to understand is the search/filter algorithm. If I have 1 core 
with all documents and I  search for Paris for userId=123, is lucene going 
to first search for all Paris documents and then apply a filter on the userId ? 
If this is the case, then I am better off having a specific index for the 
user=123 because this will be faster 





--- On Wed, 1/27/10, Marc Sturlese marc.sturl...@gmail.com wrote:

From: Marc Sturlese marc.sturl...@gmail.com
Subject: Re: Multiple Cores Vs. Single Core for the following use case
To: solr-user@lucene.apache.org
Date: Wednesday, January 27, 2010, 2:22 AM


In case you are going to use core per user take a look to this patch:
http://wiki.apache.org/solr/LotsOfCores

Trey-13 wrote:
 
 Hi Matt,
 
 In most cases you are going to be better off going with the userid method
 unless you have a very small number of users and a very large number of
 docs/user. The userid method will likely be much easier to manage, as you
 won't have to spin up a new core every time you add a new user.  I would
 start here and see if the performance is good enough for your requirements
 before you start worrying about it not being efficient.
 
 That being said, I really don't have any idea what your data looks like.
 How many users do you have?  How many documents per user?  Are any
 documents
 shared by multiple users?
 
 -Trey
 
 
 
 On Tue, Jan 26, 2010 at 7:27 PM, Matthieu Labour
 matthieu_lab...@yahoo.comwrote:
 
 Hi



 Shall I set up Multiple Core or Single core for the following use case:



 I have X number of users.



 When I do a search, I always know for which user I am doing a search



 Shall I set up X cores, 1 for each user ? Or shall I set up 1 core and
 add
 a userId field to each document?



 If I choose the 1 core solution then I am concerned with performance.
 Let's say I search for NewYork ... If lucene returns all New York
 matches for all users and then filters based on the userId, then this
 is going to be less efficient than if I have sharded per user and send
 the request for New York to the user's core



 Thank you for your help



 matt







 
 

-- 
View this message in context: 
http://old.nabble.com/Multiple-Cores-Vs.-Single-Core-for-the-following-use-case-tp27332288p27335403.html
Sent from the Solr - User mailing list archive at Nabble.com.




  

Re: Multiple Cores Vs. Single Core for the following use case

2010-01-27 Thread Matthieu Labour
Thanks Didier for your response
And in your opinion, this should be as fast as if I would getCore(userId) -- 
provided that the core is already open -- and then search for Paris ?
matt

--- On Wed, 1/27/10, didier deshommes dfdes...@gmail.com wrote:

From: didier deshommes dfdes...@gmail.com
Subject: Re: Multiple Cores Vs. Single Core for the following use case
To: solr-user@lucene.apache.org
Date: Wednesday, January 27, 2010, 10:52 AM

On Wed, Jan 27, 2010 at 9:48 AM, Matthieu Labour
matthieu_lab...@yahoo.com wrote:
 What I am trying to understand is the search/filter algorithm. If I have 1 
 core with all documents and I  search for Paris for userId=123, is lucene 
 going to first search for all Paris documents and then apply a filter on the 
 userId ? If this is the case, then I am better off having a specific index 
 for the user=123 because this will be faster

If you want to apply the filter to userid first, use filter queries
(http://wiki.apache.org/solr/CommonQueryParameters#fq). This will
filter by userid first then search for Paris.

didier






 --- On Wed, 1/27/10, Marc Sturlese marc.sturl...@gmail.com wrote:

 From: Marc Sturlese marc.sturl...@gmail.com
 Subject: Re: Multiple Cores Vs. Single Core for the following use case
 To: solr-user@lucene.apache.org
 Date: Wednesday, January 27, 2010, 2:22 AM


 In case you are going to use core per user take a look to this patch:
 http://wiki.apache.org/solr/LotsOfCores

 Trey-13 wrote:

 Hi Matt,

 In most cases you are going to be better off going with the userid method
 unless you have a very small number of users and a very large number of
 docs/user. The userid method will likely be much easier to manage, as you
 won't have to spin up a new core every time you add a new user.  I would
 start here and see if the performance is good enough for your requirements
 before you start worrying about it not being efficient.

 That being said, I really don't have any idea what your data looks like.
 How many users do you have?  How many documents per user?  Are any
 documents
 shared by multiple users?

 -Trey



 On Tue, Jan 26, 2010 at 7:27 PM, Matthieu Labour
 matthieu_lab...@yahoo.comwrote:

 Hi



 Shall I set up Multiple Core or Single core for the following use case:



 I have X number of users.



 When I do a search, I always know for which user I am doing a search



 Shall I set up X cores, 1 for each user ? Or shall I set up 1 core and
 add
 a userId field to each document?



 If I choose the 1 core solution then I am concerned with performance.
 Let's say I search for NewYork ... If lucene returns all New York
 matches for all users and then filters based on the userId, then this
 is going to be less efficient than if I have sharded per user and send
 the request for New York to the user's core



 Thank you for your help



 matt










 --
 View this message in context: 
 http://old.nabble.com/Multiple-Cores-Vs.-Single-Core-for-the-following-use-case-tp27332288p27335403.html
 Sent from the Solr - User mailing list archive at Nabble.com.








  

solr1.5

2010-01-26 Thread Matthieu Labour
Hi
quick question:
Is there any release date scheduled for solr 1.5 with all the wonderful
patches (StreamingUpdateSolrServer etc ...)?
Thank you !


replication setup

2010-01-26 Thread Matthieu Labour
Hi



I have set up replication following the wiki 

I downloaded the latest apache-solr-1.4 release and exploded it in 2 different 
directories
I modified both solrconfig.xml for the master  the slave as described on the 
wiki page
In both sirectory, I started solr from the example directory
example on the master:
java -Dsolr.solr.home=multicore -Djetty.host=0.0.0.0 -Djetty.port=8983 
-DSTOP.PORT=8078 -DSTOP.KEY=stop.now -jar start.jar

and on the slave
java -Dsolr.solr.home=multicore -Djetty.host=0.0.0.0 -Djetty.port=8982 
-DSTOP.PORT=8077 -DSTOP.KEY=stop.now -jar start.jar



I can see core0 and core 1 when I open the solr url 
However, I don't see a replication link and
the following url  solr url / replication returns a 404 error



I must be doing something wrong. I would appreciate any help !



thanks a lot

matt




  

Multiple Cores Vs. Single Core for the following use case

2010-01-26 Thread Matthieu Labour
Hi



Shall I set up Multiple Core or Single core for the following use case:



I have X number of users.



When I do a search, I always know for which user I am doing a search



Shall I set up X cores, 1 for each user ? Or shall I set up 1 core and add a 
userId field to each document?



If I choose the 1 core solution then I am concerned with performance.
Let's say I search for NewYork ... If lucene returns all New York
matches for all users and then filters based on the userId, then this
is going to be less efficient than if I have sharded per user and send
the request for New York to the user's core



Thank you for your help



matt






  

Re: performance issue

2010-01-22 Thread Matthieu Labour
Hi

Thank you for your reponse

Which version of solr?
I inherited the project so not exactly sure ... in CHANGES.txt it says
Apache Solr Version 1.4-dev
$Id: CHANGES.txt 793090 2009-07-10 19:40:33Z yonik $

What garbage collection parameters?
ulimit -n 10 ; nohup java -server -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode  -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-TraceClassUnloading
-XX:+UseParNewGC -XX:ParallelGCThreads=4 -Xmx5000m
-Dsolr.solr.home=/opt/solr_env/index
-Djava.util.logging.config.file=/opt/solr_env/index/logging.properties
-Djetty.host=0.0.0.0 -DSTOP.PORT=8079 -DSTOP.KEY=stop.now
-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=3500
-Dcom.sun.management.jmxremote.ssl=false -jar start.jar  solr.log 

What version of java?
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_18-b02, mixed mode). I also
tried with 1.6 but didn't change. Changing -Xmx from 5000 to 3500 causes the
problem to happen quicker

The machine is an xlarge machine on amazon

7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge


Thank you for your help


matt


On Thu, Jan 21, 2010 at 11:57 PM, Lance Norskog goks...@gmail.com wrote:

 Which version of Solr? Java? What garbage collection parameters?

 On Thu, Jan 21, 2010 at 1:03 PM, Matthieu Labour matth...@strateer.com
 wrote:
  Hi
 
  I have been requested to look at a solr instance that has been patched
 with
  our own home grown patch to be able to handle 1000 cores on a solr
 instance
 
  The solr instance doesn't perform well. Within 12 hours, I can see the
  garbage collection taking a lot of time and query  update requests are
  timing out (see below )
 
  [Full GC [PSYoungGen: 673152K-98800K(933888K)] [PSOldGen:
  2389375K-2389375K(2389376K)] 3062527K-2488176K(3323264K) [PSPermGen:
  23681K-23681K(23744K)], 4.0807080 secs] [Times: user=4.08 sys=0.00,
  real=4.08 secs]
 
  org.apache.solr.client.solrj.SolrServerException:
  java.net.SocketTimeoutException: Read timed out
 at
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
 at
 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
 at
 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 
 
  I used yourkit to track down eventual memory leaks but didn't succeed in
  finding one
 
  The biggest objects using up the memory seem to be org.apache.lucene.Term
  and org.apache.lucene.TermInfo
 
  The total size of the data directory in index is 46G with a typical big
 core
  being 10 documents and size of 103M
 
  There are lots of search requests and indexing happening
 
  I am posting to the mailing list hoping to hear that we must be doing
  something completely wrong because it doesn't seem to me that we are
 pushing
  the limit. I would appreciate any tips as where to look at etc... to
  troubleshoot and solve the issue
 
  Thank you for your help !
 
  matt
 



 --
 Lance Norskog
 goks...@gmail.com



CoreContainer / getCore and create ?

2010-01-22 Thread Matthieu Labour
Hi

Would it make sense to modify/ add a method to CoreContainer that creates a
core if the core doesn't exist ?

something like

public SolrCore getCore(String name) {
synchronized(cores) {
  SolrCore core = cores.get(name);
  if (core != null)
core.open();  // increment the ref count while still synchronized
  else {
  core = create(name);
  cores.put(name,core);
}
  return core;
}
  }


My apology if this already documented...

matt


performance issue

2010-01-21 Thread Matthieu Labour
Hi

I have been requested to look at a solr instance that has been patched with
our own home grown patch to be able to handle 1000 cores on a solr instance

The solr instance doesn't perform well. Within 12 hours, I can see the
garbage collection taking a lot of time and query  update requests are
timing out (see below )

[Full GC [PSYoungGen: 673152K-98800K(933888K)] [PSOldGen:
2389375K-2389375K(2389376K)] 3062527K-2488176K(3323264K) [PSPermGen:
23681K-23681K(23744K)], 4.0807080 secs] [Times: user=4.08 sys=0.00,
real=4.08 secs]

org.apache.solr.client.solrj.SolrServerException:
java.net.SocketTimeoutException: Read timed out
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)


I used yourkit to track down eventual memory leaks but didn't succeed in
finding one

The biggest objects using up the memory seem to be org.apache.lucene.Term
and org.apache.lucene.TermInfo

The total size of the data directory in index is 46G with a typical big core
being 10 documents and size of 103M

There are lots of search requests and indexing happening

I am posting to the mailing list hoping to hear that we must be doing
something completely wrong because it doesn't seem to me that we are pushing
the limit. I would appreciate any tips as where to look at etc... to
troubleshoot and solve the issue

Thank you for your help !

matt


Fwd: performance issue

2010-01-21 Thread Matthieu Labour
Hi

I have been requested to look at a solr instance that has been patched with
our own home grown patch to be able to handle 1000 cores on a solr instance

The solr instance doesn't perform well. Within 12 hours, I can see the
garbage collection taking a lot of time and query  update requests are
timing out (see below )

[Full GC [PSYoungGen: 673152K-98800K(933888K)] [PSOldGen:
2389375K-2389375K(2389376K)] 3062527K-2488176K(3323264K) [PSPermGen:
23681K-23681K(23744K)], 4.0807080 secs] [Times: user=4.08 sys=0.00,
real=4.08 secs]

org.apache.solr.client.solrj.SolrServerException:
java.net.SocketTimeoutException: Read timed out
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:472)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)


I used yourkit to track down eventual memory leaks but didn't succeed in
finding one

The biggest objects using up the memory seem to be org.apache.lucene.Term
and org.apache.lucene.TermInfo

The total size of the data directory in index is 46G with a typical big core
being 10 documents and size of 103M

There are lots of search requests and indexing happening

I am posting to the mailing list hoping to hear that we must be doing
something completely wrong because it doesn't seem to me that we are pushing
the limit. I would appreciate any tips as where to look at etc... to
troubleshoot and solve the issue

Thank you for your help !

matt


solr perf

2009-12-20 Thread Matthieu Labour
Hi
I have a slr instance in which i created 700 core. 1 Core per user of my
application.
The total size of the data indexed on disk is 35GB with solr cores going
from 100KB and few documents to 1.2GB and 50 000 documents.
Searching seems very slow and indexing as well
This is running on a EC2 xtra large instance (6CPU, 15GB Memory, Raid0 disk)
I would appreciate if anybody has some tips, articles etc... as what to do
to understand and improve performance
Thank you


Re: solr core size on disk

2009-12-17 Thread Matthieu Labour
Paul
Thank you for your reply
I did du -sh in /solr_env/index/data
and it shows
36G
It is distributed among 700 cores with most of them being 150M
Is that a big index that should be sharded ?



2009/12/17 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 look at the index dir and see the size of the files . it is typically
 in $SOLR_HOME/data/index

 On Thu, Dec 17, 2009 at 2:56 AM, Matthieu Labour matth...@kikin.com
 wrote:
  Hi
  I am new to solr. Here is my question:
  How to find out the size of a solr core on disk ?
  Thank you
  matt
 



 --
 -
 Noble Paul | Systems Architect| AOL | http://aol.com



solr core size on disk

2009-12-16 Thread Matthieu Labour
Hi
I am new to solr. Here is my question:
How to find out the size of a solr core on disk ?
Thank you
matt