Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

2014-02-12 Thread Pisarev, Vitaliy
I am running a very simple performance experiment where I post 2000 documents 
to my application. Who in turn persists them to a relational DB and sends them 
to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:

  1.  No indexing at all - ~45 sec to post 2000 documents
  2.  Indexing included - commit after each add. ~8 minutes (!) to post and 
index 2000 documents
  3.  Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 
2000 documents
The 3rd result does not make any sense, I would expect the behavior to be 
similar to the one in point 2. At first I thought that the documents were not 
really committed but I could actually see them being added by executing some 
queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Whereas the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand 
there is no need to explicitly call commit with this API
Is it possible that committing after each add will degrade performance by a 
factor of 40?



RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

2014-02-12 Thread Pisarev, Vitaliy
I absolutely agree and I even read the NRT page before posting this question.

The thing that baffles me is this:

Doing a commit after each add kills the performance.
On the other hand, when I use commit within and specify an (absurd) 1ms delay,- 
I expect that this behavior will be equivalent to making a commit- from a 
functional perspective.

Seeing that there is no magic in the world, I am trying to understand what is 
the price I am actually paying when using the commitWithin feature, on the one 
hand it commits almost immediately, on the other hand, it performs wonderfully. 
Where is the catch?


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: יום ד 12 פברואר 2014 17:00
To: solr-user
Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am 
afraid I am missing something

Doing a standard commit after every document is a Solr anti-pattern.

commitWithin is a “near-realtime” commit in recent versions of Solr and not a 
standard commit.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

- Mark

http://about.me/markrmiller

On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote:

 I am running a very simple performance experiment where I post 2000 documents 
 to my application. Who in turn persists them to a relational DB and sends 
 them to Solr for indexing (Synchronously, in the same request).
 I am testing 3 use cases:
 
  1.  No indexing at all - ~45 sec to post 2000 documents  2.  Indexing 
 included - commit after each add. ~8 minutes (!) to post and index 
 2000 documents  3.  Indexing included - commitWithin 1ms ~55 seconds 
 (!) to post and index 2000 documents The 3rd result does not make any sense, 
 I would expect the behavior to be similar to the one in point 2. At first I 
 thought that the documents were not really committed but I could actually see 
 them being added by executing some queries during the experiment (via the 
 solr web UI).
 I am worried that I am missing something very big. The code I use for point 2:
 SolrInputDocument = // get doc
 SolrServer solrConnection = // get connection solrConnection.add(doc); 
 solrConnection.commit(); Whereas the code for point 3:
 SolrInputDocument = // get doc
 SolrServer solrConnection = // get connection solrConnection.add(doc, 
 1); // According to API documentation I understand there is no need to 
 explicitly call commit with this API Is it possible that committing after 
 each add will degrade performance by a factor of 40?
 



RE: Importing database DIH

2014-02-12 Thread Pisarev, Vitaliy
It can be anything from wrong credentials, to missing driver in the class path, 
to malformed connection string, etc..

What does the Solr log say? 

-Original Message-
From: Maheedhar Kolla [mailto:maheedhar.ko...@gmail.com] 
Sent: יום ד 12 פברואר 2014 17:23
To: solr-user@lucene.apache.org
Subject: Importing database DIH

Hi ,


I need help with importing data, through DIH.  ( using solr-3.6.1, tomcat6 )

 I see the following error when I try to do a full-import from my local MySQL 
table ( http:/s/solr//dataimport?command=full-import
).

snip
..
str name=Total Requests made to DataSource0/str str name=Total Rows 
Fetched0/str str name=Total Documents Processed0/str str name=Total 
Documents Skipped0/str str name=Indexing failed. Rolled back all 
changes./str 
/snip

I did search to find ways to solve this problem and did create the file 
dataimport.properties , but no success.

Any help would be appreciated.


cheers,
Kolla

PS:  When I check the admin panel for statistics for the query /dataimport , I 
see the following:

Status : IDLE
Documents Processed : 0
Requests made to DataSource : 0
Rows Fetched : 0
Documents Deleted : 0
Documents Skipped : 0
Total Documents Processed : 0
Total Requests made to DataSource : 0
Total Rows Fetched : 0
Total Documents Deleted : 0
Total Documents Skipped : 0
handlerStart : 1391612468278
requests : 5
errors : 0
timeouts : 0
totalTime : 28
avgTimePerRequest : 5.6


Also, Here is my dataconfig file.
dataConfig
 dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost:3306/DBNAME
user=USER password=PWD/  document
 entity name=docs query=select * from TABLENAME
field column=id name=id/
field column=content name=text/
field column=title name=title/
 /entity
 /document





--
Cheers,
Kolla


Deciding how to correctly use Solr multicore

2014-02-09 Thread Pisarev, Vitaliy
Hello!

We are evaluating Solr usage in our organization and have come to the point 
where we are past the functional tests and are now looking in choosing the best 
deployment topology.
Here are some details about the structure of the problem: The application deals 
with storing and retrieving artifacts of various types. The artifact are stored 
in Projects. Each project can have hundreds of thousands of artifacts (total on 
all types) and our largest customers have hundreds of projects (~300-800) 
though the vast majority have tens of project (~30-100).

Core granularity
In terms of Core granularity- it seems to me that a core per project is 
sensible, as pushing everything to a single core will probably be too much. The 
entities themselves will have a special type field for distinction.
Moreover, it may be that not all of the project are active in a given time so 
this allows their indexes to remain on latent on disk.


Availability and synchronization
Our application is deployed on premise on our customers sites- we cannot go too 
crazy as to the amount of extra resources we demand from them- e.g. dedicated 
indexing servers. We pretty much need to make do with what is already there.

For now, we are planning to use the DIH to maintain the index. Each node the 
cluster on the app will have its own local index. When a project is created (or 
the feature is enabled on an existing project), a core is created for it on 
each one of the nodes, a full import is executed and then a delta import is 
scheduled to run on each one of the nodes. This gives us simplicity but I am 
wondering about the performance and memory consumption costs? Also, I am 
wondering whether we should use replication for this purpose. The requirement 
is for the index to be updated once in 30 seconds - are delta imports design 
for this?

I understand that this is a very complex problem in general. I tried to 
highlight all the most significant aspects and will appreciate some initial 
guidance. Note that we are planning to execute performance and stress testing 
no matter what but the assumption is that the topology of the solution can be 
predetermined with the existing data.