Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes
I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently
pushing the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script
(instead of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster
Condensing the loader into a single executable sounds right if
you have performance problems. ;-)

You could also try adding multiple docs in a single post if you
notice your problems are with tcp setup time, though if you're
doing localhost connections that should be minimal.

If you're already local to the solr server, you might check out the
CSV slurper. http://wiki.apache.org/solr/UpdateCSV  It's a little
specialized.

And then there's of course the question of are you doing full
re-indexing or incremental indexing of changes?

--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

 I inherited an existing (working) solr indexing script that runs like
 this:



 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr



 We're injecting about 1000 records / minute (constantly), frequently
 pushing the edge of our CPU / RAM limitations.



 I'm in the process of building a Perl script to use DBI and
 lwp::simple::post that will perform this all from a single script
 (instead of 3).



 Two specific questions

 1: Does anyone have a clever (or better) way to perform this process
 efficiently?



 2: Is there a way to inject into solr without using POST / curl / http?



 Admittedly, I'm no solr expert - I'm starting from someone else's setup,
 trying to reverse-engineer my way out.  Any input would be greatly
 appreciated.




RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread David Whalen
What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

 -Original Message-
 From: Clay Webster [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 09, 2007 11:43 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Any clever ideas to inject into solr? Without http?
 
 Condensing the loader into a single executable sounds right 
 if you have performance problems. ;-)
 
 You could also try adding multiple docs in a single post if 
 you notice your problems are with tcp setup time, though if 
 you're doing localhost connections that should be minimal.
 
 If you're already local to the solr server, you might check 
 out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
 It's a little specialized.
 
 And then there's of course the question of are you doing 
 full re-indexing or incremental indexing of changes?
 
 --cw
 
 
 On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 
  I inherited an existing (working) solr indexing script that 
 runs like
  this:
 
 
 
  Python script queries the mysql DB then calls bash script
 
  Bash script performs a curl POST submit to solr
 
 
 
  We're injecting about 1000 records / minute (constantly), 
 frequently 
  pushing the edge of our CPU / RAM limitations.
 
 
 
  I'm in the process of building a Perl script to use DBI and 
  lwp::simple::post that will perform this all from a single script 
  (instead of 3).
 
 
 
  Two specific questions
 
  1: Does anyone have a clever (or better) way to perform 
 this process 
  efficiently?
 
 
 
  2: Is there a way to inject into solr without using POST / 
 curl / http?
 
 
 
  Admittedly, I'm no solr expert - I'm starting from someone else's 
  setup, trying to reverse-engineer my way out.  Any input would be 
  greatly appreciated.
 
 
 


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Tobin Cataldo
(re)building the index separately (ie. on a different computer) and then 
replacing the active index may be an option.


David Whalen wrote:

What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

  

-Original Message-
From: Clay Webster [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 11:43 AM

To: solr-user@lucene.apache.org
Subject: Re: Any clever ideas to inject into solr? Without http?

Condensing the loader into a single executable sounds right 
if you have performance problems. ;-)


You could also try adding multiple docs in a single post if 
you notice your problems are with tcp setup time, though if 
you're doing localhost connections that should be minimal.


If you're already local to the solr server, you might check 
out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
It's a little specialized.


And then there's of course the question of are you doing 
full re-indexing or incremental indexing of changes?


--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

I inherited an existing (working) solr indexing script that 
  

runs like


this:



Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr



We're injecting about 1000 records / minute (constantly), 
  
frequently 


pushing the edge of our CPU / RAM limitations.



I'm in the process of building a Perl script to use DBI and 
lwp::simple::post that will perform this all from a single script 
(instead of 3).




Two specific questions

1: Does anyone have a clever (or better) way to perform 
  
this process 


efficiently?



2: Is there a way to inject into solr without using POST / 
  

curl / http?



Admittedly, I'm no solr expert - I'm starting from someone else's 
setup, trying to reverse-engineer my way out.  Any input would be 
greatly appreciated.



  


  


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Brian Whitman


On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote:




2: Is there a way to inject into solr without using POST / curl /  
http?




Check http://wiki.apache.org/solr/EmbeddedSolr

There's examples in java and cocoa to use the DirectSolrConnection  
class, querying and updating solr w/o a web server. It uses JNI in  
the Cocoa case.

-b



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster
If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw

On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:

 What we're looking for is a way to inject *without* using
 curl, or wget, or any other http-based communication.  We'd
 like for the HTTP daemon to only handle search requests, not
 indexing requests on top of them.

 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

 _
 david whalen
 senior applications developer
 eNR Services, Inc.
 [EMAIL PROTECTED]
 203-849-7240


  -Original Message-
  From: Clay Webster [mailto:[EMAIL PROTECTED]
  Sent: Thursday, August 09, 2007 11:43 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Any clever ideas to inject into solr? Without http?
 
  Condensing the loader into a single executable sounds right
  if you have performance problems. ;-)
 
  You could also try adding multiple docs in a single post if
  you notice your problems are with tcp setup time, though if
  you're doing localhost connections that should be minimal.
 
  If you're already local to the solr server, you might check
  out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV
  It's a little specialized.
 
  And then there's of course the question of are you doing
  full re-indexing or incremental indexing of changes?
 
  --cw
 
 
  On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
  
   I inherited an existing (working) solr indexing script that
  runs like
   this:
  
  
  
   Python script queries the mysql DB then calls bash script
  
   Bash script performs a curl POST submit to solr
  
  
  
   We're injecting about 1000 records / minute (constantly),
  frequently
   pushing the edge of our CPU / RAM limitations.
  
  
  
   I'm in the process of building a Perl script to use DBI and
   lwp::simple::post that will perform this all from a single script
   (instead of 3).
  
  
  
   Two specific questions
  
   1: Does anyone have a clever (or better) way to perform
  this process
   efficiently?
  
  
  
   2: Is there a way to inject into solr without using POST /
  curl / http?
  
  
  
   Admittedly, I'm no solr expert - I'm starting from someone else's
   setup, trying to reverse-engineer my way out.  Any input would be
   greatly appreciated.
  
  
 



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:
 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

One issue with HTTP is latency.  You can get around that by adding
multiple documents per request, or by using multiple threads
concurrently.

You can also bypass HTTP by using something like the CVS loader (very
light weight) and specifying a local file (via stream.file parameter).
http://wiki.apache.org/solr/UpdateCSV
I doubt you will see much of a difference between reading locally vs
streaming over HTTP, but it might be interesting to see the exact
overhead.

-Yonik


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, Siegfried Goeschl [EMAIL PROTECTED] wrote:
 +) my colleague just finished a database import service running within
 the servlet container to avoid writing out the data to the file system
 and transmitting it over HTTP.

Most people doing this read data out of the database and construct the
XML in-memory for sending to Solr... one definitely doesn't want to
write intermediate stuff to the filesystem (unless perhaps it's a CSV
dump).

 +) I think there were some discussion regarding a generic database
 importer but nothing I'm aware of

Absolutely a needed feature... it's in the queue:
https://issues.apache.org/jira/browse/SOLR-103

But there will always be more complex cases, pulling from multiple
data sources, doing some merging and munging, etc.  The easiest way to
handle many of those would probably be via a scripting language that
does the app-specific merging+munging and then uses a Solr client
(which constructs in-memory CSV or XML and sends to Solr).

-Yonik


RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes
Is this a native feature, or do we need to get creative with scp from
one server to the other?


If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw


Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley
On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr

For the most up-to-date solr client for python, check out
https://issues.apache.org/jira/browse/SOLR-216

-Yonik


RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Lance Norskog
Jython is a Python interpreter implemented in Java. (I have a lot of Python
code.)

Total throughput in the servlet is very sensitive to the total number of
servlet sockets available v.s. the number of CPUs.

The different analysers have very different performance.

You might leave some data in the DB, instead of storing it all in the index.

Underlying this all, you have a sneaky network performance problem. Your
successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
each post takes time. Not obvious: your server has sockets building up in
TIME_WAIT state.  (This means the sockets are shutting down. Having both
ends agree to close the connection is metaphysically difficult. The TCP/IP
spec even has a bug in this area.) Sockets building up can use TCP resources
to run low or may run out. Your kernel configuration may be weak in this
area.

Lance

-Original Message-
From: Kevin Holmes [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 8:13 AM
To: solr-user@lucene.apache.org
Subject: Any clever ideas to inject into solr? Without http?

I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently pushing
the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script (instead
of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.




Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Norberto Meijome
On Thu, 9 Aug 2007 15:23:03 -0700
Lance Norskog [EMAIL PROTECTED] wrote:

 Underlying this all, you have a sneaky network performance problem. Your
 successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
 each post takes time. Not obvious: your server has sockets building up in
 TIME_WAIT state.  (This means the sockets are shutting down. Having both
 ends agree to close the connection is metaphysically difficult. The TCP/IP
 spec even has a bug in this area.) Sockets building up can use TCP resources
 to run low or may run out. Your kernel configuration may be weak in this
 area.

Good point. and putting my pedantic hat on here, it may not necessarily be 
'kernel configuration', but network stack - not sure what OS the OP is using.
B
_
{Beto|Norberto|Numard} Meijome

All parts should go together without forcing. You must remember that the parts 
you are reassembling were disassembled by you.
 Therefore, if you can't get them together again, there must be a reason. 
 By all means, do not use hammer.
   IBM maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.