Re: Interesting stuff; Solr as a syslog store.

2010-02-13 Thread Kay Kay

Thanks Antonio for sharing this.

I believe this could be one of the interesting case studies for Solr In 
Action, if you are interested in sharing a bit more - I am sure the 
authors would be more interested for upcoming revisions.


--
 K K.


On 02/12/2010 06:02 PM, Antonio Lobato wrote:
Hey everyone, I don't actually have a question, but I just thought I'd 
share something really cool that I did with Solr for our company.


We run a good amount of servers, well into the several hundreds, and 
naturally we need a way to centralize all of the system logs.  For a 
while we used a commercial solution to centralize and search our logs, 
but they wanted to charge us tens of thousands of dollars for just one 
gigabyte/day more of indexed data.  So I said forget it, I'll write my 
own solution!


We already use Solr for some of our other backend searching systems, 
so I came up with an idea to index all of our logs to Solr.  I wrote a 
daemon in perl that listens on the syslog port, and pointed every 
single system's syslog to forward to this single server.  From there, 
this daemon will write to a Solr indexing server after parsing them 
into fields, such as date/time, host, program, pid, text, etc.  I then 
wrote a cool javascript/ajax web front end for Solr searching, and 
bam.  Real time searching of all of our syslogs from a web interface, 
for no cost!


Just thought this would be a neat story to share with you all.  I've 
really grown to love Solr, it's something else!


Thanks,
-Antonio




Re: Searching .msg files

2009-12-14 Thread Kay Kay
I remember seeing a similar thread in the lucene user mailing list. You 
can check the archives of the same.


As regarding the strategies - there could be 2 of them .

* you can create an index per user and store the email content involving 
the user in the same and use it for search.

(or)

* you can have 1 gigantic index , and have the To/Cc names as fields in 
them and all searches by a given user would go through an initial 
filter-pass on this index.


solr can of course, index a variety of content (see tika project ) and 
not restricted to xml at all.


You would need to weight the pros / cons of each of them depending on 
the corpus of data you are talking about and usage / performance 
expectations of the search.
Once you identify the strategy as appropriate  - you can define the solr 
schema for the fields and use the same.





Abhishek Srivastava wrote:

Hello Everyone,

In my company, we store a lot of old emails (.msg files) in a database (done
for the purpose of legal compliance).

The users have been asking us to give search functionality on the old
emails.

One of the primary requirement is that when people search, they should only
be able to search in their own emails (emails in which they were in the to,
cc or bcc list).

How can solr be used?

from what I know about this product is that it only searches xml content...
so I will have to extract the body of the email and convert it to xml right?

How will I limit the search results to only those emails where the user who
is searching was in the to, cc or bcc list?

Please do recommend me an approach for providing a solution to our
requirement.

  




Re: latency in solr response is observed after index is updated

2009-12-01 Thread Kay Kay
What would be the average doc size.  What is the autoCommit frequency 
set in solrconfig.xml .


Another place to look at is the field cache size and the nature of 
warmup queries run after a new searcher is created ( happens due to a 
commit ).




Bharath Venkatesh wrote:

Hi Kalidoss,
  
   I am not aware of using solr-config for committing the document . 
but I have mentioned below how we update and  commit documents:
 
curl http://solr_url/update --data-binary @feeds.xml -H 
'Content-type:text/xml; charset=utf-8'
curl http://solr_url/update --data-binary 'commit/' -H 
'Content-type:text/xml; charset=utf-8'


where feeds.xml contains the document in xml format

we have master and slave replication for solr server.

updates happens in master , snappuller and snapinstaller is run on 
slaves periodically

queries don't happen at master , only happens at slaves

is there any thing which can be said from above information ?

Thanks,
Bharath



-Original Message-
From: kalidoss [mailto:kalidoss.muthuramalin...@sifycorp.com]
Sent: Tue 12/1/2009 2:38 PM
To: solr-user@lucene.apache.org
Subject: Re: latency in solr response  is observed  after index is updated
 
r u using solr-config for committing the document?


bharath venkatesh wrote:
  

Hi,

We are observing latency (some times huge latency upto 10-20 secs) 
in solr response  after index is updated . whats the reason of this 
latency and how can it be minimized ?

Note: our index size is pretty large.

any help would be appreciated as we largely affected by it

Thanks in advance.
Bharath








This message is intended only for the use of the addressee and may contain information that is privileged, confidential 
and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the 
employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly prohibited. If you have received this e-mail 
in error, please notify us immediately by return e-mail and delete this e-mail and all attachments from your system.
  




Multiple DisMax Queries spanning across multiple fields

2009-09-23 Thread Kay Kay
For a particular requirement we have - we need to do a query that is a 
combination of multiple dismax queries behind the scenes.  (Using solr 
1.4 nightly ).


The DisMaxQParser org.apache.solr.search.DisMaxQParser ( details at - 
http://wiki.apache.org/solr/DisMaxRequestHandler ) takes in the /qf/ 
parameters and applies the parser to /q /and computes relevance based on 
the same.


We need to have a case where, the final query is a combination of{  
(q  = keywords, qf  = Map of field weights)  ,   (q1, qf1 ) , (q2, 
qf2 )  .. etc  } combined by a boolean AND , for the individual queries.


Creating a custom QParser works right away  as below.   



public class MultiTermDisMaxQParser extends DisMaxQParser
{
  ..
  ..
  ..


 @Override
 public Query parse() throws ParseException
 {
   BooleanQuery finalQuery = new BooleanQuery(true);

   Query superQuery = super.parse(); // Handles {  (q, qf) combination  }.
...
   ...
   // finalQuery adds superQuery with a weight.

   return finalQuery;
 }

}


Curious to see if we have an alternate method to implement the same / 
any other alternate suggestions to the problem itself.







Re: Flipping data dirs for an (/multiple) SolrCore without affecting search / IndexReaders

2009-01-08 Thread Kay Kay

Chris Hostetter wrote:

: We have an architecture where we want to flip the solr data.dir (massive
: dataset) while running and serving search requests with minimal downtime.
...

: 1) What is the fastest / best possible way to get step 1 done ,through a
: pluggable architecture.
: 
: Currently - I have written a request handler as follows, that takes care of

: creating the core. What is the best way to change dataDir (got as input from
: SolrQueryRequest) before creating SolrCore-s.

you shouldn't need any custom plugin code to achieve this ... what you 
describe wounds like exactly what the SWAP command on the CoreAdmin 
handler was designed for -- CREATE your new core (using the new data dir) 
warm it however you want (either via the solrconfig.xml or by explicitly 
hitting it with queries) and once it's warmed up send the SWAP command to 
replace the existing core with the name you want to use.


: 2) When a close() happens on an existing SolrCore - what happens when there is
: a long running IndexReader query on that SolrCore . Is that terminated
: abruptly / would the close wait until the IndexReaders completes the Query.

any existing uses of the Core will continue to finish (i'm not sure of hte 
timeline of your question, ut i'm guessing this was before the recent jira 
issue about the close() method and ref counts where this was better 
explained, correct?)




-Hoss


  
Thanks Hoss for the explanation regarding changing the data directory. 
Yes - this was before the jira issue discussions for the close() method. 


Approximate release date for 1.4

2008-12-18 Thread Kay Kay
Just curious - if we have an approximate target release date for 1.4 / 
list of milestones / feature sets for the same.


Re: Nightly build - 2008-12-17.tgz - build error - java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main

2008-12-17 Thread Kay Kay

Thanks Toby.

Aliter:
Under contrib/javascript/build.xml - dist target - I removed the 
dependency on 'docs' , to circumvent the problem.


But may be - it would be great to get js.jar from the rhino library 
distributed ( if not for license contradictions) to circumvent this.



Toby Cole wrote:
I came across this too earlier, I just deleted the contrib/javascript 
directory.
Of course, if you need javascript library then you'll have to get it 
building.


Sorry, probably not that helpful. :)
Toby.

On 17 Dec 2008, at 17:03, Kay Kay wrote:


I downloaded the latest .tgz and ran

$ ant dist


docs:

  [mkdir] Created dir: 
/opt/src/apache-solr-nightly/contrib/javascript/dist/doc
   [java] Exception in thread main java.lang.NoClassDefFoundError: 
org/mozilla/javascript/tools/shell/Main

   [java] at JsRun.main(Unknown Source)
   [java] Caused by: java.lang.ClassNotFoundException: 
org.mozilla.javascript.tools.shell.Main

   [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
   [java] at java.security.AccessController.doPrivileged(Native 
Method)
   [java] at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)

   [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
   [java] at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

   [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
   [java] at 
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

   [java] ... 1 more

BUILD FAILED
/opt/src/apache-solr-nightly/common-build.xml:335: The following 
error occurred while executing this line:
/opt/src/apache-solr-nightly/common-build.xml:212: The following 
error occurred while executing this line:
/opt/src/apache-solr-nightly/contrib/javascript/build.xml:74: Java 
returned: 1



and came across the above mentioned error.

The class seems to be from the rhino (mozilla js ) library. Is it 
supposed to be packaged by default / is there a license restriction 
that prevents from being so .




Toby Cole
Software Engineer

Semantico
Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE
T: +44 (0)1273 358 238
F: +44 (0)1273 723 232
E: toby.c...@semantico.com
W: www.semantico.com






Nightly build - 2008-12-17.tgz - build error - java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main

2008-12-17 Thread Kay Kay

I downloaded the latest .tgz and ran

$ ant dist


docs:

   [mkdir] Created dir: 
/opt/src/apache-solr-nightly/contrib/javascript/dist/doc
[java] Exception in thread main java.lang.NoClassDefFoundError: 
org/mozilla/javascript/tools/shell/Main

[java] at JsRun.main(Unknown Source)
[java] Caused by: java.lang.ClassNotFoundException: 
org.mozilla.javascript.tools.shell.Main

[java] at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
[java] at java.security.AccessController.doPrivileged(Native 
Method)
[java] at 
java.net.URLClassLoader.findClass(URLClassLoader.java:188)

[java] at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
[java] at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

[java] at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
[java] at 
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

[java] ... 1 more

BUILD FAILED
/opt/src/apache-solr-nightly/common-build.xml:335: The following error 
occurred while executing this line:
/opt/src/apache-solr-nightly/common-build.xml:212: The following error 
occurred while executing this line:
/opt/src/apache-solr-nightly/contrib/javascript/build.xml:74: Java 
returned: 1



and came across the above mentioned error.

The class seems to be from the rhino (mozilla js ) library. Is it 
supposed to be packaged by default / is there a license restriction that 
prevents from being so .




Flipping data dirs for an (/multiple) SolrCore without affecting search / IndexReaders

2008-12-16 Thread Kay Kay
We have an architecture where we want to flip the solr data.dir (massive 
dataset) while running and serving search requests with minimal downtime.


Some additional requirements.

* While ideally - we want the Solr Search clients to continue to serve 
from the indices as soon as possible -the overriding requirement is that 
the downtime for the Search Solr instances should be as less as 
possible.So when a new set of (Lucene) indices come in - the 
algorithm we are experimenting with:


- create a new solrcore instance with the revised data directory.
- warm up the solrcore instance with some test queries.
- register the new solrcore instance with the same name as the old one, 
so that all new queries from the clients are to the new SolrCore instance.
- As part of register (String, SolrCore, boolean ) - the III parameter 
when set to false , closes the core connection.


  I am trying to understand more about the first and the fourth ( last) 
steps.


1) What is the fastest / best possible way to get step 1 done ,through a 
pluggable architecture.


Currently - I have written a request handler as follows, that takes care 
of creating the core. What is the best way to change dataDir (got as 
input from SolrQueryRequest) before creating SolrCore-s.


public class CustomRequestHandler extends RequestHandlerBase implements 
SolrCoreAware

{
 private CoreDescriptor coreDescriptor;

 private String coreName;

 @Override
 public void handleRequestBody(SolrQueryRequest req,
   SolrQueryResponse rsp) throws Exception
 {
   CoreContainer container = this.coreDescriptor.getCoreContainer();
   // TODO: Parse XML to extract data
   // container.reload(this.coreName);

   // or
   // 2.
   // TODO: Set the new configuration for the data directory / before 
creating the new core.

   SolrCore newCore = container.create(this.coreDescriptor);
   container.register(this.coreName, newCore, false);
 }


 @Override
 public void inform(SolrCore core)
 {
   coreDescriptor = core.getCoreDescriptor();
   coreName = core.getName();
 }
}


2) When a close() happens on an existing SolrCore - what happens when 
there is a long running IndexReader query on that SolrCore . Is that 
terminated abruptly / would the close wait until the IndexReaders 
completes the Query.




* The same process is repeated potentially for multiple SolrCores as 
well, with additional closeHooks that might do some heavy i/o tasks -  
talking over the network etc.
Right now - these long running processes are done in an independent 
thread so that they do not block SolrCore.close() with the currently 
nightly builds.






Solrj client - CommonsHttpSolrServer - getting solr.solr.home

2008-12-16 Thread Kay Kay

I am reading the wiki here at - http://wiki.apache.org/solr/Solrj .

Is there a requestHandler ( may be - some admin handler ) already 
present that can retrieve the solr.solr.home for a given 
CommonsHttpSolrServer instance ( for a given solr endpoint), through an 
api.




Solrj - SolrQuery - specifying SolrCore - when the Solr Server has multiple cores

2008-12-15 Thread Kay Kay

Hi -
 I am looking at the  article here with a brief introduction to SolrJ .
http://www.ibm.com/developerworks/library/j-solr-update/index.html?ca=dgr-jw17SolrS_Tact=105AGX59S_CMP=GRsitejw17#solrj 
.


 In case we have multiple SolrCores in the server application - (since 
1.3) - how do I specify as part of SolrQuery as to which core needs to 
be used for the given query. I am trying to dig out the information from 
the code. Meanwhile, if someone is aware of the same - please suggest 
some pointers.






Re: Solrj - SolrQuery - specifying SolrCore - when the Solr Server has multiple cores

2008-12-15 Thread Kay Kay

Thanks Yonik for the clarification.

Yonik Seeley wrote:

A solr core is like a separate solr server... so create a new
CommonsHttpSolrServer that points at the core.
You probably want to create and reuse a single HttpClient instance for
the best efficiency.

-Yonik

On Mon, Dec 15, 2008 at 11:06 AM, Kay Kay kaykay.uni...@gmail.com wrote:
  

Hi -
 I am looking at the  article here with a brief introduction to SolrJ .
http://www.ibm.com/developerworks/library/j-solr-update/index.html?ca=dgr-jw17SolrS_Tact=105AGX59S_CMP=GRsitejw17#solrj
.

 In case we have multiple SolrCores in the server application - (since 1.3)
- how do I specify as part of SolrQuery as to which core needs to be used
for the given query. I am trying to dig out the information from the code.
Meanwhile, if someone is aware of the same - please suggest some pointers.



  




Re: Stopping / Starting IndexReaders in Solr 1.3+

2008-12-13 Thread Kay Kay

Erik Hatcher wrote:
Maybe the PingRequestHandler can help? It can check for the existence 
of a file (see solrconfig.xml for healthcheck) and return an error if 
it is not there. This wouldn't prevent Solr from responding to 
requests, but if a client used that information to determine whether 
to make requests or not it'd do the trick.


Thanks Erik.
The checking of a configuration file is helpful to see if there are any 
active clients. But my intention is to stop the client servicing 
altogether and then restart / warm up the IndexReaders from scratch.




Erik


On Dec 13, 2008, at 12:54 AM, Kay Kay wrote:

For a particular application of ours - we need to suspend the Solr 
server from doing any query operation ( IndexReader-s) for sometime, 
and then after sometime in the near future ( in minutes ) - 
reinitialize / warm IndexReaders once again and get moving.


It is a little bit different from optimize  since this server is 
only supposed to read the data and not add create segments . But we 
want to suspend it as an initial test case for one of our load 
balancers.
(Restarting Solr is an option though we want to get to that as a last 
resort ).







Re: How can i indexing MS-Outlook files?

2008-12-13 Thread Kay Kay
You can check out the format of the MS-Outlook files. If they happen to 
be plain text - may be a little bit of parsing to remove the protocol 
headers would be needed.


Otherwise - you can check with Thunderbird / OpenOffice teams to see how 
they parse the data when they import from MS-Outlook (if they are 
supported that is. ).


RaghavPrabhu wrote:

Hi Folks,

 I want to indexing MS-Outlook mails in my data directory.How can i perform
this function?
 Please help me and give the solution as soon as possible.



Thanks in advance
Prabhu.K
  




Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Kay Kay
As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler  
- I am seeing the following fragment. 

dataSource driver=org.hsqldb.jdbcDriver url=jdbc:hsqldb:/temp/example/ex 
user=sa /
document name=products
entity name=item query=select * from item
field column=ID name=id /
field column=NAME name=name /
  ..
/entity
/document
/dataSource

My scaled-down application looks very similar along these lines but where my 
resultset is so big that it cannot fit within main memory by any chance. 

So I was planning to split this single query into multiple subqueries - with 
another conditional based on the id . ( id  0 and id  100 , say ) . 

I am curious if there is any way to specify another conditional clause , 
(splitData Column = id  batch=1 /, where the column is supposed to be 
an integer value) - and internally , the implementation could actually generate 
the subqueries - 

i) get the min , max of the numeric column , and send queries to the database 
based on the batch size 

ii) Add Documents for each batch and close the resultset . 

This might end up putting more load on the database (but at least the dataset 
would fit in the main memory ). 

Let me know if anyone else had run into similar issues and how this was 
encountered. 


  

Re: Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Kay Kay
I am using MySQL. I believe (since MySQL 5) supports streaming. 

On more about streaming - can we assume that when the database driver supports 
streaming , the resultset iterator is a forward directional iterator. 

If , say the streaming size is 10K records and we are trying to retrieve a 
total of 100K records - what exactly happens when the threshold is reached , 
(say , the first 10K records were retrieved ). 

Are the previous set of records thrown away and replaced in memory by the new 
batch of records.  



--- On Fri, 12/12/08, Shalin Shekhar Mangar shalinman...@gmail.com wrote:
From: Shalin Shekhar Mangar shalinman...@gmail.com
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 9:41 PM

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay kaykay.uni...@yahoo.com wrote:

 As per the example in the wiki -
 http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
 fragment.

 dataSource driver=org.hsqldb.jdbcDriver
 url=jdbc:hsqldb:/temp/example/ex user=sa /
document name=products
entity name=item query=select * from
item
field column=ID name=id /
field column=NAME name=name /
  ..
/entity
 /document
 /dataSource

 My scaled-down application looks very similar along these lines but where
 my resultset is so big that it cannot fit within main memory by any
chance.

 So I was planning to split this single query into multiple subqueries -
 with another conditional based on the id . ( id  0 and id  100 ,
say ) .

 I am curious if there is any way to specify another conditional clause ,
 (splitData Column = id  batch=1 /,
where the column is supposed to
 be an integer value) - and internally , the implementation could actually
 generate the subqueries -

 i) get the min , max of the numeric column , and send queries to the
 database based on the batch size

 ii) Add Documents for each batch and close the resultset .

 This might end up putting more load on the database (but at least the
 dataset would fit in the main memory ).

 Let me know if anyone else had run into similar issues and how this was
 encountered.







-- 
Regards,
Shalin Shekhar Mangar.



  

Re: Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Kay Kay
Thanks Bryan . 

That clarifies a lot. 

But even with streaming - retrieving one document at a time and adding to the 
IndexWriter seems to making it more serializable . 

So - may be the DataImportHandler could be optimized to retrieve a bunch of 
results from the query and add the Documents in a separate thread , from a 
Executor pool (and make this number configurable / may be retrieved from the 
System as the number of physical cores to exploit maximum parallelism ) since 
that seems like a bottleneck. 

Any comments on the same. 



--- On Fri, 12/12/08, Bryan Talbot btal...@aeriagames.com wrote:
From: Bryan Talbot btal...@aeriagames.com
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 5:26 PM

It only supports streaming if properly enabled which is completely lame:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

 By default, ResultSets are completely retrieved and stored in memory. In most
cases this is the most efficient way to operate, and due to the design of the
MySQL network protocol is easier to implement. If you are working with
ResultSets that have a large number of rows or large values, and can not
allocate heap space in your JVM for the memory required, you can tell the driver
to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the
following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
  java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of
Integer.MIN_VALUE serves as a signal to the driver to stream result sets
row-by-row. After this any result sets created with the statement will be
retrieved row-by-row.



-Bryan




On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:

 I am using MySQL. I believe (since MySQL 5) supports streaming.
 
 On more about streaming - can we assume that when the database driver
supports streaming , the resultset iterator is a forward directional iterator.
 
 If , say the streaming size is 10K records and we are trying to retrieve a
total of 100K records - what exactly happens when the threshold is reached ,
(say , the first 10K records were retrieved ).
 
 Are the previous set of records thrown away and replaced in memory by the
new batch of records.
 
 
 
 --- On Fri, 12/12/08, Shalin Shekhar Mangar shalinman...@gmail.com
wrote:
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 Subject: Re: Solr - DataImportHandler - Large Dataset results ?
 To: solr-user@lucene.apache.org
 Date: Friday, December 12, 2008, 9:41 PM
 
 DataImportHandler is designed to stream rows one by one to create Solr
 documents. As long as your database driver supports streaming, you should
be
 fine. Which database are you using?
 
 On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay kaykay.uni...@yahoo.com
wrote:
 
 As per the example in the wiki -
 http://wiki.apache.org/solr/DataImportHandler  - I am seeing the
following
 fragment.
 
 dataSource driver=org.hsqldb.jdbcDriver
 url=jdbc:hsqldb:/temp/example/ex user=sa /
   document name=products
   entity name=item query=select * from
 item
   field column=ID name=id /
   field column=NAME name=name
/
 ..
   /entity
 /document
 /dataSource
 
 My scaled-down application looks very similar along these lines but
where
 my resultset is so big that it cannot fit within main memory by any
 chance.
 
 So I was planning to split this single query into multiple subqueries
-
 with another conditional based on the id . ( id  0 and id  100
,
 say ) .
 
 I am curious if there is any way to specify another conditional clause
,
 (splitData Column = id  batch=1 /,
 where the column is supposed to
 be an integer value) - and internally , the implementation could
actually
 generate the subqueries -
 
 i) get the min , max of the numeric column , and send queries to the
 database based on the batch size
 
 ii) Add Documents for each batch and close the resultset .
 
 This might end up putting more load on the database (but at least the
 dataset would fit in the main memory ).
 
 Let me know if anyone else had run into similar issues and how this
was
 encountered.
 
 
 
 
 
 
 
 --Regards,
 Shalin Shekhar Mangar.
 
 
 




  

Re: Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Kay Kay

Thanks Shalin for the clarification.

The case about Lucene taking more time to index the Document when 
compared to DataImportHandler creating the input is definitely intuitive.


But just curious about the underlying architecture on which the test was 
being run. Was this performed on a multi-core machine . If so - how many 
cores were there ? What architecture would they be ?  It might be useful 
to know more about them to understand more about the results and see 
where they could be improved.


As about the query -

select * from table LIMIT 0, 5000

how database / vendor / driver neutral is this statement . I believe 
mysql supports this. But I am just curious how generic is this statement 
going to be .





Shalin Shekhar Mangar wrote:

On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay kaykay.uni...@yahoo.com wrote:

  

Thanks Bryan .

That clarifies a lot.

But even with streaming - retrieving one document at a time and adding to
the IndexWriter seems to making it more serializable .




We have experimented with making DataImportHandler multi-threaded in the
past. We found that the improvement was very small (5-10%) because, with
databases on the local network, the bottleneck is Lucene's ability to index
documents rather than DIH's ability to create documents. Since that made the
implementation much more complex, we did not go with it.


  

So - may be the DataImportHandler could be optimized to retrieve a bunch of
results from the query and add the Documents in a separate thread , from a
Executor pool (and make this number configurable / may be retrieved from the
System as the number of physical cores to exploit maximum parallelism )
since that seems like a bottleneck.




For now, you can try creating multiple root entities with LIMIT clause to
fetch rows in batches.

For example:
entity name=first query=select * from table LIMIT 0, 5000

/entity
entity name=second query=select * from table LIMIT 5000, 1
...
/entity

and so on.

An alternate solution would be to use request parameters as variables in the
LIMIT clause and call DIH full import with different start and offset.

For example:
entity name=x query=select * from x LIMIT
${dataimporter.request.startAt}, ${dataimporter.request.count}
...
/entity

Then call:
http://host:port/solr/dataimport?command=full-importstartAt=0count=5000
Wait for it to complete import (you'll have to monitor the output to figure
out when the import ends), and then call:
http://host:port
/solr/dataimport?command=full-importstartAt=5000count=1
and so on. Note, start and rows are parameters used by DIH, so don't use
these parameter names.

I guess this will be more complex than using multiple root entities.


  

Any comments on the same.




A workaround for the streaming bug with MySql JDBC driver is detailed here:
http://wiki.apache.org/solr/DataImportHandlerFaq

If you try any of these tricks, do let us know if it improves the
performance. If there is something which gives a lot of improvement, we can
figure out ways to implement them inside DataImportHandler itself.

  




Stopping / Starting IndexReaders in Solr 1.3+

2008-12-12 Thread Kay Kay
For a particular application of ours - we need to suspend the Solr 
server from doing any query operation ( IndexReader-s) for sometime, and 
then after sometime in the near future ( in minutes ) - reinitialize / 
warm IndexReaders once again and get moving.


It is a little bit different from optimize  since this server is only 
supposed to read the data and not add create segments . But we want to 
suspend it as an initial test case for one of our load balancers. 

(Restarting Solr is an option though we want to get to that as a last 
resort ).


Re: Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Kay Kay

Shalin Shekhar Mangar wrote:

On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay kaykay.uni...@gmail.com wrote:

  

Thanks Shalin for the clarification.

The case about Lucene taking more time to index the Document when compared
to DataImportHandler creating the input is definitely intuitive.

But just curious about the underlying architecture on which the test was
being run. Was this performed on a multi-core machine . If so - how many
cores were there ? What architecture would they be ?  It might be useful to
know more about them to understand more about the results and see where they
could be improved.




This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the
JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL
(index size on disk 1.3 gigs). Both Solr and MySql boxes were same
configuration and running on a gigabit network. This was done a long time
back so these may not be the exact values but should be pretty close.

  

Thanks for the detailed configuration on which the tests were performed.
Our current architecture also looks more or less very similar to the same.
  

As about the query -

select * from table LIMIT 0, 5000

how database / vendor / driver neutral is this statement . I believe mysql
supports this. But I am just curious how generic is this statement going to
be .




This is for MySql. I believe we are discussing these workarounds only
because MySQL driver does not support batch streaming. It fetches rows
either one-by-one or all-at-once. You probably wouldn't need these tricks
for other databases.

  
True - Currently , playing around with mysql . But I was trying to 
understand more about how the Statement object is getting created (in 
the case of a platform / vendor specific query like this ). Are we going 
through JPA internally in Solr to create the Statements for the queries. 
Where can I look into this in Solr source code to understand more about 
this.