Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju

Hi

Yes, the XML is inside the DB in a clob. Would love to use XPath  
inside SQLEntityProcessor as it will save me tons of trouble for file- 
dumping (given that I am not able to post it).  This is how I setup my  
DIH for DB import.



driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@X"  
user="abc" password="***" batchSize="100"/>

   
  transformer="ClobTransformer"  
 query="select xml_col from xml_table where xml_col IS  
NOT NULL" >   


   dataSource="null"  

   name="record"
   processor="XPathEntityProcessor"
   stream="false"
   url="${item.xml_col}"
forEach="/record">

  
  
  

  .. and so on



 
   



The problem with this is that it always fails with this error.  I can  
see that the earlier SQL entity extraction and clob transformation is  
working as the values show in the debug jsp (verbose mode with  
dataimport.jsp).  However no records are extracted for entity.  When I  
check catalina.out file, it shows me the following errors for entity  
name="record". (the XPath entity on top).


java.lang.NullPointerException at  
org 
.apache 
.solr 
.handler 
.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85).


I don't have the whole stack trace right now.  If you need it I would  
be happy to recreate and post it.


Regards,
Guna

On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
 wrote:

Thanks

Yes the source of data is a DB.  However the xml is also posted on  
updates
via publish framework.  So I can just plug in an adapter hear to  
listen for
changes and post to SOLR.  I was trying to use the XPathProcessor  
inside the
SQLEntityProcessor and this did not work (using 1.3 - I did see  
support in
1.4).  That is not a show stopper for me and I can just post them  
via the

framework and use files for the first time load.

XPathEntityprocessor works inside SqlEntityprocessor only if a db
field contains xml.

However ,you can have a separate entity (at the root) to read from db
for delta.
Anyway if your current solution works stick to it.


Have a seen a couple of answers on the backup for crash scenarios.   
just
wanted to confirm - if I replace the index with the backup'ed files  
then I
can simple start the up solr again and reindex the documents  
changed since
last backup? Am I right?  The slaves will also automatically adjust  
to this.

Yes. you can replace an archived index and Solr should work just fine.
but the docs added since the last snapshot was taken will be missing
(of course :) )


THanks
Guna


On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
 wrote:


Hi All
We are considering SOLR for a large database of XMLs.  I have  
some newbie
questions - if there is a place I can go read about them do let  
me know

and
I will go read up :)

1. Currently we are able to pull the XMLs from a file systems using
FileDataSource.  The DIH is convenient since I can map my XML  
fields

using
the XPathProcessor. This works for an initial load.However  
after the
initial load, we would like to 'post' changed xmls to SOLR  
whenever the

XML
is updated in a separate system.  I know we can post xmls with  
'add'

however
I was not sure how to do this and maintain the DIH mapping I use in
data-config.xml?  I don't want to save the file to the disk and  
then call
the DIH - would prefer to directly post it.  Do I need to use  
solrj for

this?


What is the source of your new data? is it a DB?



2.  If my solr schema.xml changes then do I HAVE to reindex all  
the old
documents?  Suppose in future we have newer XML documents that  
contain a

new
additional xml field.The old documents that are already  
indexed don't
have this field and (so) I don't need search on them with this  
field.
However the new ones need to be search-able on this new field. 
Can I
just add this new field to the SOLR schema, restart the servers  
just post

the new new documents or do I need to reindex everything?

3. Can I backup the index directory.  So that in case of a disk  
crash - I

can restore this directory and bring solr up. I realize that any
documents
indexed after this backup would be lost - I can however keep  
track of

these
outside and simply re-index documents 'newer' than that backup  
date.

This
question is really important to me in the context of using a Master
Server
with replicated index.  I would like to run this backup for the  
'Master'.


the snapshot script is can be used to take backups on commit.


4.  In general what happens when the solr application is  
bounced?  Is the

index affected (anything maintained in memory)?

Regards
Guna





--
--Noble Paul







--
--Noble Paul




Re: Using Threading while Indexing.

2009-01-21 Thread Chris Hostetter

: I was trying to index three sets of document having 2000 articles using 
: three threads of embedded solr server. But while indexing, giving me 
: exception ?org.apache.lucene.store.LockObtainFailedException: Lock 

something doesn't sound right here ... i'm not expert on embeding solr, i 
think perhaps you aren't embedding solr the recommended way .. if you were 
then there would be only one SolrCore for your index, and only one 
IndexWriter -- all of your threads would then interate with this one 
(fully thread safe) SolrCore.

It sounds like you are constructing seperate objects (i forget which one 
it is you construct when embedding) in each thread and winding up with 
multiple SorlCores all competing for write access to the same index.


-Hoss



Re: Question about query sintax

2009-01-21 Thread Chris Hostetter

: If I query for 'ferrar*' on my index, I will get 'ferrari' and 'red ferrari'
: as a result. And that's fine. But if I try to query for 'red ferrar*', I
: have to put it between double quotes as I want to grant that it will be used
: as only one term, but the '*' is being ignored, as I don't get any result.
: What should be the apropriate query for it?

when you add the double quotes you tell solr that the * should now be 
treated as a literal, and it's no longer a special character.  it is 
possible to have query structures like what you are interested in, but i 
don't think it's possible to express it using the Lucene syntax.


-Hoss



Re: How to get XML response from CommonsHttpSolrServer through QueryResponse?

2009-01-21 Thread Chris Hostetter

: Because I used server.setParser(new XMLResponseParser()), I get the 
: wt=xml parameter in the responseHeader, but the format of the 
: responseHeader is clearly no XML at all. I expect Solr does output XML, 
: but that the QueryResponse, when I print its contents, formats this as 
: the string above.

what you are seeing is the results of the toString message on the 
QueryResponse object generated by the parser -- the XML is gone at this 
point, the XMLResponseParser processed it to generate that object.

that's really the main value add of using the SolrServer/ResponseParser 
APIs ... if you want the raw XML response there's almost no value in using 
those classes at all -- just use the HttpClient directly.




-Hoss



Re: Issue with dismaxrequestHandler for date fields

2009-01-21 Thread Chris Hostetter

: Still search on any field (?q=searchTerm) gives following error
: "The request sent by the client was syntactically incorrect (Invalid Date
: String:'searchTerm')."

because "searchTerm" isn't a valid date string 

: Is this valid to define *_dt (i.e. date fields ) in solrConfig.xml ?

if you really wanted to do a dismax search over some date fields, you 
could -- but only with date input.

but you don't want to do a dismax query over date fields, based on your
your original question...

: > 
: > productPublicationDate_product_dt^1.0
: > productPublicationDate_product_dt[NOW-45DAYS TO NOW]^1.0
: > 

...it seems that what you really want is to have a query clause matching 
docs from the last 45 days -- independent of what the searchTerm was.  so 
do that with either an "fq" (filter query) or a bq (boost query) 
(depending on your goal)

http://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




-Hoss



Re: Passing analyzer to the queryparser plugin

2009-01-21 Thread Chris Hostetter

: Is there a way to pass the analyzer to the query parser plugin

Solr uses a variant of the PerFieldAnalzyer -- you specify in the 
schema.xml what analyzer you want to use for each field.

if you have some sort of *really* exotic situation, you can always design 
a custom QParser that looks at some query params to do something really 
interesting (it's the parser that decides how to use the analyzer.

if you could explain what it is you are trying to do, we might be able to 
help you.

PS: please don't post your duplicate copies of your questions twice to 
gene...@lucene ... solr-user is the appropriate place for questions like 
this.



-Hoss



Re: What can be the reason for stopping solr work after some time?

2009-01-21 Thread Chris Hostetter

: i'm newbie with solr. We have installed with together with ezfind from
: EZ Publish web sites and it is working. But in one of the servers we
: have this kind of problem. It works for example for 3 hours, and then in
: one moment it stop to work, searching and indexing does not work.

it's pretty hard to make any sort of guess as to what your problem might 
be without more information.  is your java process still running? does it 
responsed to any HTTP requests (ie: do the admin pages work?) what do the 
logs say?


-Hoss



Re: Newbie Design Questions

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju
 wrote:
> Thanks
>
> Yes the source of data is a DB.  However the xml is also posted on updates
> via publish framework.  So I can just plug in an adapter hear to listen for
> changes and post to SOLR.  I was trying to use the XPathProcessor inside the
> SQLEntityProcessor and this did not work (using 1.3 - I did see support in
> 1.4).  That is not a show stopper for me and I can just post them via the
> framework and use files for the first time load.
XPathEntityprocessor works inside SqlEntityprocessor only if a db
field contains xml.

However ,you can have a separate entity (at the root) to read from db
for delta.
Anyway if your current solution works stick to it.
>
> Have a seen a couple of answers on the backup for crash scenarios.  just
> wanted to confirm - if I replace the index with the backup'ed files then I
> can simple start the up solr again and reindex the documents changed since
> last backup? Am I right?  The slaves will also automatically adjust to this.
Yes. you can replace an archived index and Solr should work just fine.
but the docs added since the last snapshot was taken will be missing
(of course :) )
>
> THanks
> Guna
>
>
> On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
>>  wrote:
>>>
>>> Hi All
>>> We are considering SOLR for a large database of XMLs.  I have some newbie
>>> questions - if there is a place I can go read about them do let me know
>>> and
>>> I will go read up :)
>>>
>>> 1. Currently we are able to pull the XMLs from a file systems using
>>> FileDataSource.  The DIH is convenient since I can map my XML fields
>>> using
>>> the XPathProcessor. This works for an initial load.However after the
>>> initial load, we would like to 'post' changed xmls to SOLR whenever the
>>> XML
>>> is updated in a separate system.  I know we can post xmls with 'add'
>>> however
>>> I was not sure how to do this and maintain the DIH mapping I use in
>>> data-config.xml?  I don't want to save the file to the disk and then call
>>> the DIH - would prefer to directly post it.  Do I need to use solrj for
>>> this?
>>
>> What is the source of your new data? is it a DB?
>>
>>>
>>> 2.  If my solr schema.xml changes then do I HAVE to reindex all the old
>>> documents?  Suppose in future we have newer XML documents that contain a
>>> new
>>> additional xml field.The old documents that are already indexed don't
>>> have this field and (so) I don't need search on them with this field.
>>> However the new ones need to be search-able on this new field.Can I
>>> just add this new field to the SOLR schema, restart the servers just post
>>> the new new documents or do I need to reindex everything?
>>>
>>> 3. Can I backup the index directory.  So that in case of a disk crash - I
>>> can restore this directory and bring solr up. I realize that any
>>> documents
>>> indexed after this backup would be lost - I can however keep track of
>>> these
>>> outside and simply re-index documents 'newer' than that backup date.
>>>  This
>>> question is really important to me in the context of using a Master
>>> Server
>>> with replicated index.  I would like to run this backup for the 'Master'.
>>
>> the snapshot script is can be used to take backups on commit.
>>>
>>> 4.  In general what happens when the solr application is bounced?  Is the
>>> index affected (anything maintained in memory)?
>>>
>>> Regards
>>> Guna
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
>



-- 
--Noble Paul


Re: solr-duplicate post management

2009-01-21 Thread Chris Hostetter

: what i need is ,to log the existing urlid and new urlid(of course both will
: not be same) ,when a .xml file of same id(unique field) is posted.
: 
: I want to make this by modifying the solr source.Which file do i need to
: modify so that i could get the above details in log ?
: 
: I tried with DirectUpdateHandler2.java(which removes the duplicate
: entries),but efforts in vein.

DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's 
IndexWriter.updateDocument method when you have a uniqueKey and you aren't 
allowing duplicates -- this method doesn't give you any way to access the 
old document(s) that had that existing key.

The easiest way to make a change like what you are interested in might be 
an UpdateProcessor that does a lookup/search for the uniqueKey of each 
document about to be added to see if it already exists.  that's probably 
about as efficient as you can get, and would be nicely encapsulated.

You might also want to take a look at SOLR-799, where some work is being 
done to create UpdateProcessors that can do "near duplicate" detection...

http://wiki.apache.org/solr/Deduplication
https://issues.apache.org/jira/browse/SOLR-799






-Hoss



Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju

Thanks

Yes the source of data is a DB.  However the xml is also posted on  
updates via publish framework.  So I can just plug in an adapter hear  
to listen for changes and post to SOLR.  I was trying to use the  
XPathProcessor inside the SQLEntityProcessor and this did not work  
(using 1.3 - I did see support in 1.4).  That is not a show stopper  
for me and I can just post them via the framework and use files for  
the first time load.


Have a seen a couple of answers on the backup for crash scenarios.   
just wanted to confirm - if I replace the index with the backup'ed  
files then I can simple start the up solr again and reindex the  
documents changed since last backup? Am I right?  The slaves will also  
automatically adjust to this.


THanks
Guna


On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju
 wrote:

Hi All
We are considering SOLR for a large database of XMLs.  I have some  
newbie
questions - if there is a place I can go read about them do let me  
know and

I will go read up :)

1. Currently we are able to pull the XMLs from a file systems using
FileDataSource.  The DIH is convenient since I can map my XML  
fields using
the XPathProcessor. This works for an initial load.However  
after the
initial load, we would like to 'post' changed xmls to SOLR whenever  
the XML
is updated in a separate system.  I know we can post xmls with  
'add' however

I was not sure how to do this and maintain the DIH mapping I use in
data-config.xml?  I don't want to save the file to the disk and  
then call
the DIH - would prefer to directly post it.  Do I need to use solrj  
for

this?


What is the source of your new data? is it a DB?



2.  If my solr schema.xml changes then do I HAVE to reindex all the  
old
documents?  Suppose in future we have newer XML documents that  
contain a new
additional xml field.The old documents that are already indexed  
don't

have this field and (so) I don't need search on them with this field.
However the new ones need to be search-able on this new field. 
Can I
just add this new field to the SOLR schema, restart the servers  
just post

the new new documents or do I need to reindex everything?

3. Can I backup the index directory.  So that in case of a disk  
crash - I
can restore this directory and bring solr up. I realize that any  
documents
indexed after this backup would be lost - I can however keep track  
of these
outside and simply re-index documents 'newer' than that backup  
date.  This
question is really important to me in the context of using a Master  
Server
with replicated index.  I would like to run this backup for the  
'Master'.

the snapshot script is can be used to take backups on commit.


4.  In general what happens when the solr application is bounced?   
Is the

index affected (anything maintained in memory)?

Regards
Guna





--
--Noble Paul




Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju

Hi Grant
Thanks for the reply.  My response below.


The data is stored as XMLs.  Each record/entity corresponds to an  
XML.  The XML is of the form



  
  
  
  ...



I have currently put it in a schema.xml and DIH handler as follows

schema.xml
  
 


data-import.xml


  
  

   
   
   
   

   .. and so on


I don't need all the fields in the XML indexed or stored. I just  
include the ones I need in the schema.xml and data-import.xml


Architecturally these XMLs are created, updated and stored in a  
separate system.  Currently I am dumping the files in a directory and  
invoking the DIH.


Actually we have a publishing channel that publishes each XML whenever  
its updated or created.  I'd really like to tap into this channel and  
directly post the xml to SOLR instead of saving it to a file and then  
invoking DIH.  I'd also like to do it leveraging definitions like in  
the data-config xml so that every time I can just post the original  
XML and the xpath configuration takes care of extracting the relevant  
fields.


I did take a look at cell in the link below.  It seems to be only for  
1.4 and currently 1.3 is the stable release.



Guna
On Jan 20, 2009, at 7:50 PM, Grant Ingersoll wrote:



On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:


Hi All
We are considering SOLR for a large database of XMLs.  I have some  
newbie questions - if there is a place I can go read about them do  
let me know and I will go read up :)


1. Currently we are able to pull the XMLs from a file systems using  
FileDataSource.  The DIH is convenient since I can map my XML  
fields using the XPathProcessor. This works for an initial load. 
However after the initial load, we would like to 'post' changed  
xmls to SOLR whenever the XML is updated in a separate system.  I  
know we can post xmls with 'add' however I was not sure how to do  
this and maintain the DIH mapping I use in data-config.xml?  I  
don't want to save the file to the disk and then call the DIH -  
would prefer to directly post it.  Do I need to use solrj for this?


You can likely use SolrJ, but then you probably need to parse the  
XML an extra time.  You may also be able to use Solr Cell, which is  
the Tika integration such that you can send the XML straight to Solr  
and have it deal with it.  See http://wiki.apache.org/solr/ExtractingRequestHandler 
  Solr Cell is a push technology, whereas DIH is a pull technology.


I don't know how compatible this would be w/ DIH.  Ideally, in the  
future, they will cooperate as much as possible, but we are not  
there yet.


As for your initial load, what if you ran a one time XSLT processor  
over all the files and transformed them to SolrXML and then just  
posted them the normal way?  Then, going forward, any new files  
could just be written out as SolrXML as well.


If you can give some more info about your content, I think it would  
be helpful.





2.  If my solr schema.xml changes then do I HAVE to reindex all the  
old documents?  Suppose in future we have newer XML documents that  
contain a new additional xml field.The old documents that are  
already indexed don't have this field and (so) I don't need search  
on them with this field.  However the new ones need to be search- 
able on this new field.Can I just add this new field to the  
SOLR schema, restart the servers just post the new new documents or  
do I need to reindex everything?


Yes, you should be able to add new fields w/o problems.  Where you  
can run into problems is renaming, removing, etc.





3. Can I backup the index directory.  So that in case of a disk  
crash - I can restore this directory and bring solr up. I realize  
that any documents indexed after this backup would be lost - I can  
however keep track of these outside and simply re-index documents  
'newer' than that backup date.  This question is really important  
to me in the context of using a Master Server with replicated  
index.  I would like to run this backup for the 'Master'.


Yes, just use the master/slave replication approach for doing backups.




4.  In general what happens when the solr application is bounced?   
Is the index affected (anything maintained in memory)?


I would recommend doing a commit before bouncing and letting all  
indexing operations complete.  Worst case, assuming you are using  
Solr 1.3 or later, is that you may lose what is in memory.


-Grant

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ














Re: numFound problem

2009-01-21 Thread Koji Sekiguchi

Ron Chan wrote:
I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is the StandardAnalyzer 

  

It seems WordDelimiterFilter worked for you.
Go to Admin console, click analysis, then give:

Field name: text

Field value (Index): SD/DDeck
verbose output: checked
highlight matched: checked

Field value (Query): SD DDeck
verbose output: checked

then click analyze.

regards,

Koji

but shouldn't the returned docs equal numFound? 



- Original Message - 
From: "Erick Erickson"  
To: solr-user@lucene.apache.org 
Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland, Portugal 
Subject: Re: numFound problem 

It depends (tm). What analyzer are you using when indexing? 

I'd expect (though I haven't checked) that StandardAnalyzer 
would break SD/DDeck into two tokens, SD and DDeck which 
corresponds nicely with what you're reporting. 

Other analyzers and/or filters are easy to specify 

I'd recommend getting a copy of Luke and examining your 
index to see what's actually in it 

Best 
Erick 

On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan  wrote: 

  
I have a test search which I know should return 34 docs and it does 

however, numFound says 40 

with debug enabled, I can see the 40 it has found 

my search looks for "SD DDeck" in the description 

34 of them had "SD DDeck" with 6 of them having "SD/DDeck" 

now, I can probably work round it if had returned me the 40 docs but the 
problem is it returns 34 docs but gives me a numFound of 40 

is this expected behavior? 







  




RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
A ballpark calculation would be 

Collected Amount (From GC logging)/ # of Requests.

The GC logging can tell you how much it collected each time, no need to
try and snapshot before and after heap sizes. However (big caveat here),
this is a ballpark figure. The garbage collector is not guaranteed to
collect everything, every time. It can stop collecting depending on how
much time it spent. It may only collect from certain sections within
memory (Eden, survivor, tenured), etc.

This may still be enough to make broad comparisons to see if you've
decreased the overall garbage/request (via cache changes), but it will
be quite a rough estimate.

-Todd

-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com] 
Sent: Wednesday, January 21, 2009 3:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance "dead-zone" due to garbage collection


(Thanks for the responses)

My filterCache hit rate is ~60% (so I'll try making it bigger), and I am
CPU
bound. 

How do I measure the size of my per-request garbage? Is it (total heap
size
before collection - total heap size after collection) / # of requests to
cause a collection?

I'll try your suggestions and post back any useful results.

-- 
View this message in context:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect
ion-tp21588427p21593661.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia

(Thanks for the responses)

My filterCache hit rate is ~60% (so I'll try making it bigger), and I am CPU
bound. 

How do I measure the size of my per-request garbage? Is it (total heap size
before collection - total heap size after collection) / # of requests to
cause a collection?

I'll try your suggestions and post back any useful results.

-- 
View this message in context: 
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21593661.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese

After some test with System.currentTimeMillis I have seen that the diference
is almos unapreciable ... but phpresponse was a little bit faster...

Marc Sturlese wrote:
> 
> Hey there, I am using Solr as backEnd and I don't mind whou to get back
> the results. How is faster for Solr to create the response, using
> XMLResponseWriter or PHPResponseWriter??
> For my front end is faster to process the response created by
> PHPResponseWriter but I would not like to improve speed parsing the
> response but loose it in the creation.
> Thanks in advanced
> 
> 

-- 
View this message in context: 
http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21593352.html
Sent from the Solr - User mailing list archive at Nabble.com.



Suppressing logging for /admin/ping requests

2009-01-21 Thread Todd Breiholz
Is there anyway to suppress the logging of the /admin/ping requests? We have
HAProxy configured to do health checks to this URI every couple of seconds
and it is really cluttering our logs. I'd still like to see the logging from
the other requestHandlers.
Thanks!

Todd


Re: numFound problem

2009-01-21 Thread Erick Erickson
Oops, missed that. I'll have to defer to folks with more
SOLR experience than I have, I've pretty much worked
in Lucene.

Best
Erick

On Wed, Jan 21, 2009 at 3:57 PM, Ron Chan  wrote:

> I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it
> is the StandardAnalyzer
>
> but shouldn't the returned docs equal numFound?
>
>
> - Original Message -
> From: "Erick Erickson" 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland,
> Portugal
> Subject: Re: numFound problem
>
> It depends (tm). What analyzer are you using when indexing?
>
> I'd expect (though I haven't checked) that StandardAnalyzer
> would break SD/DDeck into two tokens, SD and DDeck which
> corresponds nicely with what you're reporting.
>
> Other analyzers and/or filters are easy to specify
>
> I'd recommend getting a copy of Luke and examining your
> index to see what's actually in it
>
> Best
> Erick
>
> On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan  wrote:
>
> > I have a test search which I know should return 34 docs and it does
> >
> > however, numFound says 40
> >
> > with debug enabled, I can see the 40 it has found
> >
> > my search looks for "SD DDeck" in the description
> >
> > 34 of them had "SD DDeck" with 6 of them having "SD/DDeck"
> >
> > now, I can probably work round it if had returned me the 40 docs but the
> > problem is it returns 34 docs but gives me a numFound of 40
> >
> > is this expected behavior?
> >
> >
> >
>


Re: numFound problem

2009-01-21 Thread Ron Chan
I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is 
the StandardAnalyzer 

but shouldn't the returned docs equal numFound? 


- Original Message - 
From: "Erick Erickson"  
To: solr-user@lucene.apache.org 
Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland, 
Portugal 
Subject: Re: numFound problem 

It depends (tm). What analyzer are you using when indexing? 

I'd expect (though I haven't checked) that StandardAnalyzer 
would break SD/DDeck into two tokens, SD and DDeck which 
corresponds nicely with what you're reporting. 

Other analyzers and/or filters are easy to specify 

I'd recommend getting a copy of Luke and examining your 
index to see what's actually in it 

Best 
Erick 

On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan  wrote: 

> I have a test search which I know should return 34 docs and it does 
> 
> however, numFound says 40 
> 
> with debug enabled, I can see the 40 it has found 
> 
> my search looks for "SD DDeck" in the description 
> 
> 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" 
> 
> now, I can probably work round it if had returned me the 40 docs but the 
> problem is it returns 34 docs but gives me a numFound of 40 
> 
> is this expected behavior? 
> 
> 
> 


Re: numFound problem

2009-01-21 Thread Erick Erickson
It depends (tm). What analyzer are you using when indexing?

I'd expect (though I haven't checked) that StandardAnalyzer
would break SD/DDeck into two tokens, SD and DDeck which
corresponds nicely with what you're reporting.

Other analyzers and/or filters are easy to specify

I'd recommend getting a copy of Luke and examining your
index to see what's actually in it

Best
Erick

On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan  wrote:

> I have a test search which I know should return 34 docs and it does
>
> however, numFound says 40
>
> with debug enabled, I can see the 40 it has found
>
> my search looks for "SD DDeck" in the description
>
> 34 of them had "SD DDeck" with 6 of them having "SD/DDeck"
>
> now, I can probably work round it if had returned me the 40 docs but the
> problem is it returns 34 docs but gives me a numFound of 40
>
> is this expected behavior?
>
>
>


Re: storing complex types in a multiValued field

2009-01-21 Thread Chris Hostetter

: > I guess most people store it as a simple string "key(separator)value". Is

or use dynamic fields to putthe "key" into the field name...

: > >   > multiValued="true" />

...could be...



...then index value

if you omitNorms the overhead of having many fields should be low - 
allthough i'm not 100% certain how it compares with having a single field 
and encoding the key/value in the field value.


-Hoss



numFound problem

2009-01-21 Thread Ron Chan
I have a test search which I know should return 34 docs and it does 

however, numFound says 40 

with debug enabled, I can see the 40 it has found 

my search looks for "SD DDeck" in the description 

34 of them had "SD DDeck" with 6 of them having "SD/DDeck" 

now, I can probably work round it if had returned me the 40 docs but the 
problem is it returns 34 docs but gives me a numFound of 40 

is this expected behavior? 




Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Walter Underwood
Have you tried different sizes for the nursery? It should be several
times larger than the per-request garbage.

Also, check your cache sizes. Objects evicted from the cache are
almost always tenured, so those will add to the time needed for
a full GC.

Guess who was tuning GC for a week or two in December ...

wunder

On 1/21/09 12:15 PM, "Feak, Todd"  wrote:

> From a high level view, there is a certain amount of garbage collection
> that must occur. That garbage is generated per request, through a
> variety of means (buffers, request, response, cache expulsion). The only
> thing that JVM parameters can address is *when* that collection occurs.
> 
> It can occur often in small chunks, or rarely in large chunks (or
> anywhere in between). If you are CPU bound (which it sounds like you may
> be), then you really have a decision to make. Do you want an overall
> drop in performance, as more time is spent garbage collecting, OR do you
> want spikes in garbage collection that are more rare, but have a
> stronger impact. Realistically it becomes a question of one or the
> other. You *must* pay the cost of garbage collection at some point in
> time.
> 
> It is possible that increasing cache size will decrease overall garbage
> collection, as the churn caused by caused by cache misses creates
> additional garbage. Decreasing the churn could decrease garbage. BUT,
> this really depends on your cache hit rates. If they are pretty high
> (>90%) then it's probably not much of a factor. However, if you are in
> the 50%-60% range, larger caches may help you in a number of ways.
> 
> -Todd Feak
> 
> -Original Message-
> From: wojtekpia [mailto:wojte...@hotmail.com]
> Sent: Wednesday, January 21, 2009 11:14 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Performance "dead-zone" due to garbage collection
> 
> 
> I'm using a recent version of Sun's JVM (6 update 7) and am using the
> concurrent generational collector. I've tried several other collectors,
> none
> seemed to help the situation.
> 
> I've tried reducing my heap allocation. The search performance got worse
> as
> I reduced the heap. I didn't monitor the garbage collector in those
> tests,
> but I imagine that it would've gotten better. (As a side note, I do lots
> of
> faceting and sorting, I have 10M records in this index, with an
> approximate
> index file size of 10GB).
> 
> This index is on a single machine, in a single Solr core. Would
> splitting it
> across multiple Solr cores on a single machine help? I'd like to find
> the
> limit of this machine before spreading the data to more machines.
> 
> Thanks,
> 
> Wojtek



RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
>From a high level view, there is a certain amount of garbage collection
that must occur. That garbage is generated per request, through a
variety of means (buffers, request, response, cache expulsion). The only
thing that JVM parameters can address is *when* that collection occurs. 

It can occur often in small chunks, or rarely in large chunks (or
anywhere in between). If you are CPU bound (which it sounds like you may
be), then you really have a decision to make. Do you want an overall
drop in performance, as more time is spent garbage collecting, OR do you
want spikes in garbage collection that are more rare, but have a
stronger impact. Realistically it becomes a question of one or the
other. You *must* pay the cost of garbage collection at some point in
time.

It is possible that increasing cache size will decrease overall garbage
collection, as the churn caused by caused by cache misses creates
additional garbage. Decreasing the churn could decrease garbage. BUT,
this really depends on your cache hit rates. If they are pretty high
(>90%) then it's probably not much of a factor. However, if you are in
the 50%-60% range, larger caches may help you in a number of ways.

-Todd Feak

-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com] 
Sent: Wednesday, January 21, 2009 11:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Performance "dead-zone" due to garbage collection


I'm using a recent version of Sun's JVM (6 update 7) and am using the
concurrent generational collector. I've tried several other collectors,
none
seemed to help the situation.

I've tried reducing my heap allocation. The search performance got worse
as
I reduced the heap. I didn't monitor the garbage collector in those
tests,
but I imagine that it would've gotten better. (As a side note, I do lots
of
faceting and sorting, I have 10M records in this index, with an
approximate
index file size of 10GB).

This index is on a single machine, in a single Solr core. Would
splitting it
across multiple Solr cores on a single machine help? I'd like to find
the
limit of this machine before spreading the data to more machines.

Thanks,

Wojtek
-- 
View this message in context:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect
ion-tp21588427p21590150.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Problem with WT parameter when upgrading from Solr1.2 to solr1.3

2009-01-21 Thread Chris Hostetter

: Right, that's probably the crux of it - distributed search required
: some extensions to response writers... things like handling
: SolrDocument and SolrDocumentList.

Grrr... that's right, i forgot that there wasn't any way to make 
SolrDocumentList implement DocList ... and i don't think this caveat got 
documented anywhere.

I'm going to poke arround and see if i can find some good places to point 
this out.


-Hoss



RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
The large drop in old generation from 27GB->6GB indicates that things
are getting into your old generation prematurely. They really don't need
to get there at all, and should be collected sooner (more frequently).

Look into increasing young generation sizes via JVM parameters. Also
look into concurrent collection.

You could even consider decreasing your JVM max memory. Obviously you
aren't using it all, decreasing it will force the JVM to do more
frequent (and therefore smaller) collections. You're average collection
time may go up, but you will get smaller performance decreases.

Great details on memory tuning on Sun JDKs here 

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

There are other articles for 1.6 and 1.4 as well.

-Todd

-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com] 
Sent: Wednesday, January 21, 2009 9:49 AM
To: solr-user@lucene.apache.org
Subject: Performance "dead-zone" due to garbage collection


I'm intermittently experiencing severe performance drops due to Java
garbage
collection. I'm allocating a lot of RAM to my Java process (27GB of the
32GB
physically available). Under heavy load, the performance drops
approximately
every 10 minutes, and the drop lasts for 30-40 seconds. This coincides
with
the size of the old generation heap dropping from ~27GB to ~6GB. 

Is there a way to reduce the impact of garbage collection? A couple
ideas
we've come up with (but haven't tried yet) are: increasing the minimum
heap
size, more frequent (but hopefully less costly) garbage collection.

Thanks,

Wojtek

-- 
View this message in context:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect
ion-tp21588427p21588427.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Querying back with top few results in the same XMLWriter!

2009-01-21 Thread Chris Hostetter

: I am using a ranking algorithm by modifying the XMLWriter to use a
: formulation which takes the top 3 results and query with the 3 results and
: now presents the result with as function of the results from these 3
: queries. Can anyone reply if I can take the top 3results and query with them
: in the same reponsewriter?
:  Or is there any functionality provided by solr in either 1.2 or 1.3
: version.

I'm not sure why you would do this in XMLWriter -- this is the type of 
logic that would make more sense in a RequestHandler or SearchComponent.

in fact: this is very similar to what the MoreLikeThisComponent does -- 
except it sounds like you want to create a single query based on multiple 
documents (MLT creates a seperate query for each document)


take a look at the SearchComponent API -- and use MLT as an example and i 
think you'll see a relatively easy way to accomplish your goal.


-Hoss



Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Alexander Ramos Jardim
I would say that putting more Solr instances, each one with your own data
directory could help if you can qualify your docs, in such a way that you
can put "A" type docs in index "A", "B" type docs in index "B", and so on.

2009/1/21 wojtekpia 

>
> I'm using a recent version of Sun's JVM (6 update 7) and am using the
> concurrent generational collector. I've tried several other collectors,
> none
> seemed to help the situation.
>
> I've tried reducing my heap allocation. The search performance got worse as
> I reduced the heap. I didn't monitor the garbage collector in those tests,
> but I imagine that it would've gotten better. (As a side note, I do lots of
> faceting and sorting, I have 10M records in this index, with an approximate
> index file size of 10GB).
>
> This index is on a single machine, in a single Solr core. Would splitting
> it
> across multiple Solr cores on a single machine help? I'd like to find the
> limit of this machine before spreading the data to more machines.
>
> Thanks,
>
> Wojtek
> --
> View this message in context:
> http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21590150.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Alexander Ramos Jardim


Re: Sizing a Linux box for Solr?

2009-01-21 Thread Erick Erickson
One other useful piece of information would be how big you
expect your indexes to be. Which you should be able to estimate
quite easily by indexing, say, 20,000 documents from the
relevant databases.

Of particular interest will be the delta between the size of the
index at, say, 10,000 documents and 20,000, since size is
related to the number of unique terms per field and once you
get past a certain number of terms, virtually every new term will
already be in your index.

Also, I think that the relevant metric is what the size is for *unstored*
data since storing the fields isn't particularly relevant to search
response time (although it can *certainly* be relevant to
*total* time if you assemble a lot of stored fields to return).
*
*If your new to Lucene, the difference between stored and
indexed is a bit confusing, so if the above is gibberish, you'd
be well served by understanding the distinction before you go
too far .

Best
Erick

On Wed, Jan 21, 2009 at 1:04 PM, Thomas Dowling wrote:

> On 01/21/2009 12:25 PM, Matthew Runo wrote:
> > At a certain level it will become better to have multiple smaller boxes
> > rather than one huge one. I've found that even an old P4 with 2 gigs of
> > ram has decent response time on our 150,000 item index with only a few
> > users - but it quickly goes downhill if we get more than 5 or 6. How
> > many documents are you going to be storing in your index? How much of
> > them will be "stored" versus "indexed"? Will you be faceting on the
> > results?
>
> Thanks for the tip on multiple boxes.  We'll be hosting about 20
> databases total.  A couple of them are in the 10- to 20-million record
> range and a couple more are in the 5- to 10-million range.  It's highly
> structured data and I anticipate a lot of faceting and indexing almost
> all the fields.
>
> >
> > In general, I'd recommend a 64 bit processor with enough ram to store
> > your index in ram - but that might not be possible with "millions" of
> > records. Our 150,000 item index is about a gig and a half when optimized
> > but yours will likely be different depending on how much you store.
> > Faceting takes more memory than pure searching as well.
> >
>
> This is very helpful.  Thanks again.
>
>
> --
> Thomas Dowling
>


Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia

I'm using a recent version of Sun's JVM (6 update 7) and am using the
concurrent generational collector. I've tried several other collectors, none
seemed to help the situation.

I've tried reducing my heap allocation. The search performance got worse as
I reduced the heap. I didn't monitor the garbage collector in those tests,
but I imagine that it would've gotten better. (As a side note, I do lots of
faceting and sorting, I have 10M records in this index, with an approximate
index file size of 10GB).

This index is on a single machine, in a single Solr core. Would splitting it
across multiple Solr cores on a single machine help? I'd like to find the
limit of this machine before spreading the data to more machines.

Thanks,

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21590150.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
>Hi Fergus,
>
>It seems a field it is expecting is missing from the XML.

You mean there is some field in the document we are indexing
that is missing?

>
>sourceColName="*fileAbsePath*"/>
>
>I guess "fileAbsePath" is a typo? Can you check if that is the cause?
Well spotted. I had made a mess of sanitizing the config file I sent
to you. I will in future make sure the stuff I am messing with matches
what I send to the list. However there is no typo in the underlying file;
at least not on that line:-) 


>
>
>On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie  wrote:
>
>> Shalin
>>
>> Downloaded nightly for 21jan and tried DIH again. Its better but
>> still broken. Dozens of embeded tags are stripped from documents
>> but it now fails every few documents for no reason I can see. Manually
>> removing embeded tags causes a given problem document to be indexed,
>> only to have a it fail on one of the next few documents. I think the
>> problem is still in stripHTML
>>
>> Here is the traceback.
>>
>> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
>> INFO: Server startup in 3377 ms
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
>> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=13
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> INFO: Starting Full Import
>> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1232539612131
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: jc document : null
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>> at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>... 9 more
>> Caused by: java.util.NoSuchElementException
>>at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>... 10 more
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> SEVERE: Full Import failed
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>> 

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Alexander Ramos Jardim
How many boxes running your index? If it is just one, maybe distributing
your index will get you a better performance during garbage collection.

2009/1/21 wojtekpia 

>
> I'm intermittently experiencing severe performance drops due to Java
> garbage
> collection. I'm allocating a lot of RAM to my Java process (27GB of the
> 32GB
> physically available). Under heavy load, the performance drops
> approximately
> every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with
> the size of the old generation heap dropping from ~27GB to ~6GB.
>
> Is there a way to reduce the impact of garbage collection? A couple ideas
> we've come up with (but haven't tried yet) are: increasing the minimum heap
> size, more frequent (but hopefully less costly) garbage collection.
>
> Thanks,
>
> Wojtek
>
> --
> View this message in context:
> http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21588427.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Alexander Ramos Jardim


Re: Sizing a Linux box for Solr?

2009-01-21 Thread Alexander Ramos Jardim
Definitely you will want to have more than one box for your index.

You can take a look at distributed search and multicore ate the wiki.


2009/1/21 Thomas Dowling 

> On 01/21/2009 12:25 PM, Matthew Runo wrote:
> > At a certain level it will become better to have multiple smaller boxes
> > rather than one huge one. I've found that even an old P4 with 2 gigs of
> > ram has decent response time on our 150,000 item index with only a few
> > users - but it quickly goes downhill if we get more than 5 or 6. How
> > many documents are you going to be storing in your index? How much of
> > them will be "stored" versus "indexed"? Will you be faceting on the
> > results?
>
> Thanks for the tip on multiple boxes.  We'll be hosting about 20
> databases total.  A couple of them are in the 10- to 20-million record
> range and a couple more are in the 5- to 10-million range.  It's highly
> structured data and I anticipate a lot of faceting and indexing almost
> all the fields.
>
> >
> > In general, I'd recommend a 64 bit processor with enough ram to store
> > your index in ram - but that might not be possible with "millions" of
> > records. Our 150,000 item index is about a gig and a half when optimized
> > but yours will likely be different depending on how much you store.
> > Faceting takes more memory than pure searching as well.
> >
>
> This is very helpful.  Thanks again.
>
>
> --
> Thomas Dowling
>



-- 
Alexander Ramos Jardim


Incorrect Scoring

2009-01-21 Thread Jeff Newburn
Can someone please make sense of why the following occurs in our system.
The first item barely matches but scores higher than the second one that
matches all over the place.  The second one is a MUCH better match but has a
worse score. These are in the same query results.  All I can see are the
norms but don¹t know how to fix that.

Parsed Query Info
 +((DisjunctionMaxQuery((realBrandName:brown |
subCategory:brown^20.0 | productDescription:brown | width:brown |
personality:brown^10.0 | brandName:brown | productType:brown^8.0 |
productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown |
productNameSearch:brown | heelHeight:brown | color:brown^10.0 |
attrs:brown^5.0 | expandedGender:brown^0.5)~0.01)
DisjunctionMaxQuery((realBrandName:shoe | subCategory:shoe^20.0 |
productDescription:shoe | width:shoes | personality:shoe^10.0 |
brandName:shoe | productType:shoe^8.0 | productId:shoes^10.0 |
size:shoes^1.2 | category:shoe^10.0 | price:shoes | productNameSearch:shoe |
heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0 |
expandedGender:shoes^0.5)~0.01))~2)
DisjunctionMaxQuery((realBrandName:"brown shoe"~1^10.0 | category:"brown
shoe"~1^10.0 | productNameSearch:"brown shoe"~1 | productDescription:"brown
shoe"~1^2.0 | subCategory:"brown shoe"~1^20.0 | personality:"brown
shoe"~1^2.0 | brandName:"brown shoe"~1^10.0 | productType:"brown
shoe"~1^8.0)~0.01)
 +(((realBrandName:brown |
subCategory:brown^20.0 | productDescription:brown | width:brown |
personality:brown^10.0 | brandName:brown | productType:brown^8.0 |
productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown |
productNameSearch:brown | heelHeight:brown | color:brown^10.0 |
attrs:brown^5.0 | expandedGender:brown^0.5)~0.01 (realBrandName:shoe |
subCategory:shoe^20.0 | productDescription:shoe | width:shoes |
personality:shoe^10.0 | brandName:shoe | productType:shoe^8.0 |
productId:shoes^10.0 | size:shoes^1.2 | category:shoe^10.0 | price:shoes |
productNameSearch:shoe | heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0
| expandedGender:shoes^0.5)~0.01)~2) (realBrandName:"brown shoe"~1^10.0 |
category:"brown shoe"~1^10.0 | productNameSearch:"brown shoe"~1 |
productDescription:"brown shoe"~1^2.0 | subCategory:"brown shoe"~1^20.0 |
personality:"brown shoe"~1^2.0 | brandName:"brown shoe"~1^10.0 |
productType:"brown shoe"~1^8.0)~0.01


DebugQuery Info

  
0.45851633 = (MATCH) sum of:
  0.45851633 = (MATCH) sum of:
0.19769925 = (MATCH) max plus 0.01 times others of:
  0.19769925 = (MATCH) weight(color:brown^10.0 in 1407), product of:
0.06819186 = queryWeight(color:brown^10.0), product of:
  10.0 = boost
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  0.0023521234 = queryNorm
2.8991618 = (MATCH) fieldWeight(color:brown in 1407), product of:
  1.0 = tf(termFreq(color:brown)=1)
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  1.0 = fieldNorm(field=color, doc=1407)
0.26081708 = (MATCH) max plus 0.01 times others of:
  0.26081708 = (MATCH) weight(subCategory:shoe^20.0 in 1407), product
of:
0.14011127 = queryWeight(subCategory:shoe^20.0), product of:
  20.0 = boost
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.0023521234 = queryNorm
1.8614997 = (MATCH) fieldWeight(subCategory:shoe in 1407), product
of:
  1.0 = tf(termFreq(subCategory:shoe)=1)
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.625 = fieldNorm(field=subCategory, doc=1407)


  
0.4086538 = (MATCH) sum of:
  0.4086538 = (MATCH) sum of:
0.19769925 = (MATCH) max plus 0.01 times others of:
  0.19769925 = (MATCH) weight(color:brown^10.0 in 75829), product of:
0.06819186 = queryWeight(color:brown^10.0), product of:
  10.0 = boost
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  0.0023521234 = queryNorm
2.8991618 = (MATCH) fieldWeight(color:brown in 75829), product of:
  1.0 = tf(termFreq(color:brown)=1)
  2.8991618 = idf(docFreq=19348, numDocs=129257)
  1.0 = fieldNorm(field=color, doc=75829)
0.21095455 = (MATCH) max plus 0.01 times others of:
  0.20865366 = (MATCH) weight(subCategory:shoe^20.0 in 75829), product
of:
0.14011127 = queryWeight(subCategory:shoe^20.0), product of:
  20.0 = boost
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.0023521234 = queryNorm
1.4891998 = (MATCH) fieldWeight(subCategory:shoe in 75829), product
of:
  1.0 = tf(termFreq(subCategory:shoe)=1)
  2.9783995 = idf(docFreq=17874, numDocs=129257)
  0.5 = fieldNorm(field=subCategory, doc=75829)
  0.028179625 = (MATCH) weight(productType:shoe^8.0 in 75829), product
of:
0.029127462 = queryWeight(productType:shoe^8.0), product of:
  8.0 = boost
  1.5479344 = idf(docFreq=74728, numDocs=129257)
  0.0023521234 = queryNorm
0.967459 = (MATCH) fieldWeight(productType:sho

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Walter Underwood
What JVM and garbage collector setting? We are using the IBM JVM with
their concurrent generational collector. I would strongly recommend
trying a similar collector on your JVM. Hint: how much memory is in
use after a full GC? That is a good approximation to the working set.

27GB is a very, very large heap. Is that really being used or is it
just filling up with garbage which makes the collections really long?

We run with a 4GB heap and really only need that to handle indexing
or starting new searchers. Searching only needs a 2GB heap for us.
Our full GC pauses for under a half second. Way longer than I'd like,
but that's Java (I still miss Python sometimes).

wunder

On 1/21/09 9:49 AM, "wojtekpia"  wrote:

> 
> I'm intermittently experiencing severe performance drops due to Java garbage
> collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB
> physically available). Under heavy load, the performance drops approximately
> every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with
> the size of the old generation heap dropping from ~27GB to ~6GB.
> 
> Is there a way to reduce the impact of garbage collection? A couple ideas
> we've come up with (but haven't tried yet) are: increasing the minimum heap
> size, more frequent (but hopefully less costly) garbage collection.
> 
> Thanks,
> 
> Wojtek



Re: Word Delimiter struggles

2009-01-21 Thread Shalin Shekhar Mangar
On Mon, Jan 19, 2009 at 9:42 PM, David Shettler  wrote:

> Thank you Shalin, I'm in the process of implementing your suggestion,
> and it works marvelously.  Had to upgrade to solr 1.3, and had to hack
> up acts_as_solr to function correctly.
>
> Is there a way to receive a search for a given field, and have solr
> know to automatically check the two fields?  I suppose not.


If you use DisMax (defType=dismax) instead of the standard handler, the qf
parameter can be used to specify all the fields you want to search for the
given query.

http://wiki.apache.org/solr/DisMaxRequestHandler

-- 
Regards,
Shalin Shekhar Mangar.


Re: Query Performance while updating teh index

2009-01-21 Thread oleg_gnatovskiy

What exactly does Solr do when it receives a new Index? How does it keep
serving while performing the updates? It seems that the part that causes the
slowdown is this transition.




Otis Gospodnetic wrote:
> 
> This is an old and long thread, and I no longer recall what the specific
> suggestions were.
> My guess is this has to do with the OS cache of your index files.  When
> you make the large index update, that OS cache is useless (old files are
> gone, new ones are in) and the OS cache has get re-warmed and this takes
> time.
> 
> Are you optimizing your index before the update?  Do you *really* need to
> do that?
> How large is your update, what makes it big, and could you make it
> smaller?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: oleg_gnatovskiy 
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, January 20, 2009 6:19:46 PM
>> Subject: Re: Query Performance while updating teh index
>> 
>> 
>> Hello again. It seems that we are still having these problems. Queries
>> take
>> as long as 20 minutes to get back to their average response time after a
>> large index update, so it doesn't seem like the problem is the 12 second
>> autowarm time. Are there any more suggestions for things we can try?
>> Taking
>> our servers out of teh loop for as long as 20 minutes is a bit of a
>> hassle,
>> and a risk.
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21588779.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: problem with DIH and MySQL

2009-01-21 Thread Shalin Shekhar Mangar
I guess Noble meant the Solr log.

On Tue, Jan 20, 2009 at 9:29 PM, Nick Friedrich <
nick.friedr...@student.uni-magdeburg.de> wrote:

> no, there are no exceptions
> but I have to admit, that I'm not sure what you mean with console
>
>
> Zitat von Noble Paul ???  ?? :
>
>  it got rolled back
>> any exceptions on solr console?
>>
>>
>> --
>> --Noble Paul
>>
>>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Sizing a Linux box for Solr?

2009-01-21 Thread Thomas Dowling
On 01/21/2009 12:25 PM, Matthew Runo wrote:
> At a certain level it will become better to have multiple smaller boxes
> rather than one huge one. I've found that even an old P4 with 2 gigs of
> ram has decent response time on our 150,000 item index with only a few
> users - but it quickly goes downhill if we get more than 5 or 6. How
> many documents are you going to be storing in your index? How much of
> them will be "stored" versus "indexed"? Will you be faceting on the
> results?

Thanks for the tip on multiple boxes.  We'll be hosting about 20
databases total.  A couple of them are in the 10- to 20-million record
range and a couple more are in the 5- to 10-million range.  It's highly
structured data and I anticipate a lot of faceting and indexing almost
all the fields.

> 
> In general, I'd recommend a 64 bit processor with enough ram to store
> your index in ram - but that might not be possible with "millions" of
> records. Our 150,000 item index is about a gig and a half when optimized
> but yours will likely be different depending on how much you store.
> Faceting takes more memory than pure searching as well.
> 

This is very helpful.  Thanks again.


-- 
Thomas Dowling


Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread wojtekpia

Created SOLR 974: https://issues.apache.org/jira/browse/SOLR-974

-- 
View this message in context: 
http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588634.html
Sent from the Solr - User mailing list archive at Nabble.com.



Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia

I'm intermittently experiencing severe performance drops due to Java garbage
collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB
physically available). Under heavy load, the performance drops approximately
every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with
the size of the old generation heap dropping from ~27GB to ~6GB. 

Is there a way to reduce the impact of garbage collection? A couple ideas
we've come up with (but haven't tried yet) are: increasing the minimum heap
size, more frequent (but hopefully less costly) garbage collection.

Thanks,

Wojtek

-- 
View this message in context: 
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21588427.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Shalin Shekhar Mangar
Hi Fergus,

It seems a field it is expecting is missing from the XML.




I guess "fileAbsePath" is a typo? Can you check if that is the cause?


On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie  wrote:

> Shalin
>
> Downloaded nightly for 21jan and tried DIH again. Its better but
> still broken. Dozens of embeded tags are stripped from documents
> but it now fails every few documents for no reason I can see. Manually
> removing embeded tags causes a given problem document to be indexed,
> only to have a it fail on one of the next few documents. I think the
> problem is still in stripHTML
>
> Here is the traceback.
>
> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
> INFO: Server startup in 3377 ms
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=13
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> INFO: Starting Full Import
> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=2
>
>  
> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>
>  
> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> INFO: last commit = 1232539612131
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: jc document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>... 9 more
> Caused by: java.util.NoSuchElementException
>at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>... 10 more
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>at
> org.apache.solr.h

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-21 Thread Shalin Shekhar Mangar
On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie  wrote:

>
> After looking looking at http://issues.apache.org/jira/browse/SOLR-964,
> where
> it seems this issue has been addressed, I had another go at indexing
> documents
> containing DOCTYPE. It failed as follows.
>
>
That patch has not been committed to the trunk yet. I'll take it up.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread Shalin Shekhar Mangar
Yes please. Even though the fix is small, it is important enough to be
mentioned in the release notes.

On Wed, Jan 21, 2009 at 11:05 PM, wojtekpia  wrote:

>
> Thanks Shalin, a short circuit would definitely solve it. Should I open a
> JIRA issue?
>
>
> Shalin Shekhar Mangar wrote:
> >
> > I guess Data Import Handler still calls commit even if there were no
> > documents created. We can add a short circuit in the code to make sure
> > that
> > does not happen.
> >
>
> --
> View this message in context:
> http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588124.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread wojtekpia

Thanks Shalin, a short circuit would definitely solve it. Should I open a
JIRA issue? 


Shalin Shekhar Mangar wrote:
> 
> I guess Data Import Handler still calls commit even if there were no
> documents created. We can add a short circuit in the code to make sure
> that
> does not happen.
> 

-- 
View this message in context: 
http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588124.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Sizing a Linux box for Solr?

2009-01-21 Thread Matthew Runo
At a certain level it will become better to have multiple smaller  
boxes rather than one huge one. I've found that even an old P4 with 2  
gigs of ram has decent response time on our 150,000 item index with  
only a few users - but it quickly goes downhill if we get more than 5  
or 6. How many documents are you going to be storing in your index?  
How much of them will be "stored" versus "indexed"? Will you be  
faceting on the results?


In general, I'd recommend a 64 bit processor with enough ram to store  
your index in ram - but that might not be possible with "millions" of  
records. Our 150,000 item index is about a gig and a half when  
optimized but yours will likely be different depending on how much you  
store. Faceting takes more memory than pure searching as well.


I'm sure that we could work out some better suggestions with more  
information about your use case.


http://www.nabble.com/Solr---User-f14480.html is a great place to go  
for searching the solr user list.


-Matthew

On Jan 21, 2009, at 8:55 AM, Thomas Dowling wrote:


Is there a useful guide somewhere that suggests system configurations
for machines that will support multiple large-ish Solr indexes?  I'm
working on a group of library databases (journal article citations +
abstracts, mostly), and need to provide some sort of helpful  
information

to our hardware people.  Other than "lots", is there an answer for "We
have X millions of records, of Y average size, with Z peak  
simultaneous
users, so the memory needed for reasonable search performance is  
_"?
Or is the limiting factor on search performance going to be  
something else?


[Standard caveat: I did try checking the solr-user archives, but was
hampered by the fact that there's no search function.  The cobbler's
children go barefoot.]


--
Thomas Dowling
Ohio Library and Information Network
tdowl...@ohiolink.edu





Sizing a Linux box for Solr?

2009-01-21 Thread Thomas Dowling
Is there a useful guide somewhere that suggests system configurations
for machines that will support multiple large-ish Solr indexes?  I'm
working on a group of library databases (journal article citations +
abstracts, mostly), and need to provide some sort of helpful information
to our hardware people.  Other than "lots", is there an answer for "We
have X millions of records, of Y average size, with Z peak simultaneous
users, so the memory needed for reasonable search performance is _"?
 Or is the limiting factor on search performance going to be something else?

[Standard caveat: I did try checking the solr-user archives, but was
hampered by the fact that there's no search function.  The cobbler's
children go barefoot.]


-- 
Thomas Dowling
Ohio Library and Information Network
tdowl...@ohiolink.edu


Words that need protection from stemming, i.e., protwords.txt

2009-01-21 Thread David Woodward
Hi.

Any good protwords.txt out there?

In a fairly standard solr analyzer chain, we use the English Porter analyzer 
like so:



For most purposes the porter does just fine, but occasionally words come along 
that really don't work out to well, e.g.,

"maine" is stemmed to "main" - clearly goofing up precision about "Maine" 
without doing much good for variants of "main".

So - I have an entry for my protwords.txt. What else should go in there?

Thanks for your ideas,

Dave Woodward



Re: XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese

I have been doing some testing (with System.currentTimeMillis) and the
difference is almost unapreciable but bit faster PHPResponseWriter, just
would like to be sure I am right. Does anybody knows it for sure?


Marc Sturlese wrote:
> 
> Hey there, I am using Solr as backEnd and I don't mind whou to get back
> the results. How is faster for Solr to create the response, using
> XMLResponseWriter or PHPResponseWriter??
> For my front end is faster to process the response created by
> PHPResponseWriter but I would not like to improve speed parsing the
> response but loose it in the creation.
> Thanks in advanced
> 
> 

-- 
View this message in context: 
http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21582667.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to schedule delta-import and auto commit

2009-01-21 Thread Manupriya

Hi Shalin,

I have not faced any memory problems as of now. But I had perviously asked a
question regarding caching and memory
(http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html)-
 

--
So can I safely assume that we will not face any memory issue due to caching
even if we do not send commit that frequently? (If we wont send commit, then
new searcher wont be initialized. So I can assume that the current searcher
will correctly manage cache without any memory issues.) 

Thanks, 
Manu
-

For which I got the following answer - 

No, 

You can't assume that. You have to set a good autoCommit value for your 
solrconfig.xml, so you don't run out of memory for no commiting to Solr 
often, depending on your enviroment, memory share, doc size and update 
frequency. 
--

But my understanding is that  tag works only if there is some
update in index. 

So I wanted to understand that if there are no updates, will caching create
some problems with memory?

Thanks,
Manu


Shalin Shekhar Mangar wrote:
> 
> On Wed, Jan 21, 2009 at 4:31 PM, Manupriya
> wrote:
> 
>>
>> 2. I had asked peviously regarding caching and memory
>> management(
>> http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html
>> ).
>> So how do I schedule auto-commit for my Solr server.
>>
>> As per my understanding,  tag in Solrconfig.xml will call
>> commit
>> only if there has been an update. Right? So in case, no document has been
>> added/updated, how can I call auto commit?
>> Note: My only purpose to call commit without document change is to close
>> current Searcher and open a new Searcher. This is for better memory
>> management with caching.
> 
> 
> This confuses me. Why do you think Solr is mis-managing the memory? What
> are
> the problems you are encountering?
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21582357.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH XPathEntityProcessor fails with docs containing

2009-01-21 Thread Fergus McMenemie
Hello,

After looking looking at http://issues.apache.org/jira/browse/SOLR-964, where
it seems this issue has been addressed, I had another go at indexing documents
containing DOCTYPE. It failed as follows.

This was using the nightly build from 21-jan 2009.

The comments section within jira suggested my inital message had been replied
to twice, I somehow missed them in my inbox!

Regards Fergus.

Jan 21, 2009 12:15:21 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 21, 2009 12:15:21 PM org.apache.solr.core.SolrCore execute
INFO: [jdocs] webapp=/solr path=/dataimport params={command=show-config} 
status=0 QTime=0 
Jan 21, 2009 12:15:22 PM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: jc document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/j/dtd/jxml/data/news/f/f2008/frp70450.xmlrows 
processed :0 Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:180)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1325)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:613)
Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: 
(was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No 
such file or directory)
 at [row,col {unknown-source}]: [3,81]
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 27 more
Caused by: com.ctc.wstx.exc.WstxParsingException: (was 
java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such 
file or directory)
 at [row,col {unknown-source}]: [3,81]
at 
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at 
com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:475)
at 
com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
at 
com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
at 
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
at com.ctc

XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese

Hey there, I am using Solr as backEnd and I don't mind whou to get back the
results. How is faster for Solr to create the response, using
XMLResponseWriter or PHPResponseWriter??
For my front end is faster to process the response created by
PHPResponseWriter but I would not like to improve speed parsing the response
but loose it in the creation.
Thanks in advanced

-- 
View this message in context: 
http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21582204.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to schedule delta-import and auto commit

2009-01-21 Thread Shalin Shekhar Mangar
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya wrote:

>
> 2. I had asked peviously regarding caching and memory
> management(
> http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html
> ).
> So how do I schedule auto-commit for my Solr server.
>
> As per my understanding,  tag in Solrconfig.xml will call
> commit
> only if there has been an update. Right? So in case, no document has been
> added/updated, how can I call auto commit?
> Note: My only purpose to call commit without document change is to close
> current Searcher and open a new Searcher. This is for better memory
> management with caching.


This confuses me. Why do you think Solr is mis-managing the memory? What are
the problems you are encountering?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
Shalin

Downloaded nightly for 21jan and tried DIH again. Its better but
still broken. Dozens of embeded tags are stripped from documents
but it now fails every few documents for no reason I can see. Manually
removing embeded tags causes a given problem document to be indexed,
only to have a it fail on one of the next few documents. I think the
problem is still in stripHTML

Here is the traceback.

Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 3377 ms
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=13 
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]

commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232539612131
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: jc document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
Caused by: java.util.NoSuchElementException
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java

Re: Error, when i update the rich text documents such as .doc, .ppt files.

2009-01-21 Thread matthieuL

Hi

Do you resolve the probleme?? because I have the same prbleme.

Thanks

-- 
View this message in context: 
http://www.nabble.com/Error%2C-when-i-update-the-rich-text-documents-such-as-.doc%2C-.ppt-files.-tp20934026p21581483.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to schedule delta-import and auto commit

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya  wrote:
>
> Hi,
>
> Our Solr server is a standalone server and some web applications send HTTP
> query to search and get back the results.
>
> Now I have following two requirements -
>
> 1. we want to schedule 'delta-import' at a specified time. So that we dont
> have to explicitly send a HTTP request for delta-import.
> http://wiki.apache.org/solr/DataImportHandler mentions 'Schedule full
> imports and delta imports' but there is no detail. Even
> http://www.ibm.com/developerworks/library/j-solr-update/index.html mentions
> 'scheduler' but again there is no detail.
There is no feature in Solr to schedule commands at specific intervals.
You may have to do it externally . If you are using linux you can
setup a cron job to invoke a curl at predetermined intervals
>
> 2. I had asked peviously regarding caching and memory
> management(http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html).
> So how do I schedule auto-commit for my Solr server.
>
> As per my understanding,  tag in Solrconfig.xml will call commit
> only if there has been an update. Right? So in case, no document has been
> added/updated, how can I call auto commit?
> Note: My only purpose to call commit without document change is to close
> current Searcher and open a new Searcher. This is for better memory
> management with caching.
>
> Please let me know if there is any resources that I can refer for these.
>
> Thanks,
> Manu
>
> --
> View this message in context: 
> http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21580961.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul


How to schedule delta-import and auto commit

2009-01-21 Thread Manupriya

Hi,

Our Solr server is a standalone server and some web applications send HTTP
query to search and get back the results.

Now I have following two requirements - 

1. we want to schedule 'delta-import' at a specified time. So that we dont
have to explicitly send a HTTP request for delta-import. 
http://wiki.apache.org/solr/DataImportHandler mentions 'Schedule full
imports and delta imports' but there is no detail. Even
http://www.ibm.com/developerworks/library/j-solr-update/index.html mentions
'scheduler' but again there is no detail. 

2. I had asked peviously regarding caching and memory
management(http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html).
So how do I schedule auto-commit for my Solr server.

As per my understanding,  tag in Solrconfig.xml will call commit
only if there has been an update. Right? So in case, no document has been
added/updated, how can I call auto commit? 
Note: My only purpose to call commit without document change is to close
current Searcher and open a new Searcher. This is for better memory
management with caching.

Please let me know if there is any resources that I can refer for these.

Thanks,
Manu

-- 
View this message in context: 
http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21580961.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem in Date Unmarshalling from NamedListCodec.

2009-01-21 Thread Luca Molteni
I've solved the problem.

It was a time zone problem. :)

L.M.



2009/1/21 Luca Molteni :
> Hello list,
>
> Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal
> method (line 161) the number
>
> 119914200
>
> as a date (1 January 2008),
>
> While executing the same query with the solr administration console,
> it gives me a different date value:
>
> 2007-12-31T23:00:00Z
>
> It seems like there is a one hour difference between the twos.
>
> At first, I thought about a local time zone (I'm in Milan, Italy), but
> I've made some tries, and using the Date and Calendar constructors
> with the right locale gives me the first january.
>
> Could be possible that the date gots marshalled in a wrong way?
>
> Thank you very much.
>
> L.M.
>


Re: SOLR Problem with special chars

2009-01-21 Thread Kraus, Ralf | pixelhouse GmbH

Otis Gospodnetic schrieb:

now it works :

   positionIncrementGap="100">

   
   
   words="stopwords.txt"/>

   
   max="50" />

   
   
   language="German" />
   protected="protwords.txt" />

   

   
   
   synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
   words="stopwords.txt"/>

   
   
   
   language="German" />
   protected="protwords.txt" />

   
   


Greets,

Ralf


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jan 21, 2009 at 3:42 PM, Jaco  wrote:
> Thanks for the fast replies!
>
> It appears that I made a (probably classical) error... I didnt' make the
> change to solrconfig.xml to include the  when applying the
> upgrade. I include this now, but the slave is not cleaning up. Will this be
> done at some point automatically? Can I trigger this?
Unfortunately , no.
Lucene is supposed to cleanup these old commit points automatically
after each commit. Even if the  is not specified the
default is supposed to take  effect.
>
> User access rights for the user are OK, this use is allowed to do anything
> in the Solr data directory (Tomcat service is running from SYSTEM account
> (Windows)).
>
> Thanks, regards,
>
> Jaco.
>
>
> 2009/1/21 Shalin Shekhar Mangar 
>
>> Hi,
>>
>> There shouldn't be so many files on the slave. Since the empty index.x
>> folders are not getting deleted, is it possible that Solr process user does
>> not enough privileges to delete files/folders?
>>
>> Also, have you made any changes to the IndexDeletionPolicy configuration?
>>
>> On Wed, Jan 21, 2009 at 2:15 PM, Jaco  wrote:
>>
>> > Hi,
>> >
>> > I'm running Solr nightly build of 20.12.2008, with patch as discussed on
>> > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.
>> >
>> > On various systems running, I see that the disk space consumed on the
>> slave
>> > is much higher than on the master. One example:
>> > - Master: 30 GB in 138 files
>> > - Slave: 152 GB in 3,941 files
>> >
>> > Can anybody tell me what to do to prevent this from happening, and how to
>> > clean up the slave? Also, there are quite some empty index.xxx
>> > directories sitting in the slaves data dir. Can these be safely removed?
>> >
>> > Thanks a lot in advance, bye,
>> >
>> > Jaco.
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>



-- 
--Noble Paul


Problem in Date Unmarshalling from NamedListCodec.

2009-01-21 Thread Luca Molteni
Hello list,

Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal
method (line 161) the number

119914200

as a date (1 January 2008),

While executing the same query with the solr administration console,
it gives me a different date value:

2007-12-31T23:00:00Z

It seems like there is a one hour difference between the twos.

At first, I thought about a local time zone (I'm in Milan, Italy), but
I've made some tries, and using the Date and Calendar constructors
with the right locale gives me the first january.

Could be possible that the date gots marshalled in a wrong way?

Thank you very much.

L.M.


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Jaco
Thanks for the fast replies!

It appears that I made a (probably classical) error... I didnt' make the
change to solrconfig.xml to include the  when applying the
upgrade. I include this now, but the slave is not cleaning up. Will this be
done at some point automatically? Can I trigger this?

User access rights for the user are OK, this use is allowed to do anything
in the Solr data directory (Tomcat service is running from SYSTEM account
(Windows)).

Thanks, regards,

Jaco.


2009/1/21 Shalin Shekhar Mangar 

> Hi,
>
> There shouldn't be so many files on the slave. Since the empty index.x
> folders are not getting deleted, is it possible that Solr process user does
> not enough privileges to delete files/folders?
>
> Also, have you made any changes to the IndexDeletionPolicy configuration?
>
> On Wed, Jan 21, 2009 at 2:15 PM, Jaco  wrote:
>
> > Hi,
> >
> > I'm running Solr nightly build of 20.12.2008, with patch as discussed on
> > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.
> >
> > On various systems running, I see that the disk space consumed on the
> slave
> > is much higher than on the master. One example:
> > - Master: 30 GB in 138 files
> > - Slave: 152 GB in 3,941 files
> >
> > Can anybody tell me what to do to prevent this from happening, and how to
> > clean up the slave? Also, there are quite some empty index.xxx
> > directories sitting in the slaves data dir. Can these be safely removed?
> >
> > Thanks a lot in advance, bye,
> >
> > Jaco.
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: SOLR Problem with special chars

2009-01-21 Thread Kraus, Ralf | pixelhouse GmbH

Otis Gospodnetic schrieb:

Ralf,

Can you paste the part of your schema.xml where you defined the relevant field?

Otis


Sure !

   positionIncrementGap="100">

   
   
   
   language="German" />

   
  
   

   
   
   language="German" />

   
  
   


Greets


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Shalin Shekhar Mangar
Hi,

There shouldn't be so many files on the slave. Since the empty index.x
folders are not getting deleted, is it possible that Solr process user does
not enough privileges to delete files/folders?

Also, have you made any changes to the IndexDeletionPolicy configuration?

On Wed, Jan 21, 2009 at 2:15 PM, Jaco  wrote:

> Hi,
>
> I'm running Solr nightly build of 20.12.2008, with patch as discussed on
> http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.
>
> On various systems running, I see that the disk space consumed on the slave
> is much higher than on the master. One example:
> - Master: 30 GB in 138 files
> - Slave: 152 GB in 3,941 files
>
> Can anybody tell me what to do to prevent this from happening, and how to
> clean up the slave? Also, there are quite some empty index.xxx
> directories sitting in the slaves data dir. Can these be safely removed?
>
> Thanks a lot in advance, bye,
>
> Jaco.
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
the index.xxx directories are supposed to be deleted (automatically).
you can safely delete them.

But, I am wondering why the index files in the slave did not get
deleted. By default the deletionPolicy is KeepOnlyLastCommit.



On Wed, Jan 21, 2009 at 2:15 PM, Jaco  wrote:
> Hi,
>
> I'm running Solr nightly build of 20.12.2008, with patch as discussed on
> http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.
>
> On various systems running, I see that the disk space consumed on the slave
> is much higher than on the master. One example:
> - Master: 30 GB in 138 files
> - Slave: 152 GB in 3,941 files
>
> Can anybody tell me what to do to prevent this from happening, and how to
> clean up the slave? Also, there are quite some empty index.xxx
> directories sitting in the slaves data dir. Can these be safely removed?
>
> Thanks a lot in advance, bye,
>
> Jaco.
>



-- 
--Noble Paul


Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Rafał Kuć
Hello,

> Hi,

> I'm running Solr nightly build of 20.12.2008, with patch as discussed on
> http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.

> On various systems running, I see that the disk space consumed on the slave
> is much higher than on the master. One example:
> - Master: 30 GB in 138 files
> - Slave: 152 GB in 3,941 files

> Can anybody tell me what to do to prevent this from happening, and how to
> clean up the slave? Also, there are quite some empty index.xxx
> directories sitting in the slaves data dir. Can these be safely removed?

> Thanks a lot in advance, bye,

> Jaco.

Slaves use much more disk space after some time because they keep snapshots
of the index you pull from the master. Look at the snapcleaner script,
you can use it to automatically clean data directory.

I hope that helps.


-- 
Regards,
 Rafał Kuć



Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Jaco
Hi,

I'm running Solr nightly build of 20.12.2008, with patch as discussed on
http://markmail.org/message/yq2ram4f3jblermd, using Solr replication.

On various systems running, I see that the disk space consumed on the slave
is much higher than on the master. One example:
- Master: 30 GB in 138 files
- Slave: 152 GB in 3,941 files

Can anybody tell me what to do to prevent this from happening, and how to
clean up the slave? Also, there are quite some empty index.xxx
directories sitting in the slaves data dir. Can these be safely removed?

Thanks a lot in advance, bye,

Jaco.