Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju
Hi Yes, the XML is inside the DB in a clob. Would love to use XPath inside SQLEntityProcessor as it will save me tons of trouble for file- dumping (given that I am not able to post it). This is how I setup my DIH for DB import. driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracl

Re: Using Threading while Indexing.

2009-01-21 Thread Chris Hostetter
: I was trying to index three sets of document having 2000 articles using : three threads of embedded solr server. But while indexing, giving me : exception ?org.apache.lucene.store.LockObtainFailedException: Lock something doesn't sound right here ... i'm not expert on embeding solr, i think

Re: Question about query sintax

2009-01-21 Thread Chris Hostetter
: If I query for 'ferrar*' on my index, I will get 'ferrari' and 'red ferrari' : as a result. And that's fine. But if I try to query for 'red ferrar*', I : have to put it between double quotes as I want to grant that it will be used : as only one term, but the '*' is being ignored, as I don't get

Re: How to get XML response from CommonsHttpSolrServer through QueryResponse?

2009-01-21 Thread Chris Hostetter
: Because I used server.setParser(new XMLResponseParser()), I get the : wt=xml parameter in the responseHeader, but the format of the : responseHeader is clearly no XML at all. I expect Solr does output XML, : but that the QueryResponse, when I print its contents, formats this as : the string

Re: Issue with dismaxrequestHandler for date fields

2009-01-21 Thread Chris Hostetter
: Still search on any field (?q=searchTerm) gives following error : "The request sent by the client was syntactically incorrect (Invalid Date : String:'searchTerm')." because "searchTerm" isn't a valid date string : Is this valid to define *_dt (i.e. date fields ) in solrConfig.xml ? if you re

Re: Passing analyzer to the queryparser plugin

2009-01-21 Thread Chris Hostetter
: Is there a way to pass the analyzer to the query parser plugin Solr uses a variant of the PerFieldAnalzyer -- you specify in the schema.xml what analyzer you want to use for each field. if you have some sort of *really* exotic situation, you can always design a custom QParser that looks at s

Re: What can be the reason for stopping solr work after some time?

2009-01-21 Thread Chris Hostetter
: i'm newbie with solr. We have installed with together with ezfind from : EZ Publish web sites and it is working. But in one of the servers we : have this kind of problem. It works for example for 3 hours, and then in : one moment it stop to work, searching and indexing does not work. it's prett

Re: Newbie Design Questions

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju wrote: > Thanks > > Yes the source of data is a DB. However the xml is also posted on updates > via publish framework. So I can just plug in an adapter hear to listen for > changes and post to SOLR. I was trying to use the XPathProcessor i

Re: solr-duplicate post management

2009-01-21 Thread Chris Hostetter
: what i need is ,to log the existing urlid and new urlid(of course both will : not be same) ,when a .xml file of same id(unique field) is posted. : : I want to make this by modifying the solr source.Which file do i need to : modify so that i could get the above details in log ? : : I tried with

Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju
Thanks Yes the source of data is a DB. However the xml is also posted on updates via publish framework. So I can just plug in an adapter hear to listen for changes and post to SOLR. I was trying to use the XPathProcessor inside the SQLEntityProcessor and this did not work (using 1.3 -

Re: Newbie Design Questions

2009-01-21 Thread Gunaranjan Chandraraju
Hi Grant Thanks for the reply. My response below. The data is stored as XMLs. Each record/entity corresponds to an XML. The XML is of the form ... I have currently put it in a schema.xml and DIH handler as follows schema.xml data-import.xml

Re: numFound problem

2009-01-21 Thread Koji Sekiguchi
Ron Chan wrote: I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is the StandardAnalyzer It seems WordDelimiterFilter worked for you. Go to Admin console, click analysis, then give: Field name: text Field value (Index): SD/DDeck verbose output: checked highlight

RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
A ballpark calculation would be Collected Amount (From GC logging)/ # of Requests. The GC logging can tell you how much it collected each time, no need to try and snapshot before and after heap sizes. However (big caveat here), this is a ballpark figure. The garbage collector is not guaranteed t

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia
(Thanks for the responses) My filterCache hit rate is ~60% (so I'll try making it bigger), and I am CPU bound. How do I measure the size of my per-request garbage? Is it (total heap size before collection - total heap size after collection) / # of requests to cause a collection? I'll try your

Re: XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese
After some test with System.currentTimeMillis I have seen that the diference is almos unapreciable ... but phpresponse was a little bit faster... Marc Sturlese wrote: > > Hey there, I am using Solr as backEnd and I don't mind whou to get back > the results. How is faster for Solr to create the r

Suppressing logging for /admin/ping requests

2009-01-21 Thread Todd Breiholz
Is there anyway to suppress the logging of the /admin/ping requests? We have HAProxy configured to do health checks to this URI every couple of seconds and it is really cluttering our logs. I'd still like to see the logging from the other requestHandlers. Thanks! Todd

Re: numFound problem

2009-01-21 Thread Erick Erickson
Oops, missed that. I'll have to defer to folks with more SOLR experience than I have, I've pretty much worked in Lucene. Best Erick On Wed, Jan 21, 2009 at 3:57 PM, Ron Chan wrote: > I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it > is the StandardAnalyzer > > bu

Re: numFound problem

2009-01-21 Thread Ron Chan
I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is the StandardAnalyzer but shouldn't the returned docs equal numFound? - Original Message - From: "Erick Erickson" To: solr-user@lucene.apache.org Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT

Re: numFound problem

2009-01-21 Thread Erick Erickson
It depends (tm). What analyzer are you using when indexing? I'd expect (though I haven't checked) that StandardAnalyzer would break SD/DDeck into two tokens, SD and DDeck which corresponds nicely with what you're reporting. Other analyzers and/or filters are easy to specify I'd recommend get

Re: storing complex types in a multiValued field

2009-01-21 Thread Chris Hostetter
: > I guess most people store it as a simple string "key(separator)value". Is or use dynamic fields to putthe "key" into the field name... : > > > multiValued="true" /> ...could be... ...then index value if you omitNorms the overhead of having many fields should be low - allthough i'm no

numFound problem

2009-01-21 Thread Ron Chan
I have a test search which I know should return 34 docs and it does however, numFound says 40 with debug enabled, I can see the 40 it has found my search looks for "SD DDeck" in the description 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" now, I can probably work round it if

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Walter Underwood
Have you tried different sizes for the nursery? It should be several times larger than the per-request garbage. Also, check your cache sizes. Objects evicted from the cache are almost always tenured, so those will add to the time needed for a full GC. Guess who was tuning GC for a week or two in

RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
>From a high level view, there is a certain amount of garbage collection that must occur. That garbage is generated per request, through a variety of means (buffers, request, response, cache expulsion). The only thing that JVM parameters can address is *when* that collection occurs. It can occur

Re: Problem with WT parameter when upgrading from Solr1.2 to solr1.3

2009-01-21 Thread Chris Hostetter
: Right, that's probably the crux of it - distributed search required : some extensions to response writers... things like handling : SolrDocument and SolrDocumentList. Grrr... that's right, i forgot that there wasn't any way to make SolrDocumentList implement DocList ... and i don't think this

RE: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Feak, Todd
The large drop in old generation from 27GB->6GB indicates that things are getting into your old generation prematurely. They really don't need to get there at all, and should be collected sooner (more frequently). Look into increasing young generation sizes via JVM parameters. Also look into concu

Re: Querying back with top few results in the same XMLWriter!

2009-01-21 Thread Chris Hostetter
: I am using a ranking algorithm by modifying the XMLWriter to use a : formulation which takes the top 3 results and query with the 3 results and : now presents the result with as function of the results from these 3 : queries. Can anyone reply if I can take the top 3results and query with them :

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Alexander Ramos Jardim
I would say that putting more Solr instances, each one with your own data directory could help if you can qualify your docs, in such a way that you can put "A" type docs in index "A", "B" type docs in index "B", and so on. 2009/1/21 wojtekpia > > I'm using a recent version of Sun's JVM (6 update

Re: Sizing a Linux box for Solr?

2009-01-21 Thread Erick Erickson
One other useful piece of information would be how big you expect your indexes to be. Which you should be able to estimate quite easily by indexing, say, 20,000 documents from the relevant databases. Of particular interest will be the delta between the size of the index at, say, 10,000 documents a

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia
I'm using a recent version of Sun's JVM (6 update 7) and am using the concurrent generational collector. I've tried several other collectors, none seemed to help the situation. I've tried reducing my heap allocation. The search performance got worse as I reduced the heap. I didn't monitor the gar

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
>Hi Fergus, > >It seems a field it is expecting is missing from the XML. You mean there is some field in the document we are indexing that is missing? > >sourceColName="*fileAbsePath*"/> > >I guess "fileAbsePath" is a typo? Can you check if that is the cause? Well spotted. I had made a mess of sa

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Alexander Ramos Jardim
How many boxes running your index? If it is just one, maybe distributing your index will get you a better performance during garbage collection. 2009/1/21 wojtekpia > > I'm intermittently experiencing severe performance drops due to Java > garbage > collection. I'm allocating a lot of RAM to my

Re: Sizing a Linux box for Solr?

2009-01-21 Thread Alexander Ramos Jardim
Definitely you will want to have more than one box for your index. You can take a look at distributed search and multicore ate the wiki. 2009/1/21 Thomas Dowling > On 01/21/2009 12:25 PM, Matthew Runo wrote: > > At a certain level it will become better to have multiple smaller boxes > > rather

Incorrect Scoring

2009-01-21 Thread Jeff Newburn
Can someone please make sense of why the following occurs in our system. The first item barely matches but scores higher than the second one that matches all over the place. The second one is a MUCH better match but has a worse score. These are in the same query results. All I can see are the nor

Re: Performance "dead-zone" due to garbage collection

2009-01-21 Thread Walter Underwood
What JVM and garbage collector setting? We are using the IBM JVM with their concurrent generational collector. I would strongly recommend trying a similar collector on your JVM. Hint: how much memory is in use after a full GC? That is a good approximation to the working set. 27GB is a very, very l

Re: Word Delimiter struggles

2009-01-21 Thread Shalin Shekhar Mangar
On Mon, Jan 19, 2009 at 9:42 PM, David Shettler wrote: > Thank you Shalin, I'm in the process of implementing your suggestion, > and it works marvelously. Had to upgrade to solr 1.3, and had to hack > up acts_as_solr to function correctly. > > Is there a way to receive a search for a given field

Re: Query Performance while updating teh index

2009-01-21 Thread oleg_gnatovskiy
What exactly does Solr do when it receives a new Index? How does it keep serving while performing the updates? It seems that the part that causes the slowdown is this transition. Otis Gospodnetic wrote: > > This is an old and long thread, and I no longer recall what the specific > suggestions

Re: problem with DIH and MySQL

2009-01-21 Thread Shalin Shekhar Mangar
I guess Noble meant the Solr log. On Tue, Jan 20, 2009 at 9:29 PM, Nick Friedrich < nick.friedr...@student.uni-magdeburg.de> wrote: > no, there are no exceptions > but I have to admit, that I'm not sure what you mean with console > > > Zitat von Noble Paul ??? ?? : > > it got rolled bac

Re: Sizing a Linux box for Solr?

2009-01-21 Thread Thomas Dowling
On 01/21/2009 12:25 PM, Matthew Runo wrote: > At a certain level it will become better to have multiple smaller boxes > rather than one huge one. I've found that even an old P4 with 2 gigs of > ram has decent response time on our 150,000 item index with only a few > users - but it quickly goes down

Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread wojtekpia
Created SOLR 974: https://issues.apache.org/jira/browse/SOLR-974 -- View this message in context: http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588634.html Sent from the Solr - User mailing list archive at Nabble.com.

Performance "dead-zone" due to garbage collection

2009-01-21 Thread wojtekpia
I'm intermittently experiencing severe performance drops due to Java garbage collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB physically available). Under heavy load, the performance drops approximately every 10 minutes, and the drop lasts for 30-40 seconds. This coinci

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Shalin Shekhar Mangar
Hi Fergus, It seems a field it is expecting is missing from the XML. I guess "fileAbsePath" is a typo? Can you check if that is the cause? On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie wrote: > Shalin > > Downloaded nightly for 21jan and tried DIH again. Its better but > still broken.

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-21 Thread Shalin Shekhar Mangar
On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie wrote: > > After looking looking at http://issues.apache.org/jira/browse/SOLR-964, > where > it seems this issue has been addressed, I had another go at indexing > documents > containing DOCTYPE. It failed as follows. > > That patch has not been c

Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread Shalin Shekhar Mangar
Yes please. Even though the fix is small, it is important enough to be mentioned in the release notes. On Wed, Jan 21, 2009 at 11:05 PM, wojtekpia wrote: > > Thanks Shalin, a short circuit would definitely solve it. Should I open a > JIRA issue? > > > Shalin Shekhar Mangar wrote: > > > > I guess

Re: Performance Hit for Zero Record Dataimport

2009-01-21 Thread wojtekpia
Thanks Shalin, a short circuit would definitely solve it. Should I open a JIRA issue? Shalin Shekhar Mangar wrote: > > I guess Data Import Handler still calls commit even if there were no > documents created. We can add a short circuit in the code to make sure > that > does not happen. > --

Re: Sizing a Linux box for Solr?

2009-01-21 Thread Matthew Runo
At a certain level it will become better to have multiple smaller boxes rather than one huge one. I've found that even an old P4 with 2 gigs of ram has decent response time on our 150,000 item index with only a few users - but it quickly goes downhill if we get more than 5 or 6. How many do

Sizing a Linux box for Solr?

2009-01-21 Thread Thomas Dowling
Is there a useful guide somewhere that suggests system configurations for machines that will support multiple large-ish Solr indexes? I'm working on a group of library databases (journal article citations + abstracts, mostly), and need to provide some sort of helpful information to our hardware pe

Words that need protection from stemming, i.e., protwords.txt

2009-01-21 Thread David Woodward
Hi. Any good protwords.txt out there? In a fairly standard solr analyzer chain, we use the English Porter analyzer like so: For most purposes the porter does just fine, but occasionally words come along that really don't work out to well, e.g., "maine" is stemmed to "main" - clearly goofing

Re: XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese
I have been doing some testing (with System.currentTimeMillis) and the difference is almost unapreciable but bit faster PHPResponseWriter, just would like to be sure I am right. Does anybody knows it for sure? Marc Sturlese wrote: > > Hey there, I am using Solr as backEnd and I don't mind whou

Re: How to schedule delta-import and auto commit

2009-01-21 Thread Manupriya
Hi Shalin, I have not faced any memory problems as of now. But I had perviously asked a question regarding caching and memory (http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html)- --

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-21 Thread Fergus McMenemie
Hello, After looking looking at http://issues.apache.org/jira/browse/SOLR-964, where it seems this issue has been addressed, I had another go at indexing documents containing DOCTYPE. It failed as follows. This was using the nightly build from 21-jan 2009. The comments section within jira sugges

XMLResponsWriter or PHPResponseWriter, who is faster?

2009-01-21 Thread Marc Sturlese
Hey there, I am using Solr as backEnd and I don't mind whou to get back the results. How is faster for Solr to create the response, using XMLResponseWriter or PHPResponseWriter?? For my front end is faster to process the response created by PHPResponseWriter but I would not like to improve speed p

Re: How to schedule delta-import and auto commit

2009-01-21 Thread Shalin Shekhar Mangar
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya wrote: > > 2. I had asked peviously regarding caching and memory > management( > http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html > ). > So how do I schedule auto-commit for my Solr server. > >

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie
Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fai

Re: Error, when i update the rich text documents such as .doc, .ppt files.

2009-01-21 Thread matthieuL
Hi Do you resolve the probleme?? because I have the same prbleme. Thanks -- View this message in context: http://www.nabble.com/Error%2C-when-i-update-the-rich-text-documents-such-as-.doc%2C-.ppt-files.-tp20934026p21581483.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to schedule delta-import and auto commit

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya wrote: > > Hi, > > Our Solr server is a standalone server and some web applications send HTTP > query to search and get back the results. > > Now I have following two requirements - > > 1. we want to schedule 'delta-import' at a specified time. So that we

How to schedule delta-import and auto commit

2009-01-21 Thread Manupriya
Hi, Our Solr server is a standalone server and some web applications send HTTP query to search and get back the results. Now I have following two requirements - 1. we want to schedule 'delta-import' at a specified time. So that we dont have to explicitly send a HTTP request for delta-import.

Re: Problem in Date Unmarshalling from NamedListCodec.

2009-01-21 Thread Luca Molteni
I've solved the problem. It was a time zone problem. :) L.M. 2009/1/21 Luca Molteni : > Hello list, > > Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal > method (line 161) the number > > 119914200 > > as a date (1 January 2008), > > While executing the same query with

Re: SOLR Problem with special chars

2009-01-21 Thread Kraus, Ralf | pixelhouse GmbH
Otis Gospodnetic schrieb: now it works : positionIncrementGap="100"> words="stopwords.txt"/> max="50" /> language="German" /> protected="protwords.txt

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jan 21, 2009 at 3:42 PM, Jaco wrote: > Thanks for the fast replies! > > It appears that I made a (probably classical) error... I didnt' make the > change to solrconfig.xml to include the when applying the > upgrade. I include this now, but the slave is not cleaning up. Will this be > done

Problem in Date Unmarshalling from NamedListCodec.

2009-01-21 Thread Luca Molteni
Hello list, Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal method (line 161) the number 119914200 as a date (1 January 2008), While executing the same query with the solr administration console, it gives me a different date value: 2007-12-31T23:00:00Z It seems like

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Jaco
Thanks for the fast replies! It appears that I made a (probably classical) error... I didnt' make the change to solrconfig.xml to include the when applying the upgrade. I include this now, but the slave is not cleaning up. Will this be done at some point automatically? Can I trigger this? User a

Re: SOLR Problem with special chars

2009-01-21 Thread Kraus, Ralf | pixelhouse GmbH
Otis Gospodnetic schrieb: Ralf, Can you paste the part of your schema.xml where you defined the relevant field? Otis Sure ! positionIncrementGap="100"> language="German" />

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Shalin Shekhar Mangar
Hi, There shouldn't be so many files on the slave. Since the empty index.x folders are not getting deleted, is it possible that Solr process user does not enough privileges to delete files/folders? Also, have you made any changes to the IndexDeletionPolicy configuration? On Wed, Jan 21, 2009

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
the index.xxx directories are supposed to be deleted (automatically). you can safely delete them. But, I am wondering why the index files in the slave did not get deleted. By default the deletionPolicy is KeepOnlyLastCommit. On Wed, Jan 21, 2009 at 2:15 PM, Jaco wrote: > Hi, > > I'm running So

Re: Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Rafał Kuć
Hello, > Hi, > I'm running Solr nightly build of 20.12.2008, with patch as discussed on > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. > On various systems running, I see that the disk space consumed on the slave > is much higher than on the master. One example: > - Mast

Solr Replication: disk space consumed on slave much higher than on master

2009-01-21 Thread Jaco
Hi, I'm running Solr nightly build of 20.12.2008, with patch as discussed on http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. On various systems running, I see that the disk space consumed on the slave is much higher than on the master. One example: - Master: 30 GB in 138 fil