RE: Boost on newer documents

2010-11-30 Thread jan.kurella
You could also put a short representation of the data (I suggest days since 01.01.2010) as payload and calculate boost with payload function of the similarity. -Original Message- From: ext Jason Brown [mailto:jason.br...@sjp.co.uk] Sent: Montag, 29. November 2010 17:28 To:

Re: search strangeness

2010-11-30 Thread ramzesua
Here result with debugQuery: For term annual: result name=response numFound=0 start=0/ lst name=debug str name=rawquerystringannual/str str name=querystringannual/str str name=parsedquerytext:year text:twelve-month text:onceayear text:yearbook/str str name=parsedquery_toStringtext:year

Re: search strangeness

2010-11-30 Thread ramzesua
I found the problem: solr.EnglishPorterFilterFactory in the analyzer type=query form that parsedquery. -- View this message in context: http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1991321.html Sent from the Solr - User mailing list archive at Nabble.com.

RE: Good example of multiple tokenizers for a single field

2010-11-30 Thread jan.kurella
We had the same problem for our fields and we wrote a Tokenizer using the icu4j library. Breaking tokens at script changes, and dealing with them according the script and the configured Breakiterators. This works out very well, as we also add the scrip information to the token so later filter

Re: Large Hdd-Space using during commit/optimize

2010-11-30 Thread stockii
aha aha :D hm i dont know. we import in 2MillionSteps because we think that solr locks our database and we want a better controll of the import ... -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1991392.html Sent from

Re: question about Solr SignatureUpdateProcessorFactory

2010-11-30 Thread Bernd Fehling
As mentioned, in the typical case it's important that the field names be included in the signature, but i imagine there would be cases where you wouldn't want them included (like a simple concat Signature for building basic composite keys) I think the Signature API could definitely be

Re: Preventing index segment corruption when windows crashes

2010-11-30 Thread Peter Sturge
The index itself isn't corrupt - just one of the segment files. This means you can read the index (less the offending segment(s)), but once this happens it's no longer possible to access the documents that were in that segment (they're gone forever), nor write/commit to the index (depending on the

Re: Boost on newer documents

2010-11-30 Thread Savvas-Andreas Moysidis
hi, I might not understand your case right but can you not add an extra publishedDate field and then specify a secondary (after relevance) sort by that? On 30 November 2010 08:05, jan.kure...@nokia.com wrote: You could also put a short representation of the data (I suggest days since

Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Michael McCandless
Hmm this is in fact a regression. TopFieldCollector expects (but does not verify) that numHits is 0. I guess to fix this we could fix TopFieldCollector.create to return a NullCollector when numHits is 0. But: why is your app doing this? Ie, if numHits (rows) is 0, the only useful thing you

RE: Boost on newer documents

2010-11-30 Thread Jason Brown
Hi - you do understand may case - we tried what you suggested but as the relevancy is very precise we couldn't get it it to do a dual-sort. I like the idea of using one of the dismax parameters (bf) to in-effect increase the boost on a newer document. Thanks for all replies, most useful.

Re: Boost on newer documents

2010-11-30 Thread Savvas-Andreas Moysidis
ahhh I see..good point..yes, for a high number of unique scores the secondary sort won't have any effect.. On 30 November 2010 09:32, Jason Brown jason.br...@sjp.co.uk wrote: Hi - you do understand may case - we tried what you suggested but as the relevancy is very precise we couldn't get it

Creating Email Token Filter

2010-11-30 Thread Greg Smith
Hi, I have written a plugin to filter on email types and keep those tokens, however when I run it in the analysis in the admin it all works fine. But when I use the data import handler to import the data and set the field type it doesn't remove the other tokens and keeps the field in the

Re: Creating Email Token Filter

2010-11-30 Thread Bernd Fehling
Am 30.11.2010 10:56, schrieb Greg Smith: Hi, I have written a plugin to filter on email types and keep those tokens, however when I run it in the analysis in the admin it all works fine. But when I use the data import handler to import the data and set the field type it doesn't remove

RE: BasicHelloRequestHandler plugin - class path changed

2010-11-30 Thread Hong-Thai Nguyen
Hi, I found the problem: The class name has been changed to 1.4.1: From: import org.apache.solr.response.SolrQueryResponse; To: import org.apache.solr.request.SolrQueryResponse; Best, --- Hong-Thai -Message d'origine- De : Hong-Thai Nguyen

Re: Termvector based result grouping / field collapsing?

2010-11-30 Thread Grant Ingersoll
On Nov 29, 2010, at 5:17 PM, Shawn Heisey wrote: I was just in a meeting where we discussed customer feedback on our website. One thing that the users would like to see is galleries where photos that are part of a set are grouped together under a single result. This is basically field

Return Lucene DocId in Solr Results

2010-11-30 Thread Lohrenz, Steven
Hi, I was wondering how I would go about getting the lucene docid included in the results from a solr query? I've built a QueryParser to query another solr instance and and join the results of the two instances through the use of a Filter. The Filter needs the lucene docid to work. This is

Re: SOLR for Log analysis feasibility

2010-11-30 Thread Peter Karich
take a look into this: http://vimeo.com/16102543 for that amount of data it isn't that easy :-) We are looking into building a reporting feature and investigating solutions which will allow us to search though our logs for downloads, searches and view history. Each log item is relatively

Failover setup (is this a bad idea)

2010-11-30 Thread Keith Pope
Hi, I have a windows cluster that I would like to install Solr onto, there are two nodes that provide basic failover. I was thinking of this setup: Tomcat installed as win service Two solr instances sharing the same index The second instance would take over when the first fails, so you

Re: SOLR for Log analysis feasibility

2010-11-30 Thread Stefan Matheis
i know, it's not solr .. but perhaps you should have a look at it: http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/ On Tue, Nov 30, 2010 at 12:58 PM, Peter Karich peat...@yahoo.de wrote: take a look into this: http://vimeo.com/16102543 for that amount of

Re: SOLR for Log analysis feasibility

2010-11-30 Thread Peter Sturge
We do a lot of precisely this sort of thing. Ours is a commercial product (Honeycomb Lexicon) that extracts behavioural information from logs, events and network data (don't worry, I'm not pushing this on you!) - only to say that there are a lot of considerations beyond base Solr when it comes to

Re: Large Hdd-Space using during commit/optimize

2010-11-30 Thread Erick Erickson
Solr doesn't lock anything as far as I know, it just executes the query you specify. The query you specify may well do bad things to your database, but that's not Solr's fault. What happens if you simply try executing the query outside Solr? Do you see the same locking behavior? You might want to

Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 10:29 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm this is in fact a regression. TopFieldCollector expects (but does not verify) that numHits is 0. I guess to fix this we could fix TopFieldCollector.create to return a NullCollector when numHits is 0.

Re: Creating Email Token Filter

2010-11-30 Thread Greg Smith
Bernd, Looking at the results returned in the search results the field is populated with all of the information regardless of whether there was an email contained in the contents. Would the way the analysers and tokens be handled different if using a copy field? Thanks On 30 November 2010

Re: Creating Email Token Filter

2010-11-30 Thread Erick Erickson
See below. If this still doesn't make sense, could you show us some examples? Best Erick On Tue, Nov 30, 2010 at 8:33 AM, Greg Smith audi...@gmail.com wrote: Bernd, Looking at the results returned in the search results the field is populated with all of the information regardless of whether

Best practice for Delta every 2 Minutes.

2010-11-30 Thread stockii
Hello. index is about 28 Million documents large. When i starts an delta-import is look at modified. but delta import takes to long. over an hour need solr for delta. thats my query. all sessions from the last hour should updated and all changed. i think its normal that solr need long time for

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread Erick Erickson
Please provide more data. Specifically: how many documents are updated? Have you tried running this query without Solr? In other words have you investigated whether the speed issue is simply your SQL executing slowly? Why are you selecting the last 10 hours' data when all you want is

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread stockii
everyday ~30.000 Documents and every hour ~1200 multiple thread with DIH ? how it works ? -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992767.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread stockii
how do you think is the deltaQuery better ? XD -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Yonik Seeley
On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's an old option you have in your solrconfig.xml that causes a different code path to

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread stockii
i copied the wrong query, because 10 hours ;) i didnt test the query with 28 million records . but wiht a few million and it works fine. ... before i used DIH, i used php and import direclty documents into solr. but i want use dih because the better performance, i think so ... grml ... --

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
+1 That's exactly what we need, too. On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey elyog...@elyograg.org wrote: On 11/29/2010 3:15 PM, Jacob Elder wrote: I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single body field which

QueryNorm and FieldNorm

2010-11-30 Thread Gastone Penzo
Hello, someone can explain the difference between queryNorm and FieldNorm in debugQuery?? why if i push one bf boost up, the queryNorm goes down?? i made some modifies..before the situation was different. why?? thanx -- Gastone Penzo

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. Perhaps I could rephrase the question

RE: Return Lucene DocId in Solr Results

2010-11-30 Thread Lohrenz, Steven
Hmm, I found some similar queries on stackoverflow and they did not recommend exposing the lucene docId. So, I guess my question becomes: What is the best way, from within my custom QParser, to take a list of solr primary keys (that were retrieved from elsewhere) and turn them into docIds? I

Re: Failover setup (is this a bad idea)

2010-11-30 Thread Jayendra Patil
Rather have a Master and multiple Slave combination, with master only being used for writes and slaves used for reads. Master to Slave replication is easily configurable. Two Solr instances sharing the same index is not at all good idea with both writing to the same index. Regards, Jayendra On

PermGen per Solr core?

2010-11-30 Thread Andrew Davidoff
Hi, I am running multiple Solr cores (solr-tomcat 1.4.0+ds1-1ubuntu1) under Tomcat (6.0.24-2ubuntu1.4) on Ubuntu 10.04.1. I have a master server where all Solr writes go, and a slave server that replicates all cores from the master, and accepts all read-only queries. After maxing out PermGen

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread Erick Erickson
I don't know, you'll have to debug it to see if it's the thing that takes so long. Solr should be able to handle 1,200 updates in a very short time unless there's something else going on, like you're committing after every update or something. This may help you track down performance with DIH

Re: Large Hdd-Space using during commit/optimize

2010-11-30 Thread stockii
okay. the query kills the database, because no index of modified is set ... -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1993750.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Dinamically change master

2010-11-30 Thread Ken Krugler
Hi Tommaso, On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote: Hi all, in a replication environment if the host where the master is running goes down for some reason, is there a way to communicate to the slaves to point to a different (backup) master without manually changing

how to set maxFieldLength to unlimitd

2010-11-30 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I need index and search some pdf files which are very big (around 1000 pages each). How can I set maxFieldLength to unlimited? Thanks so much for your help in advance, Xiaohui

Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-11-30 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's

Re: how to set maxFieldLength to unlimitd

2010-11-30 Thread Erick Erickson
Set the maxFieldLength value in solrconfig.xml to, say, 2147483647 Also, see this thread for a common gotcha: http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html , it appears you can just comment out the one in the mainIndex section. Best Erick On Tue, Nov 30, 2010 at

Re: Dinamically change master

2010-11-30 Thread Tommaso Teofili
Hi, Thanks Jacob and Ken for your replies. I am not able to change project architecture to add Lucandra even if it looks like a nice solution. Going the VIP way can definitely an option even if I'd be more keen to solve that inside Solr. I am thinking to try and play with Collection Distribution

Need info on CachedSqlEntityProcessor

2010-11-30 Thread bbarani
Hi, I am using cached SQL entity processor in my data config, please find below the structure of my data config file. entity name=object query=select * from x where objecttype=''test1' entity name=objectproperty query=select * from y processor=CachedSqlEntityProcessor

RE: how to set maxFieldLength to unlimitd

2010-11-30 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! Xiaohui -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, November 30, 2010 2:01 PM To: solr-user@lucene.apache.org Subject: Re: how to set maxFieldLength to unlimitd Set the maxFieldLength value in solrconfig.xml to,

Very slow sorting, even on small result sets

2010-11-30 Thread Simon Wistow
We've got a largish corpus (~94 million documents). We'd like to be able to sort on one of the string fields. However this takes an incredibly long time. A warming query for that field takes about ~20 minutes. However most of the time the result sets are small since we use filters heavily -

Re: Bad file descriptor Errors

2010-11-30 Thread John Williams
Bump. Anyone? -J On Nov 29, 2010, at 3:17 PM, John Williams wrote: Recently, we have started to get Bad file descriptor errors in one of our Solr instances. This instance is a searcher and its index is stored on a local SSD. The master however has it's index stored on NFS, which seems to

RE: how to set maxFieldLength to unlimitd

2010-11-30 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I set maxFieldLength to 2147483647, restarted tomcat and re-indexed pdf files again. I also commented out the one in the mainIndex section. Unfortunately the files are still chopped out if the size of file is more than 20MB. Any suggestions? I really appreciate your help! Xiaohui

distributed architecture

2010-11-30 Thread Cinquini, Luca (3880)
Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine, each index containing unique content. Options that

shutdown.sh does not kill the tomcat process running solr./?

2010-11-30 Thread Robert Petersen
Greetings, we're wondering why we can issue the command to shutdown tomcat/solr but the process remains visible in memory (by using the top command) and we have to manually kill the PID for it to release its memory before we can (re)start tomcat/solr? Anybody have any ideas? The process is using

entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
Greetings, we are running one master and four slaves of our multicore solr setup. We just served searches for our catalog of 8 million products with this farm during black Friday and cyber Monday, our busiest days of the year, and the servers did not break a sweat! Index size is about 28GB.

Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Ken Krugler
Hi Robert, I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=path to where you want the file to go, so then you have something to look at versus a Gedankenexperiment :) -- Ken On Nov 30, 2010, at 3:04pm, Robert Petersen wrote: Greetings, we are

Re: Large Hdd-Space using during commit/optimize

2010-11-30 Thread Upayavira
I don't know who you are replying to here, but... There's nothing to stop you doing: * import 2m docs * sleep 2 days * import 2m docs * sleep 2 days * repeat above until done * commit There's no reason why you should commit regularly. If you need to slow down for your DB, do, but that

Re: Preventing index segment corruption when windows crashes

2010-11-30 Thread Peter Sturge
After a recent Windows 7 crash (:-\), upon restart, Solr starts giving LockObtainFailedException errors: (excerpt) 30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:

Re: Dinamically change master

2010-11-30 Thread Upayavira
Hi Tommaso, I believe you can tell each server to act as a master (which means it can have its indexes pulled from it). You can then include the master hostname in the URL that triggers a replication process. Thus, if you triggered replication from outside solr, you'd have control over which

Re: distributed architecture

2010-11-30 Thread Upayavira
I cannot say how mature the code for B) is, but it is not yet included in a release. If you want the ability to distribute content across multiple nodes (due to volume) and want resilience, then use both. I've had one setup where we have two master servers, each with four cores. Then we have two

Re: distributed architecture

2010-11-30 Thread Shawn Heisey
On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote: Hi, I'd like to know if anybody has suggestions/opinions on what is currently the best architecture for a distributed search system using Solr. The use case is that of a system composed of N indexes, each hosted on a separate machine,

RE: entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
What would I do with the heap dump though? Run one of those java heap analyzers looking for memory leaks or something? I have no experience with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory leak occurring on each commit, but it would take thousands of commits to make that

Re: shutdown.sh does not kill the tomcat process running solr./?

2010-11-30 Thread Li Li
1. make sure the Server port=8005 shutdown=SHUTDOWN the port is not used. 2. ./bin/shutdown.sh tail -f logs/xxx to see what the server is doing if you just feed data or modified index, and don't flush/commit, when shutdowning, it will do something. 2010/12/1 Robert Petersen rober...@buy.com:

Re: entire farm fails at the same time with OOM issues

2010-11-30 Thread Yonik Seeley
On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote: My question is this.  Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread Li Li
you may implement your own MergePolicy to keep on large index and merge all other small ones or simply set merge factor to 2 and the largest index not be merged by set maxMergeDocs less than the docs in the largest one. So there is one large index and a small one. when adding a little docs, they

Re: shutdown.sh does not kill the tomcat process running solr./?

2010-11-30 Thread Shawn Heisey
On 11/30/2010 3:49 PM, Robert Petersen wrote: That raises another question: top can show only 20 GB free out of 64 but the tomcat/solr process only shows its using half of that. What is using the rest? The numbers don't add up... Chances are that it's your operating system disk cache.

RE: distributed architecture

2010-11-30 Thread Jayant Das
Hi, A diagram will be very much appreciated. Thanks, Jayant From: u...@odoko.co.uk To: solr-user@lucene.apache.org Subject: Re: distributed architecture Date: Wed, 1 Dec 2010 00:39:40 + I cannot say how mature the code for B) is, but it is not yet included in a release. If you

Re: Dinamically change master

2010-11-30 Thread Tommaso Teofili
Hi Upayavira, this is a good start for solving my problem, can you please tell how does such a replication URL look like? Thanks, Tommaso 2010/12/1 Upayavira u...@odoko.co.uk Hi Tommaso, I believe you can tell each server to act as a master (which means it can have its indexes pulled from

ArrayIndexOutOfBoundsException in sort

2010-11-30 Thread Jerry Li
Hi team My solr version is 1.4 There is an ArrayIndexOutOfBoundsException when i sort one field and the following is my code and log info, any help will be appreciated. Code: SolrQuery query = new SolrQuery(); query.setSortField(author, ORDER.desc);

Twitter Search + big Hadoop, Dec. 8th at Seattle Scalability Meetup

2010-11-30 Thread Bradford Stephens
Greetings, The Seattle Scalability Meetup isn't slacking for the holidays. We've got an awesome lineup for Wed, December 8 at 7pm: http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/ -Jake Mannix from Twitter will talk about the Twitter Search infrastructure (with distributed Lucene)

Re: ArrayIndexOutOfBoundsException in sort

2010-11-30 Thread Gora Mohanty
On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li zongjie...@gmail.com wrote: Hi team My solr version is 1.4 There is an ArrayIndexOutOfBoundsException when i sort one field and the following is my code and log info, any help will be appreciated. Code:        SolrQuery query = new SolrQuery();  

Re: distributed architecture

2010-11-30 Thread Dennis Gearon
Wow, would you put a diagram somewhere up on the Solr site? Or, here, and I will put it somewhere there. And, what is a VIP? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’

Re: Basic Solr Configurations and best practice

2010-11-30 Thread Lance Norskog
Solr 4- You mean the Solr 'trunk' source or the Solr 1.4.1 release? The 1.4.1 release does not have the TikaEntityProcessor, only the /extract code. The Solr 3.x branch and the trunk have the TikaEP. I use the 3.x branch and, well, the TikaEP has a few problems but can be hacked around.