Re: Facet on TrieDateField field without including date

2012-02-15 Thread Ted Dunning
Use multiple fields and you get what you want. The extra fields are going to cost very little and will have a bit positive impact. On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote: I think it would if I indexed the time information separately. Which was my original

Re: Need help with graphing function (MATH)

2012-02-14 Thread Ted Dunning
In general this kind of function is very easy to construct using sums of basic sigmoidal functions. The logistic and probit functions are commonly used for this. Sent from my iPhone On Feb 14, 2012, at 10:05, Mark static.void@gmail.com wrote: Thanks I'll have a look at this. I should

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote: Experience has shown that it is much faster to run

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
Add this as well: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030 On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki a...@getopt.org wrote: On 08/02/2012 09:17, Ted Dunning wrote: This is true with Lucene as it stands. It would be much faster if there were a specialized

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-23 Thread Ted Dunning
point is that mixing all the *results* of the analysis chains for multiple languages into a single field will likely result in interesting behavior. Not to say it won't be satisfactory in your situation, but there are edge cases. Best Erick On Fri, Jan 20, 2012 at 9:15 AM, Ted Dunning

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-20 Thread Ted Dunning
you have. Does this clarify things? Was I able to answer your question? Best regards, Peter -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Friday, January 20, 2012 2:42 AM To: solr-user@lucene.apache.org Subject: Re: How to accelerate your

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
Write a tokenizer that does language ID and then picks which tokenizer to use. Then record the language in the language id field. What is there to elaborate? On Fri, Jan 20, 2012 at 1:58 AM, nibing nibing_...@hotmail.com wrote: But then there occurs a problem of using analyzer in indexing. I

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
Message - From: nibing nibing_...@hotmail.com To: solr-user@lucene.apache.org Cc: Sent: Friday, January 20, 2012 1:51 AM Subject: RE: Tika0.10 language identifier in Solr3.5.0 Hi, Ted Dunning, Thank you for your reply. I can understand your point on putting a language_s field

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
I think you misunderstood what I am suggesting. I am suggesting an analyzer that detects the language and then does the right thing according to the language it finds. As such, it would tokenize and stem English according to English rules, German by German rules and would probably do a sliding

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
AS - www.cominvent.com Solr Training - www.solrtraining.com On 20. jan. 2012, at 18:15, Ted Dunning wrote: I think you misunderstood what I am suggesting. I am suggesting an analyzer that detects the language and then does the right thing according to the language it finds

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Ted Dunning
Peter, My guess is that if you had said something along the lines of We have developed some SSD support software that makes SOLR work better. I would like to open a conversation here (link to external discussion) that would have been reasonably well received. One of the things that makes SPAM

Re: using solr for time series data

2012-01-19 Thread Ted Dunning
Take a look at openTSDB. You might want to use that as is, or steal some of the concepts. The major idea to snitch is the idea of using a single row of hte data base (document in Lucene or Solr) to hold many data points. Thus, you could consider having documents with the following fields: key:

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Ted Dunning
Normally this is done by putting a field on each document rather than separating the documents into separate corpora. Keeping them together makes the final search faster. At query time, you can add all of the language keys that you think are relevant based on your language id applied to the

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Ted Dunning
Actually, for search applications there is a reasonable amount of evidence that holding the index in RAM is actually more cost effective than SSD's because the throughput is enough faster to make up for the price differential. There are several papers out of UMass that describe this trade-off,

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Ted Dunning
On Thu, Jan 19, 2012 at 1:40 AM, Darren Govoni dar...@ontrenet.com wrote: And to be honest, many people on this list are professionals who not only build their own solutions, but also buy tools and tech. I don't see what the big deal is if some clever company has something of imminent value

Re: Relevancy and random sorting

2012-01-11 Thread Ted Dunning
I think the OP meant to use random order in the case of score ties. On Wed, Jan 11, 2012 at 9:31 PM, Erick Erickson erickerick...@gmail.comwrote: Alexandre: Have you thought about grouping? If you can analyze the incoming documents and include a field such that similar documents map to the

Re: Stemming numbers

2012-01-10 Thread Ted Dunning
On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert tanner.post...@gmail.comwrote: We've had some issues with people searching for a document with the search term '200 movies'. The document is actually title 'two hundred movies'. Do we need to add every number to our synonyms dictionary to

Re: Stemming numbers

2012-01-10 Thread Ted Dunning
than trying to engineer all possible rewrites by hand. On Tue, Jan 10, 2012 at 10:21 PM, Tanner Postert tanner.post...@gmail.comwrote: You mention that is one way to do it is there another i'm not seeing? On Jan 10, 2012, at 4:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Jan 10

Re: complex keywords, hierarchical data, Solr representation problem

2012-01-08 Thread Ted Dunning
Option 3 is preferably because you can use phrase queries to get interesting results as in color light beige or color light. Normalizing is bad in this kind of environment. On Sun, Jan 8, 2012 at 11:35 AM, jimmy jimmyt...@bobmail.info wrote: ... First Table KEYWORDS: keyword_id, keyword 1,

Re: stopwords as privacy measure

2012-01-08 Thread Ted Dunning
On Sun, Jan 8, 2012 at 3:33 PM, Michael Lissner mliss...@michaeljaylissner.com wrote: I have a unique use case where I have words in my corpus that users shouldn't ever be allowed to search for. My theory is that if I add these to the stopwords list, that should do the trick. That should do

Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Ted Dunning
This copying is a bit overstated here because of the way that small segments are merged into larger segments. Those larger segments are then copied much less often than the smaller ones. While you can wind up with lots of copying in certain extreme cases, it is quite rare. In particular, if you

Re: Hardware resource indication

2011-12-22 Thread Ted Dunning
On Thu, Dec 22, 2011 at 7:02 AM, Zoran | Bax-shop.nl zoran.bi...@bax-shop.nl wrote: Hello, What are (ballpark figure) the hardware requirement (diskspace, memory) SOLR will use i this case: * Heavy Dutch traffic webshop, 30.000 - 50.000 visitors a day Unique users doesn't much

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Hi, I have a basic question, let's say we're going to

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
of users, and a rough estimation for each user's data would be something around 5 MB. The other problem is that those data will be changed very often. I hope I answered your question. Thanks On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: You didn't mention how

Re: Core overhead

2011-12-16 Thread Ted Dunning
I thought it was slightly clumsy, but it was informative. It seemed like a fine thing to say. Effectively it was I/we have developed a tool that will help you solve your problem. That is responsive to the OP and it is clear that it is a commercial deal. On Fri, Dec 16, 2011 at 10:02 AM, Jason

Re: Core overhead

2011-12-16 Thread Ted Dunning
Sounds like we disagree. On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ted, ...- FREE! is stupid idiot spam. It's annoying and not suitable. On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com wrote: I thought it was slightly

Re: Core overhead

2011-12-16 Thread Ted Dunning
We still disagree. On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ted, The list would be unreadable if everyone spammed at the bottom their email like Otis'. It's just bad form. Jason On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning ted.dunn

Re: Core overhead

2011-12-15 Thread Ted Dunning
Here is a talk I did on this topic at HPTS a few years ago. On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen rober...@buy.com wrote: I see there is a lot of discussions about micro-sharding, I'll have to read them. I'm on an older version of solr and just use master index replicating out to

Re: Micro-Sharding

2011-12-05 Thread Ted Dunning
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey s...@elyograg.org wrote: On 12/4/2011 12:41 AM, Ted Dunning wrote: Read the papers I referred to. They describe how to search fairly enormous corpus with an 8GB in-memory index (and no disk cache at all). They would seem to indicate moving

Re: SolR for time-series data

2011-12-04 Thread Ted Dunning
Sax is attractive, but I have found it lacking in practice. My primary issue is that in order to get sufficient recall for practical matching problems, I had to do enough query expansion that the speed advantage of inverted indexes went away. The OP was asking for blob storage, however, and I

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey s...@elyograg.org wrote: In another thread, something was said that sparked my interest: On 12/1/2011 7:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey s...@elyograg.org wrote: On 12/3/2011 2:25 PM, Ted Dunning wrote: Things have changed since I last did this sort of thing seriously. My guess is that this is a relatively small amount of memory to devote to search. It used to be that the only way

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Well, this goes both ways. It is not that unusual to take a node down for maintenance of some kind or even to have a node failure. In that case, it is very nice to have the load from the lost node be spread fairly evenly across the remaining cluster. Regarding the cost of having several

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
while its 'busy' splitting). - Mark On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote: Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
You can do that pretty easily by just retrieving extra documents and post processing the results list. You are likely to have a significant number of apparent duplicates this way. To really get rid of duplicates in results, it might be better to remove them from the corpus by deploying something

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
Solr do it for me! that I have to as this question is probably not a good sign, but what is LSH clustering? On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning ted.dunn...@gmail.com wrote: You can do that pretty easily by just retrieving extra documents and post processing the results list

Re: how to achieve google.com like results for phrase queries

2011-11-05 Thread Ted Dunning
Google achieves their results by using data not found in the web pages themselves. This additional data critically includes link text, but also is derived from behavioral information. On Sat, Nov 5, 2011 at 5:07 PM, alx...@aim.com wrote: Hi Erick, The term newspaper latimes is not found

Re: Query time help

2011-10-30 Thread Ted Dunning
That sounds like Nagle's algorithm. http://en.wikipedia.org/wiki/Nagle's_algorithm#Interactions_with_real-time_systems On Sun, Oct 30, 2011 at 2:01 PM, dar...@ontrenet.com wrote: Another interesting note. When I use the Solr Admin screen to perform the same query, it doesn't take as long.

Re: Search calendar avaliability

2011-10-27 Thread Ted Dunning
On Thu, Oct 27, 2011 at 7:13 AM, Anatoli Matuskova anatoli.matusk...@gmail.com wrote: I don't like the idea of indexing a doc per each value, the dataset can grow a lot. What does a lot mean? How high is the sky? A million people with 3 year schedules is a billion tiny documents. That