Use multiple fields and you get what you want. The extra fields are going
to cost very little and will have a bit positive impact.
On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:
I think it would if I indexed the time information separately. Which
was my original
In general this kind of function is very easy to construct using sums of basic
sigmoidal functions. The logistic and probit functions are commonly used for
this.
Sent from my iPhone
On Feb 14, 2012, at 10:05, Mark static.void@gmail.com wrote:
Thanks I'll have a look at this. I should
This is true with Lucene as it stands. It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.
On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote:
Experience has shown that it is much faster to run
Add this as well:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030
On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki a...@getopt.org wrote:
On 08/02/2012 09:17, Ted Dunning wrote:
This is true with Lucene as it stands. It would be much faster if there
were a specialized
point is that mixing all the *results* of the
analysis chains for multiple languages into a single field
will likely result in interesting behavior. Not to say it won't
be satisfactory in your situation, but there are edge cases.
Best
Erick
On Fri, Jan 20, 2012 at 9:15 AM, Ted Dunning
you have.
Does this clarify things? Was I able to answer your question?
Best regards,
Peter
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Friday, January 20, 2012 2:42 AM
To: solr-user@lucene.apache.org
Subject: Re: How to accelerate your
Write a tokenizer that does language ID and then picks which tokenizer to
use. Then record the language in the language id field.
What is there to elaborate?
On Fri, Jan 20, 2012 at 1:58 AM, nibing nibing_...@hotmail.com wrote:
But then there occurs a problem of using analyzer in indexing. I
Message -
From: nibing nibing_...@hotmail.com
To: solr-user@lucene.apache.org
Cc:
Sent: Friday, January 20, 2012 1:51 AM
Subject: RE: Tika0.10 language identifier in Solr3.5.0
Hi, Ted Dunning,
Thank you for your reply. I can understand your point on putting a
language_s field
I think you misunderstood what I am suggesting.
I am suggesting an analyzer that detects the language and then does the
right thing according to the language it finds. As such, it would
tokenize and stem English according to English rules, German by German
rules and would probably do a sliding
AS - www.cominvent.com
Solr Training - www.solrtraining.com
On 20. jan. 2012, at 18:15, Ted Dunning wrote:
I think you misunderstood what I am suggesting.
I am suggesting an analyzer that detects the language and then does the
right thing according to the language it finds
Peter,
My guess is that if you had said something along the lines of We have
developed some SSD support software that makes SOLR work better. I would
like to open a conversation here (link to external discussion) that would
have been reasonably well received. One of the things that makes SPAM
Take a look at openTSDB.
You might want to use that as is, or steal some of the concepts. The major
idea to snitch is the idea of using a single row of hte data base (document
in Lucene or Solr) to hold many data points.
Thus, you could consider having documents with the following fields:
key:
Normally this is done by putting a field on each document rather than
separating the documents into separate corpora. Keeping them together
makes the final search faster.
At query time, you can add all of the language keys that you think are
relevant based on your language id applied to the
Actually, for search applications there is a reasonable amount of evidence
that holding the index in RAM is actually more cost effective than SSD's
because the throughput is enough faster to make up for the price
differential. There are several papers out of UMass that describe this
trade-off,
On Thu, Jan 19, 2012 at 1:40 AM, Darren Govoni dar...@ontrenet.com wrote:
And to be honest, many people on this list are professionals who not only
build their own solutions, but also buy tools and tech.
I don't see what the big deal is if some clever company has something of
imminent value
I think the OP meant to use random order in the case of score ties.
On Wed, Jan 11, 2012 at 9:31 PM, Erick Erickson erickerick...@gmail.comwrote:
Alexandre:
Have you thought about grouping? If you can analyze the incoming
documents and include a field such that similar documents map
to the
On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert tanner.post...@gmail.comwrote:
We've had some issues with people searching for a document with the
search term '200 movies'. The document is actually title 'two hundred
movies'.
Do we need to add every number to our synonyms dictionary to
than trying to engineer all possible
rewrites by hand.
On Tue, Jan 10, 2012 at 10:21 PM, Tanner Postert
tanner.post...@gmail.comwrote:
You mention that is one way to do it is there another i'm not seeing?
On Jan 10, 2012, at 4:34 PM, Ted Dunning ted.dunn...@gmail.com wrote:
On Tue, Jan 10
Option 3 is preferably because you can use phrase queries to get
interesting results as in color light beige or color light.
Normalizing is bad in this kind of environment.
On Sun, Jan 8, 2012 at 11:35 AM, jimmy jimmyt...@bobmail.info wrote:
...
First Table KEYWORDS:
keyword_id, keyword
1,
On Sun, Jan 8, 2012 at 3:33 PM, Michael Lissner
mliss...@michaeljaylissner.com wrote:
I have a unique use case where I have words in my corpus that users
shouldn't ever be allowed to search for. My theory is that if I add these
to the stopwords list, that should do the trick.
That should do
This copying is a bit overstated here because of the way that small
segments are merged into larger segments. Those larger segments are then
copied much less often than the smaller ones.
While you can wind up with lots of copying in certain extreme cases, it is
quite rare. In particular, if you
On Thu, Dec 22, 2011 at 7:02 AM, Zoran | Bax-shop.nl
zoran.bi...@bax-shop.nl wrote:
Hello,
What are (ballpark figure) the hardware requirement (diskspace, memory)
SOLR will use i this case:
* Heavy Dutch traffic webshop, 30.000 - 50.000 visitors a day
Unique users doesn't much
You didn't mention how big your data is or how you create it.
Hadoop would mostly used in the preparation of the data or the off-line
creation of indexes.
On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi
alireza.sal...@gmail.comwrote:
Hi,
I have a basic question, let's say we're going to
of users,
and a rough estimation for each user's data would be something around
5 MB.
The other problem is that those data will be changed very often.
I hope I answered your question.
Thanks
On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
You didn't mention how
I thought it was slightly clumsy, but it was informative. It seemed like a
fine thing to say. Effectively it was I/we have developed a tool that
will help you solve your problem. That is responsive to the OP and it is
clear that it is a commercial deal.
On Fri, Dec 16, 2011 at 10:02 AM, Jason
Sounds like we disagree.
On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Ted,
...- FREE! is stupid idiot spam. It's annoying and not suitable.
On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
I thought it was slightly
We still disagree.
On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Ted,
The list would be unreadable if everyone spammed at the bottom their
email like Otis'. It's just bad form.
Jason
On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning ted.dunn
Here is a talk I did on this topic at HPTS a few years ago.
On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen rober...@buy.com wrote:
I see there is a lot of discussions about micro-sharding, I'll have to
read them. I'm on an older version of solr and just use master index
replicating out to
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey s...@elyograg.org wrote:
On 12/4/2011 12:41 AM, Ted Dunning wrote:
Read the papers I referred to. They describe how to search fairly
enormous
corpus with an 8GB in-memory index (and no disk cache at all).
They would seem to indicate moving
Sax is attractive, but I have found it lacking in practice. My primary
issue is that in order to get sufficient recall for practical matching
problems, I had to do enough query expansion that the speed advantage of
inverted indexes went away.
The OP was asking for blob storage, however, and I
On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey s...@elyograg.org wrote:
In another thread, something was said that sparked my interest:
On 12/1/2011 7:17 PM, Ted Dunning wrote:
Of course, resharding is almost never necessary if you use micro-shards.
Micro-shards are shards small enough
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey s...@elyograg.org wrote:
On 12/3/2011 2:25 PM, Ted Dunning wrote:
Things have changed since I last did this sort of thing seriously. My
guess is that this is a relatively small amount of memory to devote to
search. It used to be that the only way
Of course, resharding is almost never necessary if you use micro-shards.
Micro-shards are shards small enough that you can fit 20 or more on a
node. If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
Well, this goes both ways.
It is not that unusual to take a node down for maintenance of some kind or
even to have a node failure. In that case, it is very nice to have the
load from the lost node be spread fairly evenly across the remaining
cluster.
Regarding the cost of having several
while its 'busy' splitting).
- Mark
On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
Of course, resharding is almost never necessary if you use micro-shards.
Micro-shards are shards small enough that you can fit 20 or more on a
node. If you have that many on each node, then adding
You can do that pretty easily by just retrieving extra documents and post
processing the results list.
You are likely to have a significant number of apparent duplicates this
way.
To really get rid of duplicates in results, it might be better to remove
them from the corpus by deploying something
Solr do it for me! that I have to as this
question is probably not a good sign, but what is LSH clustering?
On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
You can do that pretty easily by just retrieving extra documents and post
processing the results list
Google achieves their results by using data not found in the web pages
themselves. This additional data critically includes link text, but also
is derived from behavioral information.
On Sat, Nov 5, 2011 at 5:07 PM, alx...@aim.com wrote:
Hi Erick,
The term newspaper latimes is not found
That sounds like Nagle's algorithm.
http://en.wikipedia.org/wiki/Nagle's_algorithm#Interactions_with_real-time_systems
On Sun, Oct 30, 2011 at 2:01 PM, dar...@ontrenet.com wrote:
Another interesting note. When I use the Solr Admin screen to perform the
same query, it doesn't take as long.
On Thu, Oct 27, 2011 at 7:13 AM, Anatoli Matuskova
anatoli.matusk...@gmail.com wrote:
I don't like the idea of indexing a doc per each value, the dataset can
grow
a lot.
What does a lot mean? How high is the sky?
A million people with 3 year schedules is a billion tiny documents.
That
40 matches
Mail list logo