One solr core has essentially one index in it. (not only one 'field',
but one indexed collection of documents) There are weird hacks, like I
believe the spellcheck component kind of creates it's own sub-indexes,
not sure how it does that.
You can have more than one core in a single solr
I might consider what Erick suggested to actually be 'normalization'
rather than de-normalization! It's just that in Solr you only get one
'table'.
Here's yet another approach, which will have it's own trade-offs:
Keep the document as it is, representing a donor. But in addition to
indexing
everyone who donated more than 500 in 2006 with a
range query: fq=combo_field: [2007-0500 TO * ]
I think.
Jonathan Rochkind wrote:
I might consider what Erick suggested to actually be 'normalization'
rather than de-normalization! It's just that in Solr you only get one
'table'.
Here's yet
Maybe you are looking for the 'bq' (boost query) parameter in dismax?
http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
From: Ravi Kiran [ravi.bhas...@gmail.com]
Sent: Wednesday, September 15, 2010 10:02 PM
To:
CharFilters go before Tokenizers which go before (token) Filters.
Token filters (called just filter in the config) operate on tokens, so need
to go after the tokenizer. WhitespaceTokenizer is a tokenizer.
PatternReplaceFilterFactory is a token filter.
What you probably want instead is
Shawn Heisey wrote:
The one called PatternReplaceFilterFactory (no Char) has been around
forever. It is not mentioned on the Wiki page about analyzers. The one
called PatternReplaceCharFilterFactory is only available from svn.
This seems to be true, which I hadn't realized either. The
The stats component will give you the maximum value within one field:
http://wiki.apache.org/solr/StatsComponent
You're going to have to compute the max amongst several fields
client-side, having StatsComponent return the max for each field, and
then just max-ing them client side. Not hard.
be in the
realm of trying to use boost functions to do it, which is likely possible.
Jonathan Rochkind wrote:
The stats component will give you the maximum value within one field:
http://wiki.apache.org/solr/StatsComponent
You're going to have to compute the max amongst several fields
client
Faceting on a multi-value field?
I wonder if your positionIncrementGap for your field definition in your
schema is 256. I am not sure what it defaults to. But it seems possible
if it's 256 it could lead to what you observed. Try explicitly defining
it to be really really big maybe? I'm not
Why would you want to do that, instead of just using another tokenizer
and a lowercasefilter? It's more confusing less DRY code to leave them
separate -- the LowerCaseTokenizerFactory combines anyway because
someone decided it was such a common use case that it was worth it for
the
:10 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
How about patching the LetterTokenizer to be capable of tokenizing how you
want, which can then be combined with a LowerCaseFilter (or not) as desired.
Or indeed creating a new tokenizer to do exactly what you want, possibly
(but one that doesn't
I've been thinking about this too, and haven't come up with any GREAT way. But
there are several possible ways, that will do different things, good or bad,
depending on the nature of your data and exactly what you want to do. So here
are some ideas I've been thinking about, but not a ready
More like this is intended to be run at query time. For what reasons are you
thinking you want to (re-)index each document based on the results of
MoreLikeThis? You're right that that's not what the component is intended for.
Jonathan
From: Savannah
Those two queries might NOT always be 'the same', depending on how you
have your Solr request handler set up.
For instance, if you have dismax with a ps boost, then two one may end
up with different relevancy scores than one two, because the query as
a phrase will be used for boosting, and
I'll guess he means client/server.
HTTP is a client/server protocol, isn't it?
I just found out about 'invariants', and I found out about another thing
too: appends. (I don't think either of these are actually documented
anywhere?).
I think maybe appends rather than invariants, with your fq you want
always to be there might be exactly what you want?
I actually
Just throwing it out there, I'd consider a different approach for an
actual real app, although it might not be easier to get up quickly. (For
quickly, yeah, I'd just store it as a string, more on that at bottom).
If none of your dates have times, they're all just full days, I'm not
sure you
I'm really thinking, once you convert to -MM-DD anyway, you might be
better off just sticking this in a string field, rather than using a
date field at all. The extra precision in the date field is going to
make things confusing later, I predict. Especially for a quick and dirty
prototype,
/solr/SolrSecurity#Path_Based_Authentication
-Original message-
From: Jonathan Rochkind rochk...@jhu.edu
Sent: Wed 08-09-2010 19:19
To: solr-user@lucene.apache.org; markus.jel...@buyways.nl;
Subject: Re: Invariants on a specific fq value
I just found out about 'invariants', and I found
how SOLR-savvy you are, so pardon if this is something you already know. But
lots of people trip up over the string field type, which is NOT tokenized.
You usually want text unless it's some sort of ID So it might be worth
it to do some searching earlier rather than later G
Why
If there is no default or request-provided value, will the appends
still be used? I suspect so, but let us know, perhaps by adding it to
the wiki page!
Markus Jelsma wrote:
Sounds great! I'll be very sure to put it to the test tomorrow and perhaps add
documentation on these types to the
the 'tint' field mentioned below apply?
Dennis Gearon
Signature Warning
EARTH has a Right To Life,
otherwise we all die.
Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php
--- On Wed, 9/8/10, Jonathan Rochkind rochk...@jhu.edu wrote:
From: Jonathan
,
otherwise we all die.
Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php
--- On Wed, 9/8/10, Jonathan Rochkind rochk...@jhu.edu wrote:
From: Jonathan Rochkind rochk...@jhu.edu
Subject: Re: How to import data with a different date format
To: solr-user@lucene.apache.org solr-user
Of course you can store whatever you want in a solr index. And if you
store an integer as a Solr 1.4 int type, you can certainly query for
all documents that have greater than some specified integer in a field.
You can't use SQL to query Solr though.
I'm not sure what you're really asking?
You _could_ use SolrJ with EmbeddedSolrServer. But personally I wouldn't
unless there's a reason to. There's no automatic reason not to use the
ordinary Solr HTTP api, even for an in-house application which is not a web
application. Unless you have a real reason to use embedded solr, I'd use
Not neccesarily definitive, but filters and tokenizers can be found here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Not sure if that's all of the analyzers (which I think is the generic name for
both tokenizers and filters) that come with Solr, but I believe it's at least
What matters isn't how many documents have a value, so much as how many unique
values there are in the field total. If there aren't that many, faceting can be
done fairly quickly and fairly efficiently.
Otherwise, the only thing I can think of is experimenting with the two
different facet
for.
From: Ron Mayer [r...@0ape.com]
Sent: Monday, September 06, 2010 8:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Many sparse facets?
Jonathan Rochkind wrote:
What matters isn't how many documents have a value, so much
as how many unique values
The number of results if no filters from that facet were used is simply the
total result count of the current search -- the search being returned by that
query already is the results if no filters from that facet were used. The facet
values are 'drill downs' into the current search. So simply
Is the OS disk cache something you configure, or something the OS just does
automatically based on available free RAM? Or does it depend on the exact OS?
Thinking about the OS disk cache is new to me. Thanks for any tips.
From: Shawn Heisey
I've run into this before too. Both the dismax and solr-lucene _query
parsers_ will tokenize a query on whitespace _before_ they pass the
query to any field analyzers.
There are some reasons for this, lots of things wouldn't work if they
didn't do this.
But it makes your approach kind of
A common way is to make a facet string of categoryId-2_name_imageurl.
Then in your UI display the categoryId part of the facet.
I've been thinking about doing something like this for the same purposes. Will
having an extra long facet string like that have any effect on faceting
performace?
What analzyers are on your field?
From: j [jta...@gmail.com]
Sent: Saturday, August 07, 2010 1:17 PM
To: solr-user@lucene.apache.org
Subject: dismax debugging hyphens dashes
How does one debug index vs. dismax query parser?
I have a solr instance with 1
Mickael Magniez wrote:
Thanks for your response.
Unfortunately, I don't think it'll be enough. In fact, I have many other
products than shoes in my index, with many other facets fields.
I simplified my schema : in reality facets are dynamic fields.
You could change the way you do
This is tricky. You could try doing something with the ShingleFilter
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory)
at _query time_ to turn the users query:
i have a swollen foot into:
i, i have, i have a, i have a swollen, have, have a,
have a
Eloi Rocha wrote:
Hi everybody,
I would like to know if does make sense to use Solr in the following
scenario:
- search for large amount of data (like 1000, 1, 10 registers)
- each register contains four or five fields (strings and integers)
- every time will request for entire
Frederico Azeiteiro wrote:
But it is unusual to use both leading and trailing * operator. Why are
you doing this?
Yes I know, but I have a few queries that need this. I'll try the
ReversedWildcardFilterFactory.
ReverseWildcardFilter will help leading wildcard, but will not
Chris Hostetter wrote:
Honestly: if you have a really small cardinality for these numeric
values (ie: small enough to return every value on every request) perhaps
you should use faceting to find the min/max values (with facet.mincount=1)
instead of starts?
Thanks for the tips and info.
Thanks Hoss, the problem was transient, I believe that my index had
become corrupted (changed the schema but hadn't fully deleted all
documents that had been using the previous version of the schema), my
fault.
qf needs to have spaces in it, unfortunately the local query parser can not
deal with that, as Erik Hatcher mentioned some months ago.
By local query parser, you mean what I call the LocalParams stuff (for lack
of being sure of the proper term)? You can put spaces in there, you just need
to
I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
the java aspell library. I also extended the SpellCheckComponent to take
the
matrix of suggested words and query the corpus to find the first
combination
of suggestions which returned a match. This works well for my use
At first I was thinking the TermsComponent might give you this, but
oddly it seems not to.
http://wiki.apache.org/solr/TermsComponent
I thought I asked a variation of this before, but I don't see it on the
list, apologies if this is a duplicate, but I have new questions.
So I need to find the min and max value of a result set. Which can be
several million documents. One way to do this is the StatsComponent.
One problem is
Is there a way to tell Solr to only return a specific set of facet values? I
feel like the facet query must be able to do this, but I'm not really
understanding the facet query. In my specific case, I'd like to only see
facet
values for the same values I pass in as query filters, i.e. if I
I am now occasionally getting a Java GC overhead limit exceeded error
in my Solr. This may or may not be related to recently adding much
better (and more) warming querries.
I can get it when trying a 'commit', after deleting all documents in my
index, or in other cases.
Anyone run into
Short answer: GC overhead limit exceeded means out of memory.
Aha, thanks. So the answer is just raise your Xmx/heap size, you need more
memory to do what you're doing, yeah?
Jonathan
Man, what types of fields is StatsComponent actually known to work with?
With an sint, it seems to have trouble if there are any documents with null
values for the field. It appears to decide that a null/empty/blank value is
-1325166535, and is thus the minimum value.
At least if I'm
britske wrote:
*If* you could query on internal docid (I'm not sure that it's available
out-of-the-box, or if you can at all)
your original problem, quoted below, could imo be simplified to asking for
the last docid inserted (that match the other criteria from your use-case)
and in the next
Perhaps completely unnessecery when you have a controlled domain, but I
meant to use ids for places instead of names, because names will quickly
become ambiguous, e.g.: there are numerous different places over the world
called washington, etc.
This is related to something I've been thinking
I am keeping the id-to-name lookups in SOLR though, I just use some
lookup fields where I put id and name into one field, separated by
some fixed delimiter, e.g.
134982__Some name I am going to lookup later
The separator here would be two underscores (__).
So I can query for that lookup
and a typical query would be:
fl=id,type,timestamp,scorestart=0q=Coca+Cola+pepsi+-dr+pepperfq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)
rows=2000
My understanding is that this is essentially what the solr 1.4 trie date fields
are made for, I'd use them, should speed
and a typical query would be:
fl=id,type,timestamp,scorestart=0q=Coca+Cola+pepsi+-dr+pepperfq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y)
rows=2000
On top of using trie dates, you might consider separating the timestamp portion
and the type portion of the fq into seperate
I think the Synonym filter should actually do exactly what you want, no?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Hmm, maybe not exactly what you want as you describe it. It comes close,
maybe good enough. Do you REALLY need to support I Business M
Paul Dlug wrote:
On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
The synonym approach won't work as I need to provide them in a file.
The variants may be more dynamic and not known in advance, the process
creating the documents to index does have that logic
Chris Hostetter wrote:
computing the number: in some algorithms it's relatively cheap (on a
single server) but in others it's more expensive then computing the facet
counts being returned (consider the case where we are sorting in term
order - once we have collected counts for ${facet.limit}
I would like get the total count of the facet.field response values
I'm pretty sure there's no way to get Solr to do that -- other than not setting
a facet.limit, getting every value back in the response, and counting them
yourself (not feasible for very large counts). I've looked at trying
When I use the stats component on a field that has no values in the
result set (ie, stats.missing == rowCount), I'd expect that 'min'and
'max' would be blank.
Instead, they seem to be the smallest and largest float values or
something, min = 1.7976931348623157E308, max = 4.9E-324 .
Is this
1) While doing a dismax query, I specify the query in double quotes for
exact match. This works fine but I don't get any partial matches in search
result.
Rather than specify your query in quotes for 'exact' matches, I was suggesting
configuring the analyzers differently for your fields
I needed to get counts based GRPID clubbed with GRPNAME not different sets
Perhaps using facet.query to write your own sub queries that will collect
whatever you want?
Exactly. The weight is a weight of a given tag for specific document, not
weight of the field as in weighted search. So one document may have tag1
with weight of 0.1, and another may have the same tag1 with weight=0.8.
I've never used it, but I think this is the use case that the Solr feature
the sounds like it might be a stopword. Are you using stopwords in any
of your fields covered by the dismax search? But not in some of the
other fields covered by dismax? the combination of dismax and stopwords
can result in unexpected behavior if you aren't careful.
I wrote about this a bit
Shawn Heisey wrote:
[* TO NOW-2YEARS]^1.0
I also seem to remember seeing something about how to do less than in
range queries as well as the less than or equal to implied by the
above, but I cannot find it now.
Ranges with square brackets [] are inclusive. Ranges with parens () are
i'm hoping that -- faceting simply calculates+returns the counts for docs that
have the field present while results may still contain documents that don't
have the facet field (i.e. the field faceted on)?
Yes, that's exactly what happens. You can use facet.missing to get a count for
documents
I think you're going to have trouble doing this with seperate cores.
With seperate cores, you'll need to issue two querries to solr, one for
each core. And then to intermingle results from the differnet cores like
that, it's going to require difficult (esp to do at all efficiently)
client side
So I want to provide some range facets with an integer (probably tint,
that is trie field with non-0 precision) solr field.
It's clear enough how to do this, along the lines of facet.query=[1 TO
100]facet.query=[101 TO 200]facet.query=[201 TO 300]
etc.
The issue is that I'd like to
So I will have a solr field that contains years, ie, 1990, 2010,
maybe even 1492, 1209 and 907/0907.
I will be doing range limits over this field. Ie, [1950 TO 1975] or
what have you. The data represents publication dates of books on a
large library shelves; there will be around 3 million
So using LocalParams with dollar-sign references to other parameters.
In LocalParams in general, you can use single-quotes for values that
have spaces in them:
{!dismax qf='field^5 field2^10'}= no problem
And even if the value does not have spaces, you can use single quotes
too, why
The SolrRelevancyFAQ does suggest that both index-time and search-time boosting
can be used to boost the score of newer documents, but doesn't suggest what
reasons/contexts one might choose one vs the other. It only provides an
example of search-time boost though, so it doesn't answer the
Using solr-lucene query parser, is there a difference between using
AND and using + in querries like this:
1) q= some_field:( one AND two AND some phrase)
2) q= some_field:(+one +two +some phrase)
Are those always exactly identical in all respects, or are there any
differences in terms
You CAN easily turn spellchecking on or off, or set the spellcheck dictionary,
in request parameters. So there's really no need, that I can think of, to try
to actually add or remove the spellcheck component in request parameters; you
could just leave it turned off in your default parameters,
with that. Or if you give a single
real example as a general pattern, perhaps we could help figure out
the simplest way to avoid most of the escaping.
-Yonik
http://www.lucidimagination.com
On Tue, Jun 1, 2010 at 6:21 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
I am just trying to figure it out
Thanks, the pointer to that documentation page (which somehow I had
missed), as well as Chris's response is very helpful.
The one thing I'm still not sure about, which I might be able to figure
it out through trial-and-error reverse engineering, is escaping issues
when you combine nested
.
If you can give a specific example, we might be able to suggest easier
ways to achieve it rather than going escape crazy :-)
-Yonik
http://www.lucidimagination.com
On Tue, Jun 1, 2010 at 5:06 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
Thanks, the pointer to that documentation page
One way to do it would be to use dismax request handler at query time, with a
pf paramater on the same field(s) as your qf paramter, but with a big boost on
the pf. http://wiki.apache.org/solr/DisMaxRequestHandler
I'm not sure why you're getting matches for tigers and woods on tiger
woods
So I'm trying to wrap my head around nested querries. Also that thing
that isn't a nested query, but is similar, which I think is called
LocalParams syntax, like:
q={!dismax qf=$something}cat dog
(All my examples are not URL-encoded for clarity, of course they'd have
to be before sending to
401 - 475 of 475 matches
Mail list logo