Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Jonathan Rochkind
One solr core has essentially one index in it. (not only one 'field', but one indexed collection of documents) There are weird hacks, like I believe the spellcheck component kind of creates it's own sub-indexes, not sure how it does that. You can have more than one core in a single solr

Re: Simple Filter Query (fq) Use Case Question

2010-09-15 Thread Jonathan Rochkind
I might consider what Erick suggested to actually be 'normalization' rather than de-normalization! It's just that in Solr you only get one 'table'. Here's yet another approach, which will have it's own trade-offs: Keep the document as it is, representing a donor. But in addition to indexing

Re: Simple Filter Query (fq) Use Case Question

2010-09-15 Thread Jonathan Rochkind
everyone who donated more than 500 in 2006 with a range query: fq=combo_field: [2007-0500 TO * ] I think. Jonathan Rochkind wrote: I might consider what Erick suggested to actually be 'normalization' rather than de-normalization! It's just that in Solr you only get one 'table'. Here's yet

RE: Boosting specific field value

2010-09-15 Thread Jonathan Rochkind
Maybe you are looking for the 'bq' (boost query) parameter in dismax? http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29 From: Ravi Kiran [ravi.bhas...@gmail.com] Sent: Wednesday, September 15, 2010 10:02 PM To:

RE: order of analyzers, tokeinizers and filters

2010-09-14 Thread Jonathan Rochkind
CharFilters go before Tokenizers which go before (token) Filters. Token filters (called just filter in the config) operate on tokens, so need to go after the tokenizer. WhitespaceTokenizer is a tokenizer. PatternReplaceFilterFactory is a token filter. What you probably want instead is

Re: PatternReplaceCharFilterFactory?

2010-09-14 Thread Jonathan Rochkind
Shawn Heisey wrote: The one called PatternReplaceFilterFactory (no Char) has been around forever. It is not mentioned on the Wiki page about analyzers. The one called PatternReplaceCharFilterFactory is only available from svn. This seems to be true, which I hadn't realized either. The

Re: Returning max value of fields within documents

2010-09-14 Thread Jonathan Rochkind
The stats component will give you the maximum value within one field: http://wiki.apache.org/solr/StatsComponent You're going to have to compute the max amongst several fields client-side, having StatsComponent return the max for each field, and then just max-ing them client side. Not hard.

Re: Returning max value of fields within documents

2010-09-14 Thread Jonathan Rochkind
be in the realm of trying to use boost functions to do it, which is likely possible. Jonathan Rochkind wrote: The stats component will give you the maximum value within one field: http://wiki.apache.org/solr/StatsComponent You're going to have to compute the max amongst several fields client

Re: Facet Field Value truncation

2010-09-14 Thread Jonathan Rochkind
Faceting on a multi-value field? I wonder if your positionIncrementGap for your field definition in your schema is 256. I am not sure what it defaults to. But it seems possible if it's 256 it could lead to what you observed. Try explicitly defining it to be really really big maybe? I'm not

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Jonathan Rochkind
Why would you want to do that, instead of just using another tokenizer and a lowercasefilter? It's more confusing less DRY code to leave them separate -- the LowerCaseTokenizerFactory combines anyway because someone decided it was such a common use case that it was worth it for the

Re: LowerCaseTokenizerFactory - Tokenizer Options? Why does it behave this way?

2010-09-14 Thread Jonathan Rochkind
:10 PM, Jonathan Rochkind rochk...@jhu.edu wrote: How about patching the LetterTokenizer to be capable of tokenizing how you want, which can then be combined with a LowerCaseFilter (or not) as desired. Or indeed creating a new tokenizer to do exactly what you want, possibly (but one that doesn't

RE: Autocomplete with Filter Query

2010-09-10 Thread Jonathan Rochkind
I've been thinking about this too, and haven't come up with any GREAT way. But there are several possible ways, that will do different things, good or bad, depending on the nature of your data and exactly what you want to do. So here are some ideas I've been thinking about, but not a ready

RE: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Jonathan Rochkind
More like this is intended to be run at query time. For what reasons are you thinking you want to (re-)index each document based on the results of MoreLikeThis? You're right that that's not what the component is intended for. Jonathan From: Savannah

Re: how to normalize a query

2010-09-09 Thread Jonathan Rochkind
Those two queries might NOT always be 'the same', depending on how you have your Solr request handler set up. For instance, if you have dismax with a ps boost, then two one may end up with different relevancy scores than one two, because the query as a phrase will be used for boosting, and

RE: Solr, c/s type ?

2010-09-08 Thread Jonathan Rochkind
I'll guess he means client/server. HTTP is a client/server protocol, isn't it?

Re: Invariants on a specific fq value

2010-09-08 Thread Jonathan Rochkind
I just found out about 'invariants', and I found out about another thing too: appends. (I don't think either of these are actually documented anywhere?). I think maybe appends rather than invariants, with your fq you want always to be there might be exactly what you want? I actually

Re: How to import data with a different date format

2010-09-08 Thread Jonathan Rochkind
Just throwing it out there, I'd consider a different approach for an actual real app, although it might not be easier to get up quickly. (For quickly, yeah, I'd just store it as a string, more on that at bottom). If none of your dates have times, they're all just full days, I'm not sure you

Re: How to import data with a different date format

2010-09-08 Thread Jonathan Rochkind
I'm really thinking, once you convert to -MM-DD anyway, you might be better off just sticking this in a string field, rather than using a date field at all. The extra precision in the date field is going to make things confusing later, I predict. Especially for a quick and dirty prototype,

Re: Invariants on a specific fq value

2010-09-08 Thread Jonathan Rochkind
/solr/SolrSecurity#Path_Based_Authentication -Original message- From: Jonathan Rochkind rochk...@jhu.edu Sent: Wed 08-09-2010 19:19 To: solr-user@lucene.apache.org; markus.jel...@buyways.nl; Subject: Re: Invariants on a specific fq value I just found out about 'invariants', and I found

Re: How to import data with a different date format

2010-09-08 Thread Jonathan Rochkind
how SOLR-savvy you are, so pardon if this is something you already know. But lots of people trip up over the string field type, which is NOT tokenized. You usually want text unless it's some sort of ID So it might be worth it to do some searching earlier rather than later G Why

Re: Invariants on a specific fq value

2010-09-08 Thread Jonathan Rochkind
If there is no default or request-provided value, will the appends still be used? I suspect so, but let us know, perhaps by adding it to the wiki page! Markus Jelsma wrote: Sounds great! I'll be very sure to put it to the test tomorrow and perhaps add documentation on these types to the

Re: How to import data with a different date format

2010-09-08 Thread Jonathan Rochkind
the 'tint' field mentioned below apply? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 9/8/10, Jonathan Rochkind rochk...@jhu.edu wrote: From: Jonathan

Re: How to import data with a different date format

2010-09-08 Thread Jonathan Rochkind
, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 9/8/10, Jonathan Rochkind rochk...@jhu.edu wrote: From: Jonathan Rochkind rochk...@jhu.edu Subject: Re: How to import data with a different date format To: solr-user@lucene.apache.org solr-user

Re: Delta Import with something other than Date

2010-09-08 Thread Jonathan Rochkind
Of course you can store whatever you want in a solr index. And if you store an integer as a Solr 1.4 int type, you can certainly query for all documents that have greater than some specified integer in a field. You can't use SQL to query Solr though. I'm not sure what you're really asking?

RE: Solr, c/s type ?

2010-09-08 Thread Jonathan Rochkind
You _could_ use SolrJ with EmbeddedSolrServer. But personally I wouldn't unless there's a reason to. There's no automatic reason not to use the ordinary Solr HTTP api, even for an in-house application which is not a web application. Unless you have a real reason to use embedded solr, I'd use

RE: list of filters/factories/Input handlers/blah blah

2010-09-07 Thread Jonathan Rochkind
Not neccesarily definitive, but filters and tokenizers can be found here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Not sure if that's all of the analyzers (which I think is the generic name for both tokenizers and filters) that come with Solr, but I believe it's at least

RE: Many sparse facets?

2010-09-06 Thread Jonathan Rochkind
What matters isn't how many documents have a value, so much as how many unique values there are in the field total. If there aren't that many, faceting can be done fairly quickly and fairly efficiently. Otherwise, the only thing I can think of is experimenting with the two different facet

RE: Many sparse facets?

2010-09-06 Thread Jonathan Rochkind
for. From: Ron Mayer [r...@0ape.com] Sent: Monday, September 06, 2010 8:27 PM To: solr-user@lucene.apache.org Subject: Re: Many sparse facets? Jonathan Rochkind wrote: What matters isn't how many documents have a value, so much as how many unique values

RE: Show a facet filter All

2010-09-05 Thread Jonathan Rochkind
The number of results if no filters from that facet were used is simply the total result count of the current search -- the search being returned by that query already is the results if no filters from that facet were used. The facet values are 'drill downs' into the current search. So simply

RE: Solr crawls during replication

2010-09-03 Thread Jonathan Rochkind
Is the OS disk cache something you configure, or something the OS just does automatically based on available free RAM? Or does it depend on the exact OS? Thinking about the OS disk cache is new to me. Thanks for any tips. From: Shawn Heisey

Re: shingles work in analyzer but not real data

2010-09-02 Thread Jonathan Rochkind
I've run into this before too. Both the dismax and solr-lucene _query parsers_ will tokenize a query on whitespace _before_ they pass the query to any field analyzers. There are some reasons for this, lots of things wouldn't work if they didn't do this. But it makes your approach kind of

RE: facets - id and display value

2010-08-20 Thread Jonathan Rochkind
A common way is to make a facet string of categoryId-2_name_imageurl. Then in your UI display the categoryId part of the facet. I've been thinking about doing something like this for the same purposes. Will having an extra long facet string like that have any effect on faceting performace?

RE: dismax debugging hyphens dashes

2010-08-07 Thread Jonathan Rochkind
What analzyers are on your field? From: j [jta...@gmail.com] Sent: Saturday, August 07, 2010 1:17 PM To: solr-user@lucene.apache.org Subject: dismax debugging hyphens dashes How does one debug index vs. dismax query parser? I have a solr instance with 1

Re: No group by? looking for an alternative.

2010-08-05 Thread Jonathan Rochkind
Mickael Magniez wrote: Thanks for your response. Unfortunately, I don't think it'll be enough. In fact, I have many other products than shoes in my index, with many other facets fields. I simplified my schema : in reality facets are dynamic fields. You could change the way you do

Re: anti-words - exact match

2010-08-05 Thread Jonathan Rochkind
This is tricky. You could try doing something with the ShingleFilter (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory) at _query time_ to turn the users query: i have a swollen foot into: i, i have, i have a, i have a swollen, have, have a, have a

Re: Process entire result set

2010-08-05 Thread Jonathan Rochkind
Eloi Rocha wrote: Hi everybody, I would like to know if does make sense to use Solr in the following scenario: - search for large amount of data (like 1000, 1, 10 registers) - each register contains four or five fields (strings and integers) - every time will request for entire

Re: wildcard and proximity searches

2010-08-03 Thread Jonathan Rochkind
Frederico Azeiteiro wrote: But it is unusual to use both leading and trailing * operator. Why are you doing this? Yes I know, but I have a few queries that need this. I'll try the ReversedWildcardFilterFactory. ReverseWildcardFilter will help leading wildcard, but will not

Re: min/max, StatsComponent, performance

2010-08-03 Thread Jonathan Rochkind
Chris Hostetter wrote: Honestly: if you have a really small cardinality for these numeric values (ie: small enough to return every value on every request) perhaps you should use faceting to find the min/max values (with facet.mincount=1) instead of starts? Thanks for the tips and info.

Re: StatsComponent and sint?

2010-08-03 Thread Jonathan Rochkind
Thanks Hoss, the problem was transient, I believe that my index had become corrupted (changed the schema but hadn't fully deleted all documents that had been using the previous version of the schema), my fault.

RE: Boosting DisMax queries with !boost component

2010-08-01 Thread Jonathan Rochkind
qf needs to have spaces in it, unfortunately the local query parser can not deal with that, as Erik Hatcher mentioned some months ago. By local query parser, you mean what I call the LocalParams stuff (for lack of being sure of the proper term)? You can put spaces in there, you just need to

Re: Spellchecking and frequency

2010-07-28 Thread Jonathan Rochkind
I therefore wrote an implementation of SolrSpellChecker that wraps jazzy, the java aspell library. I also extended the SpellCheckComponent to take the matrix of suggested words and query the corpus to find the first combination of suggestions which returned a match. This works well for my use

Re: Total number of terms in an index?

2010-07-28 Thread Jonathan Rochkind
At first I was thinking the TermsComponent might give you this, but oddly it seems not to. http://wiki.apache.org/solr/TermsComponent

min/max, StatsComponent, performance

2010-07-27 Thread Jonathan Rochkind
I thought I asked a variation of this before, but I don't see it on the list, apologies if this is a duplicate, but I have new questions. So I need to find the min and max value of a result set. Which can be several million documents. One way to do this is the StatsComponent. One problem is

RE: How to 'filter' facet results

2010-07-27 Thread Jonathan Rochkind
Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I

java GC overhead limit exceeded

2010-07-26 Thread Jonathan Rochkind
I am now occasionally getting a Java GC overhead limit exceeded error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a 'commit', after deleting all documents in my index, or in other cases. Anyone run into

RE: java GC overhead limit exceeded

2010-07-26 Thread Jonathan Rochkind
Short answer: GC overhead limit exceeded means out of memory. Aha, thanks. So the answer is just raise your Xmx/heap size, you need more memory to do what you're doing, yeah? Jonathan

StatsComponent and sint?

2010-07-26 Thread Jonathan Rochkind
Man, what types of fields is StatsComponent actually known to work with? With an sint, it seems to have trouble if there are any documents with null values for the field. It appears to decide that a null/empty/blank value is -1325166535, and is thus the minimum value. At least if I'm

RE: filter query on timestamp slowing query???

2010-07-25 Thread Jonathan Rochkind
britske wrote: *If* you could query on internal docid (I'm not sure that it's available out-of-the-box, or if you can at all) your original problem, quoted below, could imo be simplified to asking for the last docid inserted (that match the other criteria from your use-case) and in the next

RE: Tree Faceting in Solr 1.4

2010-07-24 Thread Jonathan Rochkind
Perhaps completely unnessecery when you have a controlled domain, but I meant to use ids for places instead of names, because names will quickly become ambiguous, e.g.: there are numerous different places over the world called washington, etc. This is related to something I've been thinking

RE: Tree Faceting in Solr 1.4

2010-07-24 Thread Jonathan Rochkind
I am keeping the id-to-name lookups in SOLR though, I just use some lookup fields where I put id and name into one field, separated by some fixed delimiter, e.g. 134982__Some name I am going to lookup later The separator here would be two underscores (__). So I can query for that lookup

RE: filter query on timestamp slowing query???

2010-07-23 Thread Jonathan Rochkind
and a typical query would be: fl=id,type,timestamp,scorestart=0q=Coca+Cola+pepsi+-dr+pepperfq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y) rows=2000 My understanding is that this is essentially what the solr 1.4 trie date fields are made for, I'd use them, should speed

RE: filter query on timestamp slowing query???

2010-07-23 Thread Jonathan Rochkind
and a typical query would be: fl=id,type,timestamp,scorestart=0q=Coca+Cola+pepsi+-dr+pepperfq=timestamp:[2010-07-07T00:00:00Z+TO+NOW]+AND+(type:x+OR+type:y) rows=2000 On top of using trie dates, you might consider separating the timestamp portion and the type portion of the fq into seperate

Re: Providing token variants at index time

2010-07-22 Thread Jonathan Rochkind
I think the Synonym filter should actually do exactly what you want, no? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Hmm, maybe not exactly what you want as you describe it. It comes close, maybe good enough. Do you REALLY need to support I Business M

Re: Providing token variants at index time

2010-07-22 Thread Jonathan Rochkind
Paul Dlug wrote: On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The synonym approach won't work as I need to provide them in a file. The variants may be more dynamic and not known in advance, the process creating the documents to index does have that logic

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-22 Thread Jonathan Rochkind
Chris Hostetter wrote: computing the number: in some algorithms it's relatively cheap (on a single server) but in others it's more expensive then computing the facet counts being returned (consider the case where we are sorting in term order - once we have collected counts for ${facet.limit}

RE: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-19 Thread Jonathan Rochkind
I would like get the total count of the facet.field response values I'm pretty sure there's no way to get Solr to do that -- other than not setting a facet.limit, getting every value back in the response, and counting them yourself (not feasible for very large counts). I've looked at trying

stats on a field with no values

2010-07-19 Thread Jonathan Rochkind
When I use the stats component on a field that has no values in the result set (ie, stats.missing == rowCount), I'd expect that 'min'and 'max' would be blank. Instead, they seem to be the smallest and largest float values or something, min = 1.7976931348623157E308, max = 4.9E-324 . Is this

RE: Get only partial match results

2010-07-17 Thread Jonathan Rochkind
1) While doing a dismax query, I specify the query in double quotes for exact match. This works fine but I don't get any partial matches in search result. Rather than specify your query in quotes for 'exact' matches, I was suggesting configuring the analyzers differently for your fields

RE: Getting facets count on multiple fields by doing a Group By

2010-07-17 Thread Jonathan Rochkind
I needed to get counts based GRPID clubbed with GRPNAME not different sets Perhaps using facet.query to write your own sub queries that will collect whatever you want?

RE: documents with known relevancy

2010-07-16 Thread Jonathan Rochkind
Exactly. The weight is a weight of a given tag for specific document, not weight of the field as in weighted search. So one document may have tag1 with weight of 0.1, and another may have the same tag1 with weight=0.8. I've never used it, but I think this is the use case that the Solr feature

Re: Strange the when search with dismax

2010-07-14 Thread Jonathan Rochkind
the sounds like it might be a stopword. Are you using stopwords in any of your fields covered by the dismax search? But not in some of the other fields covered by dismax? the combination of dismax and stopwords can result in unexpected behavior if you aren't careful. I wrote about this a bit

Re: date boosting and dismax

2010-07-14 Thread Jonathan Rochkind
Shawn Heisey wrote: [* TO NOW-2YEARS]^1.0 I also seem to remember seeing something about how to do less than in range queries as well as the less than or equal to implied by the above, but I cannot find it now. Ranges with square brackets [] are inclusive. Ranges with parens () are

RE: faceting over field not in all documents

2010-07-13 Thread Jonathan Rochkind
i'm hoping that -- faceting simply calculates+returns the counts for docs that have the field present while results may still contain documents that don't have the facet field (i.e. the field faceted on)? Yes, that's exactly what happens. You can use facet.missing to get a count for documents

Re: Get only partial match results

2010-07-13 Thread Jonathan Rochkind
I think you're going to have trouble doing this with seperate cores. With seperate cores, you'll need to issue two querries to solr, one for each core. And then to intermingle results from the differnet cores like that, it's going to require difficult (esp to do at all efficiently) client side

range faceting with integers

2010-07-12 Thread Jonathan Rochkind
So I want to provide some range facets with an integer (probably tint, that is trie field with non-0 precision) solr field. It's clear enough how to do this, along the lines of facet.query=[1 TO 100]facet.query=[101 TO 200]facet.query=[201 TO 300] etc. The issue is that I'd like to

year range field, proper data type?

2010-07-07 Thread Jonathan Rochkind
So I will have a solr field that contains years, ie, 1990, 2010, maybe even 1492, 1209 and 907/0907. I will be doing range limits over this field. Ie, [1950 TO 1975] or what have you. The data represents publication dates of books on a large library shelves; there will be around 3 million

LocalParams, quotes, bug?

2010-06-16 Thread Jonathan Rochkind
So using LocalParams with dollar-sign references to other parameters. In LocalParams in general, you can use single-quotes for values that have spaces in them: {!dismax qf='field^5 field2^10'}= no problem And even if the value does not have spaces, you can use single quotes too, why

RE: Index-time vs. search-time boosting performance

2010-06-04 Thread Jonathan Rochkind
The SolrRelevancyFAQ does suggest that both index-time and search-time boosting can be used to boost the score of newer documents, but doesn't suggest what reasons/contexts one might choose one vs the other. It only provides an example of search-time boost though, so it doesn't answer the

solr-lucene AND vs +

2010-06-03 Thread Jonathan Rochkind
Using solr-lucene query parser, is there a difference between using AND and using + in querries like this: 1) q= some_field:( one AND two AND some phrase) 2) q= some_field:(+one +two +some phrase) Are those always exactly identical in all respects, or are there any differences in terms

RE: Array of arguments in URL?

2010-06-02 Thread Jonathan Rochkind
You CAN easily turn spellchecking on or off, or set the spellcheck dictionary, in request parameters. So there's really no need, that I can think of, to try to actually add or remove the spellcheck component in request parameters; you could just leave it turned off in your default parameters,

Re: nested querries, and LocalParams syntax

2010-06-02 Thread Jonathan Rochkind
with that. Or if you give a single real example as a general pattern, perhaps we could help figure out the simplest way to avoid most of the escaping. -Yonik http://www.lucidimagination.com On Tue, Jun 1, 2010 at 6:21 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I am just trying to figure it out

Re: nested querries, and LocalParams syntax

2010-06-01 Thread Jonathan Rochkind
Thanks, the pointer to that documentation page (which somehow I had missed), as well as Chris's response is very helpful. The one thing I'm still not sure about, which I might be able to figure it out through trial-and-error reverse engineering, is escaping issues when you combine nested

Re: nested querries, and LocalParams syntax

2010-06-01 Thread Jonathan Rochkind
. If you can give a specific example, we might be able to suggest easier ways to achieve it rather than going escape crazy :-) -Yonik http://www.lucidimagination.com On Tue, Jun 1, 2010 at 5:06 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks, the pointer to that documentation page

RE: Query related question

2010-06-01 Thread Jonathan Rochkind
One way to do it would be to use dismax request handler at query time, with a pf paramater on the same field(s) as your qf paramter, but with a big boost on the pf. http://wiki.apache.org/solr/DisMaxRequestHandler I'm not sure why you're getting matches for tigers and woods on tiger woods

nested querries, and LocalParams syntax

2010-05-26 Thread Jonathan Rochkind
So I'm trying to wrap my head around nested querries. Also that thing that isn't a nested query, but is similar, which I think is called LocalParams syntax, like: q={!dismax qf=$something}cat dog (All my examples are not URL-encoded for clarity, of course they'd have to be before sending to

<    1   2   3   4   5