Re: solr as nosql - pulling all docs vs deep paging limitations

2013-12-18 Thread Jens Grivolla
You can do range queries without an upper bound and just limit the 
number of results. Then you look at the last result to obtain the new 
lower bound.


-- Jens


On 17/12/13 20:23, Petersen, Robert wrote:

My use case is basically to do a dump of all contents of the index with no 
ordering needed.  It's actually to be a product data export for third parties.  
Unique key is product sku.  I could take the min sku and range query up to the 
max sku but the skus are not contiguous because some get turned off and only 
some are valid for export so each range would return a different number of 
products (which may or may not be acceptable and I might be able to kind of 
hide that with some code).

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been 
asked many times for that.
What if client don't need to rank results somehow, but just requesting 
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there 
is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:



: Then I remembered we currently don't allow deep paging in our
current
: search indexes as performance declines the deeper you go.  Is this
still
: the case?

Coincidently, i'm working on a new cursor based API to make this much
more feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the
results last week...


http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
ion-of-large-result-sets/

...current iterations on the patch are to eliminate the strawman code
to improve performance even more and beef up the test cases.

: If so, is there another approach to make all the data in a
collection
: easily available for retrieval?  The only thing I can think of is to
 ...
: Then I was thinking we could have a field with an incrementing
numeric
: value which could be used to perform range queries as a substitute
for
: paging through everything.  Ie queries like 'IncrementalField:[1 TO
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
: maintain as we update the index unless we reindex the entire
collection
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey
field that supports range queries, bulk exporting of all documents is
fairly trivial by sorting on your uniqueKey field and using an fq that
also filters on your uniqueKey field modify the fq each time to change
the lower bound to match the highest ID you got on the previous page.

This approach works really well in simple cases where you wnat to
fetch all documents matching a query and then process/sort them by
some other criteria on the client -- but it's not viable if it's
important to you that the documents come back from solr in score order
before your client gets them because you want to stop fetching once
some criteria is met in your client.  Example: you have billions of
documents matching a query, you want to fetch all sorted by score desc
and crunch them on your client to compute some stats, and once your
client side stat crunching tells you you have enough results (which
might be after the 1000th result, or might be after the millionth result) then 
you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch
should easy to use in the next day or so (having other people try out
and test in their applications would be *very* helpful) and hopefully
show up in Solr 4.7

-Hoss
http://www.lucidworks.com/





--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
  mkhlud...@griddynamics.com







Re: Querying a transitive closure?

2013-03-28 Thread Jens Grivolla
Exactly, you should usually design your schema to fit your queries, and 
if you need to retrieve all ancestors then you should index all 
ancestors so you can query for them easily.


If that doesn't work for you then either Solr is not the right tool for 
the job, or you need to rethink your schema.


The description of doing lookups within a tree structure doesn't sound 
at all like what you would use a text retrieval engine for, so you might 
want to rethink why you want to use Solr for this. But if that 
transitive closure is something you can calculate at indexing time 
then the correct solution is the one Upayavira provided.


If you want people to be able to help you you need to actually describe 
your problem (i.e. what is my data, and what are my queries) instead of 
diving into technical details like reducing HTTP roundtrips. My guess 
is that if you need to reduce HTTP roundtrips you're probably doing it 
wrong.


HTH,
Jens

On 03/28/2013 08:15 AM, Upayavira wrote:

Why don't you index all ancestor classes with the document, as a
multivalued field, then you could get it in one hit. Am I missing
something?

Upayavira

On Thu, Mar 28, 2013, at 01:59 AM, Jack Park wrote:

Hi Otis,
That's essentially the answer I was looking for: each shard (are we
talking master + replicas?) has the plug-in custom query handler.  I
need to build it to find out.

What I mean is that there is a taxonomy, say one with a single root
for sake of illustration, which grows all the classes, subclasses, and
instances. If I have an object that is somewhere in that taxonomy,
then it has a zigzag chain of parents up that tree (I've seen that
called a transitive closure. If class B is way up that tree from M,
no telling how many queries it will take to find it.  Hmmm...
recursive ascent, I suppose.

Many thanks
Jack

On Wed, Mar 27, 2013 at 6:52 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

Hi Jack,

I don't fully understand the exact taxonomy structure and your needs,
but in terms of reducing the number of HTTP round trips, you can do it
by writing a custom SearchComponent that, upon getting the initial
request, does everything locally, meaning that it talks to the
local/specified shard before returning to the caller.  In SolrCloud
setup with N shards, each of these N shards could be queried in such a
way in parallel, running query/queries on their local shards.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Mar 27, 2013 at 3:11 PM, Jack Park jackp...@topicquests.org wrote:

Hi Otis,

I fully expect to grow to SolrCloud -- many shards. For now, it's
solo. But, my thinking relates to cloud. I look for ways to reduce the
number of HTTP round trips through SolrJ. Maybe you have some ideas?

Thanks
Jack

On Wed, Mar 27, 2013 at 10:04 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

Hi Jack,

Is this really about HTTP and Solr vs. SolrCloud or more whether
Solr(Cloud) is the right tool for the job and if so how to structure
the schema and queries to make such lookups efficient?

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Mar 27, 2013 at 12:53 PM, Jack Park jackp...@topicquests.org wrote:

This is a question about isA?

We want to know if M isA B   isA?(M,B)

For some M, one might be able to look into M to see its type or which
class(es) for which it is a subClass. We're talking taxonomic queries
now.
But, for some M, one might need to ripple up the transitive closure,
looking at all the super classes, etc, recursively.

It seems unreasonable to do that over HTTP; it seems more reasonable
to grab a core and write a custom isA query handler. But, how do you
do that in a SolrCloud?

Really curious...

Many thanks in advance for ideas.
Jack







Re: SOLR 4.2 SNAPSHOT There exists no core with name x

2013-03-04 Thread Jens Grivolla

On 03/01/2013 07:46 PM, Neal Ensor wrote:

Again, it appears to work on Safari fine hitting the same container,
so must be something Chrome-specific (perhaps something I have
disabled?)


This sounds like it might just be a browser cache issue (if you used 
Chrome to access the same URL previously with the old Solr version 
installed). It might just not be refreshing everything.


Jens



Re: SOLR 4.2 SNAPSHOT There exists no core with name x

2013-03-04 Thread Jens Grivolla
Yes, we've had quite a few surprises with outdated information (and 
mixtures of old and new information) in the admin UI, so I'd definitely 
be in favor of getting rid of caching.


Jens

On 03/04/2013 04:03 PM, Stefan Matheis wrote:

Thanks Jens! Didn't think about caching .. :/

Perhaps we should change the requests in favor of 
https://issues.apache.org/jira/browse/SOLR-4311 to avoid any caching at the UI? 
Results maybe in a few more (real) requests but i guess that would be okay?

Stefan


On Monday, March 4, 2013 at 2:21 PM, Neal Ensor wrote:


Actually, just updated Chrome this morning, and it all appears to
work. Flushed cache as well, so could be part of that. All's well
that ends well I suppose.

neal

On Mon, Mar 4, 2013 at 4:44 AM, Jens Grivolla j+...@grivolla.net 
(mailto:j+...@grivolla.net) wrote:

On 03/01/2013 07:46 PM, Neal Ensor wrote:


Again, it appears to work on Safari fine hitting the same container,
so must be something Chrome-specific (perhaps something I have
disabled?)





This sounds like it might just be a browser cache issue (if you used Chrome
to access the same URL previously with the old Solr version installed). It
might just not be refreshing everything.

Jens









Re: configuring schema to match database

2013-01-14 Thread Jens Grivolla

On 01/11/2013 06:14 PM, Gora Mohanty wrote:

On 11 January 2013 22:30, Jens Grivolla j+...@grivolla.net wrote:
[...]

Actually, that is what you would get when doing a join in an RDBMS, the 
cross-product of your tables. This is NOT AT ALL what you typically do in Solr.

Best start the other way around, think of Solr as a retrieval system, not a 
storage system. What are your queries? What do you want to find, and what 
criteria do you use to search for it?

[...]

Um, he did describe his desired queries, and there was a reason
that I proposed the above schema design.


He said he wants queries such as users how have taken courseA and are 
fluent in english, which is exactly one case I was describing.



UserA has taken courseA, courseB and courseC and has writingskill
good verbalskill good for english and writingskill excellent
verbalskill excellent for spanish UserB has taken courseA, courseF,
courseG and courseH and has writingskill fluent verbalskill fluent
for english and writingskill good verbalskill good for italian


Unless the index is becoming huge, I feel that it is better to
flatten everything out rather than combine fields, and
post-process the results.


Then please show me the query to find users that are fluent in spanish 
and english. Bonus points if you manage to not retrieve the same user 
several times. (Hint, your schema stores only one language skill per row).


Regards,
Jens



Re: configuring schema to match database

2013-01-14 Thread Jens Grivolla

On 01/14/2013 12:50 PM, Gora Mohanty wrote:

On 14 January 2013 16:59, Jens Grivolla j+...@grivolla.net wrote:
[...]

Then please show me the query to find users that are fluent in spanish and
english. Bonus points if you manage to not retrieve the same user several
times. (Hint, your schema stores only one language skill per row).


Doh! You are right, of course. Brainfart from my side.


Ok, I was starting to wonder if I was the one missing something. 
Re-reading what I wrote I see I may have sounded a bit rude, that was 
not my intention, sorry.


Best,
Jens




Re: configuring schema to match database

2013-01-11 Thread Jens Grivolla

On 01/11/2013 05:23 PM, Gora Mohanty wrote:

You are still thinking of Solr as a RDBMS, where you should not
be. In your case, it is easiest to flatten out the data. This increases
the size of the index, but that should not really be of concern. As
your courses and languages tables are connected only to user, the
schema that I described earlier should suffice. To extend my
earlier example, given:
* userA with courses c1, c2, c3, and languages l1, l2
* userB with c2, c3, and l2
you should flatten it such that you get the following Solr documents
userA c1 name c1 startdate...l1 l1 writing skill...
userA c1 name c1 startdate...l2 l2 writing skill...
userA c2 name c2 startdate...l1 l1 writing skill...

userB c2 name c2 startdate...l2 l2 writing skill...
userB c3 name c3 startdate...l2 l2 writing skill...
i.e., a total of 3 courses x 2 languages = 6 documents for
userA, and 2 courses x 1 language = 2 documents for userB


Actually, that is what you would get when doing a join in an RDBMS, the 
cross-product of your tables. This is NOT AT ALL what you typically do 
in Solr.


Best start the other way around, think of Solr as a retrieval system, 
not a storage system. What are your queries? What do you want to find, 
and what criteria do you use to search for it?


If your intention is to find users that match certain criteria, each 
entry should be a user (with ALL associated information, e.g. all 
courses, all language skills, etc.), if you want to retrieve courses, 
each entry should be a course.


Let's say you want to find users who have certain language skills, you 
would have a schema that describes a user:

- user id
- user name
- languages
- ...

In languages, you could store e.g. things like: en|reading|high 
es|writing|low, etc. It could be a multivalued field or just have 
everything separated by space and a tokenizer that splits on whitespace.


Now you can query:

- language:es* -- return all users with some spanish skills
- language:en|writing|high -- return all users with high english writing 
skills
- +(language:es* language:fr*) +language:en|writing|high -- return users 
with high english writing skills and some knowledge of french or spanish


If you want to avoid wildcard queries (more costly) you can just add 
plain en and es, etc. to your field so language:es will match 
anybody with spanish skills.


Best,
Jens



Re: Multicore solr

2012-05-23 Thread Jens Grivolla

So are you even doing text search in Solr at all, or just using it as a
key-value store?

If the latter, do you have your schema configured so
that only the search_id field is indexed (with a keyword tokenizer) and 
everything else only stored? Also, are you sure that Solr is the best 
option as a key-value store?


Jens

On 05/23/2012 04:34 AM, Amit Jha wrote:

Hi,

Thanks for your advice. It is basically a meta search application.
Users can perform a search on N number of data sources at a time. We
broadcast Parallel search to each  selected data sources and write
data to solr using custom build API(API and solr are deployed on
separate machine API job is to perform parallel search, write data to
solr ). API respond to application that some results are available
then application fires  a search query to display the results(query
would be q=unique_search_id). And other side API keep writing data to
solr and user can fire a search to solr to view all results.

In current scenario we are using single solr server  we performing
real time index and search. Performing these operations on single
solr making process slow as index size increases.

So we are planning to use multi core solr and each user will have its
core. All core will have the same schema.

Please suggest if this approach has any issues.

Rgds AJ

On 22-May-2012, at 20:14, Sohail Aboobakersabooba...@gmail.com
wrote:


It would help if you provide your use case. What are you indexing
for each user and why would you need a separate core for indexing
each user? How do you decide schema for each user? It might be
better to describe your use case and desired results. People on the
list will be able to advice on the best approach.

Sohail







Re: Wildcard-Search Solr 3.5.0

2012-05-23 Thread Jens Grivolla
Maybe a filter like ISOLatin1AccentFilter that doesn't get applied when 
using wildcards? How do the terms actually appear in the index?


Jens

On 05/23/2012 01:19 PM, spr...@gmx.eu wrote:

No one an idea?

Thx.



The text may contain FooBar.

When I do a wildcard search like this: Foo* - no hits.
When I do a wildcard search like this: foo* - doc is
found.


Please see http://wiki.apache.org/solr/MultitermQueryAnalysis



Well, it works in 3.6. With one exception: If I use german umlauts it does
not work anymore.

Text: Bär

Bä* -  no hits
Bär -  hits

What can I do in this case?

Thank you







Re: mysolr python client

2011-12-01 Thread Jens Grivolla

On 11/30/2011 05:40 PM, Marco Martinez wrote:

For anyone interested, recently I've been using a new Solr client for
Python. It's easy and pretty well documented. If you're interested its site
is: http://mysolr.redtuna.org/


Do you know what advantages it has over pysolr or solrpy? On the page it 
only says mysolr was born to be a fast and easy-to-use client for 
Apache Solr’s API and because existing Python clients didn’t fulfill 
these conditions.


Thanks,
Jens



Re: MoreLikeThis and two field in mlt.fl

2010-11-25 Thread Jens Grivolla

On 11/25/2010 10:06 AM, Damien Fontaine wrote:

I have a problem with MoreLikeThis on Solr 1.4.1. I can't put two field
on mlt.fl.
Example : text and title, only text is in interestingTerms


It should work. My guess is that the terms from the title simply don't 
make the cut due to mlt.mintf, which is often set so that only terms 
appearing multiple times are considered.


HTH,
Jens