Re: Fuzzy searching documents over multiple fields using Solr

2013-05-09 Thread Geert-Jan Brits
I didn't mention it but I'd like individual fields to contribute to the
overall score on a continuum instead of 1 (match) and 0 (no match), which
will lead to more fine-grained scoring.

A contrived example: all other things equal a tv of 40 inch should score
higher than a 38 inch tv when searching for a 42 inch tv.
This based on some distance modeling on the 'size' -field. (eg:
score(42,40) = 0.6 and score(42,38) = 0,4).
Other qualitative fields may be modeled in the same way: (e.g: restaurants
with field 'price' with values: 'budget','mid-range', 'expensive', ...)

Any way to incorporate this?



2013/5/9 Jack Krupansky j...@basetechnology.com

 A simple OR boolean query will boost documents that have more matches.
 You can also selectively boost individual OR terms to control importance.
 And do and AND for the required terms, like tv.

 -- Jack Krupansky
 -Original Message- From: britske
 Sent: Thursday, May 09, 2013 11:21 AM
 To: solr-user@lucene.apache.org
 Subject: Fuzzy searching documents over multiple fields using Solr


 Not sure if this has ever come up (or perhaps even implemented without me
 knowing) , but I'm interested in doing Fuzzy search over multiple fields
 using Solr.

 What I mean is the ability to returns documents based on some 'distance
 calculation' without documents having to match 100% to the query.

 Usecase: a user is searching for a tv with a couple of filters selected. No
 tv matches all filters. How to come up with a bunch of suggestions that
 match the selected filters as closely as possible? The hard part is to
 determine what 'closely' means in this context, etc.

 This relates to (approximate) nearest neighbor, Kd-trees, etc. Has anyone
 ever tried to do something similar? any plugins, etc? or reasons
 Solr/Lucene
 would/wouldn't be the correct system to build on?

 Thanks



 --
 View this message in context: http://lucene.472066.n3.**
 nabble.com/Fuzzy-searching-**documents-over-multiple-**
 fields-using-Solr-tp4061867.**htmlhttp://lucene.472066.n3.nabble.com/Fuzzy-searching-documents-over-multiple-fields-using-Solr-tp4061867.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: modeling prices based on daterange using multipoints

2012-12-12 Thread Geert-Jan Brits
2012/12/12 David Smiley (@MITRE.org) dsmi...@mitre.org

 britske wrote
  Hi David,
 
  Yeah interesting (as well as problematic as far is implementing) use-case
  indeed :)
 
  1. You mention there are no special caches / memory requirements
 inherent
  in this.. For a given user-query this would mean all hotels would have
 to
  seach for all point.x each time right? What would be a good plugin-point
  to
  build in some custom cached filter code for this (perhaps using the Solr
  Filter cache)? As I see it, determining all hotels that have a particular
  point.x value is probably: A) pretty costly to do on each user query. B).
  is static and can be cached easily without a lot of memory (relatively
  speaking) i.e: 20.000 filters (representing all of the 20.000 different
  point.x, that is, lt;date,duration,nr persons, roomtypegt; combos) with
  a
  bitset per filter  representing ids of hotels that have the said point.x.

 I think you're over-thinking the complexity of this query.  I bet it's
 faster than you think and even then putting this in a filter query 'fq' is
 going to be cached by Solr any way, making it lightning fast at subsequent
 queries.


Ah! Didn't realize such a spatial query could be dropped in a FQ. Nice,
that solves this part indeed.


  britske wrote
  2. I'm not sure I explained C. (sorting) well, since I believe you're
  talking about implementing custom code to sort multiple point.y's per
  hotel, correct?. That's not what I need. Instead, for every user-query at
  most 1 point ever matches. I.e: a hotel has a price for a particular
  lt;date,
  duration,nrpersons,roomtypegt;-combo (P.x) or it hasn't.
 
  Say a user queries for the
 lt;date,duration,nrpersons,roomtypegt;-combo:
  21
  dec 2012,3 days,2 persons, double. This might be encoded into a value,
  say: 12345.
  Now, for the hotels that do match that query (i.e: those hotels that have
  a
  point P for which P.x=12345) I want to sort those hotels on P.y (the
 price
  for the requested P.x)

 Ah; ok.  But still, my first suggestion is still what I think you could do
 except that the algorithm is simpler -- return the first matching 'y' in
 the
 document where the point matches the query.  Alternatively, if you're
 confident the number of matching documents (hotels) is going to be
 small-ish, say less than a couple hundred, then you could simply sort it
 client-side.  You'd have to get back all the values, or maybe write a
 DocTransformer to find the specific one.

 ~ David


Writing something similar to ShapeFieldCacheDistanceValueSource, being a
valueSource, would enable me to expose it by name to the frontend?
What I'm saying is: let's say I want to call this implementation
'pricesort' and chain it with other sorts, like: 'sort=pricesort asc,
popularity desc, name asc'. Or use it by name in a functionquery. That
would be possible right?

Geert-Jan



 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-tp4026011p4026256.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: social123 Data Appending Service

2012-01-26 Thread Geert-Jan Brits
No thanks, not sure which site you're talking about btw.
But anyway, no thanks


Op 26 januari 2012 19:41 schreef Aaron Biddar
aaron.bid...@social123.comhet volgende:

 Hi there-

 I was on your site today and was not sure who to reach out to.  My Company,
 Social123, provides Social Data Appending for companies that provide
 lists.  In a nutshell, we add Facebook, LinkedIn and Twitter contact
 information to your current lists. Its a great way to easily offer a new
 service or add on to your current offerings.  Providing social media
 contact information to your customers will allow them to interact with
 their customers on a whole new level.

 If you are the right person to speak with, please let me know your
 availability for a quick 5-minute demo or check out our tour at
 www.social123.com.  If you are not the right person, would you mind
 passing
 this e-mail along?

 Thanks in advance.

 --
 Aaron Biddar
 Founder, CEO
 aaron.bid...@social123.com
 www.social123.com
 78 Alexander St. #K  Charleston SC 29403
 M  678 925 3556   P 800.505.7295 ex101



Re: multiple dateranges/timeslots per doc: modeling openinghours.

2011-10-03 Thread Geert-Jan Brits
Interesting! Reading your previous blogposts, I gather that the to be posted
'implementation approaches' includes a way of making the SpanQueries
available within SOLR?
Also, would with your approach would (numeric) RangeQueries be possible as
Hoss suggests?

Looking forward to that 'implementation post'
Cheers,
Geert-Jan

Op 1 oktober 2011 19:57 schreef Mikhail Khludnev mkhlud...@griddynamics.com
 het volgende:

 I agree about SpanQueries. It's a viable measure against false-positive
 matches on multivalue fields.
  we've implemented this approach some time ago. Pls find details at

 http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html

 and

 http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
 we are going to publish the third post about an implementation approaches.

 --
 Mikhail Khludnev


 On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter hossman_luc...@fucit.org
 wrote:

 
  : Another, faulty, option would be to model opening/closing hours in 2
  : multivalued date-fields, i.e: open, close. and insert open/close for
 each
  : day, e.g:
  :
  : open: 2011-11-08:1800 - close: 2011-11-09:0300
  : open: 2011-11-09:1700 - close: 2011-11-10:0500
  : open: 2011-11-10:1700 - close: 2011-11-11:0300
  :
  : And queries would be of the form:
  :
  : 'open  now  close  now+3h'
  :
  : But since there is no way to indicate that 'open' and 'close' are
  pairwise
  : related I will get a lot of false positives, e.g the above document
 would
  be
  : returned for:
 
  This isn't possible out of the box, but the general idea of position
  linked queries is possible using the same approach as the
  FieldMaskingSpanQuery...
 
 
 
 https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
  https://issues.apache.org/jira/browse/LUCENE-1494
 
  ..implementing something like this that would work with
  (Numeric)RangeQueries however would require some additional work, but it
  should certianly be doable -- i've suggested this before but no one has
  taken me up on it...
  http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
 
  If we take it as a given that you can do multiple ranges at the same
  position, then you can imagine supporting all of your regular hours
  using just two fields (open and close) by encoding the day+time of
  each range of open hours into them -- even if a store is open for
 multiple
  sets of ranges per day (ie: closed for siesta)...
 
   open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
   close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
 
  then asking for stores open now and for the next 3 hours on wed at
  2:13PM becomes a query for...
 
  sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
 
  For the special case part of your problem when there are certain dates
  that a store will be open atypical hours, i *think* that could be solved
  using some special docs and the new join QParser in a filter query...
 
 https://wiki.apache.org/solr/Join
 
  imagine you have your regular docs with all the normal data about a
  store, and the open/close fields i describe above.  but in addition to
  those, for any store that you know is closed on dec 25 or only open
  12:00-15:00 on Jan 01 you add an additional small doc encapsulating
  the information about the stores closures on that special date - so that
  each special case would be it's own doc, even if one store had 5 days
  where there was a special case...
 
   specialdoc1:
 store_id: 42
 special_date: Dec-25
 status: closed
   specialdoc2:
 store_id: 42
 special_date: Jan-01
 status: irregular
 open: 09_30
 close: 13_00
 
  then when you are executing your query, you use an fq to constrain to
  stores that are (normally) open right now (like i mentioned above) and
 you
  use another fq to find all docs *except* those resulting from a join
  against these special case docs based on the current date.
 
  so if you r query is open now and for the next 3 hours and now ==
  sunday, 2011-12-25 @ 10:17AM your query would be something like...
 
  q=...user input...
  time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
  fq={!v=time}
  fq={!join from=store_id to=unique_key v=$vv}
  vv=-(+special_date:Dec-25 +(status:closed OR _query_:{v=$time}))
 
  That join based approach for dealing with the special dates should work
  regardless of wether someone implements a way to do pair wise
  sameposition() rangequeries ... so if you can live w/o the multiple
  open/close pairs per day, you can just use the one field per day of hte
  week type approach you mentioned combined with the join for special
  case days of hte year and everything you need should already work w/o any
  code (on trunk).
 
  (disclaimer: obviously i haven't tested that query, the exact syntax may
  be off but the princible for modeling the special docs and using
  them in a join should work)
 
 
  -Hoss
 



 --
 

Re: multiple dateranges/timeslots per doc: modeling openinghours.

2011-10-03 Thread Geert-Jan Brits
Thanks Hoss for that in-depth walkthrough.

I like your solution of using (something akin to)
FieldMaskingSpanQueryhttps://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html.
Conceptually
the Join-approach looks like it would work from paper, although I'm not a
big fan of introducing a lot of complexity to the frontend / querying part
of the solution.

As an alternative, what about using your fieldMaskingSpanQuery-approach
solely (without the JOIN-approach)  and encode open/close on a per day
basis?
I didn't mention it, but I 'only' need 100 days of data, which would lead to
100 open and 100 close values, not counting the pois with multiple
openinghours per day which are pretty rare.
The index is rebuild each night, refreshing the date-data.

I'm not sure what the performance implications would be like, but somehow
that feels doable. Perhaps it even offsets the extra time needed for doing
the Joins, only 1 way to find out I guess.
Disadvantage would be fewer cache-hits when using FQ.

Data then becomes:

open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...

Notice the: 20111021_26_30, which indicates close at 2AM the next day,
which would work (in contrast to encoding it like 20111022_02_30)

Alternatively, how would you compare your suggested approach with the
approach by David Smiley using either SOLR-2155 (Geohash prefix query
filter) or LSP:
https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244.
That would work right now, and the LSP-approach seems pretty elegant to me.
FQ-style caching is probably not possible though.

Geert-Jan

Op 1 oktober 2011 04:25 schreef Chris Hostetter
hossman_luc...@fucit.orghet volgende:


 : Another, faulty, option would be to model opening/closing hours in 2
 : multivalued date-fields, i.e: open, close. and insert open/close for each
 : day, e.g:
 :
 : open: 2011-11-08:1800 - close: 2011-11-09:0300
 : open: 2011-11-09:1700 - close: 2011-11-10:0500
 : open: 2011-11-10:1700 - close: 2011-11-11:0300
 :
 : And queries would be of the form:
 :
 : 'open  now  close  now+3h'
 :
 : But since there is no way to indicate that 'open' and 'close' are
 pairwise
 : related I will get a lot of false positives, e.g the above document would
 be
 : returned for:

 This isn't possible out of the box, but the general idea of position
 linked queries is possible using the same approach as the
 FieldMaskingSpanQuery...


 https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
 https://issues.apache.org/jira/browse/LUCENE-1494

 ..implementing something like this that would work with
 (Numeric)RangeQueries however would require some additional work, but it
 should certianly be doable -- i've suggested this before but no one has
 taken me up on it...
 http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery

 If we take it as a given that you can do multiple ranges at the same
 position, then you can imagine supporting all of your regular hours
 using just two fields (open and close) by encoding the day+time of
 each range of open hours into them -- even if a store is open for multiple
 sets of ranges per day (ie: closed for siesta)...

  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...

 then asking for stores open now and for the next 3 hours on wed at
 2:13PM becomes a query for...

 sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])

 For the special case part of your problem when there are certain dates
 that a store will be open atypical hours, i *think* that could be solved
 using some special docs and the new join QParser in a filter query...

https://wiki.apache.org/solr/Join

 imagine you have your regular docs with all the normal data about a
 store, and the open/close fields i describe above.  but in addition to
 those, for any store that you know is closed on dec 25 or only open
 12:00-15:00 on Jan 01 you add an additional small doc encapsulating
 the information about the stores closures on that special date - so that
 each special case would be it's own doc, even if one store had 5 days
 where there was a special case...

  specialdoc1:
store_id: 42
special_date: Dec-25
status: closed
  specialdoc2:
store_id: 42
special_date: Jan-01
status: irregular
open: 09_30
close: 13_00

 then when you are executing your query, you use an fq to constrain to
 stores that are (normally) open right now (like i mentioned above) and you
 use another fq to find all docs *except* those resulting from a join
 against these special case docs based on the current date.

 so if you r query is open now and for the next 3 hours and now ==
 sunday, 2011-12-25 @ 10:17AM your query would be something like...

 q=...user input...
 time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
 fq={!v=time}
 

Re: copyField destination does not exist

2011-03-28 Thread Geert-Jan Brits
The error is saying you have a copyfield-directive in schema.xml that wants
to copy the value of a field to the destination field 'text' that doesn't
exist (which indeed is the case given your supplied fields) Search your
schema.xml for 'copyField'. There's probably something configured related to
copyfield functionality that you don't want.  Perhaps you de-commented the
copyfield-portion of schema.xml by accident?

hth,
Geert-Jan

2011/3/28 Merlin Morgenstern merli...@fastmail.fm

 Hi there,

 I am trying to get solr indexing mysql tables. Seems like I have
 misconfigured schema.xml:

 HTTP ERROR: 500

 Severe errors in solr configuration.

 -
 org.apache.solr.common.SolrException: copyField destination :'text' does
 not exist
at

  org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:685)


 My config looks like this:

  fields
field name=id type=string indexed=true stored=true
required=true/
field name=phrase type=text indexed=true stored=true
required=true/
field name=country type=text indexed=true stored=true
required=true/
  /fields

  uniqueKeyid/uniqueKey
  !-- field for the QueryParser to use when an explicit fieldname is
  absent --
  defaultSearchFieldphrase/defaultSearchField


 What is wrong within this config? The type schould be OK.

 --
 http://www.fastmail.fm - Choose from over 50 domains or use your own




Re: working with collection : Where is default schema.xml

2011-03-22 Thread Geert-Jan Brits
Changing the default schema.xml to what you want is the way to go for most
of us.
It's a good learning experience as well, since it contains a lot of
documentation about the options that may be of interest to you.

Cheers,
Geert-Jan

2011/3/22 geag34 sac@gmail.com

 Ok thank.

 It is my fault. I have created collection with a lucidimagination perl
 script.

 I will errase the schema.xml.

 Thanks

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/working-with-collection-Where-is-default-schema-xml-tp2700455p2712496.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Adding the suggest component

2011-03-18 Thread Geert-Jan Brits
 2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983
Solr started on port 8983

instead of this:
 http://localhost/solr/admin/

try this instead:
http://localhost:8983/solr/admin/ http://localhost/solr/admin/

Cheers,
Geert-Jan



2011/3/18 Brian Lamb brian.l...@journalexperts.com

 That does seem like a better solution. I downloaded a recent version and
 there were the following files/folders:

 build.xml
 dev-tools
 LICENSE.txt
 lucene
 NOTICE.txt
 README.txt
 solr

 So I did cp -r solr/* /path/to/solr/stuff/ and started solr. I didn't get
 any error message but I only got the following messages:

 2011-03-18 14:11:02.016:INFO::Logging to STDERR via
 org.mortbay.log.StdErrLog
 2011-03-18 14:11:02.240:INFO::jetty-6.1-SNAPSHOT
 2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983

 Where as before I got a bunch of messages indicating various libraries had
 been loaded. Additionally, when I go to http://localhost/solr/admin/, I
 get
 the following message:

 HTTP ERROR: 404

 Problem accessing /solr/admin. Reason:

NOT_FOUND

 What did I do incorrectly?

 Thanks,

 Brian Lamb


 On Fri, Mar 18, 2011 at 9:04 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  What do you mean you copied the contents...to the right place? If you
  checked out trunk and copied the files into 1.4.1, you have mixed source
  files between disparate versions. All bets are off.
 
  Or do you mean jar files? or???
 
  I'd build the source you checked out (at the Solr level) and use that
  rather
  than try to mix-n-match.
 
  BTW, if you're just starting (as in not in production), you may want to
  consider
  using 3.1, as it's being released even as we speak and has many
  improvements
  over 1.4. You can get a nightly build from here:
  https://builds.apache.org/hudson/view/S-Z/view/Solr/
 
  Best
  Erick
 
  On Thu, Mar 17, 2011 at 3:36 PM, Brian Lamb
  brian.l...@journalexperts.com wrote:
   Hi all,
  
   When I installed Solr, I downloaded the most recent version (1.4.1) I
   believe. I wanted to implement the Suggester (
   http://wiki.apache.org/solr/Suggester). I copied and pasted the
  information
   there into my solrconfig.xml file but I'm getting the following error:
  
   Error loading class 'org.apache.solr.spelling.suggest.Suggester'
  
   I read up on this error and found that I needed to checkout a newer
  version
   from SVN. I checked out a full version and copied the contents of
   src/java/org/apache/spelling/suggest to the same location on my set up.
   However, I am still receiving this error.
  
   Did I not put the files in the right place? What am I doing
 incorrectly?
  
   Thanks,
  
   Brian Lamb
  
 



Re: Solr query POST and not in GET

2011-03-15 Thread Geert-Jan Brits
Yes it's possible.
Assuming your using SolrJ as a client-library:

set:
QueryRequest req = new QueryRequest();
req.setMethod(METHOD.POST);

Any other client-library should have a similar method.
hth,
Geert-Jan


2011/3/15 Gastone Penzo gastone.pe...@gmail.com

 Hi,
 is possible to change Solr sending query method from get to post?
 because my query has a lot of OR..OR..OR and the log says to me Request URI
 too large
 Where can i change it??
 thanx




 --
 Gastone Penzo

 www.solr-italia.it
 The first italian blog about SOLR



Re: Solr Query

2011-03-15 Thread Geert-Jan Brits
 But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at
all.

I believe you mean: 'it returns all results with RetailPriceCodeID = 1 while
ignoring the 2nd query?'

If so, please check that your default operator is set to AND in your schema
config.
Other than that, your syntax seems correct.

Hth,
Geert-Jan


2011/3/15 Vishal Patel lin...@gmail.com

 I am a bit new for Solr.

 I am running below query in query browser admin interface

 +RetailPriceCodeID:1 +MSRP:[16001.00 TO 32000.00]

 I think it should return only results with RetailPriceCode = 1 ad MSRP
 between 16001 and 32000.

 But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at
 all.

 Am i doing something wrong here? Please help



Re: Solr and Permissions

2011-03-12 Thread Geert-Jan Brits
Ahh yes, sorry about that. I assumed ExternalFileField would work for
filtering as well. Note to self: never assume
Geert-Jan

2011/3/12 Koji Sekiguchi k...@r.email.ne.jp

 (11/03/12 10:28), go canal wrote:

 Looking at the API doc, it seems that only floating value is currently
 supported, is it true?


 Right. And it is just for changing score by using float values in the file,
 so it cannot be used for filtering.

 Koji
 --
 http://www.rondhuit.com/en/



Re: Getting Category ID (primary key)

2011-03-11 Thread Geert-Jan Brits
If it works, it's performant and not too messy it's a good way :-) . You can
also consider just faceting on Id, and use the id to fetch the categoryname
through sql / nosql.
That way your logic is seperated from your presentation, which makes
extending (think internationalizing, etc.) easier. Not sure if that's
appropriate for your 'category' field but anyway.

I belief you were asking this because you already had 2 multivalued fields:
 'id' and 'category' which you wanted to reuse for this particular use-case.
In short: you can't link a particular value in a multivalued field (e.g:
'id') to a particular value in another multivalued field (e.g: 'category'),
so just give up this route, and go with what you had, or use the suggested
above.

hth,
Geert-Jan



2011/3/11 Prav Buz buz.p...@gmail.com

 Hi,
 Thanks Erik, yes that's what I've done for now, but was wondering if it's
 the best way :)

 thanks

 Praveen

 On Fri, Mar 11, 2011 at 6:06 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Thinking out loud here, but would it work to just have ugly
  categories? Instead of splitting them up, just encode them like
  1|a
  2|b
  3|c
 
  or some such. Then split them  back up again and display
  the name to the user and use the ID in the URL
 
  Best
  Erick
 
  On Fri, Mar 11, 2011 at 4:17 AM, Prav Buz buz.p...@gmail.com wrote:
   Hi,
  
   Yes I already have different fields for category and category Id , and
  they
   are in same order when retrieved from solr
  
   for eg:
   IDs
   1
   3
   4
   5
   names
   a
   b
   c
   d
   e
  
   id 1 is of name a and id 5 is of name e. but when I sort the category
  names
   , looses this order as they are not related in any manner in the solr
  docs.
  
  
   Thanks
  
   Praveen
  
   On Fri, Mar 11, 2011 at 2:35 PM, Gora Mohanty g...@mimirtech.com
  wrote:
  
   On Fri, Mar 11, 2011 at 2:32 PM, Prav Buz buz.p...@gmail.com wrote:
   [...]
I need to show a facets on Category and then I need the category id
 in
   the
href link. For this what I 'm trying to do is create a field which
  will
store ID|Category in the schema and split it in the UI.
Also I have Category and category id 's indexed .
   [...]
  
   Why not have two different fields for category, and for category ID?
  
   Regards,
   Gora
  
  
 



Re: Solr and Permissions

2011-03-11 Thread Geert-Jan Brits
About the 'having to reindex when permissions change'-problem:

have a look at ExternalFileField
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.htmlwhich
enables you to reload a file without having to reindex all the documents.

Thinking out loud: multivalued field 'roles' of type ExternalFileField.
- assign each person 1 or multiple roles.
- each document has multiple roles assigned to it (which are entitled to
view it)

Not sure if it (the ExternalFileField approach) scales though.

Geert-Jan


2011/3/11 Bill Bell billnb...@gmail.com

 Why not just add a security field in Solr and use fq to limit to the users
 permissions?

 Bill Bell
 Sent from mobile


 On Mar 11, 2011, at 10:27 AM, Walter Underwood wun...@wunderwood.org
 wrote:

  On Mar 10, 2011, at 10:48 PM, go canal wrote:
 
  But in real world, any content management systems need full text search;
 so the
  question is to how to support search with permission control.
 
  I have yet to see a Search Engine that provides some sort of Content
 Management
  features like we are discussing here (Solr, Elastic Search ?)
 
 
  It isn't free, but MarkLogic can do this. It is an XML database with
 security support and search. Changing permissions is an update transaction,
 not a reload. Permissions can be part of a search, just like any other
 constraint.
 
  The search is not the usual crappy search you get in a database.
 MarkLogic is built with search engine technology, so the search is fast and
 good.
 
  We do offer a community license for personal, not-for-profit use. See
 details here:
 
  http://developer.marklogic.com/licensing
 
  wunder
  --
  Walter Underwood
  Lead Engineer, MarkLogic
 



Re: Solr

2011-03-10 Thread Geert-Jan Brits
Start by reading  http://wiki.apache.org/solr/FrontPage and the provided
links (introduction, tutorial, etc. )

2011/3/10 yazhini.k vini yazhini@gmail.com

 Hi ,

 I need notes and detail about solr because of Now I am working in solr so i
 need help .


 Regards ,

 Yazhini . K
  NCSI ,
  M.Sc ( Software Engineering ) .



Re: how would you design schema?

2011-03-09 Thread Geert-Jan Brits
Would having a solr-document represent a 'product purchase per account'
solve your problem?
You could then easily link the date of purchase to the document as well as
the account-number.

e.g:
fields: orderid (key), productid, product-characteristics,
order-characteristics (including date of purchase).

or in case of option of multiple products having a joined orderid:
fields: cat(orderid,productid) (key), orderid, productid,
product-characteristics, order-characteristics (including date of
purchase).

The difference to your setup (i.e: one document per account) is that the
suggested setup above may return multiple documents when you search by
account-nr, which may or may not be what you're after.

hth,
Geert-Jan

2011/3/9 dan whelan d...@adicio.com

 Hi,

 I'm investigating how to set up a schema like this:

 I want to index accounts and the products purchased (multiValued) by that
 account but I also need the ability to search by the date the product was
 purchased.

 It would be easy if the purchase date wasn't part of the requirements.

 How would the schema be designed? Is there a better approach?

 Thanks,

 Dan




Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Geert-Jan Brits
Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.

Lately I've been looking for crawlers that incoporate this technology but
without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean dominique.bej...@eolya.fr

 Rosa,

 In the pipeline, there is a stage that extract the text from the original
 document (PDF, HTML, ...).
 It is possible to plug scripts (Java 6 compliant) in order to keep only
 relevant parts of the document.
 See
 http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage

 Dominique

 Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

  Nice job!

 It would be good to be able to extract specific data from a given page via
 XPATH though.

 Regards,


 Le 02/03/2011 01:25, Dominique Bejean a écrit :

 Hi,

 I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
 Crawler. It includes :

   * a crawler
   * a document processing pipeline
   * a solr indexer

 The crawler has a web administration in order to manage web sites to be
 crawled. Each web site crawl is configured with a lot of possible parameters
 (no all mandatory) :

   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

 The pileline includes various ready to use stages (text extraction,
 language detection, Solr ready to index xml writer, ...).

 All is very configurable and extendible either by scripting or java
 coding.

 With scripting technology, you can help the crawler to handle javascript
 links or help the pipeline to extract relevant title and cleanup the html
 pages (remove menus, header, footers, ..)

 With java coding, you can develop your own pipeline stage stage

 The Crawl Anywhere web site provides good explanations and screen shots.
 All is documented in a wiki.

 The current version is 1.1.4. You can download and try it out from here :
 www.crawl-anywhere.com


 Regards

 Dominique







Re: Efficient boolean query

2011-03-02 Thread Geert-Jan Brits
If you often query X as part of several other queries (e.g: X  | X AND Y |
 X AND Z)
you might consider putting X in a filter query (
http://wiki.apache.org/solr/CommonQueryParameters#fq)

leading to:
q=*:*fq=X
q=Yfq=X
q=Zfq=X

Filter queries are cached seperately which means that after the first query
involving X, X should be returned quickly.
So your FIRST query will probably still be in the 'few seconds'- range, but
all following queries involving X will return much quicker.

hth,
Geert-Jan

2011/3/2 Ofer Fort ofer...@gmail.com

 Hey all,
 I have an index with a lot of documents with the term X and no documents
 with the term Y.
 If i query for X it take a few seconds and returns the results.
 If I query for Y it takes a millisecond and returns an empty set.
 If i query for Y AND X it takes a few seconds and returns an empty set.

 I'm guessing that it evaluate both X and Y and only then tries to intersect
 them?

 Am i wrong? is there another way to run this query more efficiently?

 thanks for any input



Re: Problem with sorting using functions.

2011-02-28 Thread Geert-Jan Brits
sort by functionquery is only available from solr 3.1 (from :
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function)


2011/2/28 John Sherwood j...@storecrowd.com

 This works:
 /select/?q=*:*sort=price desc

 This throws a 400 error:
 /select/?q=*:*sort=sum(1, 1) desc

 Missing sort order.

 I'm using 1.4.2.  I've tried all sorts of different numbers, functions, and
 fields but nothing seems to change that error.  Any ideas?



Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Geert-Jan Brits
You could always use a secondary sort as a tie-breaker, i.e: something
unique like 'documentid' or something. That would ensure a stable sort.

2011/2/23 Stephen Duncan Jr stephen.dun...@gmail.com

 I'm trying to use

 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
 as
 a bf parameter to my dismax handler.  The problem is, the value of NOW can
 cause documents in a similar range (date value within a few seconds of each
 other) to sometimes round to be equal, and sometimes not, changing their
 sort order (when equal, falling back to a secondary sort).  This, in turn,
 screws up paging.

 The problem is that score is rounded to a lower level of precision than
 what
 the suggested formula produces as a difference between two values within
 seconds of each other.  It seems to me if I could round the value to
 minutes
 or hours, where the difference will be large enough to not be rounded-out,
 then I wouldn't have problems with order changing on me.  But it's not
 legal
 syntax to specify something like:
 recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)

 Is this a problem anyone has faced and solved?  Anyone have suggested
 solutions, other than indexing a copy of the date field that's rounded to
 the hour?

 --
 Stephen Duncan Jr
 www.stephenduncanjr.com



Re: Index Not Matching

2011-02-03 Thread Geert-Jan Brits
Make sure your index is completely commited.

curl 'http://localhost:8983/solr/update?commit=true'

http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22

for an overview:
http://lucene.apache.org/solr/tutorial.html

hth,
Geert-Jan
http://techgurulive.com/2010/11/22/apache-solr-commit-and-optimize/

2011/2/3 Esclusa, Will william.escl...@bonton.com

 Both the application and the SOLR gui match (with the incorrect number
 of course :-) )

 At first I thought it could be a schema problem, but we went though it
 with a fine comb and compared it to the one in our stage environment.
 What is really weird is that I grabbed one of the product ID that are
 not showing up in SOLR from the DB, search through the SOLR GUI and it
 found it.

 -Original Message-
 From: Savvas-Andreas Moysidis
 [mailto:savvas.andreas.moysi...@googlemail.com]
 Sent: Thursday, February 03, 2011 4:57 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Index Not Matching

 that's odd..are you viewing the results through your application or the
 admin console? if you aren't, I'd suggest you use the admin console just
 to
 eliminate the possibility of an application bug.
 We had a similar problem in the past and turned out to be a mixup of our
 dev/test instances..

 On 3 February 2011 21:41, Esclusa, Will william.escl...@bonton.com
 wrote:

  Hello Saavs,
 
  I am 100% sure we are not updating the DB after we index the data. We
  are specifying the same fields on both queries. Our prod boxes do not
  have access to QA or DEV, so I would expect a connection error when
  indexing if this is the case. No connection errors in the logs.
 
 
 
  -Original Message-
  From: Savvas-Andreas Moysidis
  [mailto:savvas.andreas.moysi...@googlemail.com]
  Sent: Thursday, February 03, 2011 4:26 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Index Not Matching
 
  Hello,
 
  Are you definitely positive your database isn't updated after you
 index
  your
  data? Are you querying against the same field(s) specifying the same
  criteria both in Solr and in the database?
  Any chance you might be pointing to a dev/test instance of Solr ?
 
  Regards,
  - Savvas
 
  On 3 February 2011 20:17, Esclusa, Will william.escl...@bonton.com
  wrote:
 
   Greetings!
  
  
  
   My organization is new to SOLR, so please bare with me.  At times,
 we
   experience an out of sync condition between SOLR index files and our
   Database. We resolved that by clearing the index file and performing
 a
  full
   crawl of the database. Last time we noticed an out of sync
 condition,
  we
   went through our procedure of deleting and crawling, but this time
 it
  did
   not fix it.
  
  
  
   For example, search for swim on the DB and we get 440 products, but
  yet
   SOLR states we have 214 products. Has anyone experience anything
 like
  this?
   Does anyone have any suggestions on a trace we can turn on? Again,
 we
  are
   new to SOLR so any help you can provide is greatly appreciated.
  
  
  
   Thanks!
  
  
  
   Will
  
  
  
  
 



Re: Faceting Question

2011-01-24 Thread Geert-Jan Brits
 fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(
|10015|)version=2.2start=0rows=10indent=onfacet=onfacet.field={!ex=tag1}categoryfacet.field=capacityfacet.field=brand

I'm just guessing here, but perhaps {!tag=tag1} is only picking up the 'tags:(
|1003| |1007|) '-part. If so {!ex=tag1} would only exclude 'tags:( |1003|
|1007|) ' but it wouldn't exclude ' tags:(
|10015|)'

I believe this would 100% explain what you're seeing.

Assuming my guess is correct you could try to a couple of things (none of
which I'm absolutely certain will work, but you could try it out easily):
1. put fq in quotes: fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(|10015|)
 -- this might instruct {!tag=tag1} to tag the whole fq-filter.
2. make multiple fq's, and exclude them all (not sure if you can exclude
multiple fields): fq={!tag=tag1}tags:( |1003| |1007|)fq={!tag=tag2}tags:(
|10015|)facet.field={!ex=tag1,tag2}category...

hth,
Geert-Jan

2011/1/24 beaviebugeater mbro...@cox.net


 I am attempting to do facets on products similar to how hayneedle does it
 on
 their online stores (they do NOT use Solr).   See:
 http://www.clockstyle.com/wall-clocks/antiqued/1359+1429+4294885075.cfm

 So simple example, my left nav might contain categories and 2 attributes,
 brand and capacity:

 Categories
 - Cat1 (23) selected
 - Cat2 (16)
 - Cat3 (5)

 Brand
 -Brand1 (18)
 -Brand2 (10)
 -Brand3 (0)

 Capacity
 -Capacity1 (14)
 -Capacity2 (9)


 Each category or attribute value is represented with a checkbox and can be
 selected or deselected.

 The initial entry into this page has one category selected.  Other
 categories can be selected which might change the number of products
 related
 to each attribute value.  The number of products in each category never
 changes.

 I should also be able to select one or more attribute.

 Logically this would look something like:

 (Cat1 Or Cat2) AND (Value1 OR Value2) AND (Value4)

 Behind the scenes I have each category and attribute value represented by a
 tag, which is just a numeric value.  So I search on the tags field only
 and then facet on category, brand and capacity fields which are stored
 separately.

 My current Solr query ends up looking something like:

 fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(

 |10015|)version=2.2start=0rows=10indent=onfacet=onfacet.field={!ex=tag1}categoryfacet.field=capacityfacet.field=brand

 This shows 2 categories being selected (1003 and 1007) and one attribute
 value (10015).

 This partially works - the categories work fine.   The problem is, if I
 select, say a brand attribute (as in the above example the 10015 tag) it
 does filter to the selected categories AND the selected attribute BUT I'm
 not able to broaden the search by selecting another attribute value.

 I want to display of products to be filtered to what I select, but I want
 to
 be able to broaden the filter without having to back up.

 I feel like I'm close but still missing something.  Is there a way to
 specify 2 tags that should be excluded from facet fields?

 I hope this example makes sense.

 Any help greatly appreciated.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Faceting-Question-tp2320542p2320542.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: one last questoni on dynamic fields

2011-01-23 Thread Geert-Jan Brits
Yep you can. Although I'm not sure you can use a wildcard-prefix. (perhaps
you can I'm just not sure) . I always use wildcard-suffixes.

Cheers,
Geert-Jan

2011/1/23 Dennis Gearon gear...@sbcglobal.net

 Is it possible to use ONE definition of a dynamic field type for inserting
 mulitple dynamic fields of that type with different names? Or do I need a
 seperate dynamic field definition for each eventual field?

 Can I do this?
 in schema.xml
  field name=ALL_OTHER_STANDARD_FILEDS type=OTHER_TYPES
 indexed=SOME_TIMES stored=USUALLY/
  dynamicField name=*_i  type=intindexed=true  stored=true/
  .
  .
 /in schema.xml


 and then doing for insert
 add
 doc
  field name=ALL_OTHER_STANDARD_FILEDSall their valuesfield
  field name=customA_i9802490824908field
  field name=customB_i9809084field
  field name=customC_i09845970011field
  field name=customD_i09874523459870field
 /doc
 /add

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: Search on two core and two schema

2011-01-18 Thread Geert-Jan Brits
Schemas are very differents, i can't group them.

In contrast to what you're saying above, you may rethink the option of
combining both type of documents in a single core.
It's a perfectly valid approach to combine heteregenous documents in a
single core in Solr. (and use a specific field -say 'type'-  to distinguish
between them when needed)

Geert-Jan

2011/1/18 Jonathan Rochkind rochk...@jhu.edu

 Solr can't do that. Two cores are two seperate cores, you have to do two
 seperate queries, and get two seperate result sets.

 Solr is not an rdbms.


 On 1/18/2011 12:24 PM, Damien Fontaine wrote:

 I want execute this query :

 Schema 1 :
 field name=id type=string indexed=true stored=true
 required=true /
 field name=title type=string indexed=true stored=true
 required=true /
 field name=UUID_location type=string indexed=true stored=true
 required=true /

 Schema 2 :
 field name=UUID_location type=string indexed=true stored=true
 required=true /
 field name=label type=string indexed=true stored=true
 required=true /
 field name=type type=string indexed=true stored=true
 required=true /

 Query :

 select?facet=truefl=titleq=title:*facet.field=UUID_locationrows=10qt=standard

 Result :

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 lst name=params
 str name=facettrue/str
 str name=fltitle/str
 str name=qtitle:*/str
 str name=facet.fieldUUID_location/str
 str name=qtstandard/str
 /lst
 /lst
 result name=response numFound=1889 start=0
 doc
 str name=titletitre 1/str
 /doc
 doc
 str name=titleTitre 2/str
 /doc
 /result
 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=UUID_location
 int name=Japan998/int
 int name=China891/int
 /lst
 /lst
 lst name=facet_dates/
  /lst
 /response

 Le 18/01/2011 17:55, Stefan Matheis a écrit :

 Okay .. and .. now .. you're trying to do what? perhaps you could give us
 an
 example, w/ real data .. sample queries   - results.
 because actually i cannot imagine what you want to achieve, sorry

 On Tue, Jan 18, 2011 at 5:24 PM, Damien Fontainedfonta...@rosebud.fr
 wrote:

  On my first schema, there are informations about a document like title,
 lead, text etc and many UUID(each UUID is a taxon's ID)
 My second schema contains my taxonomies with auto-complete and facets.

 Le 18/01/2011 17:06, Stefan Matheis a écrit :

   Search on two cores but combine the results afterwards to present them
 in

 one group, or what exactly are you trying to do Damien?

 On Tue, Jan 18, 2011 at 5:04 PM, Damien Fontainedfonta...@rosebud.fr

 wrote:

   Hi,

 I would like make a search on two core with differents schemas.

 Sample :

 Schema Core1
   - ID
   - Label
   - IDTaxon
 ...

 Schema Core2
   - IDTaxon
   - Label
   - Hierarchy
 ...

 Schemas are very differents, i can't group them. Have you an idea to
 realize this search ?

 Thanks,

 Damien







Re: Sub query using SOLR?

2011-01-05 Thread Geert-Jan Brits
Bbarani probably wanted to be able to create the query without having to
prefetch the ids at the clientside first.
But I agree this is the only stable solution I can think of (so excluding
possible patches)

Geert-Jan

2011/1/5 Grijesh.singh pintu.grij...@gmail.com


 Why thinking so complex,just use result of first query as filter for your
 second query
 like
 fq=related_id:(id1 OR id2 OR id3 )q=q=”type:IT AND
 manager_12:dave”

 somthing like that

 -
 Grijesh
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Sub-query-using-SOLR-tp2193251p2197490.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Consequences for using multivalued on all fields

2010-12-21 Thread Geert-Jan Brits
You should be aware that the behavior of sorting on a multi-valued field is
undefined. After all, which of the multiple values should be used for
sorting?
So if you need sorting on the field, you shouldn't make it multi-valued.

Geert-Jan

2010/12/21 J.J. Larrea j...@panix.com

 Someone please correct me if I am wrong, but as far as I am aware index
 format is identical in either case.

 One benefit of allowing one to specify a field as single-valued is similar
 to specifying that a field is required: Providing a safeguard that index
 data conforms to requirements.  So making all fields multivalued forgoes
 that integrity check for fields which by definition should be singular.

 Also depending on the response writer and for the XMLResponseWriter the
 requested response version (see
 http://wiki.apache.org/solr/XMLResponseFormat) the multi-valued setting
 can determine whether the document values returned from a query will be
 scalars (eg. str name=year2010/str) or arrays of scalars (arr
 name=yearstr2010/str/arr), regardless of how many values are
 actually stored.

 But the most significant gotcha of not specifying the actual arity (1 or N)
 arises if any of those fields is used for field-faceting: By default the
 field-faceting logic chooses a different algorithm depending on whether the
 field is multi-valued, and the default choice for multi-valued is only
 appropriate for a small set of enumerated values since it creates a filter
 query for each value in the set. And this can have a profound effect on Solr
 memory utilization. So if you are not relying on the field arity setting to
 select the algorithm, you or your users might need to specify it explicitly
 with the f.field.facet.method argument; see
 http://wiki.apache.org/solr/SolrFacetingOverview for more info.

 So while all-multivalued isn't a showstopper, if it were up to me I'd want
 to give users the option to specify arity and whether the field is required.

 - J.J.

 At 2:13 PM +0100 12/21/10, Tim Terlegård wrote:
 In our application we use dynamic fields and there can be about 50 of
 them and there can be up to 100 million documents.
 
 Are there any disadvantages having multivalued=true on all fields in
 the schema? An admin of the application can specify dynamic fields and
 if they should be indexed or stored. Question is if we gain anything
 by letting them to choose multivalued as well or if it just adds
 complexity to the user interface?
 
 Thanks,
 Tim




Re: Search based on images

2010-12-11 Thread Geert-Jan Brits
Well-known algorithms for detecting 'highly descriptive features'  in images
that can cope with scaling and rotation (up to a certain degree of course)
are
SIFT and SURF (SURF is generally considered the more mature of the two
afaik)

http://en.wikipedia.org/wiki/Scale-invariant_feature_transform
http://en.wikipedia.org/wiki/SURF

http://en.wikipedia.org/wiki/SURFthat link comes with links to the
original papers as well as a list of open-source implementations, e.g:
http://code.google.com/p/javasurf/

http://code.google.com/p/javasurf/I don't have experience with the
open-source code myself, and you probably have to make a similiary-like
method based on the more low-level methods that implement these algorithms.
So this is perhaps a more 'down in the trenches' -approach, but at least it
should give you some solid background on how this is done.

Geert-Jan

2010/12/11 Dennis Gearon gear...@sbcglobal.net

 Tried one, of Perry Mason's secretary when she was young (and HOOOT),
 Barbara Hale.
 http://www.skylighters.org/ggparade/index8.html

 Didn't find it. 1.8 billion images indexed is probably a DROP in the bucket
 of
 what's out there.

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Dennis Gearon gear...@sbcglobal.net
 To: solr-user@lucene.apache.org
 Sent: Fri, December 10, 2010 9:24:53 PM
 Subject: Re: Search based on images

 Threre is actually some image recognition search engine software  somewhere
 I
 heard about. Take a picture of something, say a poster,  upload it, and it
 will
 adjust for some lighting/angle/distortion, and  try to find it on the web
 somewhere.

 You hear about crazy stuff like this at dev camps. Basically, handme downs
 from
 Homeland Security and the military ;-)
 Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better

 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



Re: finding exact case insensitive matches on single and multiword values

2010-12-03 Thread Geert-Jan Brits
when you went from strField to TextField in your config you enabled
tokenizing (which I believe splits on spaces by default),
which is why you see seperate 'words' / terms in the debugQuery-explanation.

I believe you want to keep your old strField config and try quoting:

fq=city:den+haag or fq=city:den haag

Concerning the lower-casing: wouldn't if be easiest to do that at the
client? (I'm not sure at the moment how to do lowercasing with a strField)
.

Geert-jan


2010/12/3 PeterKerk vettepa...@hotmail.com



 You are right, this is what I see when I append the debug query (very very
 useful btw!!!) in old situation:
 arr name=parsed_filter_queries
strcity:den title:haag/str
strPhraseQuery(themes:hotel en restaur)/str
 /arr



 I then changed the schema.xml to:

 fieldType name=myField class=solr.TextField sortMissingLast=true
 omitNorms=true
 analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 field name=city type=myField indexed=true stored=true/ !-- used
 to be string --


 I then tried adding parentheses:

 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city
 also tried (without +):
 http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den
 haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city

 Then I get:

 arr name=parsed_filter_queries
strcity:den city:haag/str
 /arr

 And still 0 results

 But as you can see the query is split up into 2 separate words, I dont
 think
 that is what I need?


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
if first is selected in the user interface and we have 10 price ranges
query would be 120 cluases (12 months * 10 price ranges)

What would you intend to do with the returned facet-results in this
situation? I doubt you want to display 12 categories (1 for each month) ?

When a user hasn't selected a date, perhaps it would be more useful to show
the cheapest fare regardless of month and facet on that?

This would involve introducing 2 new fields:
FareDateDontCareStandard, FareDateDontCareFirst

Populate these fields on indexing time, by calculating the cheapest fares
over all months.

This then results in every query having to support at most 20 price ranges
(10 for normal and 10 for first class)

HTH,
Geert-Jan



2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Hi Erick,
 so if i understand you we could do something like:

 if Jan is selected in the user interface and we have 10 price ranges

 query would be 20 cluases in the query (10 * 2 fare clases)

 if first is selected in the user interface and we have 10 price ranges
 query would be 120 cluases (12 months * 10 price ranges)

 if first and jan selected with 10 price ranges
 query would be 10 cluases

 if we required facets to be returned for all price combinations we'd need
 to
 supply
 240 cluases

 the user interface would also need to collate the individual fields into
 meaningful aggragates for the user (ie numbers by month, numbers by fare
 class)

 have I understood or missed the point (i usually have)




 On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote:

  I'd think that facet.query would work for you, something like:
  facet=truefacet.query=FareJanStandard:[price1 TO
  price2]facet.query:fareJanStandard[price2 TO price3]
  You can string as many facet.query clauses as you want, across as many
  fields as you want, they're all
  independent and will get their own sections in the response.
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 4:55 AM, lee carroll 
 lee.a.carr...@googlemail.com
  wrote:
 
   Hi
  
   I've built a schema for a proof of concept and it is all working fairly
   fine, niave maybe but fine.
   However I think we might run into trouble in the future if we ever use
   facets.
  
   The data models train destination city routes from a origin city:
   Doc:City
  Name: cityname [uniq key]
  CityType: city type values [nine possible values so good for
 faceting]
  ... [other city attricbutes which relate directy to the doc unique
  key]
   all have limited vocab so good for faceting
  FareJanStandard:cheapest standard fare in january(float value)
  FareJanFirst:cheapest first class fare in january(float value)
  FareFebStandard:cheapest standard fare in feb(float value)
  FareFebFirst:cheapest first fare in feb(float value)
  . etc
  
   The question is how would i best facet fare price? The desire is to
  return
  
   number of citys with jan prices in a set of ranges
   etc
   number of citys with first prices in a set of ranges
   etc
  
   install is 1.4.1 running in weblogic
  
   Any ideas ?
  
  
  
   Lee C
  
 



Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Ok longer answer than anticipated (and good conceptual practice ;-)

Yeah I belief that would work if I understand correctly that:

'in Jan [9]
in feb [10]
in march [1]'

has nothing to do with pricing, but only with availability?

If so you could seperate it out as two seperate issues:

1. ) showing pricing (based on context)
2. ) showing availabilities (based on context)

For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc])
note: 'dc' indicates 'don't care.

depending on the context you query the correct pricefield to populate the
price facet-values.
for discussion lets call the fields: _p[fare][date].
IN other words the price field for no preference at all would become: _pdcdc


For 2.) define a multivalued field 'FaresPerDate 'which indicate
availability, which is used to display:

A)
Standard fares [10]
First fares [3]

B)
in Jan [9]
in feb [10]
in march [1]

A) depends on your selection (or dont caring) about a month
B) vice versa depends on your selection (or dont caring)  about a fare type

given all possible date values: [jan,feb,..dec,dontcare]
given all possible fare values:[standard,first,dontcare]

FaresPerDate consists of multiple values per document where each value
indicates the availability of a combination of 'fare' and 'date':
(standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
Note that the nr of possible values = 39.

Example:
1. ) the user hasn't selected any preference:

q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
20]facet.query=_pdcdc:[20 TO 40], etc.

in the client you have to make sure to select the correct values of
'FaresPerDate' for display:
in this case:

Standard fares [10] -- FaresPerDate.standardDC
First fares [3] -- FaresPerDate.firstDC

in Jan [9] - FaresPerDate.DCJan
in feb [10] - FaresPerDate.DCFeb
in march [1]- FaresPerDate.DCMarch

2) the user has selected January
q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
TO 20]facet.query=_pDCJan:[20 TO 40]

Standard fares [10] -- FaresPerDate.standardJan
First fares [3] -- FaresPerDate.firstJan

in Jan [9] - FaresPerDate.DCJan
in feb [10] - FaresPerDate.DCFeb
in march [1]- FaresPerDate.DCMarch

Hope that helps,
Geert-Jan


2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Sorry Geert missed of the price value bit from the user interface so we'd
 display

 Facet price
 Standard fares [10]
 First fares [3]

 When traveling
 in Jan [9]
 in feb [10]
 in march [1]

 Fare Price
 0 - 25 :  [20]
 25 - 50: [10]
 50 - 100 [2]

 cheers lee c


 On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
 wrote:

  Geert
 
  The UI would be something like:
  user selections
  for the facet price
  max price: £100
  fare class: any
 
  city attributes facet
  cityattribute1 etc: xxx
 
  results displayed something like
 
  Facet price
  Standard fares [10]
  First fares [3]
  in Jan [9]
  in feb [10]
  in march [1]
  etc
  is this compatible with your approach ?
 
  Erick the price is an interval scale ie a fare can be any value (not
 high,
  low, medium etc)
 
  How sensible would the following approach be
  index city docs with fields only related to the city unique key
  in the same index also index fare docs which would be something like:
  Fare:
  cityID: xxx
  Fareclass:standard
  FareMonth: Jan
  FarePrice: 100
 
  the query would be something like:
  q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
  returning facets for FareClass and FareMonth. hold on this will not facet
  city docs correctly. sorry thasts not going to work.
 
 
 
 
 
 
 
 
  On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
  going to
  have to insure that HTTP request that long get through and stuff like
  that
 
  I'm reaching a bit here, but you can facet on a tokenized field.
 Although
  that's not
  often done there's no prohibition against it.
 
  So, what if you had just one field for each city that contained some
  abstract
  information about your fares etc. Something like
  janstdfareclass1 jancheapfareclass3 febstdfareclass6
 
  Now just facet on that field? Not #values# in that field, just the field
  itself. You'd then have to make those into human-readable text, but that
  would considerably simplify your query. Probably only works if your user
  is
  selecting from pre-defined ranges, if they expect to put in arbitrary
  ranges
  this scheme probably wouldn't work...
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
  lee.a.carr...@googlemail.comwrote:
 
   Hi Erick,
   so if i understand you we could do something like:
  
   if Jan is selected in the user interface and we have 10 price ranges
  
   query would be 20 cluases in the query (10 * 2 fare clases)
  
   if first is selected in the user interface and we have 10 price ranges
   query would be 120 cluases (12 months * 10 price ranges)
  
   if first and jan selected 

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Also, filtering and sorting on price can be done as well. Just be sure to
use the correct price- field.
Geert-Jan

2010/12/1 Geert-Jan Brits gbr...@gmail.com

 Ok longer answer than anticipated (and good conceptual practice ;-)

 Yeah I belief that would work if I understand correctly that:

 'in Jan [9]
 in feb [10]
 in march [1]'

 has nothing to do with pricing, but only with availability?

 If so you could seperate it out as two seperate issues:

 1. ) showing pricing (based on context)
 2. ) showing availabilities (based on context)

 For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
 [standard,first,dc])
 note: 'dc' indicates 'don't care.

 depending on the context you query the correct pricefield to populate the
 price facet-values.
 for discussion lets call the fields: _p[fare][date].
 IN other words the price field for no preference at all would become:
 _pdcdc


 For 2.) define a multivalued field 'FaresPerDate 'which indicate
 availability, which is used to display:

 A)
 Standard fares [10]
 First fares [3]

 B)
 in Jan [9]
 in feb [10]
 in march [1]

 A) depends on your selection (or dont caring) about a month
 B) vice versa depends on your selection (or dont caring)  about a fare type

 given all possible date values: [jan,feb,..dec,dontcare]
 given all possible fare values:[standard,first,dontcare]

 FaresPerDate consists of multiple values per document where each value
 indicates the availability of a combination of 'fare' and 'date':

 (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
 Note that the nr of possible values = 39.

 Example:
 1. ) the user hasn't selected any preference:

 q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
 20]facet.query=_pdcdc:[20 TO 40], etc.

 in the client you have to make sure to select the correct values of
 'FaresPerDate' for display:
 in this case:

 Standard fares [10] -- FaresPerDate.standardDC
 First fares [3] -- FaresPerDate.firstDC

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 2) the user has selected January
 q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
 TO 20]facet.query=_pDCJan:[20 TO 40]

 Standard fares [10] -- FaresPerDate.standardJan
 First fares [3] -- FaresPerDate.firstJan

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 Hope that helps,
 Geert-Jan


 2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Sorry Geert missed of the price value bit from the user interface so we'd
 display

 Facet price
 Standard fares [10]
 First fares [3]

 When traveling
 in Jan [9]
 in feb [10]
 in march [1]

 Fare Price
 0 - 25 :  [20]
 25 - 50: [10]
 50 - 100 [2]

 cheers lee c


 On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
 wrote:

  Geert
 
  The UI would be something like:
  user selections
  for the facet price
  max price: £100
  fare class: any
 
  city attributes facet
  cityattribute1 etc: xxx
 
  results displayed something like
 
  Facet price
  Standard fares [10]
  First fares [3]
  in Jan [9]
  in feb [10]
  in march [1]
  etc
  is this compatible with your approach ?
 
  Erick the price is an interval scale ie a fare can be any value (not
 high,
  low, medium etc)
 
  How sensible would the following approach be
  index city docs with fields only related to the city unique key
  in the same index also index fare docs which would be something like:
  Fare:
  cityID: xxx
  Fareclass:standard
  FareMonth: Jan
  FarePrice: 100
 
  the query would be something like:
  q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
  returning facets for FareClass and FareMonth. hold on this will not
 facet
  city docs correctly. sorry thasts not going to work.
 
 
 
 
 
 
 
 
  On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, that's getting to be a pretty clunky query sure enough. Now
 you're
  going to
  have to insure that HTTP request that long get through and stuff like
  that
 
  I'm reaching a bit here, but you can facet on a tokenized field.
 Although
  that's not
  often done there's no prohibition against it.
 
  So, what if you had just one field for each city that contained some
  abstract
  information about your fares etc. Something like
  janstdfareclass1 jancheapfareclass3 febstdfareclass6
 
  Now just facet on that field? Not #values# in that field, just the
 field
  itself. You'd then have to make those into human-readable text, but
 that
  would considerably simplify your query. Probably only works if your
 user
  is
  selecting from pre-defined ranges, if they expect to put in arbitrary
  ranges
  this scheme probably wouldn't work...
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
  lee.a.carr...@googlemail.comwrote:
 
   Hi Erick,
   so if i understand you we could do something like:
  
   if Jan is selected in the user interface and we have 10 price ranges
  
   query

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Indeed, selecting the best price for January OR April OR November and
sorting on it isn't possible with this solution (if that's what you mean).
However, any combination of selecting 1 month and/or 1 price-range and/or 1
fare-type IS possible.

2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Hi Geert,

 Ok I think I follow. the magic is in the multi-valued field.

 The only danger would be complexity if we allow users to multi select
 months/prices/fare classes. For example they can search for first prices in
 jan, april and november. I think what you describe is possible in this case
 just complicated. I'll see if i can hack some facets into the proto type
 tommorrow. Thanks for your help

 Lee C

 On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote:

  Ok longer answer than anticipated (and good conceptual practice ;-)
 
  Yeah I belief that would work if I understand correctly that:
 
  'in Jan [9]
  in feb [10]
  in march [1]'
 
  has nothing to do with pricing, but only with availability?
 
  If so you could seperate it out as two seperate issues:
 
  1. ) showing pricing (based on context)
  2. ) showing availabilities (based on context)
 
  For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
 [standard,first,dc])
  note: 'dc' indicates 'don't care.
 
  depending on the context you query the correct pricefield to populate the
  price facet-values.
  for discussion lets call the fields: _p[fare][date].
  IN other words the price field for no preference at all would become:
  _pdcdc
 
 
  For 2.) define a multivalued field 'FaresPerDate 'which indicate
  availability, which is used to display:
 
  A)
  Standard fares [10]
  First fares [3]
 
  B)
  in Jan [9]
  in feb [10]
  in march [1]
 
  A) depends on your selection (or dont caring) about a month
  B) vice versa depends on your selection (or dont caring)  about a fare
 type
 
  given all possible date values: [jan,feb,..dec,dontcare]
  given all possible fare values:[standard,first,dontcare]
 
  FaresPerDate consists of multiple values per document where each value
  indicates the availability of a combination of 'fare' and 'date':
 
 
 (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
  Note that the nr of possible values = 39.
 
  Example:
  1. ) the user hasn't selected any preference:
 
  q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
  20]facet.query=_pdcdc:[20 TO 40], etc.
 
  in the client you have to make sure to select the correct values of
  'FaresPerDate' for display:
  in this case:
 
  Standard fares [10] -- FaresPerDate.standardDC
  First fares [3] -- FaresPerDate.firstDC
 
  in Jan [9] - FaresPerDate.DCJan
  in feb [10] - FaresPerDate.DCFeb
  in march [1]- FaresPerDate.DCMarch
 
  2) the user has selected January
 
 q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
  TO 20]facet.query=_pDCJan:[20 TO 40]
 
  Standard fares [10] -- FaresPerDate.standardJan
  First fares [3] -- FaresPerDate.firstJan
 
  in Jan [9] - FaresPerDate.DCJan
  in feb [10] - FaresPerDate.DCFeb
  in march [1]- FaresPerDate.DCMarch
 
  Hope that helps,
  Geert-Jan
 
 
  2010/12/1 lee carroll lee.a.carr...@googlemail.com
 
   Sorry Geert missed of the price value bit from the user interface so
 we'd
   display
  
   Facet price
   Standard fares [10]
   First fares [3]
  
   When traveling
   in Jan [9]
   in feb [10]
   in march [1]
  
   Fare Price
   0 - 25 :  [20]
   25 - 50: [10]
   50 - 100 [2]
  
   cheers lee c
  
  
   On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
   wrote:
  
Geert
   
The UI would be something like:
user selections
for the facet price
max price: £100
fare class: any
   
city attributes facet
cityattribute1 etc: xxx
   
results displayed something like
   
Facet price
Standard fares [10]
First fares [3]
in Jan [9]
in feb [10]
in march [1]
etc
is this compatible with your approach ?
   
Erick the price is an interval scale ie a fare can be any value (not
   high,
low, medium etc)
   
How sensible would the following approach be
index city docs with fields only related to the city unique key
in the same index also index fare docs which would be something like:
Fare:
cityID: xxx
Fareclass:standard
FareMonth: Jan
FarePrice: 100
   
the query would be something like:
q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
returning facets for FareClass and FareMonth. hold on this will not
  facet
city docs correctly. sorry thasts not going to work.
   
   
   
   
   
   
   
   
On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
   wrote:
   
Hmmm, that's getting to be a pretty clunky query sure enough. Now
  you're
going to
have to insure that HTTP request that long get through and stuff
 like
that
   
I'm reaching a bit here, but you can facet on a tokenized field.
   Although

Re: Is this sort order possible in a single query?

2010-11-24 Thread Geert-Jan Brits
You could do it with sorting on a functionquery (which is supported from
solr 1.5)
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
Consider the search:
http://localhost:8093/solr/select?author:'j.k.rowling'

sorting like you specified would involve:

1. introducing an extra field: 'author_exact' of type 'string' which takes
care of the exact matching. (You can populate it by defining it as a
copyfield of Author so your indexing-code doesn't change)
2. set sortMissingLast=true for 'num_copies' and 'num_comments'
like:  fieldType
name=num_copies sorMissingLast=true 

this makes sure that documents which don't have the value set end up at the
end of the sort when sorted on that particular field.

3. construct a functionquery that scores either 0 (no match)  or x (not sure
what x is (1?) , but it should always be the same for all exact matches )

This gives

http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
v='j.k.rowling'}) desc

which scores all exact matches before all partial matches.

4. now just concatenate the other sorts giving:

http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
v='j.k.rowling'}) desc, num_copies desc, num_comments desc

That should do it.

Please note that 'num_copies' and 'num_comments' still kick in to break the
tie for documents that exactly match on 'author_exact'. I assume this is
ok.

I can't see a way to do it without functionqueries at the moment, which
doesn't mean there isn't any.

Hope that helps,

Geert-Jan







*query({!dismax qf=text v='solr rocks'})*
*
*




2010/11/24 Robert Gründler rob...@dubture.com

 Hi,

 we have a requirement for one of our search results which has a quite
 complex sorting strategy. Let me explain the document first, using an
 example:

 The document is a book. It has several indexed text fields: Title, Author,
 Distributor. It has two integer columns, where one reflects the number of
 sold copies (num_copies), and the other reflects
 the number of comments on the website (num_comments).

 The Requirement for the relevancy looks like this:

 * Documents which have exact matches in the Author field, should be
 ranked highest, disregarding their values in num_copies and num_comments
 fields
 * After the exact matches, the sorting should be based on the value in the
 field num_copies, but only for documents, where this field is set
 * After the num_copies matches, the sorting should be based on
 num_comments

 I'm wondering is this kind of sort order can be implemented in a single
 query, or if i need to break it down into several queries and merge the
 results on application level.

 -robert





Re: How to get facet counts without fields that are constrained by themselves?

2010-11-24 Thread Geert-Jan Brits
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

2010/11/24 Petrov Sergey geoco...@yandex.ua

 I need to retrieve result of query and facet counts for all searchable
 document fields. I can't get correct results in case when facets counts are
 calculated for field that is in search query. Facet counts are calculated to
 match the whole query, but for this field I need to get values, that are
 constrained by all query params except of query on current field (so facet
 values must to be constrained by all query values except of current field
 itself).
 Variant with performing one full query plus as many queries, as is the
 count of search fields, gives me what I need, but I think that there must be
 a better way to solve this problem.
 P.S. Sorry for my English.



Re: Is this sort order possible in a single query?

2010-11-24 Thread Geert-Jan Brits
hmm, sorry about that. I haven't used the 'sort by functionquery'-option
myself, but I remembered it existed.
Indeed solr 1.5 was never released (as you've read in the link you pointed
out)

the relevant JIRA-issue: https://issues.apache.org/jira/browse/SOLR-1297

https://issues.apache.org/jira/browse/SOLR-1297There's some recent
activity and a final post suggesting the patch works. (assumingly under
either 3.1 and/or 4.x)
Both branches are not released at the moment though, although 3.1 should be
pretty close (and perhaps stable enough) . I'm just not sure.

Your best bet is to start a new thread asking at what branch to patch
SOLR-1297 https://issues.apache.org/jira/browse/SOLR-1297 and asking the
subjective 'is it stable enough?'.

Hope that helps some,
Geert-Jan


2010/11/24 Robert Gründler rob...@dubture.com

 thanks a lot for the explanation. i'm a little confused about solr 1.5,
 especially
 after finding this wiki page:

 http://wiki.apache.org/solr/Solr1.5

 Is there a stable build available for version 1.5, so i can test your
 suggestion
 using functionquery?


 -robert



 On Nov 24, 2010, at 1:53 PM, Geert-Jan Brits wrote:

  You could do it with sorting on a functionquery (which is supported from
  solr 1.5)
  http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
  http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
  Consider the search:
  http://localhost:8093/solr/select?author:'j.k.rowling'
 
  sorting like you specified would involve:
 
  1. introducing an extra field: 'author_exact' of type 'string' which
 takes
  care of the exact matching. (You can populate it by defining it as a
  copyfield of Author so your indexing-code doesn't change)
  2. set sortMissingLast=true for 'num_copies' and 'num_comments'
  like:  fieldType
  name=num_copies sorMissingLast=true 
 
  this makes sure that documents which don't have the value set end up at
 the
  end of the sort when sorted on that particular field.
 
  3. construct a functionquery that scores either 0 (no match)  or x (not
 sure
  what x is (1?) , but it should always be the same for all exact matches )
 
  This gives
 
 
 http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
  v='j.k.rowling'}) desc
 
  which scores all exact matches before all partial matches.
 
  4. now just concatenate the other sorts giving:
 
 
 http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact
  v='j.k.rowling'}) desc, num_copies desc, num_comments desc
 
  That should do it.
 
  Please note that 'num_copies' and 'num_comments' still kick in to break
 the
  tie for documents that exactly match on 'author_exact'. I assume this is
  ok.
 
  I can't see a way to do it without functionqueries at the moment, which
  doesn't mean there isn't any.
 
  Hope that helps,
 
  Geert-Jan
 
 
 
 
 
 
 
  *query({!dismax qf=text v='solr rocks'})*
  *
  *
 
 
 
 
  2010/11/24 Robert Gründler rob...@dubture.com
 
  Hi,
 
  we have a requirement for one of our search results which has a quite
  complex sorting strategy. Let me explain the document first, using an
  example:
 
  The document is a book. It has several indexed text fields: Title,
 Author,
  Distributor. It has two integer columns, where one reflects the number
 of
  sold copies (num_copies), and the other reflects
  the number of comments on the website (num_comments).
 
  The Requirement for the relevancy looks like this:
 
  * Documents which have exact matches in the Author field, should be
  ranked highest, disregarding their values in num_copies and
 num_comments
  fields
  * After the exact matches, the sorting should be based on the value in
 the
  field num_copies, but only for documents, where this field is set
  * After the num_copies matches, the sorting should be based on
  num_comments
 
  I'm wondering is this kind of sort order can be implemented in a single
  query, or if i need to break it down into several queries and merge the
  results on application level.
 
  -robert
 
 
 




Re: SOLR and secure content

2010-11-23 Thread Geert-Jan Brits
 When making a query these fields should be required. Is it possible to
configure handlers on the solr server so that these field are required whith
each type of query? So for adding documents, deleting and querying?

have a look at 'invariants' (and 'appends') in the example solrconfig.
They can be defined per requesthandler and do exactly what you describe (at
least for the search-side of things)

Cheers,
Geert-Jan

2010/11/23 Jos Janssen j...@websdesign.nl


 Hi everyone,

 This is how we think we should set it up.

 Situation:
 - Multiple websites indexed on 1 solr server
 - Results should be seperated for each website
 - Search results should be filtered on group access

 Solution i think is possible with solr:
 - Solr server should only be accesed through API which we will write in
 PHP.
 - Solr server authentication wil be defined through IP adres on server side
 and username and password will be send through API for each different
 website.
 - Extra document fields in Solr server will contain:
 1. Website Hash to identify and filter results fo each different website
 (Website authentication)
 2. list of groups who can access the document  (Group authentication)

 When making a query these fields should be required. Is it possible to
 configure handlers on the solr server so that these field are required
 whith
 each type of query? So for adding documents, deleting and querying?

 Am i correct? Any further advice is welcome.

 regard,

 Jos



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-and-secure-content-tp1945028p1953071.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to Facet on a price range

2010-11-10 Thread Geert-Jan Brits
Ah I see: like you said it's part of the facet range implementation.
Frontend is already working, just need the 'update-on-slide' behavior.

Thanks
Geert-Jan

2010/11/10 gwk g...@eyefi.nl

 On 11/9/2010 7:32 PM, Geert-Jan Brits wrote:

 when you drag the sliders , an update of how many results would match is
 immediately shown. I really like this. How did you do this? IS this
 out-of-the-box available with the suggested Facet_by_range patch?


 Hi,

 With the range facets you get the facet counts for every discrete step of
 the slider, these values are requested in the AJAX request whenever search
 criteria change and then someone uses the sliders we simply check the range
 that is selected and add the discrete values of that range to get the
 expected amount of results. So yes it is available, but as Solr is just the
 search backend the frontend stuff you'll have to write yourself.

 Regards,

 gwk



Re: Facet showing MORE results than expected when its selected?

2010-11-10 Thread Geert-Jan Brits
Another option :  assuming themes_raw is type 'string' (couldn't get that
nugget of info for 100%) it could be that you're seeing a difference in nr
of results between the 110 for fq:themes_raw and 321 from your db, because
fieldtype:string (thus themes_raw)  is case-sensitive while (depending on
your db-setup) querying your db is case-insensitive, which could explain the
larger nr of hits for your db as well.

Cheers,
Geert-Jan


2010/11/10 Jonathan Rochkind rochk...@jhu.edu

 I've had that sort of thing happen from 'corrupting' my index, by changing
 my schema.xml without re-indexing.

 If you change field types or other things in schema.xml, you need to
 reindex all your data. (You can add brand new fields or types without having
 to re-index, but most other changes will require a re-index).

 Could that be it?


 PeterKerk wrote:

 LOL, very clever indeed ;)

 The thing is: when I select the amount of records matching the theme
 'Hotel
 en Restaurant' in my db, I end up with 321 records. So that is correct. I
 dont know where the 370 is coming from.

 Now when I change the query to this: fq=themes_raw:Hotel en Restaurant I
 end up with 110 records...(another number even :s)

 What I did notice, is that this only happens on multi-word facets Hotel
 en
 Restaurant being a 3 word facet. The facets work correct on a facet named
 Cafe, so I suspect it has something to do with the tokenization.

 As you can see, I'm using text and string.
 For compleness Im posting definition of those in my schema.xml as well:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/

!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_dutch.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_dutch.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


 fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true /





Re: How to Facet on a price range

2010-11-09 Thread Geert-Jan Brits
Just to add to this, if you want to allow the user more choice in his option
to select ranges, perhaps by using a 2-sided javasacript slider for the
pricerange (ala kayak.com) it may be very worthwhile to discretize the
allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider
implementations allow for this easily.

This has the advantages of:
- having far fewer possible facetqueries and thus a far greater chance of
these facetqueries hitting the cache.
- a better user-experience, although that's debatable.

just to be clear: for this the Solr-side would still use:
facet=onfacet.query=price:[50
TO *]facet.query=price:[* TO 100] and not the optimized pre-computed
variant suggested above.

Geert-Jan

2010/11/9 jayant jayan...@hotmail.com


 That was very well thought of and a clever solution. Thanks.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to Facet on a price range

2010-11-09 Thread Geert-Jan Brits
@ 
http://www.mysecondhome.co.uk/search.htmhttp://www.mysecondhome.co.uk/search.html
--
when you drag the sliders , an update of how many results would match is
immediately shown. I really like this. How did you do this? IS this
out-of-the-box available with the suggested Facet_by_range patch?

Thanks,
Geert-Jan

2010/11/9 gwk g...@eyefi.nl

 Hi,

 Instead of all the facet queries, you can also make use of range facets (
 http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which
 is in trunk afaik, it should also be patchable into older versions of Solr,
 although that should not be necessary.

 We make use of it (http://www.mysecondhome.co.uk/search.html) to create
 the nice sliders Geert-Jan describes. We've also used it to add the
 sparklines above the sliders which give a nice indication of how the current
 selection is spread out.

 Regards,

 gwk


 On 11/9/2010 3:33 PM, Geert-Jan Brits wrote:

 Just to add to this, if you want to allow the user more choice in his
 option
 to select ranges, perhaps by using a 2-sided javasacript slider for the
 pricerange (ala kayak.com) it may be very worthwhile to discretize the
 allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider
 implementations allow for this easily.

 This has the advantages of:
 - having far fewer possible facetqueries and thus a far greater chance of
 these facetqueries hitting the cache.
 - a better user-experience, although that's debatable.

 just to be clear: for this the Solr-side would still use:
 facet=onfacet.query=price:[50
 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed
 variant suggested above.

 Geert-Jan

 2010/11/9 jayantjayan...@hotmail.com

  That was very well thought of and a clever solution. Thanks.
 --
 View this message in context:

 http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: dynamic stop words?

2010-10-09 Thread Geert-Jan Brits
That might work, although depending on your use-case it might be hard to
have a good controlled vocab on citynames (hotel metropole bruxelles, hotel
metropole brussels, hotel metropole brussel, etc.)  Also 'hotel paris
bruxelles' stinks...

given your example:

 Doc 1
 name = Holiday  Inn
 city = Denver

 Doc 2
 name = Holiday Inn,  Denver
 city = Denver

 q=name:(Holiday Inn, Denver)

turning it upside down, perhaps an alternative would be to query on:
q=name:Holiday Inn+city:Denver

and configure field 'name' in such a way that doc1 and doc2 score the same.
I believe that must be possible, just not sure how to config it exactly at
the moment.

Of course, it depends on your scenario if you have enough knowlegde on the
clientside to transform:
q=name:(Holiday Inn, Denver)  to   q=name:Holiday Inn+city:Denver

Hth,
Geert-Jan

2010/10/9 Otis Gospodnetic otis_gospodne...@yahoo.com

 Matt,

 The first thing that came to my mind is that this might be interesting to
 try
 with a dictionary (of city names) if this example is not a made-up one.


 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Matt Mitchell goodie...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, October 8, 2010 11:22:36 AM
  Subject: dynamic stop words?
 
  Is it possible to have certain query terms not effect score, if that
  same  query term is present in a field? For example, I have an index of
  hotels.  Each hotel has a name and city. If the name of a hotel has the
  name of the  city in it's name field, I want to completely ignore
  that and not have it  influence score.
 
  Example:
 
  Doc 1
  name = Holiday  Inn
  city = Denver
 
  Doc 2
  name = Holiday Inn,  Denver
  city = Denver
 
  q=name:(Holiday Inn, Denver)
 
  I'd  like those docs to have the same score in the response. I don't
  want Doc2 to  have a higher score, just because it has all of the query
  terms.
 
  Is  this possible without using stop words? I hope this makes  sense!
 
  Thanks,
  Matt
 



Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?

2010-09-09 Thread Geert-Jan Brits
You're right for the general case. I should have added that our setup is
perhaps a little bit out of the ordinary in that we send explicit commits to
solr as part of our indexing app.
Once a commit has finished we're sure all docs until then are present in
solr. For us it's much more difficult to do the way you suggested bc we
index into several embedded solr shards, etc. It can be done it's just not
convienient. But for the general case I admit querying all ids as a
post-process is probably the more elegant and robust way.

2010/9/9 Scott K s...@skister.com

 But how do you know when the document actually makes it to solr,
 especially if you are using commitWithin and not explicitly calling
 commit.

 One solution is to have a status field in the database such as
 0 - unindexed
 1 - indexing
 2 - committed / verified

 And have a separate process query solr for documents in the indexing
 state and set them to committed if they are queryable in solr.

 On Tue, Sep 7, 2010 at 14:26, Geert-Jan Brits gbr...@gmail.com wrote:
 Please let me know if there are any other ideas / suggestions to
 implement
  this.
 
  You're indexing program should really take care of this IMHO. Each time
 your
  indexer inserts a document to Solr, flag the corresponding entity in your
  RDBMS, each time you delete, remove the flag. You should implement this
 as a
  transaction to make sure all is still fine in the unlikely event of a
 crash
  midway.
 
  2010/9/7 bbarani bbar...@gmail.com
 
 
  Hi,
 
  I am trying to get complete list of unique document ID and compare it
 with
  that of back end to make sure that both back end and SOLR documents are
 in
  sync.
 
  Is there a way to fetch the complete list of data from a particular
 column
  in SOLR document?
 
  Once I get the list, I can easily compare it against the DB and delete
 the
  orphan documents..
 
  Please let me know if there are any other ideas / suggestions to
 implement
  this.
 
  Thanks,
  Barani
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?

2010-09-07 Thread Geert-Jan Brits
Please let me know if there are any other ideas / suggestions to implement
this.

You're indexing program should really take care of this IMHO. Each time your
indexer inserts a document to Solr, flag the corresponding entity in your
RDBMS, each time you delete, remove the flag. You should implement this as a
transaction to make sure all is still fine in the unlikely event of a crash
midway.

2010/9/7 bbarani bbar...@gmail.com


 Hi,

 I am trying to get complete list of unique document ID and compare it with
 that of back end to make sure that both back end and SOLR documents are in
 sync.

 Is there a way to fetch the complete list of data from a particular column
 in SOLR document?

 Once I get the list, I can easily compare it against the DB and delete the
 orphan documents..

 Please let me know if there are any other ideas / suggestions to implement
 this.

 Thanks,
 Barani
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: High - Low field value?

2010-09-01 Thread Geert-Jan Brits
StatsComponent is exactly what you're looking for.

http://wiki.apache.org/solr/StatsComponent

http://wiki.apache.org/solr/StatsComponentCheers,
Geert-Jan

2010/9/1 kenf_nc ken.fos...@realestate.com


 I want to do range facets on a couple fields, a Price field in particular.
 But Price is relative to the product type. Books, Automobiles and Houses
 are
 vastly different price ranges, and withing Houses there may be a regional
 difference (price range in San Francisco is different than Columbus, OH for
 example).

 If I do Filter Query on type, so I'm not mixing books with houses, is there
 a quick way in a query to get the High and Low value for a given field? I
 would need those to build my range boundaries more efficiently.

 Ideally it would be a function of the query, so regionality could be taken
 into account. It's not a search score, or a facet, it's more a function. I
 know query functions exist, but haven't had to use them yet and the 'max'
 function doesn't look like what I need.  Any suggestions?
 Thanks.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/High-Low-field-value-tp1402568p1402568.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: questions about synonyms

2010-08-31 Thread Geert-Jan Brits
concerning:
 . I got a very big text file of synonyms. How I can use it? Do I need to
index this text file first?

have you seen
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter ?

Cheers,
Geert-Jan
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter

2010/8/31 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov

 Hello,



 I have an couple of questions about synonyms.



 1. I got a very big text file of synonyms. How I can use it? Do I need to
 index this text file first?



 2. Is there a way to do synonyms' highlight in search result?



 3. Does anyone use WordNet to solr?





 Thanks so much in advance,




Re: solr working...

2010-08-26 Thread Geert-Jan Brits
Check out Drew Farris' explantion for remote debugging Solr with Eclipse
posted a couple of days ago:
http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html
http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html
Geert-Jan

2010/8/26 Michael Griffiths mgriffi...@am-ind.com

 Take a look at the code? It _is_ open source. Open it up in Eclipse and
 debug it.

 -Original Message-
 From: satya swaroop [mailto:sswaro...@gmail.com]
 Sent: Thursday, August 26, 2010 8:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: solr working...

 Hi peter,
I am already working on solr and it is working good. But i want
 to understand the code and know where the actual working is going on, and
 how indexing is done and how the requests are parsed and how it is
 responding and all others. TO understand the  code i asked how to start???

 Regards,
 satya



Re: Solr search speed very low

2010-08-25 Thread Geert-Jan Brits
have a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to
see how that works.

2010/8/25 Marco Martinez mmarti...@paradigmatecnologico.com

 You should use the tokenizer solr.WhitespaceTokenizerFactory in your field
 type to get your terms indexed, once you have indexed the data, you dont
 need to use the * in your queries that is a heavy query to solr.

 Marco Martínez Bautista
 http://www.paradigmatecnologico.com
 Avenida de Europa, 26. Ática 5. 3ª Planta
 28224 Pozuelo de Alarcón
 Tel.: 91 352 59 42


 2010/8/25 Andrey Sapegin andrey.sape...@unister-gmbh.de

  Dear ladies and gentlemen.
 
  I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing
 here.
 
  I'm analysing Solr performance and have 1 problem. *Search time is about
  7-10 seconds per query.*
 
  I have a *.csv 5Gb-database with about 15 fields and 1 key field (record
  number). I uploaded it to Solr without any problem using curl. This
 database
  contains information about books and I'm intrested in keyword search
 using
  one of the fields (not a key field). I mean that if I search, for
 example,
  for word Hello, I expect response with sentences containing Hello:
  Hello all
  Hello World
  I say Hello to all
  etc.
 
  I tested it from console using time command and curl:
 
  /usr/bin/time -o test_results/time_solr -a curl 
 
 http://localhost:8983/solr/select/?q=itemname:*$query*version=2.2start=0rows=10indent=on
 
  -6 21  test_results/response_solr
 
  So, my query is *itemname:*$query**. 'Itemname' - is the name of field.
  $query - is a bash variable containing only 1 word. All works fine.
  *But unfortunately, search time is about 7-10 seconds per query.* For
  example, Sphinx spent only about 0.3 second per query.
  If I use only $query, without stars (*), I receive answer pretty fast,
 but
  only exact matches.
  And I want to see any sentence containing my $query in the response.
 Thats
  why I'm using stars.
 
  NOW THE QUESTION.
  Is my query syntax correct (*field:*word**) for keyword search)? Why
  response time is so big? Can I reduce search time?
 
  Thank You in advance,
  Kind Regards,
 
  Andrey Sapegin,
  Software Developer,
 
  Unister GmbH
  Barfußgässchen 11 | 04109 Leipzig
 
  andrey.sape...@unister-gmbh.de mailto:%20andreas.b...@unister-gmbh.de
  www.unister.de http://www.unister.de
 
 



Re: How to Debug Sol-Code in Eclipse ?!

2010-08-22 Thread Geert-Jan Brits
1. download solr lib and import them in your project.
2. download solr source-code of the same version and attach in to the
libraries. (I haven't got eclipse open but it is something like project -
settings - jre/libraries?)
3. write a small program yourself which calls EmbededSolrServer and
step-through/debug the source-code from there. It works just like it is your
own source-code.

HTH,
Geert-Jan

2010/8/22 stockii st...@shopgate.com


 thx for you reply.

 i dont want to test my own classes in unittest. i try to understand how
 solr
 works , because i write a little text about solr and lucene. so i want go
 through the code, step by step and find out on which places is solr using
 lucene.

 when i can debug the code its easyer ;-)
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-tp1262050p1274285.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to support implicit trailing wildcards

2010-08-10 Thread Geert-Jan Brits
you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 -- 2 .

q=exactmatch:mount+wildcardmatch:mount*q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao yydz...@gmail.com

 Hi Bastian,

 Sorry for not make it clear, I also want exact match have higher score than
 wildcard match, that is means: if searching 'mount', documents with 'mount'
 will have higher score than documents with 'mountain', while 'mount*' seems
 treat 'mount' and 'mountain' as same.

 besides, also want the query to be processed with analyzer, while from

 http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
 ,
 Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
 The
 rationale is that if search 'mounted', I also want documents with 'mount'
 match.

 So seems built-in wildcard search could not satisfy my requirements if i
 understand correctly.

 Thanks very much!


 2010/8/9 Bastian Spitzer bspit...@magix.net

  Wildcard-Search is already built in, just use:
 
  ?q=umoun*
  ?q=mounta*
 
  -Ursprüngliche Nachricht-
  Von: yandong yao [mailto:yydz...@gmail.com]
  Gesendet: Montag, 9. August 2010 15:57
  An: solr-user@lucene.apache.org
  Betreff: how to support implicit trailing wildcards
 
  Hi everyone,
 
 
  How to support 'implicit trailing wildcard *' using Solr, eg: using
 Google
  to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
  will be matched.
 
  From my point of view, there are several ways, both with disadvantages:
 
  1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
  'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
 index
  size increases dramatically, b) will matches even has no relationship,
 such
  as such 'mount' will match 'mountain' also.
 
  2) Using two pass searching: first pass searches term dictionary through
  TermsComponent using given keyword, then using the first matched term
 from
  term dictionary to search again. eg: when user enter 'umoun',
 TermsComponent
  will match 'umount', then use 'umount' to search. The disadvantage are:
 a)
  need to parse query string so that could recognize meta keywords such as
  'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP
  client), b) The returned hit counts is not for original search string,
 thus
  will influence other components such as auto-suggest component based on
 user
  search history and hit counts.
 
  3) Write custom SearchComponent, while have no idea where/how to start
  with.
 
  Is there any other way in Solr to do this, any feedback/suggestion are
  welcome!
 
  Thanks very much in advance!
 



Re: How do i update some document when i use sharding indexs?

2010-08-09 Thread Geert-Jan Brits
I'm not sure if Solr has some build-in support for sharding-functions, but
you should generally use some hashing-algorithm to split the indices and use
the same hash-algorithm to locate which shard contains a document.
http://en.wikipedia.org/wiki/Hash_function

Without employing any domain knowledge (of documents you possible want to
group toegether on a single shard for performance) you could build a very
simple (crude) hash-function by md5-hashing the unique-keys of your
documents, taking the first 3 chars (should be precise enough, so load is
pretty much balanced), calculate a nr from the chars (256 * first char + 16
* 2nd char + 3rd char), and take that nr modulo 20. That should give you a
nr in [0,20) which is the shard-index.

use the same algorithm to determine which shard contains the document that
you want to change.

Geert-Jan


2010/8/9 lu.rongbin lu.rong...@goodhope.net


My index has 76 million documents, I split it to 20 indexs because the
 size of index is 33G. I deploy 20 shards for search response performence on
 ec2's 20 instances.But when i wan't to update some doc, it means i must
 traversal each index , and find the document is in which shard index, and
 update the doc? It's crazy! How can i do?
thanks.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How do i update some document when i use sharding indexs?

2010-08-09 Thread Geert-Jan Brits
Just to be completely clear: the program that splits your index in 20 shards
should employ this algo as well.


2010/8/9 Geert-Jan Brits gbr...@gmail.com

 I'm not sure if Solr has some build-in support for sharding-functions, but
 you should generally use some hashing-algorithm to split the indices and use
 the same hash-algorithm to locate which shard contains a document.
 http://en.wikipedia.org/wiki/Hash_function

 Without employing any domain knowledge (of documents you possible want to
 group toegether on a single shard for performance) you could build a very
 simple (crude) hash-function by md5-hashing the unique-keys of your
 documents, taking the first 3 chars (should be precise enough, so load is
 pretty much balanced), calculate a nr from the chars (256 * first char + 16
 * 2nd char + 3rd char), and take that nr modulo 20. That should give you a
 nr in [0,20) which is the shard-index.

 use the same algorithm to determine which shard contains the document that
 you want to change.

 Geert-Jan


 2010/8/9 lu.rongbin lu.rong...@goodhope.net


My index has 76 million documents, I split it to 20 indexs because the
 size of index is 33G. I deploy 20 shards for search response performence
 on
 ec2's 20 instances.But when i wan't to update some doc, it means i must
 traversal each index , and find the document is in which shard index, and
 update the doc? It's crazy! How can i do?
thanks.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: XML Format

2010-08-06 Thread Geert-Jan Brits
at first glance I see no difference between the 2 documents.
Perhaps you can illustrate which fields are not in the resultset that you
want to be there?

also use the 'fl'-param to describe which fields should be outputted in your
results.
Of course, you have to first make sure the fields you want outputted are
stored to begin with.

http://wiki.apache.org/solr/CommonQueryParameters#fl
http://wiki.apache.org/solr/CommonQueryParameters#fl

2010/8/6 twojah e...@tokobagus.com


 can somebody help me please
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/XML-Format-tp1024608p1028456.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to take a value from the query result

2010-08-05 Thread Geert-Jan Brits
you should parse the xml and extract the value. Lot's of libraries
undoubtably exist for PHP to help you with that (I don't know PHP)

Moreover, if all you want from the result is AUC_CAT you should consider
using the fl=param like:
http://172.16.17.126:8983/search/select/?q=AUC_ID:607136fl=AUC_CAT

to return a document of the form:

doc
int name=AUC_CAT576/int
/doc

which if more efficient.
Still you have to parse the doc with xml though.



http://172.16.17.126:8983/search/select/?q=AUC_ID:607136

2010/8/5 twojah e...@tokobagus.com


 this is my query in browser navigation toolbar
 http://172.16.17.126:8983/search/select/?q=AUC_ID:607136

 and this is the result in browser page:
 ...
 doc
 int name=AP_AUC_PHOTO_AVAIL1/int
 double name=AUC_AD_PRICE1.0/double
 int name=AUC_CAT576/int
 int name=AUC_CLIENT_ID27017/int
 str name=AUC_DESCR_SHORTBracket Ceiling untuk semua merk projector,
 panjang 60-90 cm  Bahan Besi Cat Hitam = 325rb Bahan Sta/str
 str

 name=AUC_HTML_DIR_NL/aksesoris-batere-dan-tripod/update-bracket-projector-dan-lcd-plasma-tv-607136.html/str
 int name=AUC_ID607136/int
 str name=AUC_ISNEGONego/str
 int name=AUC_LOCATION7/int
 str name=AUC_PHOTO270/27017/bracket_lcd_plasma_3a-1274291780.JPG/str
 str name=AUC_START2010-05-19 17:56:45/str
 str name=AUC_TITLE[UPDATE] BRACKET Projector dan LCD/PLASMA TV/str
 int name=AUC_TYPE21/int
 int name=PRO_BACKGROUND0/int
 int name=PRO_BOLD0/int
 int name=PRO_COLOR0/int
 int name=PRO_GALLERY0/int
 int name=PRO_LINK0/int
 int name=PRO_SPONSOR0/int
 int name=cat_id_sub0/int
 int name=sectioncode28/int
 /doc

 I want to get the AUC_CAT value (576) and using it in my PHP, how can I get
 that value?
 please help
 thanks before
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-take-a-value-from-the-query-result-tp1025119p1025119.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: No group by? looking for an alternative.

2010-08-05 Thread Geert-Jan Brits
If I understand correctly:
1. products have different product variants ( in case of shoes a combination
of color and size + some other fields).
2. Each product is shown once in the result set. (so no multiple product
variants of the same product are shown)

This would solve that IMO:

1, create 1 document per product (so not a document per product-variant)
2.create a multivalued field on which to facet containing: all combinations
of: size-color-any other field-yett another field
3. make sure to include combinations in which the user is indifferent of a
particular filter. i.e: don't care about size (dc) + red -- dc-red
4. filtering on that combination would give you all the products that
satisfy the product-variant constraints (size, color, etc.) + the extra
product constraints ('converse)
5. on the detail page show all available product-variants not filtered by
the constraints specified. This would likely be something outside of solr (a
simple sql-select on a single product)

hope that helps,
Geert-Jan

2010/8/5 Mickael Magniez mickaelmagn...@gmail.com


 I've got only one document per shoes, whatever its size or color.

 My first try was to create one document per model/size/color, but when i
 searche for 'converse' for example, the same shoe is retrieved several
 times, and i want to show only one record for each model. But I don't
 succeed in grouping results by shoe model.

 If you look at

 http://www.amazon.com/s/ref=nb_sb_noss?url=node%3D679255011field-keywords=Converse+All+Star+Leather+Hi+Chuck+Taylor+x=0y=0ih=1_0_0_0_0_0_0_0_0_0.4136_1fsc=-1
 amazon for Converse All Star Leather Hi Chuck Taylor  .
 They show the shoe only one time, but if you go on the product details, its
 exists in several colors and sizes. Now if you filter or color, there is
 less sizes available.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1026618.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Geert-Jan Brits
Field Collapsing (currently as patch) is exactly what you're looking for
imo.

http://wiki.apache.org/solr/FieldCollapsing

http://wiki.apache.org/solr/FieldCollapsingGeert-Jan


2010/8/4 Ken Krugler kkrugler_li...@transpac.com

 Hi all,

 I've got a situation where the key result from an initial search request
 (let's say for dog) is the list of values from a faceted field, sorted by
 hit count.

 For the top 10 of these faceted field values, I need to get the top hit for
 the target request (dog) restricted to that value for the faceted field.

 Currently this is 11 total requests, of which the 10 requests following the
 initial query can be made in parallel. But that's still a lot of requests.

 So my questions are:

 1. Is there any magic query to handle this with Solr as-is?

 2. if not, is the best solution to create my own request handler?

 3. And in that case, any input/tips on developing this type of custom
 request handler?

 Thanks,

 -- Ken


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







Re: Best solution to avoiding multiple query requests

2010-08-04 Thread Geert-Jan Brits
If I understand correctly: you want to sort your collapsed results by 'nr of
collapsed results'/ hits.

It seems this can't be done out-of-the-box using this patch (I'm not
entirely sure, at least it doesn't follow from the wiki-page. Perhaps best
is to check the jira-issues to make sure this isn't already available now,
but just not updated on the wiki)

Also I found a blogpost (from the patch creator afaik) with in the comments
someone with the same issue + some pointers.
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

hope that helps,
Geert-jan

2010/8/4 Ken Krugler kkrugler_li...@transpac.com

 Hi Geert-Jan,


 On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote:

  Field Collapsing (currently as patch) is exactly what you're looking for
 imo.

 http://wiki.apache.org/solr/FieldCollapsing


 Thanks for the ref, good stuff.

 I think it's close, but if I understand this correctly, then I could get
 (using just top two, versus top 10 for simplicity) results that looked like

 dog training (faceted field value A)
 super dog (faceted field value B)

 but if the actual faceted field value/hit counts were:

 C (10)
 D (8)
 A (2)
 B (1)

 Then what I'd want is the top hit for dog AND facet field:C, followed by
 dog AND facet field:D.

 Used field collapsing would improve the probability that if I asked for the
 top 100 hits, I'd find entries for each of my top N faceted field values.

 Thanks again,

 -- Ken


  I've got a situation where the key result from an initial search request
 (let's say for dog) is the list of values from a faceted field, sorted
 by
 hit count.

 For the top 10 of these faceted field values, I need to get the top hit
 for
 the target request (dog) restricted to that value for the faceted
 field.

 Currently this is 11 total requests, of which the 10 requests following
 the
 initial query can be made in parallel. But that's still a lot of
 requests.

 So my questions are:

 1. Is there any magic query to handle this with Solr as-is?

 2. if not, is the best solution to create my own request handler?

 3. And in that case, any input/tips on developing this type of custom
 request handler?

 Thanks,

 -- Ken


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







Re: Quering the database

2010-08-03 Thread Geert-Jan Brits
No. With Solr is really flexible and allows for a lot of complex querying
out-of-the-box.
Really the Wiki is your best friend here.

http://wiki.apache.org/solr/
perhaps start with:
1. http://lucene.apache.org/solr/tutorial.html
2. http://wiki.apache.org/solr/SolrQuerySyntax
3. http://wiki.apache.org/solr/QueryParametersIndex (list of some standard
parameters with link to their function/use)
-- especially look at the 'fq'-param which is aanother way to limit your
result-set.

and just browse the wiki starting from the homepage for the rest. It should
pretty quickly give you some an overview of what's possible.

cheers,
Geert-Jan


http://lucene.apache.org/solr/tutorial.html

2010/8/3 Hando420 hando...@gmail.com


 Thanks alot to all now its clear the problem was in the schema. One more
 thing i would like to know is if the user queries for something does it
 have
 to always be like q=field:monitor where field is defined in schema and
 monitor is just a text in a column.

 Hando
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1018268.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Quering the database

2010-08-02 Thread Geert-Jan Brits
you should (as per the example) define the field as text in your solr-schema
not in your RDB.
something like:  field name=field_1 type=text indexed=true
stored=true required=true/

then search like: q=field_1:monitors

the example schema illustrates a lot of the possibilities on how you to
define fields and what is all means.
Moreover have a look at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Geert-Jan

2010/8/2 Hando420 hando...@gmail.com


 Thank you for your reply. Still the the problem persists even i tested with
 a
 simple example by defining a column of type text as varchar in database and
 in schema.xml used the default id which is set to string. Row is fetched
 and
 document created but searching doesn't give any results of the content in
 the column.

 Best Regards,
 Hando
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1015890.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-29 Thread Geert-Jan Brits
I can interprete your question in 2 different ways:
1. Do you want to index several heterogenous documents all coming from
different tables? So documents of type tableA are created and indexed
alongside documents of type tableB, tableC, etc.
2. Do you want to combine unrelated data from 15 tables to form some kind of
logical solr-document as your basis for indexing?

I assume you mean nr 1.
This can be done, and is done quite regularly. And you're right that this
creates a lot of empty slots for fields that only exist for documents
created from tableA and not tableB, etc. This in itself is not a problem. In
this case I would advise you to create an extra field: 'type' (per the above
example with values: (table)A, (table)B, etc. ) So you can distinguish the
different types of documents that you have created (and filter on them) .

If you meant nr2, which I believe you didn't: it's logically impossible to
create/imagine a logical solr-document comprised of combining unrelated
data. You should really think about what you're trying to achieve (what is
it that I want to index, what do I expect to do with it, etc. )  If you did
mean this, please show an example of what you want to achieve.

HTH,
Geert-Jan


2010/7/29 S Ahmed sahmed1...@gmail.com

 I understand (and its straightforward) when you want to create a index for
 something simple like Products.

 But how do you go about creating a Solr index when you have data coming
 from
 10-15 database tables, and the tables have unrelated data?

 The issue is then you would have many 'columns' in your index, and they
 will
 be NULL for much of the data since you are trying to shove 15 db tables
 into
 a single Solr/Lucense index.


 This must be a common problem, what are the potential solutions?



Re: 2 type of docs in same schema?

2010-07-26 Thread Geert-Jan Brits
You can easily have different types of documents in 1 core:

1. define searchquery as a field(just as the others in your schema)
2. define type as a field (this allows you to decide which type of documents
to search for, e.g: type_normal or type_search)

now searching on regular docs becomes:
q=title:some+titlefq=type:type_normal

and searching for searchqueries becomes (I think this is what you want):
q=searchquery:bmw+carfq=type:type_search

Geert-Jan

2010/7/26 scr...@asia.com




  I need you expertise on this one...

 We would like to index every search query that is passed in our solr engine
 (same core)

 Our docs format are like this (already in our schema):
 title
 content
 price
 category
 etc...

 Now how to add search queries as a field in our schema? Know that the
 search queries won't have all the field above?
 For example:
 q=bmw car
 q=car wheels
 q=moto honda
 etc...

 Should we run an other core that only index search queries? or is there a
 way to do this with same instance and same core?

 Thanks for your help





Re: 2 type of docs in same schema?

2010-07-26 Thread Geert-Jan Brits
I still assume that what you mean by search queries data is just some
other form of document (in this case containing 1 seach-request per
document)
I'm not sure what you intend to do by that actually, but yes indexing stays
the same (you probably want to mark field type as required so you don't
forget to include in in your indexing-program)

2010/7/26 scr...@asia.com


  Thanks for you answer! That's great.

 Now to index search quieries data is there something special to do? or it
 stay as usual?








 -Original Message-
 From: Geert-Jan Brits gbr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Mon, Jul 26, 2010 4:57 pm
 Subject: Re: 2 type of docs in same schema?


 You can easily have different types of documents in 1 core:

 1. define searchquery as a field(just as the others in your schema)
 2. define type as a field (this allows you to decide which type of
 documents
 to search for, e.g: type_normal or type_search)

 now searching on regular docs becomes:
 q=title:some+titlefq=type:type_normal

 and searching for searchqueries becomes (I think this is what you want):
 q=searchquery:bmw+carfq=type:type_search

 Geert-Jan

 2010/7/26 scr...@asia.com

 
 
 
   I need you expertise on this one...
 
  We would like to index every search query that is passed in our solr
 engine
  (same core)
 
  Our docs format are like this (already in our schema):
  title
  content
  price
  category
  etc...
 
  Now how to add search queries as a field in our schema? Know that the
  search queries won't have all the field above?
  For example:
  q=bmw car
  q=car wheels
  q=moto honda
  etc...
 
  Should we run an other core that only index search queries? or is there a
  way to do this with same instance and same core?
 
  Thanks for your help
 
 
 





Re: Which is a good XPath generator?

2010-07-25 Thread Geert-Jan Brits
I am assuming (like Li I think)  that you want to induce a structure/schema
from a html-example so you can use that schema to extract data from similiar
html-structured pages.

Another term often used in literature for that is Wrapper Induction.
Beside DOM, using CSS-classes often give good distinction and they are often
more stable under small redesigns.

Besides Li's suggestions have a look at this thread for an open source
python implementation (I hav enever tested it)
http://www.holovaty.com/writing/templatemaker/
also make sure to read all the comments for links to other products, etc.

HTH,
Geert-Jan



2010/7/25 Li Li fancye...@gmail.com

 it's not a related topic in solr. maybe you should read some papers
 about wrapper generation or automatical web data extraction. If you
 want to generate xpath, you could possibly read liubing's papers such
 as Structured Data Extraction from the Web based on Partial Tree
 Alignment. Besides dom tree, visual clues also may be used. But none
 of them will be perfect solution because of the diversity of web
 pages.

 2010/7/25 Savannah Beckett savannah_becket...@yahoo.com:
  Hi,
I am looking for a XPath generator that can generate xpath by picking a
  specific tag inside a html.  Do you know a good xpath generator?  If
 possible,
  free xpath generator would be great.
  Thanks.
 
 
 



Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
Perhaps completely unnessecery when you have a controlled domain, but I
meant to use ids for places instead of names, because names will quickly
become ambiguous, e.g.: there are numerous different places over the world
called washington, etc.

2010/7/24 SR r.steve@gmail.com

 Hi Geert-Jan,

 What did you mean by this:

  Also, just a suggestion, consider using id's instead of names for
 filtering;

 Thanks,
 -S


Re: Tree Faceting in Solr 1.4

2010-07-24 Thread Geert-Jan Brits
I believe we use an in-process weakhashmap to store the id-name
relationship. It's not that we're talking billions of values here.
For anything more mem-intensive we use no-sql (tokyo tyrant through
memcached protocol at the moment)

2010/7/24 Jonathan Rochkind rochk...@jhu.edu

  Perhaps completely unnessecery when you have a controlled domain, but I
  meant to use ids for places instead of names, because names will quickly
  become ambiguous, e.g.: there are numerous different places over the
 world
  called washington, etc.

 This is related to something I've been thinking about. Okay, say you use
 ID's instead of names. Now, you've got to translate those ID's to names
 before you display them, of course.

 One way to do that would be to keep the id-to-name lookup in some non-solr
 store (rdbms, or non-sql store)

 Is that what you'd do? Is there any non-crazy way to do that without an
 external store, just with solr?  Any way to do it with term payloads?
 Anything else?

 Jonathan


Re: Tree Faceting in Solr 1.4

2010-07-23 Thread Geert-Jan Brits
If I am doing
facet=on  facet.field={!ex=State}State  fq={!tag=State}State:Karnataka

All it gives me is Facets on state excluding only that filter query.. But i
was not able to do same on third level ..Like  facet.field= Give me the
counts of  cities also in state Karantaka..
Let me know solution for this...

This looks like regular faceting to me.

1. Showing citycounts given state
facet=onfq=State:Karnatakafacet.field=city

2. showing statecounts given country (similar to 1)
facet=onfq=Country:Indiafacet.field=state

3. showing city and state counts given country:
facet=onfq=Country:Indiafacet.field=statefacet.field=city

4. showing city counts given state + all other states not filtered by
current state (
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
)
facet=onfq={!tag=State}state:Karnatakafacet.field={!ex=State}statefacet.field=city

5. showing state + city counts given country + all other countries not
filtered by current country
(shttp://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filtersimilar
to 4)
facet=onfq={!tag=country}country:Indiafacet.field={!ex=country}countryfacet.field=cityfacet.field=state

etc.

This has nothing to do with Hierarchical faceting as described in SOLR-792
btw, although I understand the possible confusion as County  state  city
can obvisouly be seen as some sort of hierarchy.  The first part of your
question seemed to be more about Hierarchial faceting as per SOLR-792, but I
couldn't quite distill a question from that part.

Also, just a suggestion, consider using id's instead of names for filtering;
you will get burned sooner or later otherwise.

HTH,

Geert-Jan



2010/7/23 rajini maski rajinima...@gmail.com

 I am also looking out for same feature in Solr and very keen to know
 whether
 it supports this feature of tree faceting... Or we are forced to index in
 tree faceting formatlike

 1/2/3/4
 1/2/3
 1/2
 1

 In-case of multilevel faceting it will give only 2 level tree facet is what
 i found..

 If i give query as : country India and state Karnataka and city
 bangalore...All what i want is a facet count  1) for condition above. 2)
 The
 number of states in that Country 3) the number of cities in that state ...

 Like = Country: India ,State:Karnataka , City: Bangalore 1

 State:Karnataka
  Kerla
  Tamilnadu
  Andra Pradesh...and so on

 City:  Mysore
  Hubli
  Mangalore
  Coorg and so on...


 If I am doing
 facet=on  facet.field={!ex=State}State  fq={!tag=State}State:Karnataka

 All it gives me is Facets on state excluding only that filter query.. But i
 was not able to do same on third level ..Like  facet.field= Give me the
 counts of  cities also in state Karantaka..
 Let me know solution for this...

 Regards,
 Rajani Maski





 On Thu, Jul 22, 2010 at 10:13 PM, Eric Grobler impalah...@googlemail.com
 wrote:

  Thank you for the link.
 
  I was not aware of the multifaceting syntax - this will enable me to run
 1
  less query on the main page!
 
  However this is not a tree faceting feature.
 
  Thanks
  Eric
 
 
 
 
  On Thu, Jul 22, 2010 at 4:51 PM, SR r.steve@gmail.com wrote:
 
   Perhaps the following article can help:
  
 
 http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
  
   -S
  
  
   On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:
  
Hi Solr Community
   
If I have:
COUNTRY CITY
Germany Berlin
Germany Hamburg
Spain   Madrid
   
Can I do faceting like:
Germany
 Berlin
 Hamburg
Spain
 Madrid
   
I tried to apply SOLR-792 to the current trunk but it does not seem
 to
  be
compatible.
Maybe there is a similar feature existing in the latest builds?
   
Thanks  Regards
Eric
  
  
 



Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
With the usecase you specified it should work to just index each Row as
you described in your initial post to be a seperate document.
This way p_value and p_type all get singlevalued and you get a correct
combination of p_value and p_type.

However, this may not go so well with other use-cases you have in mind,
e.g.: requiring that no multiple results are returned with the same document
id.



2010/7/23 Pramod Goyal pramod.go...@gmail.com

 I want to do that. But if i understand correctly in solr it would store the
 field like this:

 p_value: Pramod  Raj
 p_type:  Client Supplier

 When i search
 p_value:Pramod AND p_type:Supplier

 it would give me result as document 1. Which is incorrect, since in
 document
 1 Pramod is a Client and not a Supplier.




 On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin 
 knagelb...@globeandmail.com wrote:

  I think you just want something like:
 
  p_value:Pramod AND p_type:Supplier
 
  no?
  -Kallin Nagelberg
 
  -Original Message-
  From: Pramod Goyal [mailto:pramod.go...@gmail.com]
  Sent: Friday, July 23, 2010 2:17 PM
  To: solr-user@lucene.apache.org
  Subject: help with a schema design problem
 
  Hi,
 
  Lets say i have table with 3 columns document id Party Value and Party
  Type.
  In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod
  Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type:
  Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier.
  Now in this table if i use SQL its easy for me find all document with
 Party
  Value as Pramod and Party Type as Client.
 
  I need to design solr schema so that i can do the same in Solr. If i
 create
  2 fields in solr schema Party value and Party type both of them multi
  valued
  and try to query +Pramod +Supplier then solr will return me the first
  document, even though in the first document Pramod is a client and not a
  supplier
  Thanks,
  Pramod Goyal
 



Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
 Is there any way in solr to say p_value[someIndex]=pramod
And p_type[someIndex]=client.
No, I'm 99% sure there is not.

 One way would be to define a single field in the schema as p_value_type =
client pramod i.e. combine the value from both the field and store it in a
single field.
yep, for the use-case you mentioned that would definitely work. Multivalued
of course, so it can contain Supplier Raj as well.


2010/7/23 Pramod Goyal pramod.go...@gmail.com

In my case the document id is the unique key( each row is not a unique
 document ) . So a single document has multiple Party Value and Party Type.
 Hence i need to define both Party value and Party type as mutli-valued. Is
 there any way in solr to say p_value[someIndex]=pramod And
 p_type[someIndex]=client.
Is there any other way i can design my schema ? I have some solutions
 but none seems to be a good solution. One way would be to define a single
 field in the schema as p_value_type = client pramod i.e. combine the
 value
 from both the field and store it in a single field.


 On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com
 wrote:

  With the usecase you specified it should work to just index each Row as
  you described in your initial post to be a seperate document.
  This way p_value and p_type all get singlevalued and you get a correct
  combination of p_value and p_type.
 
  However, this may not go so well with other use-cases you have in mind,
  e.g.: requiring that no multiple results are returned with the same
  document
  id.
 
 
 
  2010/7/23 Pramod Goyal pramod.go...@gmail.com
 
   I want to do that. But if i understand correctly in solr it would store
  the
   field like this:
  
   p_value: Pramod  Raj
   p_type:  Client Supplier
  
   When i search
   p_value:Pramod AND p_type:Supplier
  
   it would give me result as document 1. Which is incorrect, since in
   document
   1 Pramod is a Client and not a Supplier.
  
  
  
  
   On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin 
   knagelb...@globeandmail.com wrote:
  
I think you just want something like:
   
p_value:Pramod AND p_type:Supplier
   
no?
-Kallin Nagelberg
   
-Original Message-
From: Pramod Goyal [mailto:pramod.go...@gmail.com]
Sent: Friday, July 23, 2010 2:17 PM
To: solr-user@lucene.apache.org
Subject: help with a schema design problem
   
Hi,
   
Lets say i have table with 3 columns document id Party Value and
 Party
Type.
In this table i have 3 rows. 1st row Document id: 1 Party Value:
 Pramod
Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party
  Type:
Supplier. 3rd row Document id:2 Party Value: Pramod Party Type:
  Supplier.
Now in this table if i use SQL its easy for me find all document with
   Party
Value as Pramod and Party Type as Client.
   
I need to design solr schema so that i can do the same in Solr. If i
   create
2 fields in solr schema Party value and Party type both of them multi
valued
and try to query +Pramod +Supplier then solr will return me the first
document, even though in the first document Pramod is a client and
 not
  a
supplier
Thanks,
Pramod Goyal
   
  
 



Re: filter query on timestamp slowing query???

2010-07-23 Thread Geert-Jan Brits
just wanted to mention a possible other route, which might be entirely
hypothetical :-)

*If* you could query on internal docid (I'm not sure that it's available
out-of-the-box, or if you can at all)
your original problem, quoted below, could imo be simplified to asking for
the last docid inserted (that match the other criteria from your use-case)
and in the next call filter from that docid forward.

Every 30 minutes, i ask the index what are the documents that were added to
it, since the last time i queried it, that match a certain criteria.
From time to time, once a week or so, i ask the index for ALL the documents
that match that criteria. (i also do this for not only one query, but
several)
This is why i need the timestamp filter.

Again, I'm not entirely sure that quering / filtering on internal docid's is
possible (perhaps someone can comment) but if it is, it would perhaps be
more performant.
Big IF, I know.

Geert-Jan

2010/7/23 Chris Hostetter hossman_luc...@fucit.org

 : On top of using trie dates, you might consider separating the timestamp
 : portion and the type portion of the fq into seperate fq parameters --
 : that will allow them to to be stored in the filter cache seperately. So
 : for instance, if you include type:x OR type:y in queries a lot, but
 : with different date ranges, then when you make a new query, the set for
 : type:x OR type:y can be pulled from the filter cache and intersected

 definitely ... that's the one big thing that jumped out at me once you
 showed us *how* you were constructing these queries.



 -Hoss




Re: help with a schema design problem

2010-07-23 Thread Geert-Jan Brits
Multiple rows in the OPs example are combined to form 1 solr-document (e.g:
row 1 and 2 both have documentid=1)
Because of this combine, it would match p_value from row1 with p_type from
row2 (or vice versa)


2010/7/23 Nagelberg, Kallin knagelb...@globeandmail.com

When i search
p_value:Pramod AND p_type:Supplier
   
it would give me result as document 1. Which is incorrect, since in
document
1 Pramod is a Client and not a Supplier.

 Would it? I would expect it to give you nothing.

 -Kal



 -Original Message-
 From: Geert-Jan Brits [mailto:gbr...@gmail.com]
 Sent: Friday, July 23, 2010 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: help with a schema design problem

  Is there any way in solr to say p_value[someIndex]=pramod
 And p_type[someIndex]=client.
 No, I'm 99% sure there is not.

  One way would be to define a single field in the schema as p_value_type =
 client pramod i.e. combine the value from both the field and store it in
 a
 single field.
 yep, for the use-case you mentioned that would definitely work. Multivalued
 of course, so it can contain Supplier Raj as well.


 2010/7/23 Pramod Goyal pramod.go...@gmail.com

 In my case the document id is the unique key( each row is not a unique
  document ) . So a single document has multiple Party Value and Party
 Type.
  Hence i need to define both Party value and Party type as mutli-valued.
 Is
  there any way in solr to say p_value[someIndex]=pramod And
  p_type[someIndex]=client.
 Is there any other way i can design my schema ? I have some solutions
  but none seems to be a good solution. One way would be to define a single
  field in the schema as p_value_type = client pramod i.e. combine the
  value
  from both the field and store it in a single field.
 
 
  On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com
  wrote:
 
   With the usecase you specified it should work to just index each Row
 as
   you described in your initial post to be a seperate document.
   This way p_value and p_type all get singlevalued and you get a correct
   combination of p_value and p_type.
  
   However, this may not go so well with other use-cases you have in mind,
   e.g.: requiring that no multiple results are returned with the same
   document
   id.
  
  
  
   2010/7/23 Pramod Goyal pramod.go...@gmail.com
  
I want to do that. But if i understand correctly in solr it would
 store
   the
field like this:
   
p_value: Pramod  Raj
p_type:  Client Supplier
   
When i search
p_value:Pramod AND p_type:Supplier
   
it would give me result as document 1. Which is incorrect, since in
document
1 Pramod is a Client and not a Supplier.
   
   
   
   
On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin 
knagelb...@globeandmail.com wrote:
   
 I think you just want something like:

 p_value:Pramod AND p_type:Supplier

 no?
 -Kallin Nagelberg

 -Original Message-
 From: Pramod Goyal [mailto:pramod.go...@gmail.com]
 Sent: Friday, July 23, 2010 2:17 PM
 To: solr-user@lucene.apache.org
 Subject: help with a schema design problem

 Hi,

 Lets say i have table with 3 columns document id Party Value and
  Party
 Type.
 In this table i have 3 rows. 1st row Document id: 1 Party Value:
  Pramod
 Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party
   Type:
 Supplier. 3rd row Document id:2 Party Value: Pramod Party Type:
   Supplier.
 Now in this table if i use SQL its easy for me find all document
 with
Party
 Value as Pramod and Party Type as Client.

 I need to design solr schema so that i can do the same in Solr. If
 i
create
 2 fields in solr schema Party value and Party type both of them
 multi
 valued
 and try to query +Pramod +Supplier then solr will return me the
 first
 document, even though in the first document Pramod is a client and
  not
   a
 supplier
 Thanks,
 Pramod Goyal

   
  
 



Re: indexing best practices

2010-07-18 Thread Geert-Jan Brits
Have you read:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

To be short there are only guidelines (see links) no definitive answers.
If you followed the guidelines for improviing indexing speed on a single box
and after having tested various settings indexing is still too slow, you may
want to test the scenario:
1. indexing to several boxes/shards (using round robin or something).
2. copy all created indexes to one box.
3. use indexwriter.addIndexes to merge the indexes.

1/2/3 done on ssd's is of course going to boost performance a lot as well
(on large indexes, bc small ones may fit in disk cache entirely)
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
Hope that helps a bit,
Geert-Jan

2010/7/18 kenf_nc ken.fos...@realestate.com


 No one has done performance analysis? Or has a link to anywhere where it's
 been done?

 basically fastest way to get documents into Solr. So many options
 available,
 what's the fastest:
 1) file import (xml, csv)  vs  DIH  vs POSTing
 2) number of concurrent clients   1   vs 10 vs 100 ...is there a
 diminishing
 returns number?

 I have 16 million small (8 to 10 fields, no large text fields) docs that
 get
 updated monthly and 2.5 million largish (20 to 30 fields, a couple html
 text
 fields) that get updated monthly. It currently takes about 20 hours to do a
 full import. I would like to cut that down as much as possible.
 Thanks,
 Ken
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Re: How to speed up solr search speed

2010-07-17 Thread Geert-Jan Brits
My query string is always simple like design, principle of design,
tom
EG:
URL:
http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on

IMO, indeed with these types of simple searches caching (and thus RAM usage)
can not be fully exploited, i.e: there isn't really anything to cache (no
sort-ordering, faceting (Lucene fieldcache), no documentsets,faceting (Solr
filtercache))

The only thing that helps you here would be a big solr querycache, depending
on how often queries are repeated.
Just execute the same query twice, the second time you should see a fast
response (say  20ms) that's the querycache (and thus RAM)  working for
you.

Now the issue I found is search with fq argument looks slow down the
search.

This doesn't align with your previous statement that you only use search
with a q-param (e.g:
http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on
)
For your own sake, explain what you're trying to do, otherwise we really are
guessing in the dark.

Anyway the FQ-param let's you cache (using the Solr-filtercache)  individual
documentsets that can be used to efficiently to intersect your resultset.
Also the first time, caches should be warmed (i.e: the fq-query should be
exectuted and results saved to cache, since there isn't anything there yet)
. Only on the second time would you start seeing improvements.

For instance:
http://localhost:7550/solr/select/?q=designfq=doctype:pdfversion=2.2start=0rows=10indent=onhttp://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on

http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=onwould
only show documents containing design when the doctype=pdf (Again this is
just an example here where I'm just assuming that you have defined a field
'doctype')
since the nr of values of documenttype would be pretty low and would be used
independently of other queries, this would be an excellent candidate for the
FQ-param.

http://wiki.apache.org/solr/CommonQueryParameters#fq
http://wiki.apache.org/solr/CommonQueryParameters#fq
This was a longer reply than I wanted to. Really think about your use-cases
first, then present some real examples of what you want to achieve and then
we can help you in a more useful manner.

Cheers,
Geert-Jan

2010/7/17 marship mars...@126.com

 Hi. Peter and All.
 I merged my indexes today. Now each index stores 10M document. Now I only
 have 10 solr cores.
 And I used

 java -Xmx1g -jar -server start.jar
 to start the jetty server.

 At first I deployed them all on one search. The search speed is about 3s.
 Then I noticed from cmd output when search start, 4 of 10's QTime only cost
 about 10ms-500ms. The left 5 cost more, up to 2-3s. Then I put 6 on web
 server, 4 on another(DB, high load most time). Then the search speed goes
 down to about 1s most time.
 Now most search takes about 1s. That's great.

 I watched the jetty output on cmd windows on web server, now when each
 search start, I saw 2 of 6 costs 60ms-80ms. The another 4 cost 170ms -
 700ms.  I do believe the bottleneck is still the hard disk. But at least,
 the search speed at the moment is acceptable. Maybe i should try memdisk to
 see if that help.


 And for -Xmx1g, actually I only see jetty consume about 150M memory,
 consider now the index is 10x bigger. I don't think that works. I googled
 -Xmx is go enlarge the heap size. Not sure can that help search.  I still
 have 3.5G memory free on server.

 Now the issue I found is search with fq argument looks slow down the
 search.

 Thanks All for your help and suggestions.
 Thanks.
 Regards.
 Scott


 在2010-07-17 03:36:19,Peter Karich peat...@yahoo.de 写道:
   Each solr(jetty) instance on consume 40M-60M memory.
 
  java -Xmx1024M -jar start.jar
 
 That's a good suggestion!
 Please, double check that you are using the -server version of the jvm
 and the latest 1.6.0_20 or so.
 
 Additionally you can start jvisualvm (shipped with the jdk) and hook
 into jetty/tomcat easily to see the current CPU and memory load.
 
  But I have 70 solr cores
 
 if you ask me: I would reduce them to 10-15 or even less and increase
 the RAM.
 try out tomcat too
 
  solr distriubted search's speed is decided by the slowest one.
 
 so, try to reduce the cores
 
 Regards,
 Peter.
 
  you mentioned that you have a lot of mem free, but your yetty containers
  only using between 40-60 mem.
 
  probably stating the obvious, but have you increased the -Xmx param like
 for
  instance:
  java -Xmx1024M -jar start.jar
 
  that way you're configuring the container to use a maximum of 1024 MB
 ram
  instead of the standard which is much lower (I'm not sure what exactly
 but
  it could well be 64MB for non -server, aligning with what you're seeing)
 
  Geert-Jan
 
  2010/7/16 marship mars...@126.com
 
 
  Hi Tom Burton-West.
 
   Sorry looks my email ISP filtered out your replies. I checked web
 version
  of mailing list and saw your reply.
 
   My query string is always simple like 

Re: Re:Re: How to speed up solr search speed

2010-07-16 Thread Geert-Jan Brits
you mentioned that you have a lot of mem free, but your yetty containers
only using between 40-60 mem.

probably stating the obvious, but have you increased the -Xmx param like for
instance:
java -Xmx1024M -jar start.jar

that way you're configuring the container to use a maximum of 1024 MB ram
instead of the standard which is much lower (I'm not sure what exactly but
it could well be 64MB for non -server, aligning with what you're seeing)

Geert-Jan

2010/7/16 marship mars...@126.com

 Hi Tom Burton-West.

  Sorry looks my email ISP filtered out your replies. I checked web version
 of mailing list and saw your reply.

  My query string is always simple like design, principle of design,
 tom



 EG:

 URL:
 http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on

 Response:

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime16/int
 -
 lst name=params
 str name=indenton/str
 str name=start0/str
 str name=qdesign/str
 str name=version2.2/str
 str name=rows10/str
 /lst
 /lst
 -
 result name=response numFound=5981 start=0
 -
 doc
 str name=idproduct_208619/str
 /doc





 EG:
 http://localhost:7550/solr/select/?q=Principleversion=2.2start=0rows=10indent=on

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime94/int
 -
 lst name=params
 str name=indenton/str
 str name=start0/str
 str name=qPrinciple/str
 str name=version2.2/str
 str name=rows10/str
 /lst
 /lst
 -
 result name=response numFound=104 start=0
 -
 doc
 str name=idproduct_56926/str
 /doc



 As I am querying over single core and other cores are not querying at same
 time. The QTime looks good.

 But when I query the distributed node: (For this case, 6422ms is still a
 not bad one. Many cost ~20s)

 URL:
 http://localhost:7499/solr/select/?q=the+first+world+warversion=2.2start=0rows=10indent=ondebugQuery=true

 Response:

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime6422/int
 -
 lst name=params
 str name=debugQuerytrue/str
 str name=indenton/str
 str name=start0/str
 str name=qthe first world war/str
 str name=version2.2/str
 str name=rows10/str
 /lst
 /lst
 -
 result name=response numFound=4231 start=0



 Actually I am thinking and testing a solution: As I believe the bottleneck
 is in harddisk and all our indexes add up is about 10-15G. What about I just
 add another 16G memory to my server then use MemDisk to map a memory disk
 and put all my indexes into it. Then each time, solr/jetty need to load
 index from harddisk, it is loading from memory. This should give solr the
 most throughout and avoid the harddisk access delay. I am testing 

 But if there are way to make solr use better use our limited resource to
 avoid adding new ones. that would be great.








Re: How I can use score value for my function

2010-06-29 Thread Geert-Jan Brits
It's possible using functionqueries. See this link.

http://wiki.apache.org/solr/FunctionQuery#query

2010/6/29 MitchK mitc...@web.de


 Ramzesua,

 this is not possible, because Solr does not know what is the resulting
 score
 at query-time (as far as I know).
 The score will be computed, when every hit from every field is combined by
 the scorer.
 Furthermore I have shown you an alternative in the other threads. It makes
 not exactly what you are describing, but works without a problem.

 Regards
 - Mitch
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p930646.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
You can treat dynamic fields like any other field, so you can facet, sort,
filter, etc on these fields (afaik)

I believe the confusion arises that sometimes the usecase for dynamic fields
seems to be ill-understood, i.e: to be able to use them to do some kind of
wildcard search, e.g: search for a value in any of the dynamic fields at
once like pic_url_*. This however is NOT possible.

As far as your question goes:

Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o
pic
To the best of my knowledge, everyone is saying that faceting cannot be
done on dynamic fields (only on definitive field names). Thus, I tried the
following and it's working: I assume that the stored  pictures have a
sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it
means that the underlying doc has at least one picture:
 ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:*
 While this is working fine, I'm wondering whether there's a cleaner way to
do the same thing without assuming that pictures have a sequential number.

If I understand your question correctly: faceting on docs with and without
pics could ofcourse by done like you mention, however it  would be more
efficient to have an extra field defined:  hasAtLestOnePic with values (0 |
1)
use that to facet / filter on.

you can extend this to NrOfPics [0,N)  if you need to filter / facet on docs
with a certain nr of pics.

also I wondered what else you wanted to do with this pic-related info. Do
you want to search on pic-description / pic-caption for instance? In that
case the dynamic-fields approach may not be what you want: how would you
know in which dynamic-field to search for a particular term? Would if be
pic_desc_1 , or pic_desc_x?  Of couse you could OR over all dynamic fields,
but you need to know how many pics an upperbound for the nr of pics and it
really doesn't feel right, to me at least.

If you need search on pic_description for instance, but don't mind what pic
matches, you could create a single field pic_description and put in the
concat of all pic-descriptions and search on that, or just make it a a
multi-valued field.

If you dont need search at all on these fields, the best thing imo is to
store all pic-related info of all pics together by concatenating them with
some delimiter which you know how to seperate at the client-side.
That or just store it in an external RDB since solr is just sitting on the
data and not doing anything intelligent with it.

I assume btw that you don't want to sort/ facet on pic-desc / pic_caption/
pic_url either ( I have a hard time thinking of a useful usecase for that)

HTH,

Geert-Jan



2010/6/26 Saïd Radhouani r.steve@gmail.com

 Thanks so much Otis. This is working great.

 Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o
 pic

 To the best of my knowledge, everyone is saying that faceting cannot be
 done on dynamic fields (only on definitive field names). Thus, I tried the
 following and it's working: I assume that the stored pictures have a
 sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it
 means that the underlying doc has at least one picture:

 ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:*

 While this is working fine, I'm wondering whether there's a cleaner way to
 do the same thing without assuming that pictures have a sequential number.

 Also, do you have any documentation about handling Dynamic Fields using
 SolrJ. So far, I found only issues about that on JIRA, but no documentation.

 Thanks a lot.

 -Saïd

 On Jun 26, 2010, at 1:18 AM, Otis Gospodnetic wrote:

  Saïd,
 
  Dynamic fields could help here, for example imagine a doc with:
  id
  pic_url_*
  pic_caption_*
  pic_description_*
 
  See http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
 
  So, for you:
 
  dynamicField name=pic_url_*  type=string  indexed=true
  stored=true/
  dynamicField name=pic_caption_*  type=text  indexed=true
  stored=true/
  dynamicField name=pic_description_*  type=text  indexed=true
  stored=true/
 
  Then you can add docs with unlimited number of
 pic_(url|caption|description)_* fields, e.g.
 
  id
  pic_url_1
  pic_caption_1
  pic_description_1
 
  id
  pic_url_2
  pic_caption_2
  pic_description_2
 
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  - Original Message 
  From: Saïd Radhouani r.steve@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, June 25, 2010 6:01:13 PM
  Subject: Setting many properties for a multivalued field. Schema.xml ?
 External file?
 
  Hi,
 
  I'm trying to index data containing a multivalued field picture,
  that has three properties: url, caption and description:
 
  picture/
 
 url/
 
  caption/
 description/
 
  Thus, each
  indexed document might have many pictures, each of them has a url, a
 caption,
  and a description.
 
  I wonder wether it's 

Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
If I understand your suggestion correctly, you said that there's NO need to
have many Dynamic Fields; instead, we can have one definitive field name,
which can store a long string (concatenation of information about tens of
pictures), e.g., using - and % delimiters:
pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
I don't clearly see the reason of doing this. Is there a gain in terms of
performance? Or does this make programming on the client-side easier? Or
something else?

I think you should ask the exact opposite question. If you don't do anything
with these fields which Solr is particularly good at (searching / filtering
/ faceting/ sorting) why go through the trouble of creating dynamic fields?
 (more fields is more overhead cost/ tracking cost no matter how you look at
it)

Moreover, indeed from a client-view it's easier the way I suggested, since
otherwise you:
- would have to ask (through SolrJ) to include all dynamic fields to be
returned in the Fl-field (
http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult,
because a-priori you don't know how many dynamic-fields to query. So in
other words you can't just ask SOlr (though SolrJ lik you asked) to just
return all dynamic fields beginning with pic_*. (afaik)
- your client iterate code (looping the pics) is a bit more involved.

HTH, Cheers,

Geert-Jan

2010/6/26 Saïd Radhouani r.steve@gmail.com

 Thanks Geert-Jan for the detailed answer. Actually, I don't search at all
 on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the
 number of pictures). Thus, your suggestion of adding an extra field NrOfPics
 [0,N] would be the best solution.

 Regarding the other suggestion:

  If you dont need search at all on these fields, the best thing imo is to
  store all pic-related info of all pics together by concatenating them
 with
  some delimiter which you know how to seperate at the client-side.
  That or just store it in an external RDB since solr is just sitting on
 the
  data and not doing anything intelligent with it.

 If I understand your suggestion correctly, you said that there's NO need to
 have many Dynamic Fields; instead, we can have one definitive field name,
 which can store a long string (concatenation of information about tens of
 pictures), e.g., using - and % delimiters:
 pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...

 I don't clearly see the reason of doing this. Is there a gain in terms of
 performance? Or does this make programming on the client-side easier? Or
 something else?


 My other question was: in case we use Dynamic Fields, is there a
 documentation about using SolrJ for this purpose?

 Thanks
 -Saïd

 On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote:

  You can treat dynamic fields like any other field, so you can facet,
 sort,
  filter, etc on these fields (afaik)
 
  I believe the confusion arises that sometimes the usecase for dynamic
 fields
  seems to be ill-understood, i.e: to be able to use them to do some kind
 of
  wildcard search, e.g: search for a value in any of the dynamic fields at
  once like pic_url_*. This however is NOT possible.
 
  As far as your question goes:
 
  Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc
 w/o
  pic
  To the best of my knowledge, everyone is saying that faceting cannot be
  done on dynamic fields (only on definitive field names). Thus, I tried
 the
  following and it's working: I assume that the stored  pictures have a
  sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index,
 it
  means that the underlying doc has at least one picture:
  ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:*
  While this is working fine, I'm wondering whether there's a cleaner way
 to
  do the same thing without assuming that pictures have a sequential
 number.
 
  If I understand your question correctly: faceting on docs with and
 without
  pics could ofcourse by done like you mention, however it  would be more
  efficient to have an extra field defined:  hasAtLestOnePic with values (0
 |
  1)
  use that to facet / filter on.
 
  you can extend this to NrOfPics [0,N)  if you need to filter / facet on
 docs
  with a certain nr of pics.
 
  also I wondered what else you wanted to do with this pic-related info. Do
  you want to search on pic-description / pic-caption for instance? In that
  case the dynamic-fields approach may not be what you want: how would you
  know in which dynamic-field to search for a particular term? Would if be
  pic_desc_1 , or pic_desc_x?  Of couse you could OR over all dynamic
 fields,
  but you need to know how many pics an upperbound for the nr of pics and
 it
  really doesn't feel right, to me at least.
 
  If you need search on pic_description for instance, but don't mind what
 pic
  matches, you could create a single field

Re: Setting many properties for a multivalued field. Schema.xml ? External file?

2010-06-26 Thread Geert-Jan Brits
btw, be careful with you delimiters: pic_url may possibly contain a '-',
etc.

2010/6/26 Geert-Jan Brits gbr...@gmail.com

 If I understand your suggestion correctly, you said that there's NO need
 to have many Dynamic Fields; instead, we can have one definitive field name,
 which can store a long string (concatenation of information about tens of
 pictures), e.g., using - and % delimiters:
 pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...
 I don't clearly see the reason of doing this. Is there a gain in terms of
 performance? Or does this make programming on the client-side easier? Or
 something else?

 I think you should ask the exact opposite question. If you don't do
 anything with these fields which Solr is particularly good at (searching /
 filtering / faceting/ sorting) why go through the trouble of creating
 dynamic fields?  (more fields is more overhead cost/ tracking cost no matter
 how you look at it)

 Moreover, indeed from a client-view it's easier the way I suggested, since
 otherwise you:
 - would have to ask (through SolrJ) to include all dynamic fields to be
 returned in the Fl-field (
 http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult,
 because a-priori you don't know how many dynamic-fields to query. So in
 other words you can't just ask SOlr (though SolrJ lik you asked) to just
 return all dynamic fields beginning with pic_*. (afaik)
 - your client iterate code (looping the pics) is a bit more involved.

 HTH, Cheers,

 Geert-Jan

 2010/6/26 Saïd Radhouani r.steve@gmail.com

 Thanks Geert-Jan for the detailed answer. Actually, I don't search at all
 on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the
 number of pictures). Thus, your suggestion of adding an extra field NrOfPics
 [0,N] would be the best solution.

 Regarding the other suggestion:

  If you dont need search at all on these fields, the best thing imo is to
  store all pic-related info of all pics together by concatenating them
 with
  some delimiter which you know how to seperate at the client-side.
  That or just store it in an external RDB since solr is just sitting on
 the
  data and not doing anything intelligent with it.

 If I understand your suggestion correctly, you said that there's NO need
 to have many Dynamic Fields; instead, we can have one definitive field name,
 which can store a long string (concatenation of information about tens of
 pictures), e.g., using - and % delimiters:
 pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%...

 I don't clearly see the reason of doing this. Is there a gain in terms of
 performance? Or does this make programming on the client-side easier? Or
 something else?


 My other question was: in case we use Dynamic Fields, is there a
 documentation about using SolrJ for this purpose?

 Thanks
 -Saïd

 On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote:

  You can treat dynamic fields like any other field, so you can facet,
 sort,
  filter, etc on these fields (afaik)
 
  I believe the confusion arises that sometimes the usecase for dynamic
 fields
  seems to be ill-understood, i.e: to be able to use them to do some kind
 of
  wildcard search, e.g: search for a value in any of the dynamic fields at
  once like pic_url_*. This however is NOT possible.
 
  As far as your question goes:
 
  Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc
 w/o
  pic
  To the best of my knowledge, everyone is saying that faceting cannot be
  done on dynamic fields (only on definitive field names). Thus, I tried
 the
  following and it's working: I assume that the stored  pictures have a
  sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the
 index, it
  means that the underlying doc has at least one picture:
  ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:*
  While this is working fine, I'm wondering whether there's a cleaner way
 to
  do the same thing without assuming that pictures have a sequential
 number.
 
  If I understand your question correctly: faceting on docs with and
 without
  pics could ofcourse by done like you mention, however it  would be more
  efficient to have an extra field defined:  hasAtLestOnePic with values
 (0 |
  1)
  use that to facet / filter on.
 
  you can extend this to NrOfPics [0,N)  if you need to filter / facet on
 docs
  with a certain nr of pics.
 
  also I wondered what else you wanted to do with this pic-related info.
 Do
  you want to search on pic-description / pic-caption for instance? In
 that
  case the dynamic-fields approach may not be what you want: how would you
  know in which dynamic-field to search for a particular term? Would if be
  pic_desc_1 , or pic_desc_x?  Of couse you could OR over all dynamic
 fields,
  but you need to know how many pics an upperbound for the nr of pics and
 it
  really doesn't feel right

Re: Searching across multiple repeating fields

2010-06-22 Thread Geert-Jan Brits
Perhaps my answer is useless, bc I don't have an answer to your direct
question, but:
You *might* want to consider if your concept of a solr-document is on the
correct granular level, i.e:

your problem posted could be tackled (afaik) by defining a  document being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-event
docs in this proposed situation.

Additionally each sub-event doc gets an additional field 'parent-eventid'
which maps to something like an event-id (which you're probably using) .
So several sub-event docs can point to the same event-id.

Lastly, all sub-event docs belonging to a particular event implement all the
other fields that you may have stored in that particular event-doc.

Now you can query for events based on data-rages like you envisioned, but
instead of returning events you return sub-event-docs. However since all
data of the original event (except the multiple dateranges) is available in
the subevent-doc this shouldn't really bother the client. If you need to
display all dates of an event (the only info missing from the returned
solr-doc) you could easily store it in a RDB and fetch it using the defined
parent-eventid.

The only caveat I see, is that possibly multiple sub-events with the same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1)  If you always issue queries with date-filters, and *assuming* that
sub-events of a particular event don't temporally overlap, you will never
get multiple sub-events returned.
2)  if 1)  doesn't hold and assuming you *do* mind multiple sub-events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-eventid that
matches the rest of your query. (Note however, that Field Collapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)

Not sure if this helped you at all, but at the very least it was a nice
conceptual exercise ;-)

Cheers,
Geert-Jan


2010/6/22 Mark Allan mark.al...@ed.ac.uk

 Hi all,

 Firstly, I apologise for the length of this email but I need to describe
 properly what I'm doing before I get to the problem!

 I'm working on a project just now which requires the ability to store and
 search on temporal coverage data - ie. a field which specifies a date range
 during which a certain event took place.

 I hunted around for a few days and couldn't find anything which seemed to
 fit, so I had a go at writing my own field type based on solr.PointType.
  It's used as follows:
  schema.xml
fieldType name=temporal class=solr.TemporalCoverage
 dimension=2 subFieldSuffix=_i/
field name=daterange type=temporal indexed=true stored=true
 multiValued=true/
  data.xml
add
doc
...
field name=daterange1940,1945/field
/doc
/add

 Internally, this gets stored as:
arr name=daterangestr1940,1945/str/arr
int name=daterange_0_i1940/int
int name=daterange_1_i1945/int

 In due course, I'll declare the subfields as a proper date type, but in the
 meantime, this works absolutely fine.  I can search for an individual date
 and Solr will check (queryDate  daterange_0 AND queryDate  daterange_1 )
 and the correct documents are returned.  My code also allows the user to
 input a date range in the query but I won't complicate matters with that
 just now!

 The problem arises when a document has more than one daterange field
 (imagine a news broadcast which covers a variety of topics and hence time
 periods).

 A document with two daterange fields
doc
...
field name=daterange19820402,19820614/field
field name=daterange1990,2000/field
/doc
 gets stored internally as
arr
 name=daterangestr19820402,19820614/strstr1990,2000/str/arr
arr name=daterange_0_iint19820402/intint1990/int/arr
arr name=daterange_1_iint19820614/intint2000/int/arr

 In this situation, searching for 1985 should yield zero results as it is
 contained within neither daterange, however, the above document is returned
 in the result set.  What Solr is doing is checking that the queryDate (1985)
 is greater than *any* of the values in daterange_0 AND queryDate is less
 than *any* of the values in daterange_1.

 How can I get Solr to respect the positions of each item in the daterange_0
 and _1 arrays?  Ideally I'd like the search to use the following logic, thus
 preventing the above document from being returned in a search for 1985:
(queryDate  daterange_0[0] AND queryDate  daterange_1[0]) OR
 (queryDate  daterange_0[1] AND queryDate  daterange_1[1])

 Someone else had a very similar problem recently on the mailing list with a
 multiValued PointType field but the thread went cold without a final
 solution.

 While I could filter the results when they get back to my application
 layer, it seems like it's not really the right 

Re: Sort facet Field by name

2010-06-21 Thread Geert-Jan Brits
facet.sort=false

http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort

2010/6/21 Ankit Bhatnagar abhatna...@vantage.com

 Hi All,
 I couldn't really figure out if we a have option for sorting the facet
 field by name in ascending/descending.

 Any clues?

 Thanks
 Ankit



Re: custom scorer in Solr

2010-06-14 Thread Geert-Jan Brits
First of all,

Do you expect every query to return results for all 4 buckets?
i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2, 1.
When displaying the first 10 results, is it ok that these documents
potentially all have score 4, and thus only bucket 1 is filled?

If so, I can think of the following out-of-the-box option works: (which I'm
not sure performs enough, but you can easily test it on your data)

following your example create 4 fields:
1. categoryExact - configure anaylzers so that only full matches score,
other
2. categoryPartial - configure so that full and partial match (likely you
have already configured this)
3. nameExact - like 1
4. namepartial - like 2

configure copyfields: 1 -- 2 and 3 -- 4
this way your indexing client can stay the same as it likely is at the
moment.


Now you have 4 fields which scores you have to combine on search-time so
that the evenual scores are [1,4]
Out-of-the-box you can do this with functionqueries.

http://wiki.apache.org/solr/FunctionQuery

I don't have time to write it down exactly, but for each field:
- calc the score of each field (use the Query functionquery (nr 16 in the
wiki) . If score  0 use the map function to map it to respectively
4,3,2,1.

now for each document you have potentially multiple scores for instance: 4
and 2 if your doc matches exact and partial on category.
- use the max functionquery to only return the highest score -- 4 in this
case.

You have to find out for yourself if this performs though.

Hope that helps,
Geert-Jan


2010/6/14 Fornoville, Tom tom.fornovi...@truvo.com

 I've been investigating this further and I might have found another path
 to consider.

 Would it be possible to create a custom implementation of a SortField,
 comparable to the RandomSortField, to tackle the problem?


 I know it is not your standard question but would really appreciate all
 feedback and suggestions on this because this is the issue that will
 make or break the acceptance of Solr for this client.

 Thanks,
 Tom

 -Original Message-
 From: Fornoville, Tom
 Sent: woensdag 9 juni 2010 15:35
 To: solr-user@lucene.apache.org
 Subject: custom scorer in Solr

 Hi all,



 We are currently working on a proof-of-concept for a client using Solr
 and have been able to configure all the features they want except the
 scoring.



 Problem is that they want scores that make results fall in buckets:

 *   Bucket 1: exact match on category (score = 4)
 *   Bucket 2: exact match on name (score = 3)
 *   Bucket 3: partial match on category (score = 2)
 *   Bucket 4: partial match on name (score = 1)



 First thing we did was develop a custom similarity class that would
 return the correct score depending on the field and an exact or partial
 match.



 The only problem now is that when a document matches on both the
 category and name the scores are added together.

 Example: searching for restaurant returns documents in the category
 restaurant that also have the word restaurant in their name and thus get
 a score of 5 (4+1) but they should only get 4.



 I assume for this to work we would need to develop a custom Scorer class
 but we have no clue on how to incorporate this in Solr.

 Maybe there is even a simpler solution that we don't know about.



 All suggestions welcome!



 Thanks,

 Tom




Re: custom scorer in Solr

2010-06-14 Thread Geert-Jan Brits
Just to be clear,
this is for the use-case in which it is ok that potentially only 1 bucket
gets filled.

2010/6/14 Geert-Jan Brits gbr...@gmail.com

 First of all,

 Do you expect every query to return results for all 4 buckets?
 i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2,
 1.
 When displaying the first 10 results, is it ok that these documents
 potentially all have score 4, and thus only bucket 1 is filled?

 If so, I can think of the following out-of-the-box option works: (which I'm
 not sure performs enough, but you can easily test it on your data)

 following your example create 4 fields:
 1. categoryExact - configure anaylzers so that only full matches score,
 other
 2. categoryPartial - configure so that full and partial match (likely you
 have already configured this)
 3. nameExact - like 1
 4. namepartial - like 2

 configure copyfields: 1 -- 2 and 3 -- 4
 this way your indexing client can stay the same as it likely is at the
 moment.


 Now you have 4 fields which scores you have to combine on search-time so
 that the evenual scores are [1,4]
 Out-of-the-box you can do this with functionqueries.

 http://wiki.apache.org/solr/FunctionQuery

 I don't have time to write it down exactly, but for each field:
 - calc the score of each field (use the Query functionquery (nr 16 in the
 wiki) . If score  0 use the map function to map it to respectively
 4,3,2,1.

 now for each document you have potentially multiple scores for instance: 4
 and 2 if your doc matches exact and partial on category.
 - use the max functionquery to only return the highest score -- 4 in this
 case.

 You have to find out for yourself if this performs though.

 Hope that helps,
 Geert-Jan


 2010/6/14 Fornoville, Tom tom.fornovi...@truvo.com

 I've been investigating this further and I might have found another path
 to consider.

 Would it be possible to create a custom implementation of a SortField,
 comparable to the RandomSortField, to tackle the problem?


 I know it is not your standard question but would really appreciate all
 feedback and suggestions on this because this is the issue that will
 make or break the acceptance of Solr for this client.

 Thanks,
 Tom

 -Original Message-
 From: Fornoville, Tom
 Sent: woensdag 9 juni 2010 15:35
 To: solr-user@lucene.apache.org
 Subject: custom scorer in Solr

 Hi all,



 We are currently working on a proof-of-concept for a client using Solr
 and have been able to configure all the features they want except the
 scoring.



 Problem is that they want scores that make results fall in buckets:

 *   Bucket 1: exact match on category (score = 4)
 *   Bucket 2: exact match on name (score = 3)
 *   Bucket 3: partial match on category (score = 2)
 *   Bucket 4: partial match on name (score = 1)



 First thing we did was develop a custom similarity class that would
 return the correct score depending on the field and an exact or partial
 match.



 The only problem now is that when a document matches on both the
 category and name the scores are added together.

 Example: searching for restaurant returns documents in the category
 restaurant that also have the word restaurant in their name and thus get
 a score of 5 (4+1) but they should only get 4.



 I assume for this to work we would need to develop a custom Scorer class
 but we have no clue on how to incorporate this in Solr.

 Maybe there is even a simpler solution that we don't know about.



 All suggestions welcome!



 Thanks,

 Tom





Re: Tips on recursive xml-parsing in dataConfig

2010-06-08 Thread Geert-Jan Brits
my bad, it looks like XPathEntityProcessor doesn't support relative xpaths.

However, I quickly looked at the Slashdot example (which is pretty good
actually) at http://wiki.apache.org/solr/DataImportHandler.
From that I infer that you use only 1 entity per xml-doc. And within that
entity use multiple field declararations with xpath-attributes to extract
the values you want.
So even though your xml-dcoument is nested (like most xml's are) your
field-declarations are not.

I think your best bet is to read the slashdot example and go from there.

For now, I'm not entirely sure what you want a solr-document to be in your
example. i.e:
- 1 solr-document per 1 xml-document (as supplied)
- or 1 solr-doc per CHAP  per PARA or per SUB?

Once you know that, perhaps coming up with a decent pointer is easier.

HTH,
Geert-Jan


http://wiki.apache.org/solr/DataImportHandler

2010/6/8 Tor Henning Ueland tor.henn...@gmail.com

 I have tried both to change the datasource per child node to use the
 parent nodes name, and tried to making the Xpath`s relative, all
 causing either exceptions telling that Xpath must start with /, or
 nullpointer exceptions ( nsfgrantsdir document : null).

 Best regards

 On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits gbr...@gmail.com wrote:
  I'm guessing (I'm not familiar with the xml dataimport handler, but I am
  pretty familiar with Xpath)
  that your problem lies in having absolute xpath-queries, instead of
 relative
  xpath queries to your parent node.
 
  e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
  'KAP' instead.
  The same for all xpaths deeper in the tree.
 
  Geert-Jan
 
  2010/6/7 Tor Henning Ueland tor.henn...@gmail.com
 
  Hi,
 
  I am doing some testing of dataimport to Solr from XML-documents with
  many children in the children. To parse the children i some levels
  down using Xpath goes fine, but the speed is very slow. (~1 minute per
  document, on a quad Xeon server). When i do the same using the format
  solr wants it, the parsing time is 0.02 seconds per document.
 
  I have published a quick example here:
  http://pastebin.com/adhcEvRx
 
  My question is:
 
  I hope that i have done something wrong in the child-parsing  (as you
  can see, it goes down quite a few levels). Can anybody point me in the
  right direction so i can speed up the process?  I have been looking
  around for some examples, but nobody gives examples of such deep data
  indexing.
 
  PS: I know there are some bugs in the Xpath naming etc, but it is just
  a rough example :)
 
  --
  Best regars
  Tor Henning Ueland
 
 



 --
 Mvh
 Tor Henning Ueland



Re: Tips on recursive xml-parsing in dataConfig

2010-06-07 Thread Geert-Jan Brits
I'm guessing (I'm not familiar with the xml dataimport handler, but I am
pretty familiar with Xpath)
that your problem lies in having absolute xpath-queries, instead of relative
xpath queries to your parent node.

e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try
'KAP' instead.
The same for all xpaths deeper in the tree.

Geert-Jan

2010/6/7 Tor Henning Ueland tor.henn...@gmail.com

 Hi,

 I am doing some testing of dataimport to Solr from XML-documents with
 many children in the children. To parse the children i some levels
 down using Xpath goes fine, but the speed is very slow. (~1 minute per
 document, on a quad Xeon server). When i do the same using the format
 solr wants it, the parsing time is 0.02 seconds per document.

 I have published a quick example here:
 http://pastebin.com/adhcEvRx

 My question is:

 I hope that i have done something wrong in the child-parsing  (as you
 can see, it goes down quite a few levels). Can anybody point me in the
 right direction so i can speed up the process?  I have been looking
 around for some examples, but nobody gives examples of such deep data
 indexing.

 PS: I know there are some bugs in the Xpath naming etc, but it is just
 a rough example :)

 --
 Best regars
 Tor Henning Ueland



Re: exclude docs with null field

2010-06-04 Thread Geert-Jan Brits
Additionally, I should have mentioned that you can instead do:
fq=field_3:[* TO *], which uses the filtercache.

The method presented by Chris will probably outperform the above method but
only on the first request, from then on the filtercache takes over.
From a performance standpoint it's probably not worth going the 'default
value for null-approach' imho.
It IS useful however if you want to be able to query on docs with a
null-value (instead of excluding them)


2010/6/4 bluestar sea...@butterflycluster.net

 nice one! thanks.

 
  i could be wrong but it seems this
  way has a performance hit?
 
  or i am missing something?
 
  Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/
  He proposes alternative (more efficient) way other than [* TO *]
 
 
 
 





Re: MultiValue Exclusion

2010-06-04 Thread Geert-Jan Brits
I guess the following works.

A. similar to your option 2, but using the filtercache
fq=-item_id:001 -item_id:002

B. similar to your option 3, but using the filtercache
fq=-users_excluded_field:userid

the advantage being that the filter is cached independently from the rest of
the query so it can be reused efficiently.

adv A over B. the 'muted news items' can be queried dynamically, i.e: they
aren't set in stone at index time.
B will probably perform a little bit better the first time (when nog
cached), but I'm not sure.

hope that helps,
Geert-Jan


2010/6/4 homerlex homerlex.nab...@gmail.com


 How would you model this?

 We have a table of news items that people can view in their news stream and
 comment on.  Users have the ability to mute item so they never see them
 in
 their feed or search results.

 From what I can see there are a couple ways to accomplish this.

 1 - Post process the results and do not render any muted news items.  The
 downside of the pagination become problematic.  Its possible we may forgo
 pagination because of this but for now assume that pagination is a
 requirement.

 2 - Whenever we query for a given user we append a clause that excludes all
 muted items.  I assume in Solr we'd need to do something like -item_id(1
 AND
 2 AND 3).  Obviously this doesn't scale very well.

 3 - Have a multi-valued property in the index that contains all ids of
 users
 who have muted the item.  Being new to Solr I don't even know how (or if
 its
 possible) to run a query that says user id not this multivalued property.
 Can this even be done (sample query please)?  Again, I know this doesn't
 scale very well.

 Any other suggestions?

 Thanks in advance for the help.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Geert-Jan Brits
Hi Ninad,

SolrQuery q = new SolrQuery();
q.setQuery(*:*);
q.setFacet(true);
q.set(facet.data, pub);
q.set(facet.date.start, 2000-01-01T00:00:00Z)
... etc.

basically you can completely build your entire query with the 'raw' set (and
add) methods.
The specific methods are just helpers.

So this is the same as above:

SolrQuery q = new SolrQuery();
q.set(q,*:*);
q.set(facet,true);
q.set(facet.data, pub);
q.set(facet.date.start, 2000-01-01T00:00:00Z)
... etc.


Geert-Jan

2010/6/2 Ninad Raut hbase.user.ni...@gmail.com

 Hi,

 I want to hit the query given below :


 ?q=*:*facet=truefacet.date=pubfacet.date.start=2000-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1YEAR

 using SolrJ. I am browsing the net but not getting any clues about how
 should I approach it.  How can SolJ API be used to create above mentioned
 Query.

 Regards,
 Ninad R



Re: Interleaving the results

2010-06-01 Thread Geert-Jan Brits
Indeed, it's just a matter of ordening the results on the client-side IFF I
infer correctly from your description that you are guarenteed to get results
from enough different customers from SOlr in the first place to do the
interleaving that you describe. (In general this is a pretty big IF).

So assuming that's the case, you just make sure to return the customerid as
part of the solr-result (make sure the customerid is stored) (or get the
customerid through other means e.g: look it up in a db based on the id of
the doc returned).
Finally, simply code the interleaving (for example: throw the results in
something like Mapcustomerid, Listdocid and iterate the map, so you get
the first element of each list then the 2nd, etc...



2010/6/1 NarasimhaRaju rajux...@yahoo.com

 Can some body throw some ideas, on how to achieve (interleaving) from with
 in the application especially in a distributed setup?


  “ There are only 10 types of people in this world:-
 Those who understand binary and those who don’t “


 Regards,
 P.N.Raju,




 
 From: Lance Norskog goks...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Sat, May 29, 2010 3:04:46 AM
 Subject: Re: Interleaving the results

 There is no interleaving tool. There is a random number tool. You will
 have to achive this in your application.

 On Fri, May 28, 2010 at 8:23 AM, NarasimhaRaju rajux...@yahoo.com wrote:
  Hi,
  how to achieve custom ordering of the documents when there is a general
 query?
 
  Usecase:
  Interleave documents from different customers one after the other.
 
  Example:
  Say i have 10 documents in the index belonging to 3 customers
 (customer_id field in the index ) and using query *:*
  so all the documents in the results score the same.
  but i want the results to be interleaved
  one document from the each customer should appear before a document from
 the same customer repeats ?
 
  is there a way to achieve this ?
 
 
  Thanks in advance
 
  R.
 
 
 
 



 --
 Lance Norskog
 goks...@gmail.com







Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
NP ;-) .

Just to explain:

With tooltips I meant js-tooltips (not the native webbrowser tooltips)
since sliders require JS anyway, presenting additional info in a Js-tooltip
on drag, doesn't limit the nr of people able to view it.

I think this is ok from a usability standpoint since I don't consider the
'nr of items left' info 100% essential (after all lots of sites do well
without it at the moment).
Call if graceful degradation ;-)

As for mobile, I never realized that 'hover' is an issue on mobile, but on
drag is supported on mobile touch displays...

Moreover, having a navigational-complex site like kayak.com /
tripadvisor.com to work well on mobile (from a usability perspective)  is
pretty much an utopia anyway.
For these types of sites, specialized mobile sites (or apps as is the case
for the above brands) are the way to go in my opinion.

Geert-Jan


2010/5/28 Mark Bennett mbenn...@ideaeng.com

 Haha!  Important tooltips are now deprecated in Web Applications.

 This is nothing official, of course.

 But it's being advised to avoid important UI tasks that require cursor
 tracking, mouse-over, hovering, etc. in web applications.

 Why?  Many touch-centric mobile devices don't support hover.  For me I'm
 used to my laptop where the touch pad or stylus *is* able to measure the
 pressure.  But the finger based touch devices generally can differenciate
 it
 I guess.

 They *can* tell one gesture from another, but only looking at the timing
 and
 shape.  And hapless hover aint one of them.

 With that said, I'm still a fan of Tool Tips in desktop IDE's like Eclipse,
 or even on Web applications when I'm on a desktop.

 I guess the point is that, if it's a really important thing, then you need
 to expose it in another way on mobile.

 Just passing this on, please don't shoot the messenger.  ;-)

 Mark

 --
 Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
 Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


 On Thu, May 27, 2010 at 2:55 PM, Geert-Jan Brits gbr...@gmail.com wrote:

  Perhaps you could show the 'nr of items left' as a tooltip of sorts when
  the
  user actually drags the slider.
  If the user doesn't drag (or hovers over ) the slider 'nr of items left'
  isn't shown.
 
  Moreover, initially a slider doesn't limit the results so 'nr of items
  left'
  shown for the slider would be the same as the overall number of items
 left
  (thereby being redundant)
 
  I must say I haven't seen this been implemented but it would be rather
 easy
  to adapt a slider implementation, to show the nr on drag/ hover.  (they
  exit
  for jquery, scriptaculous and a bunch of other libs)
 
  Geert-Jan
 
  2010/5/27 Lukas Kahwe Smith m...@pooteeweet.org
 
  
   On 27.05.2010, at 23:32, Geert-Jan Brits wrote:
  
Something like sliders perhaps?
Of course only numerical ranges can be put into sliders. (or a
 concept
   that
may be logically presented as some sort of ordening, such as bad,
 hmm,
good, great
   
Use Solr's Statscomponent to show the min and max values
   
Have a look at tripadvisor.com for good uses/implementation of
 sliders
(price, and reviewscore are presented as sliders)
my 2c: try to make the possible input values discrete (like at
   tripadvisor)
which gives a better user experience and limits the potential nr of
   queries
(cache-wise advantage)
  
  
   yeah i have been pondering something similar. but i now realized that
  this
   way the user doesnt get an overview of the distribution without
 actually
   applying the filter. that being said, it would be nice to display 3
  numbers
   with the silders, the count of items that were filtered out on the
 lower
  and
   upper boundaries as well as the number of items still left (*).
  
   aside from this i just put a little tweak to my facetting online:
   http://search.un-informed.org/search?q=malariatm=anys=Search
  
   if you deselect any of the checkboxes, it updates the counts. however i
   display both the count without and with those additional checkbox
 filters
   applied (actually i only display two numbers of they are not the same):
   http://screencast.com/t/MWUzYWZkY2Yt
  
   regards,
   Lukas Kahwe Smith
   m...@pooteeweet.org
  
   (*) if anyone has a slider that can do the above i would love to
  integrate
   that and replace the adoption year checkboxes with that
 



Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
Interesting..

say you have a double slider with a discrete range (like tripadvisor et.al.)
perhaps it would be a good guideline to use these discrete points for the
quantum interval for the sparkline as well?

Of course it then becomes the question which discrete values to use for the
slider. I tend to follow what tripadvisor does for it's price-slider:
set a cap for the max price, and set a fixed interval ($25) for the discrete
steps. (of course there are edge cases like when no product hits the maximum
capped price)

I have also seen non-linear steps implemented, but I guess this doesn't go
well with the notion of sparlines.


Anyway, from a implementation standpoint it would be enough for Solr to
return the 'nr of items' per interval. From that, it would be easy to
calculate on the application-side the 'nr of items' for each possible
slider-combination.

getting these values from solr would require (staying with the
price-example):
- a new discretised price field. And doing a facet.field.
- the (continu) price field already present, and doing 50 facet queries (if
you have 50 steps)
- another more elegant way ;-) . Perhaps an addition to statscomponent that
returns all counts within a discrete (to be specified) step?  Would this
slow the statscomponent-code down a lot, or ir the info already (almost)
present in statscomponent for doing things as calculating sddev / means,
etc?
- something I'm completely missing...




2010/5/28 Chris Hostetter hossman_luc...@fucit.org


 : Perhaps you could show the 'nr of items left' as a tooltip of sorts when
 the
 : user actually drags the slider.

 Years ago, when we were first working on building Solr, a coworker of mind
 suggested using double bar sliders (ie: pick a range using a min and a
 max) for all numeric facets and putting sparklines above them to give
 the user a visual indication of the spread of documents across the
 numeric spectrum.

 it wsa a little more complicated then anything we needed -- and seemed
 like a real pain in hte ass to implement.  i still don't know of anyone
 doing anything like that, but it's definitley an interesting idea.

 The hard part is really just deciding what quantum interval you want
 to use along the xaxis to decide how to count the docs for the y axis.

 http://en.wikipedia.org/wiki/Sparkline
 http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR


 -Hoss




Re: Sites with Innovative Presentation of Tags and Facets

2010-05-31 Thread Geert-Jan Brits
May I ask how you implemented getting the facet counts for each interval? Do
you use a facet-query per interval?
And perhaps for inspiration a link to the site you implemented this ..

Thanks,
Geert-Jan

I love the idea of a sparkline at range-sliders. I think if I have time, I
 might add them to the range sliders on our site. I already have all the data
 since I show the count for a range while the user is dragging by storing the
 facet counts for each interval in javascript.



Re: Sites with Innovative Presentation of Tags and Facets

2010-05-27 Thread Geert-Jan Brits
Something like sliders perhaps?
Of course only numerical ranges can be put into sliders. (or a concept that
may be logically presented as some sort of ordening, such as bad, hmm,
good, great

Use Solr's Statscomponent to show the min and max values

Have a look at tripadvisor.com for good uses/implementation of sliders
(price, and reviewscore are presented as sliders)
my 2c: try to make the possible input values discrete (like at tripadvisor)
which gives a better user experience and limits the potential nr of queries
(cache-wise advantage)

Cheers,
Geert-Jan

2010/5/27 Mark Bennett mbenn...@ideaeng.com

 I'm a big fan of plain old text facets (or tags), displayed in some logical
 order, perhaps with a bit of indenting to help convey context. But as you
 may have noticed, I don't rule the world.  :-)

 Suppose you took the opposite approach, rending facets in non-traditional
 ways, that were still functional, and not ugly.

 Are there any pubic sites that come to mind that are displaying facets,
 tags, clusters, taxonomies or other navigators in really innovative ways?
  And what you liked / didn't like?

 Right now I'm just looking for examples of what's been tried.  I suppose
 even bad examples might be educational.

 My future ideal wish list:
 * Stays out of the way (of casual users)
 * Looks clean and cool (to the power users)
I'm thinking for example a light gray chevron  that casual users
 don't notice,
but when you click on it, cool things come up?
 * Probably that does not require Flash or SilverLight (just to avoid the
 whole platform wars)
I guess that means Ajax or HTML5
 * And since I'm doing pie in the sky, can be made to look good on desktops
 and mobile

 Some examples to get the ball rolling:

 StackOverflow, Flickr and YouTube, Clusty(now Yippy) are all nice, but a
 bit
 pedestrian for my mission today.
 (grokker was cool too)

 Lucid has done a nice job with Facets and Solr:
 http://www.lucidimagination.com/search/
 And although I really like it, it's not a flashy enough specimen for what
 I'm hunting today.
 (and they should thread the actual results list)

 I did some mockups of 2.0 style search navigators a couple years back:

 http://www.ideaeng.com/tabId/98/itemId/115/Search-20-in-the-Enterprise-Moving-Beyond-Singl.aspx
 Though these were intentionally NOT derived from specific web sites.

 Digg has done some cool stuff, for example:
 http://labs.digg.com/365/
 http://labs.digg.com/arc/
 http://labs.digg.com/stack/
 But for what I'm after, these are a bit too far off of the searching for
 something in particular track.

 Google Image Swirl and Similar Images are interesting, but for images.
 Lots of other cool stuff at labs.google.com

 Amazon, NewEgg, etc are all fine, but again text based.

 TouchGraph has some cool stuff, though very non-linear (many others on this
 theme)
 http://www.touchgraph.com/TGGoogleBrowser.html
 http://www.touchgraph.com/navigator.html


 Cool articles on the subject: (some examples now offline)
 http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/viz4all_a.html



 --
 Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
 Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513



Re: Sites with Innovative Presentation of Tags and Facets

2010-05-27 Thread Geert-Jan Brits
Perhaps you could show the 'nr of items left' as a tooltip of sorts when the
user actually drags the slider.
If the user doesn't drag (or hovers over ) the slider 'nr of items left'
isn't shown.

Moreover, initially a slider doesn't limit the results so 'nr of items left'
shown for the slider would be the same as the overall number of items left
(thereby being redundant)

I must say I haven't seen this been implemented but it would be rather easy
to adapt a slider implementation, to show the nr on drag/ hover.  (they exit
for jquery, scriptaculous and a bunch of other libs)

Geert-Jan

2010/5/27 Lukas Kahwe Smith m...@pooteeweet.org


 On 27.05.2010, at 23:32, Geert-Jan Brits wrote:

  Something like sliders perhaps?
  Of course only numerical ranges can be put into sliders. (or a concept
 that
  may be logically presented as some sort of ordening, such as bad, hmm,
  good, great
 
  Use Solr's Statscomponent to show the min and max values
 
  Have a look at tripadvisor.com for good uses/implementation of sliders
  (price, and reviewscore are presented as sliders)
  my 2c: try to make the possible input values discrete (like at
 tripadvisor)
  which gives a better user experience and limits the potential nr of
 queries
  (cache-wise advantage)


 yeah i have been pondering something similar. but i now realized that this
 way the user doesnt get an overview of the distribution without actually
 applying the filter. that being said, it would be nice to display 3 numbers
 with the silders, the count of items that were filtered out on the lower and
 upper boundaries as well as the number of items still left (*).

 aside from this i just put a little tweak to my facetting online:
 http://search.un-informed.org/search?q=malariatm=anys=Search

 if you deselect any of the checkboxes, it updates the counts. however i
 display both the count without and with those additional checkbox filters
 applied (actually i only display two numbers of they are not the same):
 http://screencast.com/t/MWUzYWZkY2Yt

 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org

 (*) if anyone has a slider that can do the above i would love to integrate
 that and replace the adoption year checkboxes with that


Re: Personalized Search

2010-05-21 Thread Geert-Jan Brits
Just want to throw this in: If you're worried about scaling, etc. you could
take a look at item-based collaborative filtering instead of user based.
i.e:
DO NIGHTLY/ BATCH:
- calculate the similarity between items based on their properties

DO ON EACH REQUEST
- have a user store/update it's interest as a vector of item-properties. How
to update this based on click / browse behavior is the interesting thing and
depends a lot on your environment.
- Next is to recommend 'neighboring' items that are close to the defined
'interest-vector'.

The code is similar to user-based colab. filtering, but scaling is invariant
to the nr of users.

other advantages:
- new items/ products can be recommended as soon as they are added to the
catalog (no need for users to express interest in them before the item can
be suggested)

disadvantage:
- top-N results tend to be less dynamic then when using user-based colab.
filtering.

Of course, this doesn't touch on how to integrate this with Solr. Perhaps
some combination with Mahout is indeed the best solution. I haven't given
this much thought yet I must say.
For info on Mahout Taste (+ an explanation on item-based filtering vs.
user-based filtering) see:
http://lucene.apache.org/mahout/taste.html

Cheers,
Geert-Jan

2010/5/21 Rih tanrihae...@gmail.com

 
  - keep the SOLR index independent of bought/like

 - have a db table with user prefs on a per item basis


 I have the same idea this far.

 at query time, specify boosts for 'my items' items


 I believe this works if you want to sort results by faved/not faved. But
 how
 does it scale if users already favorited/liked hundreds of item? The query
 can be quite long.

 Looking forward to your idea.



 On Thu, May 20, 2010 at 6:37 PM, dc tech dctech1...@gmail.com wrote:

  Another approach would be to do query time boosts of 'my' items under
  the assumption that count is limited:
  - keep the SOLR index independent of bought/like
  - have a db table with user prefs on a per item basis
  - at query time, specify boosts for 'my items' items
 
  We are planning to do this in the context of document management where
  documents in 'my (used/favorited ) folders' provide a boost factor
  to the results.
 
 
 
  On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote:
   Hi Rih,
  
   You going to include either of the two field bought or like to per
   member/visitor OR a unique field per member / visitor?
  
   If it's one or two common fields are included then there will not be
 any
   impact in performance. If you want to include unique field then you
 need
  to
   consider multi value field otherwise you certainly hit the wall.
  
   Regards
   Aditya
   www.findbestopensource.com
  
  
  
  
   On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote:
  
   Has anybody done personalized search with Solr? I'm thinking of
  including
   fields such as bought or like per member/visitor via dynamic
 fields
  to
   a
   product search schema. Another option is to have a multi-value field
  that
   can contain user IDs. What are the possible performance issues with
 this
   setup?
  
   Looking forward to your ideas.
  
   Rih
  
  
 
  --
  Sent from my mobile device
 



Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Would each Id need to return a different doc?

If not:
you could probably use FieldCollapsing:
http://wiki.apache.org/solr/FieldCollapsing
http://wiki.apache.org/solr/FieldCollapsingi.e: - collapse on listOfIds
(see wiki entry for syntax)
 -  constrain the field to only return the id's you want e.g:
q= listOfIds:10 OR q= listOfIds:5,...,OR q= listOfIds:56

Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Thanks Darren,

 The problem with that is that it may not return one document per id, which
 is what I need.  IE, I could give 100 ids in that OR query and retrieve 100
 documents, all containing just 1 of the IDs.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: seemingly impossible query

 Ok. I think I understand. What's impossible about this?

 If you have a single field name called id that is multivalued
 then you can retrieved the documents with something like:

 id:1 OR id:2 OR id:56 ... id:100

 then add limit 100.

 There's probably a more succinct way to do this, but I'll leave that to
 the experts.

 If you also only want the documents within a certain time, then you also
 create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
 or something similar to this. Check the query syntax wiki for specifics.

 Darren


  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1 row,
  and caching the result. This could work OK since some of these ids should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 




Re: seemingly impossible query

2010-05-20 Thread Geert-Jan Brits
Hi Kallin,

again please look at
FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing ,
that should do the trick.
basically: first you constrain the field: 'listOfIds' to only contain docs
that contain any of the (up to) 100 random ids as you know how to do

Next, in the same query, specify to collapse on field 'listOfIds '
basically:
q=listOfIds:1 OR listOfIds:10 OR listOfIds:24
collapse.threshold=1collapse.field=listOfIdscollapse.type=normal

this would return the top-matching doc for each id left in listOfIds. Since
you constrained this field by the ids specified you are left with 1 matching
doc for each id.

Again it is not guarenteed that all docs returned are different. Since you
didn't specify this as a requirement I think this will suffics.

Cheers,
Geert-Jan

2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com

 Yeah I need something like:
 (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that..

 I'm not sure how I can hit solr once. If I do try and do them all in one
 big OR query then I'm probably not going to get a hit for each ID. I would
 need to request probably 1000 documents to find all 100 and even then
 there's no guarantee and no way of knowing how deep to go.

 -Kallin Nagelberg

 -Original Message-
 From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
 Sent: Thursday, May 20, 2010 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: RE: seemingly impossible query

 I see. Well, now you're asking Solr to ignore its prime directive of
 returning hits that match a query. Hehe.

 I'm not sure if Solr has a unique attribute.

 But this sounds, to me, like you will have to filter the results yourself.
 But at least you hit Solr only once before doing so.

 Good luck!

  Thanks Darren,
 
  The problem with that is that it may not return one document per id,
 which
  is what I need.  IE, I could give 100 ids in that OR query and retrieve
  100 documents, all containing just 1 of the IDs.
 
  -Kallin Nagelberg
 
  -Original Message-
  From: dar...@ontrenet.com [mailto:dar...@ontrenet.com]
  Sent: Thursday, May 20, 2010 12:21 PM
  To: solr-user@lucene.apache.org
  Subject: Re: seemingly impossible query
 
  Ok. I think I understand. What's impossible about this?
 
  If you have a single field name called id that is multivalued
  then you can retrieved the documents with something like:
 
  id:1 OR id:2 OR id:56 ... id:100
 
  then add limit 100.
 
  There's probably a more succinct way to do this, but I'll leave that to
  the experts.
 
  If you also only want the documents within a certain time, then you also
  create a time field and use a conjunction (id:0 ...) AND time:NOW-1H
  or something similar to this. Check the query syntax wiki for specifics.
 
  Darren
 
 
  Hey everyone,
 
  I've recently been given a requirement that is giving me some trouble. I
  need to retrieve up to 100 documents, but I can't see a way to do it
  without making 100 different queries.
 
  My schema has a multi-valued field like 'listOfIds'. Each document has
  between 0 and N of these ids associated to them.
 
  My input is up to 100 of these ids at random, and I need to retrieve the
  most recent document for each id (N Ids as input, N docs returned). I'm
  currently planning on doing a single query for each id, requesting 1
  row,
  and caching the result. This could work OK since some of these ids
  should
  repeat quite often. Of course I would prefer to find a way to do this in
  Solr, but I'm not sure it's capable.
 
  Any ideas?
 
  Thanks,
  -Kallin Nagelberg
 
 
 




Re: limit rows by field

2010-04-13 Thread Geert-Jan Brits
I believe you're talking about Fieldcollapsing.
It's available as a patch, although I'm not sure how well it applies to the
current trunk.

for more info check out:
http://wiki.apache.org/solr/FieldCollapsing

http://wiki.apache.org/solr/FieldCollapsingGeert-Jan

2010/4/13 Felix Zimmermann feliz...@gmx.de

 Hi,

 for a preview of results, I need to display up to 3 documents per
 category. Is it possible to limit the number of rows of solr response by
 field-values? What I mean is:

 rows: 9
 -(sub)rows of field:cat1 : 3
 -(sub)rows of field:cat2 : 3
 -(sub)rows of field:cat3 : 3

 If not, is there a workaround or do I have to send three queries?

 Thanks!
 felix




Re: Impossible Boost Query?

2010-03-25 Thread Geert-Jan Brits
Have a look at functionqueries.

http://wiki.apache.org/solr/FunctionQuery

http://wiki.apache.org/solr/FunctionQueryYou could for instance use your
regular score and multiply it with RandomValueSource bound between 1.0 and
1.1 for example.
This would at least break ties in a possibly natural looking manner.  (btw:
this would still influence all documents however)

//Geert-Jan

2010/3/26 Blargy zman...@hotmail.com


 Ok so this is basically just a random sort.

 Anyway I can get this to randomly sort documents that closely related and
 not the rest of the results?
 --
 View this message in context:
 http://n3.nabble.com/Impossible-Boost-Query-tp472080p580214.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi Select Facets through Java API

2010-03-22 Thread Geert-Jan Brits
something like this?

q=mainqueryfq={!tag=carfq}cars:corvette OR
cars:camarofacet=onfacet.field={!ex=carfq key=carfacet}cars

-the facet: carfacet is indepedennt of the filter query that filters on cars.
-you construct the filter query (fq={!tag=carfq}cars:corvette OR
cars:camaro) yourself in your application layer.

perhaps a disadvantage is that you get a lot of different filter queries
which are all independently cached... I don't see any other way at the
moment though..

Geert-Jan



2010/3/22 homerlex nab...@mlecza.newnetco.com


 bump - anyone?
 --
 View this message in context:
 http://old.nabble.com/Multi-Select-Facets-through-Java-API-tp27951014p27986301.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Will Solr fit our needs?

2010-03-17 Thread Geert-Jan Brits
If you dont' plan on filtering/ sorting and/or faceting on fast-changing
fields it would be better to store them outside of solr/lucene in my
opinion.

If you must: for indexing-performance reasons you will probably end up with
maintaining seperate indices (1 for slow-changing/static fields and 1 for
fast-changing-fields) .
You frequently commit the fast-changing -index to incorporate the changes
in current_price. Afterwards you have 2 options I believe:

1. use parallelreader to query the seperate indices directly. Afaik, this is
not (completely) integrated in Solr... I wouldn't recommend it.
2. after you commit the fast-changing-index, merge with the static-index.
You're left with 1 fresh index, which you can push to your slave-servers.
(all this in regular interverals)

Disadvatages:
- In any way, you must be very careful with maintaining multiple parallel
indexes with the purpose of treating them as one. For instance document
inserts must be done exactly in the same order, otherwise the indices go
'out-of-sync' and are unusable.
- higher maintenance
- there is always a time-window in which the current_price values are stale.
If that's within reqs that's ok.

The other path, which I recommend, would be to store the current_price
outside of solr (like you're currently doing) but instead of using a
relational db, try looking into persistent key-value stores. Many of them
exist and a lot of progress has been made in the last couple of years. For
simple key-lookups (what you need as far as I can tell) they really blow
every relational db out of the water (considering the same hardware of
course)

We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant
and seeing almost a 5x increased in lookup performance compared to our
previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was
already several times faster than our mysql-setup (although not optimally
tuned) .

to sum things up: use the best tools for what they were meant to do.

- index/search -- solr/ lucene without a doubt.

- kv-lookup -- consensus is still forming, and a lot of players (with a lot
of different types of functionality) but if all you need is simple
key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the moment.
 Please note that TC and competitors aren't just some code/ hobby projects
but are usually born out of a real need at huge websites / social networks
such as TC which is born from mixi  (big social network in Japan) . So at
least you're in good company..

for kv-stores I would suggest to begin your research at:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
(beginning
2009)
http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores (half
2009)
and get a feel of the kv-playing field.

Hope this (pretty long) post helps,
Geert-Jan


2010/3/17 Krzysztof Grodzicki krzysztof.grodzi...@iterate.pl

 Hi Mortiz,

 You can take a look on the project ZOIE -
 http://code.google.com/p/zoie/. I think it's that what are you looking
 for.

 br
 Krzysztof

 On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.de
 wrote:
  Hi List,
 
  we are running a marketplace which has about a comparable functionality
 like ebay (auctions, fixed-price items etc).
  The items are placed on the market by users who want to sell their goods.
 
  Currently we are using Sphinx as an indexing engine, but, as Sphinx
 returns only document ids we have to make a
  database-query to fetch the data to display. This massively decreases
 performance as we have to do two requests to
  display data.
 
  I heard that Solr is able to return a complete dataset and we hope a
 switch to Solr can boost perfomance.
  A critical question is left and i was not able to find a solution for it
 in the docs: Is it possible to update attributes directly in the
  index?
  An example for better illustration:
  We have an index which holds all the auctions (containing auctionid,
 auction title) with its current prices(field: current_price). When a user
 places a new bid,
  is it possible to update the attribute 'current_price' directly in the
 index so that we can fetch the current_price from Solr and not from the
 database?
 
  I hope you understood my problem. It would be kind if someone can point
 me to the right direction.
 
  Thanks alot!
 
  Moritz



Re: Implementing hierarchical facet

2010-03-03 Thread Geert-Jan Brits
you could always define 1 dynamicfield and encode the hierarchy level in the
fieldname:

dynamicField name=_loc_hier_* type=string stored=false indexed=true
omitNorms=true/
using:
facet=onfacet.field={!key=Location}_loc_hier_cityfq=_loc_hier_country:somecountryid
...
adding cityarea later for instance would be as simple as:
facet=onfacet.field={!key=Location}_loc_hier_cityareafq=_loc_hier_city:somecityid

Cheers,
Geert-Jan


2010/3/3 Andy angelf...@yahoo.com

 Thanks. I didn't know about the {!key=Location} trick.

 Thanks everyone for your help. From what I could gather, there're 3
 approaches:

 1) SOLR-64
 Pros:
 - can have arbitrary levels of hierarchy without modifying schema
 Cons:
 - each combination of all the levels in the hierarchy will result in a
 separate filter cache. This number could be huge, which would lead to poor
 performance

 2) SOLR-792
 Pros:
 - each level of the hierarchy separately results in filter cache. Much
 smaller number of filter cache. Better performance.
 Cons:
 - Only 2 levels are supported

 3) Separate fields for each hierarchy levels
 Pros:
 - same as SOLR-792. Good performance
 Cons:
 - can only handle a fixed number of levels in the hierarchy. Adding any
 levels beyond that requires schema modification

 Does that sound right?

 Option 3 is probably the best match for my use case. Is there any trick to
 make it able to deal with arbitrary number of levels?

 Thanks.

 --- On Tue, 3/2/10, Geert-Jan Brits gbr...@gmail.com wrote:

 From: Geert-Jan Brits gbr...@gmail.com
 Subject: Re: Implementing hierarchical facet
 To: solr-user@lucene.apache.org
 Date: Tuesday, March 2, 2010, 8:02 PM

 Using Solr 1.4: even less changes to the frontend:

 facet=onfacet.field={!key=Location}countryid
 ...
 facet=onfacet.field={!key=Location}cityidfq=countryid:somecountryid
 etc.

 will consistently render the resulting facet under the name Location .


 2010/3/3 Geert-Jan Brits gbr...@gmail.com

  If it's a requirement to let Solr handle the facet-hierarchy please
  disregard this post, but
  an alternative would be to have your App control when to ask for which
  'facet-level' (e.g: country, state, city) in the hierarchy.
 
  as follows,
 
  each doc has 3 seperate fields (indexed=true, stored=false):
  - countryid
  - stateid
  - cityid
 
  facet on country:
  facet=onfacet.field=countryid
 
  facet on state ( country selected. functionally you probably don't want
 to
  show states without the user having selected a country anyway)
  facet=onfacet.field=countryidfq=countryid:somecountryid
 
  facet on city (state selected, same functional analogy as above)
  facet=onfacet.field=cityidfq=stateid:somestateid
 
  or
 
  facet on city (countryselected, same functional analogy as above)
  facet=onfacet.field=cityidfq=countryid:somecountryid
 
  grab the resulting facat and drop it under Location
 
  pros:
  - reusing fq's (good performance, I've never used hierarchical facets,
 but
  would be surprised if it has a (major) speed increase to this method)
  - flexible (you get multiple hierarchies: country -- state -- city and
  country -- city)
 
  cons:
  - a little more application logic
 
  Hope that helps,
  Geert-Jan
 
 
 
 
 
  2010/3/2 Andy angelf...@yahoo.com
 
  I read that a simple way to implement hierarchical facet is to
 concatenate
  strings with a separator. Something like level1level2level3 with 
 as
  the separator.
 
  A problem with this approach is that the number of facet values will
  greatly increase.
 
  For example I have a facet Location with the hierarchy
  countrystatecity. Using the above approach every single city will lead
 to
  a separate facet value. With tens of thousands of cities in the world
 the
  response from Solr will be huge. And then on the client side I'd have to
  loop through all the facet values and combine those with the same
 country
  into a single value.
 
  Ideally Solr would be aware of the hierarchy structure and send back
  responses accordingly. So at level 1 Solr will send back facet values
 based
  on country (100 or so values). Level 2 the facet values will be based on
 the
  states within the selected country (a few dozen values). Next level will
 be
  cities within that state. and so on.
 
  Is it possible to implement hierarchical facet this way using Solr?
 
 
 
 
 
 
 







Re: Implementing hierarchical facet

2010-03-02 Thread Geert-Jan Brits
If it's a requirement to let Solr handle the facet-hierarchy please
disregard this post, but
an alternative would be to have your App control when to ask for which
'facet-level' (e.g: country, state, city) in the hierarchy.

as follows,

each doc has 3 seperate fields (indexed=true, stored=false):
- countryid
- stateid
- cityid

facet on country:
facet=onfacet.field=countryid

facet on state ( country selected. functionally you probably don't want to
show states without the user having selected a country anyway)
facet=onfacet.field=countryidfq=countryid:somecountryid

facet on city (state selected, same functional analogy as above)
facet=onfacet.field=cityidfq=stateid:somestateid

or

facet on city (countryselected, same functional analogy as above)
facet=onfacet.field=cityidfq=countryid:somecountryid

grab the resulting facat and drop it under Location

pros:
- reusing fq's (good performance, I've never used hierarchical facets, but
would be surprised if it has a (major) speed increase to this method)
- flexible (you get multiple hierarchies: country -- state -- city and
country -- city)

cons:
- a little more application logic

Hope that helps,
Geert-Jan





2010/3/2 Andy angelf...@yahoo.com

 I read that a simple way to implement hierarchical facet is to concatenate
 strings with a separator. Something like level1level2level3 with  as
 the separator.

 A problem with this approach is that the number of facet values will
 greatly increase.

 For example I have a facet Location with the hierarchy
 countrystatecity. Using the above approach every single city will lead to
 a separate facet value. With tens of thousands of cities in the world the
 response from Solr will be huge. And then on the client side I'd have to
 loop through all the facet values and combine those with the same country
 into a single value.

 Ideally Solr would be aware of the hierarchy structure and send back
 responses accordingly. So at level 1 Solr will send back facet values based
 on country (100 or so values). Level 2 the facet values will be based on the
 states within the selected country (a few dozen values). Next level will be
 cities within that state. and so on.

 Is it possible to implement hierarchical facet this way using Solr?






Re: Implementing hierarchical facet

2010-03-02 Thread Geert-Jan Brits
Using Solr 1.4: even less changes to the frontend:

facet=onfacet.field={!key=Location}countryid
...
facet=onfacet.field={!key=Location}cityidfq=countryid:somecountryid
etc.

will consistently render the resulting facet under the name Location .


2010/3/3 Geert-Jan Brits gbr...@gmail.com

 If it's a requirement to let Solr handle the facet-hierarchy please
 disregard this post, but
 an alternative would be to have your App control when to ask for which
 'facet-level' (e.g: country, state, city) in the hierarchy.

 as follows,

 each doc has 3 seperate fields (indexed=true, stored=false):
 - countryid
 - stateid
 - cityid

 facet on country:
 facet=onfacet.field=countryid

 facet on state ( country selected. functionally you probably don't want to
 show states without the user having selected a country anyway)
 facet=onfacet.field=countryidfq=countryid:somecountryid

 facet on city (state selected, same functional analogy as above)
 facet=onfacet.field=cityidfq=stateid:somestateid

 or

 facet on city (countryselected, same functional analogy as above)
 facet=onfacet.field=cityidfq=countryid:somecountryid

 grab the resulting facat and drop it under Location

 pros:
 - reusing fq's (good performance, I've never used hierarchical facets, but
 would be surprised if it has a (major) speed increase to this method)
 - flexible (you get multiple hierarchies: country -- state -- city and
 country -- city)

 cons:
 - a little more application logic

 Hope that helps,
 Geert-Jan





 2010/3/2 Andy angelf...@yahoo.com

 I read that a simple way to implement hierarchical facet is to concatenate
 strings with a separator. Something like level1level2level3 with  as
 the separator.

 A problem with this approach is that the number of facet values will
 greatly increase.

 For example I have a facet Location with the hierarchy
 countrystatecity. Using the above approach every single city will lead to
 a separate facet value. With tens of thousands of cities in the world the
 response from Solr will be huge. And then on the client side I'd have to
 loop through all the facet values and combine those with the same country
 into a single value.

 Ideally Solr would be aware of the hierarchy structure and send back
 responses accordingly. So at level 1 Solr will send back facet values based
 on country (100 or so values). Level 2 the facet values will be based on the
 states within the selected country (a few dozen values). Next level will be
 cities within that state. and so on.

 Is it possible to implement hierarchical facet this way using Solr?









  1   2   >