Sorting by vale of field

2011-06-29 Thread Judioo
Hi

Say I have a field type in multiple documents which can be either
type:bike
type:boat
type:car
type:van


and I want to order a search to give me documents in the following order

type:car
type:van
type:boat
type:bike

Is there a way I can do this just using the sort method?

Thanks


Re: Sorting by vale of field

2011-06-29 Thread Judioo
Thanks,
Yes this is the work around I am currently doing.
 Still wondering is the sort method can be used alone.




On 29 June 2011 18:34, Michael Ryan mr...@moreover.com wrote:

 You could try adding a new int field (like typeSort) that has the desired
 sort values. So when adding a document with type:car, also add typeSort:1;
 when adding type:van, also add typeSort:2; etc. Then you could do
 sort=typeSort asc to get them in your desired order.

 I think this is also possible with custom function queries, but I've never
 done that.

 -Michael



Replication without configs

2011-06-27 Thread Judioo
I have replicated a solr instance without configs as the slave has
it's own config.


The replication has failed. My plan was to use replication to remove
the indexes I no longer wish to use which is why the slave has a
different schema.xml file.

Does anyone know why the replication has failed?
Thanks

Error below:

HTTP ERROR 500
Problem accessing /solr/select/. Reason:
null

java.lang.NullPointerException
at org.apache.solr.response.XMLWriter.writePrim(XMLWriter.java:828)
at org.apache.solr.response.XMLWriter.writeStr(XMLWriter.java:686)
at org.apache.solr.schema.StrField.write(StrField.java:49)
at org.apache.solr.schema.SchemaField.write(SchemaField.java:124)
at org.apache.solr.response.XMLWriter.writeDoc(XMLWriter.java:373)
at org.apache.solr.response.XMLWriter$3.writeDocs(XMLWriter.java:545)
at org.apache.solr.response.XMLWriter.writeDocuments(XMLWriter.java:482)
at org.apache.solr.response.XMLWriter.writeDocList(XMLWriter.java:519)
at org.apache.solr.response.XMLWriter.writeVal(XMLWriter.java:582)
at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:131)
at 
org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:35)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:343)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Do unused indexes after performance?

2011-06-24 Thread Judioo
Hi,

As a proof of concept I have imported around ~11 million document in a solr
index. my schema file has multiple fields defined

dynamicField name=*_idtype=text   indexed=true  stored=true/
dynamicField name=*_start type=tdate  indexed=true  stored=true/
dynamicField name=*_end   type=tdate  indexed=true  stored=true/

dynamicField name=*   type=string indexed=true stored=true/

Above being the most important for my question.

The average document has around 40 attributes. Each document has:

* a minimum of 2 tdate fileds ( max of 10)
* a minimum of 2 *_id fields each contain a space delimited list of ids
(i.e. 4de5656 q23ew9h)

The finial dynamicField causes all fields within a document to be indexed.
This was done to firstly show the flexibility of solr and also due to me not
knowing what fields we would use to query / filter on. The total size of my
index is ~18GB

However... we now know the fields we will be querying on.

I have 3 questions

1) Do unused indexes on the same dynamicField affect solr's performance?
Our query will always be (type:book book_id:*). Will the presents of 4
million documents (type:location store_id:*) affect solr's performance?
Sounds obviously yes but may not be the case.

2) Do unused dynamicField indexes affect solr's performance?
All documents have a attribute version which is indexed as text yet this
is never used in any queries. Does their existence ( in 11 million documents
) effect performance?

3) How does one improve query times against an index
Once an index is built is there a method to optimise the query analyzers or
a method of removing unused indexes without rebuilding the entire index?

The latter is a very important one. We want to replace the current schema
with a more restrictive version. Most importantly

   dynamicField name=* type=string indexed=true stored=true /

becomes

   dynamicField name=* type=string indexed=*false* stored=true /


But this change alone does not cause the index to shrink. It would be lovely
if there was a method to re-analyze an index post import.

More than happy to be referred to related documentation.

I have read and considered
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


But there may be some fluid knowledge held here which is undocumented.

Thank you in advance for any answers.


Boost Strangeness

2011-06-18 Thread Judioo
WONDERFUL!
Just reporting back.
This document is ACE

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

For explaining what the filters are and how to affect the analyzer.

Erik your statement First, boosting isn't absolute  played on me so
I continued to investigate boosting.

I found this document that ( at last ) explains the dismax logic

http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

The reason why I was not getting the order I require was due to:
A)  my boost metrics were too close together.
b) similar id's in a document affected the score


It seems that if a partial match is made the product ( a % of the
total boost ) contributes to the documents score.
This meant that one type of document in the index had a higher
aggregate score due to the fact it had all but one of the boosted
fields ( does not have parent_id ) in it and the fields where
populated with content that was *very* similar to the requested id.

for example

required id = b011mg62
X_id = b011mgsf

Due to the partial matching and closeness of the boost ranges this
type of document always aquired a higher score than another document
with just one matching field ( i.e. id field ).

My solution was to increase the value of the fields I wanted to *really* count

id^10 parent_id^5000 brand_container_id^500 

As a result even if there are similar matches in any field the id and
parent_id matches should always receive a higher boost.


This was also useful
http://stackoverflow.com/questions/2179497/adding-date-boosting-to-complex-solr-queries


Thanks for the help!


Re: Boost Strangeness

2011-06-16 Thread Judioo
: {
   - time: 0
}
- -
org.apache.solr.handler.component.DebugComponent: {
   - time: 18
}
 }
  }
   }

}


On 15 June 2011 13:16, Erick Erickson erickerick...@gmail.com wrote:

 First off, you didn't violate groups ettiquette. In fact, yours was
 one of the better first posts in terms or providing enough information
 for us to actually help!

 A very useful page is the admin/analysis page to see how the
 analysis chain works. For instance, if you haven't changed the
 field type (i.e. fieldType name=text) that your input is
 being broken up by WordDelimiterFilterFactory. Be sure to check
 the verbose checkbox and enter text in both the query and
 index boxes!

 Here's an invaluable page, though do note that it's not exhaustive:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


 But on to your problem:

 First, boosting isn't absolute, boosting terms just tends to
 bubble things up, you have to experiment with various weights

 To get the full comparison for both documents you're curious about,
 try using explainOther. see:

 http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query

 If you use that against the two docs in question, you should
 see (although it's a hard read!) the reason the docs got
 their relative scores.

 Finally, your next e-mail hints at what's happening. If you're
 putting multiple tokens in some of these fields, the length
 normalization may be causing the matches to score lower. You can
 try disabling those calculations (omitNorms=true in your field
 definition).
 See:

 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

 String types accept spaces just fine, but you might want to define
 the fields with 'multiValued=true ' and index each as a separate
 field (note that won't work with a field that's also your uniqueKey).

 Best
 Erick

 On Wed, Jun 15, 2011 at 7:16 AM, Judioo cont...@judioo.com wrote:
dynamicField name=*_id  type=textindexed=true
  stored=true/
 
  so all attributes except 'id' are of type text.
 
  I didn't know that about the string type. So is my problem as described (
  that partial matches are contributing to the calculation ) and does
 defining
  the filed type as string solve this problem.
 
  Or is my understanding completely incorrect?
 
  Thanks in advance
 
  On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:
 
  
 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
  
  
   same result ( just higher scores ). It's almost as if
   partial matches on
   brand|series_container_id and id are being considered in
   the 1st document.
   Surely this can't be right / expected?
 
  What is your fieldType definition? Don't you think it is better to use
  string type which is not tokenized?
 
 



Boost Strangeness

2011-06-15 Thread Judioo
Hi

I'm confused about exactly how boosts relevancy scores work.

Apologies if I am violating this groups etiquette but I could not find
solr's paste bin anywhere.

I have 2 document types but want to return any documents where the requested
ID appears. The ID appears in multiple attributes but I want to boost
results based on which attribute contains the ID.

so my query is

q=id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6
series_container_id:b007vty6 subseries_container_id:b007vty6
clip_container_id:b007vty6 clip_episode_id:b007vty6

and I use qf to boost fields

qf=id^10 parent_id^9 brand_container_id^8 series_container_id^8
subseries_container_id^8 clip_container_id^1 clip_episode_id^1


I expect any document with the following id:b007vty6 to be returned 1st (
with the highest score ) yet this is not the case. Can anyone explain why
this is? Could it be that


extra info below:

complete URL

/solr/select/?q=id:b007vty6%20parent_id:b007vty6%20brand_container_id:b007vty6%20series_container_id:b007vty6%20subseries_container_id:b007vty6%20clip_container_id:b007vty6%20clip_episode_id:b007vty6start=0rows=10wt=jsonindent=ondebugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scoreqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1

results

{

   - -
   responseHeader: {
  - status: 0
  - QTime: 12
  - -
  params: {
 - debugQuery: on
 - fl:
 
id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
 - indent: on
 - start: 0
 - q: id:b007vty6 parent_id:b007vty6 brand_container_id:b007vty6
 series_container_id:b007vty6 subseries_container_id:b007vty6
 clip_container_id:b007vty6 clip_episode_id:b007vty6
 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
 subseries_container_id^8 clip_container_id^1 clip_episode_id^1
 - wt: json
 - rows: 10
  }
   }
   - -
   response: {
  - numFound: 2
  - start: 0
  - maxScore: 1.5543144
  - -
  docs: [
 - -
 {
- series_container_id: b007vm94
- id: b007vsvm
- brand_container_id: b007hhk5
- subseries_container_id: b007vty6
- clip_episode_id: 
- score: 1.5543144
 }
 - -
 {
- parent_id: b007vm94
- id: b007vty6
- score: 0.3014368
 }
  ]
   }
   - -
   debug: {
  - rawquerystring: id:b007vty6 parent_id:b007vty6
  brand_container_id:b007vty6 series_container_id:b007vty6
  subseries_container_id:b007vty6 clip_container_id:b007vty6
  clip_episode_id:b007vty6
  - querystring: id:b007vty6 parent_id:b007vty6
  brand_container_id:b007vty6 series_container_id:b007vty6
  subseries_container_id:b007vty6 clip_container_id:b007vty6
  clip_episode_id:b007vty6
  - parsedquery: id:b007vty6 PhraseQuery(parent_id:b 007 vty 6)
  PhraseQuery(brand_container_id:b 007 vty 6)
  PhraseQuery(series_container_id:b 007 vty 6)
  PhraseQuery(subseries_container_id:b 007 vty 6)
  PhraseQuery(clip_container_id:b 007 vty 6)
PhraseQuery(clip_episode_id:b
  007 vty 6)
  - parsedquery_toString: id:b007vty6 parent_id:b 007 vty 6
  brand_container_id:b 007 vty 6 series_container_id:b 007 vty 6
  subseries_container_id:b 007 vty 6 clip_container_id:b 007 vty 6
  clip_episode_id:b 007 vty 6
  - -
  explain: {
 - b007vsvm:  1.5543144 = (MATCH) product of: 10.8802 = (MATCH) sum
 of: 10.8802 = (MATCH) weight(subseries_container_id:b 007
vty 6 in 39526),
 product of: 0.43911988 =
queryWeight(subseries_container_id:b 007 vty 6),
 product of: 49.55458 = idf(subseries_container_id: b=547
007=31 vty=1 6=87)
 0.008861338 = queryNorm 24.77729 =
fieldWeight(subseries_container_id:b 007
 vty 6 in 39526), product of: 1.0 = tf(phraseFreq=1.0) 49.55458 =
 idf(subseries_container_id: b=547 007=31 vty=1 6=87) 0.5 =
 fieldNorm(field=subseries_container_id, doc=39526) 0.14285715
= coord(1/7) 
 - b007vty6:  0.3014368 = (MATCH) product of: 2.1100576 = (MATCH)
 sum of: 2.1100576 = (MATCH) weight(id:b007vty6 in 39512), product of:
 0.13674039 = queryWeight(id:b007vty6), product of: 15.431123 =
 idf(docFreq=1, maxDocs=3701577) 0.008861338 = queryNorm
15.431123 = (MATCH)
 fieldWeight(id:b007vty6 in 39512), product of: 1.0 =
 tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1,
maxDocs=3701577) 1.0
 = fieldNorm(field=id, doc=39512) 0.14285715 = coord(1/7) 
  }
  - QParser: LuceneQParser
  - -
  timing: {
 - time: 12
 - -
 prepare: {
- time: 3
- -
 

Re: Boost Strangeness

2011-06-15 Thread Judioo
Apologies
I have tried that method as well.

/solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on


same result ( just higher scores ). It's almost as if  partial matches on
brand|series_container_id and id are being considered in the 1st document.
Surely this can't be right / expected?

{

   - -
   responseHeader: {
  - status: 0
  - QTime: 13
  - -
  params: {
 - debugQuery: on
 - fl:
 
id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
 - indent: on
 - q: b007vty6
 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
 subseries_container_id^8 clip_container_id^1 clip_episode_id^1
 - wt: json
 - defType: dismax
  }
   }
   - -
   response: {
  - numFound: 2
  - start: 0
  - maxScore: 21.138214
  - -
  docs: [
 - -
 {
- series_container_id: b007vm94
- id: b007vsvm
- brand_container_id: b007hhk5
- subseries_container_id: b007vty6
- clip_episode_id: 
- score: 21.138214
 }
 - -
 {
- parent_id: b007vm94
- id: b007vty6
- score: 5.1243143
 }
  ]
   }
   - -
   debug: {
  - rawquerystring: b007vty6
  - querystring: b007vty6
  - parsedquery: +DisjunctionMaxQuery((id:b007vty6^10.0 |
  clip_episode_id:b 007 vty 6 | subseries_container_id:b 007
vty 6^8.0 |
  series_container_id:b 007 vty 6^8.0 | clip_container_id:b 007 vty 6 |
  brand_container_id:b 007 vty 6^8.0 | parent_id:b 007 vty 6^9.0)) ()
  - parsedquery_toString: +(id:b007vty6^10.0 | clip_episode_id:b 007
  vty 6 | subseries_container_id:b 007 vty 6^8.0 |
series_container_id:b
  007 vty 6^8.0 | clip_container_id:b 007 vty 6 |
brand_container_id:b 007
  vty 6^8.0 | parent_id:b 007 vty 6^9.0) ()
  - -
  explain: {
 - b007vsvm:  21.138214 = (MATCH) sum of: 21.138214 = (MATCH) max
 of: 21.138214 = (MATCH) weight(subseries_container_id:b 007
vty 6^8.0 in
 39526), product of: 0.85312855 =
queryWeight(subseries_container_id:b 007
 vty 6^8.0), product of: 8.0 = boost 49.55458 =
idf(subseries_container_id:
 b=547 007=31 vty=1 6=87) 0.0021519922 = queryNorm 24.77729 =
 fieldWeight(subseries_container_id:b 007 vty 6 in 39526),
product of: 1.0
 = tf(phraseFreq=1.0) 49.55458 = idf(subseries_container_id:
b=547 007=31
 vty=1 6=87) 0.5 = fieldNorm(field=subseries_container_id, doc=39526) 
 - b007vty6:  5.1243143 = (MATCH) sum of: 5.1243143 = (MATCH) max
 of: 5.1243143 = (MATCH) weight(id:b007vty6^10.0 in 39512), product of:
 0.33207658 = queryWeight(id:b007vty6^10.0), product of: 10.0 = boost
 15.431123 = idf(docFreq=1, maxDocs=3701577) 0.0021519922 = queryNorm
 15.431123 = (MATCH) fieldWeight(id:b007vty6 in 39512),
product of: 1.0 =
 tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1,
maxDocs=3701577) 1.0
 = fieldNorm(field=id, doc=39512) 
  }
  - QParser: DisMaxQParser
  - altquerystring: null
  - boostfuncs: null
  - -
  timing: {
 - time: 13
 - -
 prepare: {
- time: 3
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 3
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.DebugComponent: {
   - time: 0
}
 }
 - -
 process: {
- time: 10
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0

Re: Boost Strangeness

2011-06-15 Thread Judioo
   dynamicField name=*_id  type=textindexed=true  stored=true/

so all attributes except 'id' are of type text.

I didn't know that about the string type. So is my problem as described (
that partial matches are contributing to the calculation ) and does defining
the filed type as string solve this problem.

Or is my understanding completely incorrect?

Thanks in advance

On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:

 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
  same result ( just higher scores ). It's almost as if
  partial matches on
  brand|series_container_id and id are being considered in
  the 1st document.
  Surely this can't be right / expected?

 What is your fieldType definition? Don't you think it is better to use
 string type which is not tokenized?



Re: Boost Strangeness

2011-06-15 Thread Judioo
String also does not seem to accept spaces. currently the _id fields can
contain multiple ids ( using as a multiType alternative ). This is why I
used the text type.

On 15 June 2011 12:16, Judioo cont...@judioo.com wrote:

dynamicField name=*_id  type=textindexed=true
 stored=true/

 so all attributes except 'id' are of type text.

 I didn't know that about the string type. So is my problem as described (
 that partial matches are contributing to the calculation ) and does defining
 the filed type as string solve this problem.

 Or is my understanding completely incorrect?

 Thanks in advance


 On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:

 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
  same result ( just higher scores ). It's almost as if
  partial matches on
  brand|series_container_id and id are being considered in
  the 1st document.
  Surely this can't be right / expected?

 What is your fieldType definition? Don't you think it is better to use
 string type which is not tokenized?





Pattern: Is there a method of resolving multivalued date ranges into a single document?

2011-06-11 Thread Judioo
Hi All,

Question on best methods again :)

I have the following type of document.

film
  titleTron/film
  times
time start='2010-09-23T12:00:00Z' end='2010-09-23T1430:00:00Z'
theater_id='445632'/
time start='2010-09-23T15:00:00Z' end='2010-09-23T1730:00:00Z'
theater_id='445633'/
time start='2010-09-23T18:00:00Z' end='2010-09-23T2030:00:00Z'
theater_id='445634'/
  /times
  .
/film

where theater identifies the place where the film is showing. Each theater
is stored in another document. I want to store the timings in the same
document as the film details. This is so I can perform a range search like

( type:film AND start:[ NOW TO * ] AND end:[NOW TO *] )

i.e. give me all the films that are scheduled to start in the future.

I was hoping I could submit a document like the following:

doc
  field name=id12345-67890-12345/field
  field name=titleTron/field
  field name=445632_start2010-09-23T12:00:00Z/field
  field name=445632_end2010-09-23T1430:00:00Z/field
  field name=445633_start2010-09-23T15:00:00Z/field
  field name=445633_end2010-09-23T1730:00:00Z/field
  field name=445634_start2010-09-23T18:00:00Z/field
  field name=445634_end2010-09-23T2030:00:00Z/field
  
/doc


My assumption is that I could then perform a wildcard date range search like

( type:film AND *_start:[ NOW TO * ] AND *_end:[NOW TO *] )

Using the attribute name theater_id_start|end  as an indicator to the
theater. However I do not think date ranges support this.

Can ANYONE suggest a method to accomplish this with examples?


Thank you in advance.


Re: Solr Indexing Patterns

2011-06-09 Thread Judioo
Very informative links and statement Jonathan. thank you.



On 6 June 2011 20:55, Jonathan Rochkind rochk...@jhu.edu wrote:

 This is a start, for many common best practices:

 http://wiki.apache.org/solr/SolrRelevancyFAQ

 Many of the questions in there have an answer that involves de-normalizing.
 As an example. It may be that even if your specific problem isn't in there,
  I myself anyway found reading through there gave me a general sense of
 common patterns in Solr.

 ( It's certainly true that some things are hard to do in Solr.  It turns
 out that an RDBMS is a remarkably flexible thing -- but when it doesn't do
 something you need well, and you turn to a specialized tool instead like
 Solr, you certainly give up some things

 One of the biggest areas of limitation involves hieararchical or
 relationship data, definitely. There are a variety of features, some more
 fully baked than others, some not yet in a Solr release, meant to provide
 tools to get at different aspects of this. Including pivot facetting,
  join (https://issues.apache.org/jira/browse/SOLR-2272), and
 field-collapsing.  Each, IMO, is trying to deal with different aspects of
 dealing with hieararchical or multi-class data, or data that is entities
 with relationships. ).


 On 6/6/2011 3:43 PM, Judioo wrote:

 I do think that Solr would be better served if there was a *best practice
 section *of the site.

 Looking at the majority of emails to this list they resolve around how do
 I
 do X?.

 Seems like tutorials with real world examples would serve Solr no end of
 good.

 I still do not have an example of the best method to approach my problem,
 although Erick has  help me understand the limitations of Solr.

 Just thought I'd say.






 On 6 June 2011 20:26, Judioocont...@judioo.com  wrote:

  Thanks


 On 6 June 2011 19:32, Erick Ericksonerickerick...@gmail.com  wrote:

  #Everybody# (including me) who has any RDBMS background
 doesn't want to flatten data, but that's usually the way to go in
 Solr.

 Part of whether it's a good idea or not depends on how big the index
 gets, and unfortunately the only way to figure that out is to test.

 But that's the first approach I'd try.

 Good luck!
 Erick

 On Mon, Jun 6, 2011 at 11:42 AM, Judioocont...@judioo.com  wrote:

 On 5 June 2011 14:42, Erick Ericksonerickerick...@gmail.com  wrote:

  See: http://wiki.apache.org/solr/SchemaXml

 By adding ' multiValued=true ' to the field, you can add
 the same field multiple times in a doc, something like

 add
 doc
  field name=mvvalue1/field
  field name=mvvalue2/field
 /doc
 /add

 I can't see how that would work as one would need to associate the

 right

 start / end dates and price.
 As I understand using multivalued and thus flattening the  discounts

 would

 result in:

 {
name:The Book,
price:$9.99,
price:$3.00,
price:$4.00,synopsis:thanksgiving special,
starts:11-24-2011,
starts:10-10-2011,
ends:11-25-2011,
ends:10-11-2011,
synopsis:Canadian thanksgiving special,
  },

 How does one differentiate the different offers?



  But there's no real ability  in Solr to store sub documents,
 so you'd have to get creative in how you encoded the discounts...

  This is what I'm asking :)
 What is the best / recommended / known patterns for doing this?



  But I suspect a better approach would be to store each discount as
 a separate document. If you're in the trunk version, you could then
 group results by, say, ISBN and get responses grouped together...

  This is an option but seems sub optimal. So say I store the discounts
 in
 multiple documents with ISDN as an attribute and also store the title

 again

 with ISDN as an attribute.

 To get
 all books currently discounted

 requires 2 request

 * get all discounts currently active
 * get all books  using ISDN retrieved from above search

 Not that bad. However what happens when I want
 all books that are currently on discount in the horror genre

 containing

 the word 'elm' in the title.

 The only way I can see in catering for the above search is to duplicate

 all

 searchable fields in my book document in my discount document.

 Coming

 from a RDBM background this seems wrong.

 Is this the correct approach to take?



  Best
 Erick

 On Sat, Jun 4, 2011 at 1:42 AM, Judioocont...@judioo.com  wrote:

 Hi,
 Discounts can change daily. Also there can be a lot of them (over

 time

 and

 in a given time period ).

 Could you give an example of what you mean buy multi-valuing the

 field.

 Thanks

 On 3 June 2011 14:29, Erick Ericksonerickerick...@gmail.com

 wrote:

  How often are the discounts changed? Because you can simply
 re-index the book information with a multiValued discounts field
 and get something similar to your example (wt=json)


 Best
 Erick

 On Fri, Jun 3, 2011 at 8:38 AM, Judioocont...@judioo.com  wrote:

 What is the best practice method to index the following in Solr:

 I'm attempting to use solr for a book store site

Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote:

 See: http://wiki.apache.org/solr/SchemaXml

 By adding ' multiValued=true ' to the field, you can add
 the same field multiple times in a doc, something like

 add
 doc
  field name=mvvalue1/field
  field name=mvvalue2/field
 /doc
 /add

 I can't see how that would work as one would need to associate the right
start / end dates and price.
As I understand using multivalued and thus flattening the  discounts would
result in:

{
name:The Book,
price:$9.99,
price:$3.00,
price:$4.00,synopsis:thanksgiving special,
starts:11-24-2011,
starts:10-10-2011,
ends:11-25-2011,
ends:10-11-2011,
synopsis:Canadian thanksgiving special,
  },

How does one differentiate the different offers?



 But there's no real ability  in Solr to store sub documents,
 so you'd have to get creative in how you encoded the discounts...


This is what I'm asking :)
What is the best / recommended / known patterns for doing this?




 But I suspect a better approach would be to store each discount as
 a separate document. If you're in the trunk version, you could then
 group results by, say, ISBN and get responses grouped together...


This is an option but seems sub optimal. So say I store the discounts in
multiple documents with ISDN as an attribute and also store the title again
with ISDN as an attribute.

To get
all books currently discounted

requires 2 request

* get all discounts currently active
* get all books  using ISDN retrieved from above search

Not that bad. However what happens when I want
all books that are currently on discount in the horror genre containing
the word 'elm' in the title.

The only way I can see in catering for the above search is to duplicate all
searchable fields in my book document in my discount document. Coming
from a RDBM background this seems wrong.

Is this the correct approach to take?




 Best
 Erick

 On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote:
  Hi,
  Discounts can change daily. Also there can be a lot of them (over time
 and
  in a given time period ).
 
  Could you give an example of what you mean buy multi-valuing the field.
 
  Thanks
 
  On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com wrote:
 
  How often are the discounts changed? Because you can simply
  re-index the book information with a multiValued discounts field
  and get something similar to your example (wt=json)
 
 
  Best
  Erick
 
  On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote:
   What is the best practice method to index the following in Solr:
  
   I'm attempting to use solr for a book store site.
  
   Each book will have a price but on occasions this will be discounted.
 The
   discounted price exists for a defined time period but there may be
 many
   discount periods. Each discount will have a brief synopsis, start and
 end
   time.
  
   A subset of the desired output would be as follows:
  
   ...
   response:{numFound:1,start:0,docs:[
{
  name:The Book,
  price:$9.99,
  discounts:[
  {
   price:$3.00,
   synopsis:thanksgiving special,
   starts:11-24-2011,
   ends:11-25-2011,
  },
  {
   price:$4.00,
   synopsis:Canadian thanksgiving special,
   starts:10-10-2011,
   ends:10-11-2011,
  },
   ]
},
.
  
   A requirement is to be able to search for just discounted
 publications. I
   think I could use date faceting for this ( return publications that
 are
   within a discount window ). When a discount search is performed no
   publications that are not currently discounted will be returned.
  
   My question are:
  
 - Does solr support this type of sub documents
  
   In the above example the discounts are the sub documents. I know solr
 is
  not
   a relational DB but I would like to store and index the above
  representation
   in a single document if possible.
  
 - what is the best method to approach the above
  
   I can see in many examples the authors tend to denormalize to solve
  similar
   problems. This suggest that for each discount I am required to
 duplicate
  the
   book data or form a document
   association
 http://stackoverflow.com/questions/2689399/solr-associations
  .
   Which method would you advise?
  
   It would be nice if solr could return a response structured as above.
  
   Much Thanks
  
 
 



Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
Thanks

On 6 June 2011 19:32, Erick Erickson erickerick...@gmail.com wrote:

 #Everybody# (including me) who has any RDBMS background
 doesn't want to flatten data, but that's usually the way to go in
 Solr.

 Part of whether it's a good idea or not depends on how big the index
 gets, and unfortunately the only way to figure that out is to test.

 But that's the first approach I'd try.

 Good luck!
 Erick

 On Mon, Jun 6, 2011 at 11:42 AM, Judioo cont...@judioo.com wrote:
  On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote:
 
  See: http://wiki.apache.org/solr/SchemaXml
 
  By adding ' multiValued=true ' to the field, you can add
  the same field multiple times in a doc, something like
 
  add
  doc
   field name=mvvalue1/field
   field name=mvvalue2/field
  /doc
  /add
 
  I can't see how that would work as one would need to associate the right
  start / end dates and price.
  As I understand using multivalued and thus flattening the  discounts
 would
  result in:
 
  {
 name:The Book,
 price:$9.99,
 price:$3.00,
 price:$4.00,synopsis:thanksgiving special,
 starts:11-24-2011,
 starts:10-10-2011,
 ends:11-25-2011,
 ends:10-11-2011,
 synopsis:Canadian thanksgiving special,
   },
 
  How does one differentiate the different offers?
 
 
 
  But there's no real ability  in Solr to store sub documents,
  so you'd have to get creative in how you encoded the discounts...
 
 
  This is what I'm asking :)
  What is the best / recommended / known patterns for doing this?
 
 
 
 
  But I suspect a better approach would be to store each discount as
  a separate document. If you're in the trunk version, you could then
  group results by, say, ISBN and get responses grouped together...
 
 
  This is an option but seems sub optimal. So say I store the discounts in
  multiple documents with ISDN as an attribute and also store the title
 again
  with ISDN as an attribute.
 
  To get
  all books currently discounted
 
  requires 2 request
 
  * get all discounts currently active
  * get all books  using ISDN retrieved from above search
 
  Not that bad. However what happens when I want
  all books that are currently on discount in the horror genre
 containing
  the word 'elm' in the title.
 
  The only way I can see in catering for the above search is to duplicate
 all
  searchable fields in my book document in my discount document. Coming
  from a RDBM background this seems wrong.
 
  Is this the correct approach to take?
 
 
 
 
  Best
  Erick
 
  On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote:
   Hi,
   Discounts can change daily. Also there can be a lot of them (over time
  and
   in a given time period ).
  
   Could you give an example of what you mean buy multi-valuing the
 field.
  
   Thanks
  
   On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com wrote:
  
   How often are the discounts changed? Because you can simply
   re-index the book information with a multiValued discounts field
   and get something similar to your example (wt=json)
  
  
   Best
   Erick
  
   On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote:
What is the best practice method to index the following in Solr:
   
I'm attempting to use solr for a book store site.
   
Each book will have a price but on occasions this will be
 discounted.
  The
discounted price exists for a defined time period but there may be
  many
discount periods. Each discount will have a brief synopsis, start
 and
  end
time.
   
A subset of the desired output would be as follows:
   
...
response:{numFound:1,start:0,docs:[
 {
   name:The Book,
   price:$9.99,
   discounts:[
   {
price:$3.00,
synopsis:thanksgiving special,
starts:11-24-2011,
ends:11-25-2011,
   },
   {
price:$4.00,
synopsis:Canadian thanksgiving special,
starts:10-10-2011,
ends:10-11-2011,
   },
]
 },
 .
   
A requirement is to be able to search for just discounted
  publications. I
think I could use date faceting for this ( return publications that
  are
within a discount window ). When a discount search is performed no
publications that are not currently discounted will be returned.
   
My question are:
   
  - Does solr support this type of sub documents
   
In the above example the discounts are the sub documents. I know
 solr
  is
   not
a relational DB but I would like to store and index the above
   representation
in a single document if possible.
   
  - what is the best method to approach the above
   
I can see in many examples the authors tend to denormalize to solve
   similar
problems. This suggest that for each discount I am required to
  duplicate
   the
book data or form a document
association
  http://stackoverflow.com/questions

Re: Solr Indexing Patterns

2011-06-06 Thread Judioo
I do think that Solr would be better served if there was a *best practice
section *of the site.

Looking at the majority of emails to this list they resolve around how do I
do X?.

Seems like tutorials with real world examples would serve Solr no end of
good.

I still do not have an example of the best method to approach my problem,
although Erick has  help me understand the limitations of Solr.

Just thought I'd say.






On 6 June 2011 20:26, Judioo cont...@judioo.com wrote:

 Thanks


 On 6 June 2011 19:32, Erick Erickson erickerick...@gmail.com wrote:

 #Everybody# (including me) who has any RDBMS background
 doesn't want to flatten data, but that's usually the way to go in
 Solr.

 Part of whether it's a good idea or not depends on how big the index
 gets, and unfortunately the only way to figure that out is to test.

 But that's the first approach I'd try.

 Good luck!
 Erick

 On Mon, Jun 6, 2011 at 11:42 AM, Judioo cont...@judioo.com wrote:
  On 5 June 2011 14:42, Erick Erickson erickerick...@gmail.com wrote:
 
  See: http://wiki.apache.org/solr/SchemaXml
 
  By adding ' multiValued=true ' to the field, you can add
  the same field multiple times in a doc, something like
 
  add
  doc
   field name=mvvalue1/field
   field name=mvvalue2/field
  /doc
  /add
 
  I can't see how that would work as one would need to associate the
 right
  start / end dates and price.
  As I understand using multivalued and thus flattening the  discounts
 would
  result in:
 
  {
 name:The Book,
 price:$9.99,
 price:$3.00,
 price:$4.00,synopsis:thanksgiving special,
 starts:11-24-2011,
 starts:10-10-2011,
 ends:11-25-2011,
 ends:10-11-2011,
 synopsis:Canadian thanksgiving special,
   },
 
  How does one differentiate the different offers?
 
 
 
  But there's no real ability  in Solr to store sub documents,
  so you'd have to get creative in how you encoded the discounts...
 
 
  This is what I'm asking :)
  What is the best / recommended / known patterns for doing this?
 
 
 
 
  But I suspect a better approach would be to store each discount as
  a separate document. If you're in the trunk version, you could then
  group results by, say, ISBN and get responses grouped together...
 
 
  This is an option but seems sub optimal. So say I store the discounts in
  multiple documents with ISDN as an attribute and also store the title
 again
  with ISDN as an attribute.
 
  To get
  all books currently discounted
 
  requires 2 request
 
  * get all discounts currently active
  * get all books  using ISDN retrieved from above search
 
  Not that bad. However what happens when I want
  all books that are currently on discount in the horror genre
 containing
  the word 'elm' in the title.
 
  The only way I can see in catering for the above search is to duplicate
 all
  searchable fields in my book document in my discount document.
 Coming
  from a RDBM background this seems wrong.
 
  Is this the correct approach to take?
 
 
 
 
  Best
  Erick
 
  On Sat, Jun 4, 2011 at 1:42 AM, Judioo cont...@judioo.com wrote:
   Hi,
   Discounts can change daily. Also there can be a lot of them (over
 time
  and
   in a given time period ).
  
   Could you give an example of what you mean buy multi-valuing the
 field.
  
   Thanks
  
   On 3 June 2011 14:29, Erick Erickson erickerick...@gmail.com
 wrote:
  
   How often are the discounts changed? Because you can simply
   re-index the book information with a multiValued discounts field
   and get something similar to your example (wt=json)
  
  
   Best
   Erick
  
   On Fri, Jun 3, 2011 at 8:38 AM, Judioo cont...@judioo.com wrote:
What is the best practice method to index the following in Solr:
   
I'm attempting to use solr for a book store site.
   
Each book will have a price but on occasions this will be
 discounted.
  The
discounted price exists for a defined time period but there may be
  many
discount periods. Each discount will have a brief synopsis, start
 and
  end
time.
   
A subset of the desired output would be as follows:
   
...
response:{numFound:1,start:0,docs:[
 {
   name:The Book,
   price:$9.99,
   discounts:[
   {
price:$3.00,
synopsis:thanksgiving special,
starts:11-24-2011,
ends:11-25-2011,
   },
   {
price:$4.00,
synopsis:Canadian thanksgiving special,
starts:10-10-2011,
ends:10-11-2011,
   },
]
 },
 .
   
A requirement is to be able to search for just discounted
  publications. I
think I could use date faceting for this ( return publications
 that
  are
within a discount window ). When a discount search is performed no
publications that are not currently discounted will be returned.
   
My question are:
   
  - Does solr support this type of sub documents
   
In the above example

Solr Indexing Patterns

2011-06-03 Thread Judioo
What is the best practice method to index the following in Solr:

I'm attempting to use solr for a book store site.

Each book will have a price but on occasions this will be discounted. The
discounted price exists for a defined time period but there may be many
discount periods. Each discount will have a brief synopsis, start and end
time.

A subset of the desired output would be as follows:

...
response:{numFound:1,start:0,docs:[
  {
name:The Book,
price:$9.99,
discounts:[
{
 price:$3.00,
 synopsis:thanksgiving special,
 starts:11-24-2011,
 ends:11-25-2011,
},
{
 price:$4.00,
 synopsis:Canadian thanksgiving special,
 starts:10-10-2011,
 ends:10-11-2011,
},
 ]
  },
  .

A requirement is to be able to search for just discounted publications. I
think I could use date faceting for this ( return publications that are
within a discount window ). When a discount search is performed no
publications that are not currently discounted will be returned.

My question are:

   - Does solr support this type of sub documents

In the above example the discounts are the sub documents. I know solr is not
a relational DB but I would like to store and index the above representation
in a single document if possible.

   - what is the best method to approach the above

I can see in many examples the authors tend to denormalize to solve similar
problems. This suggest that for each discount I am required to duplicate the
book data or form a document
associationhttp://stackoverflow.com/questions/2689399/solr-associations.
Which method would you advise?

It would be nice if solr could return a response structured as above.

Much Thanks


Storing, indexing and searching XML documents in Solr

2011-05-18 Thread Judioo
Hi,
I'm new to solr so apologies if the solution is already documented.
I have installed and populated a solr index using the examples as a template
with a version of the data below.

I have XML in the form of

  entity
resource
  guid123898-2092099098982/guid
  media_formatBlu-Ray/media_format
  updated2011-05-05T11:25:35+0500/updated
/resource
price currency=usd3.99price
discounts
  discount type=percentage rate=30
start=2011-05-03T00:00:00 end=2011-05-10T00:00:00 /

  discount type=decimal amount=1.99 coupon=1 /
  .
/discounts
aspect_ratio16:9/aspect_ratio
duration1620/duration
categories
  category id=drama /
  category id=horror /
/categories
rating
  rate id=D1contains some scenes which some viewers may find
upsetting/rate
/rating
...
media_typeVideo/media_type
/entity


Can I populate solr directly with this document (like I believe marklogic
does )?
If yes
Can I search on any attribute ( i.e. find all records where
/entity/resource/media_format equals blu-ray )

If no
What is the best practice to import the attributes above into solr ( i.e.
patterns for sub dividing / flattening document ).
Does solr support attached documents and if so is this advised ( how does it
affect performance ).

Any help is greatly appreciated. Pointers to documentation that address my
issues is even more helpful.

Thanks again


OJ


Re: Storing, indexing and searching XML documents in Solr

2011-05-18 Thread Judioo
The data is being imported directly from mysql. The document is however
indeed a good starting place.
Thanks

2011/5/18 Yury Kats yuryk...@yahoo.com

 On 5/18/2011 4:19 PM, Judioo wrote:

  Any help is greatly appreciated. Pointers to documentation that address
 my
  issues is even more helpful.

 I think this would be a good start:

 http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource



Re: Storing, indexing and searching XML documents in Solr

2011-05-18 Thread Judioo
Great document. I can see how to import the data direct from the database.
However it seems as though I need to write xpath's in the config to extract
the fields that I wish to transform into an solr document.

So it seems that there is no way of storing the document structure in solr
as is?


2011/5/18 Yury Kats yuryk...@yahoo.com

 On 5/18/2011 4:19 PM, Judioo wrote:

  Any help is greatly appreciated. Pointers to documentation that address
 my
  issues is even more helpful.

 I think this would be a good start:

 http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource