Is Solr ready for Nested Documents importing and querying ?

2015-08-26 Thread Rafael
Hi, I'm using solr and I'm starting to index my database. I work for a book
seller, but we have a lot of different publications (i.e: different
editions from different publishers ) for the same book, and I was wondering
if it would be wise to model this schema using a hierarchical approach
(with nested docs). For example:

{
  title: 'The hoobit',
  author: 'J. R. Tolkien,
  publications: [{
  isbn: 9780007591855,
  price: 0.99,
  pages: 200
}, {
  isbn: 9780007497904,
  price: 4.00,
  pages: 230
}
  ]
}

And, another question, how can I achieve this with data-import-handler ? I
found this: https://issues.apache.org/jira/browse/SOLR-5147 (I'm using solr
5.3) and I was able to index the data, but I cannot retrieve the
publications values inside a book.

What do you think, guys ? Or is it better to forget about nested documents
and get back to the old-fashioned denormalized approach ?

Thanks.

[]'s
Rafael


Data Import Handler use of JNDI decayed

2015-08-26 Thread Davis, Daniel (NIH/NLM) [C]
NLM tends to be rather security conscious.   Nothing appears terribly wrong, 
but the layout of Solr doesn't include Jetty's start.ini or jetty.xml
It will have to be the detailed way - 
https://wiki.eclipse.org/Jetty/Feature/JNDI#Detailed_Setup

Once I've figured it out, I'll request wiki edit permissions to add it in.

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH



Re: Search opening hours

2015-08-26 Thread Darren Spehr
So thanks to the tireless efforts of David Smiley and the devs at Vivid
Solutions (not to mention the various contributors that help power Solr and
Lucene) spatial search is awesome, efficient and easy.  The biggest
roadblock I've run into is not having the JTS (Java Topology Suite) JAR
where Solr can find it. It doesn't ship with Solr OOB so you have to either
add it to one of the dynamic directories, or bundle it with the WAR (I
think pre-5.0). The link above has most of what you need to index data and
issue queries. I'd also suggest the sections on spatial search in Solr In
Action (Grainger, Potter) - they add a few more use cases that I've found
interesting. Finally, the aging wiki has some good info too:

http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

Basically indexing spatial data is as easy as anything else: define the
field in the solrconfig.xml, create the data and push it in. Now the data
in this case are boxes or polygons (effectively the same here) and come in
a specific format known as WKT, or Well-Known-Text
https://en.wikipedia.org/wiki/Well-known_text. I'd say unless you're
aiming at an advanced use case set the max dist error on the field config a
little higher than normal - precision isn't really a requirement here and
good unit tests would alert you to any unforeseen issues. Then for the
query side of the world you just ask for point inclusion like:

q=+polygon:Contains(POINT(my_long my_lat))

Please note that WKT reverses the order of lat/lng because it uses
euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell
you how many times my brain hurt thanks to this idiom combined with janky
client logic :) Anyway, that's about it - let me know if you have any other
questions.


On Wed, Aug 26, 2015 at 1:56 PM, O. Klein kl...@octoweb.nl wrote:

 Darren,

 This sounds like solution I'm looking for. Especially nice fix for the
 Sunday-Monday problem.

 Never worked with spatial search before, so any pointers are welcome.

 Will start working on this solution.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Darren


Re: StrDocValues

2015-08-26 Thread Mikhail Khludnev
Hello Jamie,

Check here
https://github.com/apache/lucene-solr/blob/7f721a1f9323a85ce2b5b35e12b4788c31271b69/lucene/sandbox/src/java/org/apache/lucene/search/DocValuesRangeQuery.java#L185
Note, SortedSet works there even if an actual field is multivalue=false


On Wed, Aug 26, 2015 at 8:48 PM, Jamie Johnson jej2...@gmail.com wrote:

 Are there any example implementation showing how StrDocValues works?  I am
 not sure if this is the right place or not, but I was thinking about having
 some document level doc value that I'd like to read in a function query to
 impact if the document is returned or not.  Am I barking up the right tree
 looking at this or is there another method to supporting this?




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Connect and sync two solr server

2015-08-26 Thread Erick Erickson
From the description, this is straight forward SolrCloud where you
have replicas on the separate machines, see:
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

A different way of accomplishing this would be the master/slave style, see:
https://cwiki.apache.org/confluence/display/solr/Index+Replication

Best,
Erick

On Wed, Aug 26, 2015 at 6:55 AM, shahper shahper.ja...@techblue.co.uk wrote:
 Hi,

 I want to connect two solrcloud server. and sync there indexes to each other
 so that is any server is down we can work with other and whenever I update
 or add index in any server the other also get updated.

 shahper











StrDocValues

2015-08-26 Thread Jamie Johnson
Are there any example implementation showing how StrDocValues works?  I am
not sure if this is the right place or not, but I was thinking about having
some document level doc value that I'd like to read in a function query to
impact if the document is returned or not.  Am I barking up the right tree
looking at this or is there another method to supporting this?


Securing Solr 5.3 with Basic Authentication

2015-08-26 Thread Gofio Code
With version 5.3 Solr have full-featured authentication and authorization
plugins that use Basic  authentication and “permission rules” which are
completely driven from ZooKeeper.

So I have tried that without success follwong the info in
https://cwiki.apache.org/confluence/display/solr/Securing+Solr and
http://lucidworks.com/blog/securing-solr-basic-auth-permission-rules:

I followed this steps:

*1) Set up a Zookeeper Ensemble (3 nodes).*

*2) I upload the filesecurity.json to Zookeper*

I used this command to upload the file: zkcli.bat -zkhost localhost:2181
-cmd putfile /security.json security.json

Content of the file security.json:
{
authentication:{
   class:solr.BasicAuthPlugin,
   credentials:{solr:IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=}
},
authorization:{
   class:solr.RuleBasedAuthorizationPlugin,
   user-role:{solr:admin},
   permissions:[{name:security-edit,
  role:admin}]
}}

I also tried with this security.json content:

{authentication:{class:solr.BasicAuthPlugin},authorization:{class:solr.RuleBasedAuthorizationPlugin}}


*3) ** I started Solr 5.3.0 in cloud mode (and 'bootstrap' ):*

I used this command:
./solr start -c -z localhost:2181,localhost:2182,localhost:2183 -s
../server/solrcloud_test
-Dbootstrap_confdir=../server/solrcloud_test/configsets/basic_configs/conf
-Dcollection.configName=c_test_cfg -f


However, I can access directly to http://localhost:8983/solr and the
browser doesn't ask me the credentials. In Solr Admin I can see the
/security.json (with the correct content) and even the c_test_cfg under
/cofigs .

I can see this in the log when solr starts:

955  INFO  (main) [   ] o.a.s.c.CoreContainer Security conf doesn't exist.
Skipping setup for authorization module.
955  INFO  (main) [   ] o.a.s.c.CoreContainer No authentication plugin used.

Can anybody tell me what I'm doing wrong??


Re: Search opening hours

2015-08-26 Thread O. Klein
Darren,

This sounds like solution I'm looking for. Especially nice fix for the
Sunday-Monday problem.

Never worked with spatial search before, so any pointers are welcome. 

Will start working on this solution.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
Sent from the Solr - User mailing list archive at Nabble.com.


IOException, ConnectionTimeout Error while searching

2015-08-26 Thread Nitin Solanki
Hello,
I indexed 2 million documents and after completing indexing. I
tried for searching. It throws IOException and Connection Timeout Error.


 error:{
msg:org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at:
http://192.168.1.25:8983/solr/col_ner_shard1_replica1;,
trace:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: IOException occured
when talking to server at:
http://192.168.1.25:8983/solr/col_ner_shard1_replica1\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:337)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat


Re: Behavior of grouping on a field with same value spread across shards.

2015-08-26 Thread Erick Erickson
That should be the case.

Best,
Erick

On Tue, Aug 25, 2015 at 8:55 PM, Modassar Ather modather1...@gmail.com wrote:
 Thanks Erick,

 I saw the link. So is it that the grouping functionality works fine in
 distributed search except the two cases mentioned in the link?

 Regards,
 Modassar

 On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 That's not really the case. Perhaps you're confusing
 group.ngroups and group.facet with just grouping?

 See the ref guide:

 https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats

 Best,
 Erick

 On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com
 wrote:
  Hi,
 
  As per my understanding, to group on a field all documents with the same
  value in the field have to be in the same shard.
 
  Can we group by a field where the documents with the same value in that
  field will be distributed across shards?
  Please let me know what are the limitations, feature not available or
  performance issues for such fields?
 
  Thanks,
  Modassar



Solr 5.2.1 versus Solr 4.7.0 performance

2015-08-26 Thread Esther Goldbraich
Hello,
We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same 
data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 
(5-15% better than 5.2.1, with an exception of 100% improvement for one of 
the queries ).
Using same JVM (IBM 1.7) and JVM params.
Index's size is ~500G, spread over 64 shards, with replication factor 2.
Do you know about any config / setup change for Solr 5.2.1 that can 
improve the performance? Any idea what causes this behavior?
Thank you,
Esther




Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-26 Thread Erick Erickson
Sure, I think it's fine to raise a JIRA, especially if you can include
a patch, even a preliminary one to solicit feedback... which I'll
leave to people who are more familiar with that code...

I'm not sure how generally useful this would be, and if it comes
at a cost to normal searching there's sure to be lively discussion.

Best
Erick

On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson jej2...@gmail.com wrote:
 Looks like I have something basic working for Trie fields.  I am doing
 exactly what I said in my previous email, so good news there.  I think this
 is a big step as there are only a few field types left that I need to
 support, those being date (should be similar to Trie) and Spatial fields,
 which at a glance looked like it provided a way to provide the token stream
 through an extension.  Definitely need to look more though.

 All of this said though, is this really the right way to get payloads into
 these types of fields?  Should a jira feature request be added for this?
 On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote:

 Right, I had assumed (obviously here is my problem) that I'd be able to
 specify payloads for the field regardless of the field type.  Looking at
 TrieField that is certainly non-trivial.  After a bit of digging it appears
 that if I wanted to do something here I'd need to build a new TrieField,
 override createField and provide a Field that would return something like
 NumericTokenStream but also provide the payloads.  Like you said sounds
 interesting to say the least...

 Were payloads not really intended to be used for these types of fields
 from a Lucene perspective?


 On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Well, you're going down a path that hasn't been trodden before ;).

 If you can treat your primitive types as text types you might get
 some traction, but that makes a lot of operations like numeric
 comparison difficult.

 H. another idea from left field. For single-valued types,
 what about a sidecar field that has the auth token? And even
 for a multiValued field, two parallel fields are guaranteed to
 maintain order so perhaps you could do something here. Yes,
 I'm waving my hands a LOT here.

 I suspect that trying to have a custom type that incorporates
 payloads for, say, trie fields will be interesting to say the least.
 Numeric types are packed to save storage etc. so it'll be
 an adventure..

 Best,
 Erick

 On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com wrote:
  We were originally using this approach, i.e. run things through the
  KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.
 Again
  this works fine for text, though I had wanted to use the
 StandardTokenizer
  in the chain.  Is there an equivalent filter that does what the
  StandardTokenizer does?
 
  All of this said this doesn't address the issue of the primitive field
  types, which at this point is the bigger issue.  Given this use case
 should
  there be another way to provide payloads?
 
  My current thinking is that I will need to provide custom
 implementations
  for all of the field types I would like to support payloads on which
 will
  essentially be copies of the standard versions with some extra sugar
 to
  read/write the payloads (I don't see a way to wrap/delegate these at
 this
  point because AttributeSource has the attribute retrieval related
 methods
  as final so I can't simply wrap another tokenizer and return my added
  attributes + the wrapped attributes).  I know my use case is a bit
 strange,
  but I had not expected to need to do this given that Lucene/Solr
 supports
  payloads on these field types, they just aren't exposed.
 
  As always I appreciate any ideas if I'm barking up the wrong tree here.
 
  On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
  Well, if i remember correctly (i have no testing facility at hand)
  WordDelimiterFilter maintains payloads on emitted sub terms. So if you
 use
  a KeywordTokenizer, input 'some text^PAYLOAD', and have a
  DelimitedPayloadFilter, the entire string gets a payload. You can then
  split that string up again in individual tokens. It is possible to
 abuse
  WordDelimiterFilter for it because it has a types parameter that you
 can
  use to split it on whitespace if its input is not trimmed. Otherwise
 you
  can use any other character instead of a space as your input.
 
  This is a crazy idea, but it might work.
 
  -Original message-
   From:Jamie Johnson jej2...@gmail.com
   Sent: Tuesday 25th August 2015 19:37
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
  
   To be clear, we are using payloads as a way to attach authorizations
 to
   individual tokens within Solr.  The payloads are normal Solr Payloads
   though we are not using floats, we are using the identity payload
 encoder
   

Re: how to index document with multiple words (phrases) and words permutation?

2015-08-26 Thread afrooz
Simon, Thanks a lot. that is a great tool . I am trying to use it.
Great solution.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4225425.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search opening hours

2015-08-26 Thread Darren Spehr
Sorry - didn't finish my thought. I need to address querying :) So using
the above to define what's in the index your queries for a day/time become
a CONTAINS operation against the field. Let's say that the field is defined
as a location_rpt using JTS and its Spatial Factory (which supports
polygons) - oh, and it would need to be multi-valued. Querying the field
would require first translating now or in an hour or Monday at 9am to
a geocode, then hitting the index with a CONTAINS request per the docs:

https://cwiki.apache.org/confluence/display/solr/Spatial+Search


On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr darre...@gmail.com wrote:

 Sure - and sorry for its density. I reread it and thought the same ;)

 So imagine a polygon of say 1/2 mile width (I made that up) that stretches
 around the equator. Let's call this a week's timeline and subdivide it into
 7 blocks, one for each day. For the sake of simplicity assume it's a line
 (which I forget but is supported in Solr as an infinitely small polygon)
 starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
 Sunday at 11:59 PM. By subdivide you can think of it either radially or by
 longitude, but you have 360 degrees to divide into 7, which means that
 every hour is represented by a range of roughly 2.143 degrees (360/7/24).
 These regions represent each day and hour (or less), and the region
 boundaries represent midnight for the day before.

 Now for indexing - your open hours then become a combination of these
 subdivisions. If you're open 24x7 then the whole polygon is indexed. If
 you're only open on Monday from 9-5 then only the polygon between
 (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
 can index any combination of times this way.

 So now the varsity question is how to do this with a fluctuating calendar?
 I think this example can be extended to include searching against any given
 day of the week in a year, or years. Just imagine a translation layer that
 adjusts the latitude N or S by some amount to represent which day in which
 year you're looking for. Make sense?

 On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and
 closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and
 do a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you
 just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




 --
 Darren




-- 
Darren


Re: Exact substring search with ngrams

2015-08-26 Thread Upayavira
analysis tab does not support multi-valued fields. It only analyses a
single field value.

On Wed, Aug 26, 2015, at 05:05 PM, Erick Erickson wrote:
 bq: my dog
 has fleas
 I wouldn't  want some variant of og ha to match,
 
 Here's where the mysterious positionIncrementGap comes in. If you
 make this field multiValued,  and index this like this:
 doc
 field name=blahmy dog/field
 field name=blahhas fleas/field
 doc
 
 or equivalently in SolrJ just
 doc.addField(blah, my dog);
 doc.addField(blah, has fleas);
 
 then the position of dog will be 2 and the position of has will be
 102 assuming
 the positionIncrementGap is the default 100. N.B. I'm not sure you'll
 see this in the
 admin/analysis page or not.
 
 Anyway, now your example won't match across the two parts unless
 you specify a slop up in the 101 range.
 
 Best,
 Erick
 
 On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer r...@networkz.ch
 wrote:
  On 26/08/15 00:24, Erick Erickson wrote:
  Hmmm, this sounds like a nonsensical question, but what do you mean
  by arbitrary substring?
 
  Because if your substrings consist of whole _tokens_, then ngramming
  is totally unnecessary (and gets in the way). Phrase queries with no slop
  fulfill this requirement.
 
  But let's assume you need to march within tokens, i.e. if the doc
  contains my dog has fleas, you need to match input like as fle, in this
  case ngramming is an option.
 
  Yeah the as fle-thing is exactly what I want to achieve.
 
 
  You have substantially different index and query time chains. The result 
  is that
  the offsets for all the grams at index time are the same in the quick 
  experiment
  I tried, all were 1. But at query time, each gram had an incremented 
  position.
 
  I'd start by using the query time analysis chain for indexing also. Next, 
  I'd
  try enclosing multiple words in double quotes at query time and go from 
  there.
  What you have now is an anti-pattern in that having substantially
  different index
  and query time analysis chains is not something that's likely to be very
  predictable unless you know _exactly_ what the consequences are.
 
  The admin/analysis page is your friend, in this case check the
  verbose checkbox
  to see what I mean.
 
  Hmm interesting. I had the additional \R tokenizer in the index chain
  because the the document can be multiple lines (but the search text is
  always a single line) and if the document was
 
  my dog
  has fleas
 
  I wouldn't want some variant of og ha to match, but I didn't realize
  it didn't give me any positions like you noticed.
 
  I'll try to experiment some more, thanks for the hints!
 
  Chris
 
 
  Best,
  Erick
 
  On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch 
  wrote:
  Hi
 
  I'm trying to build an index for technical documents that basically
  works like grep, i.e. the user gives an arbitray substring somewhere
  in a line of a document and the exact matches will be returned. I
  specifically want no stemming etc. and keep all whitespace, parentheses
  etc. because they might be significant. The only normalization is that
  the search should be case-insensitvie.
 
  I tried to achieve this by tokenizing on line breaks, and then building
  trigrams of the individual lines:
 
  fieldType name=configtext_trigram class=solr.TextField 
 
  analyzer type=index
 
  tokenizer class=solr.PatternTokenizerFactory
  pattern=\R group=-1/
 
  filter class=solr.NGramFilterFactory
  minGramSize=3 maxGramSize=3/
  filter class=solr.LowerCaseFilterFactory/
 
  /analyzer
 
  analyzer type=query
 
  tokenizer class=solr.NGramTokenizerFactory
  minGramSize=3 maxGramSize=3/
  filter class=solr.LowerCaseFilterFactory/
 
  /analyzer
  /fieldType
 
  Then in the search, I use the edismax parser with mm=100%, so given the
  documents
 
 
  {id:test1,content:
  encryption
  10.0.100.22
  description
  }
 
  {id:test2,content:
  10.100.0.22
  description
  }
 
  and the query content:encryption, this will turn into
 
  parsedquery_toString:
 
  +((content:enc content:ncr content:cry content:ryp
  content:ypt content:pti content:tio content:ion)~8),
 
  and return only the first document. All fine and dandy. But I have a
  problem with possible false positives. If the search is e.g.
 
  content:.100.22
 
  then the generated query will be
 
  parsedquery_toString:
  +((content:.10 content:100 content:00. content:0.2 content:.22)~5),
 
  and because all of tokens are also generated for document test2 in the
  proximity of 5, both documents will wrongly be returned.
 
  So somehow I'd need to express the query content:.10 content:100
  content:00. content:0.2 content:.22 with *the tokens exactly in this
  order and nothing in between*. Is this somehow possible, maybe by using
  the termvectors/termpositions stuff? Or am I trying to do something
  that's fundamentally impossible? Other good ideas how to 

Re: re:New Solr installation fails to create core

2015-08-26 Thread deviantcode
Hi Scott,
How about having logged in as a privileged user,  you to run create_core as
solr, 
something like this on a Redhat env: sudo -u solr ./bin/solr create_core -c
demo

KR
Henry



--
View this message in context: 
http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225361.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Exact substring search with ngrams

2015-08-26 Thread Erick Erickson
bq: my dog
has fleas
I wouldn't  want some variant of og ha to match,

Here's where the mysterious positionIncrementGap comes in. If you
make this field multiValued,  and index this like this:
doc
field name=blahmy dog/field
field name=blahhas fleas/field
doc

or equivalently in SolrJ just
doc.addField(blah, my dog);
doc.addField(blah, has fleas);

then the position of dog will be 2 and the position of has will be
102 assuming
the positionIncrementGap is the default 100. N.B. I'm not sure you'll
see this in the
admin/analysis page or not.

Anyway, now your example won't match across the two parts unless
you specify a slop up in the 101 range.

Best,
Erick

On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer r...@networkz.ch wrote:
 On 26/08/15 00:24, Erick Erickson wrote:
 Hmmm, this sounds like a nonsensical question, but what do you mean
 by arbitrary substring?

 Because if your substrings consist of whole _tokens_, then ngramming
 is totally unnecessary (and gets in the way). Phrase queries with no slop
 fulfill this requirement.

 But let's assume you need to march within tokens, i.e. if the doc
 contains my dog has fleas, you need to match input like as fle, in this
 case ngramming is an option.

 Yeah the as fle-thing is exactly what I want to achieve.


 You have substantially different index and query time chains. The result is 
 that
 the offsets for all the grams at index time are the same in the quick 
 experiment
 I tried, all were 1. But at query time, each gram had an incremented 
 position.

 I'd start by using the query time analysis chain for indexing also. Next, I'd
 try enclosing multiple words in double quotes at query time and go from 
 there.
 What you have now is an anti-pattern in that having substantially
 different index
 and query time analysis chains is not something that's likely to be very
 predictable unless you know _exactly_ what the consequences are.

 The admin/analysis page is your friend, in this case check the
 verbose checkbox
 to see what I mean.

 Hmm interesting. I had the additional \R tokenizer in the index chain
 because the the document can be multiple lines (but the search text is
 always a single line) and if the document was

 my dog
 has fleas

 I wouldn't want some variant of og ha to match, but I didn't realize
 it didn't give me any positions like you noticed.

 I'll try to experiment some more, thanks for the hints!

 Chris


 Best,
 Erick

 On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote:
 Hi

 I'm trying to build an index for technical documents that basically
 works like grep, i.e. the user gives an arbitray substring somewhere
 in a line of a document and the exact matches will be returned. I
 specifically want no stemming etc. and keep all whitespace, parentheses
 etc. because they might be significant. The only normalization is that
 the search should be case-insensitvie.

 I tried to achieve this by tokenizing on line breaks, and then building
 trigrams of the individual lines:

 fieldType name=configtext_trigram class=solr.TextField 

 analyzer type=index

 tokenizer class=solr.PatternTokenizerFactory
 pattern=\R group=-1/

 filter class=solr.NGramFilterFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer

 analyzer type=query

 tokenizer class=solr.NGramTokenizerFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer
 /fieldType

 Then in the search, I use the edismax parser with mm=100%, so given the
 documents


 {id:test1,content:
 encryption
 10.0.100.22
 description
 }

 {id:test2,content:
 10.100.0.22
 description
 }

 and the query content:encryption, this will turn into

 parsedquery_toString:

 +((content:enc content:ncr content:cry content:ryp
 content:ypt content:pti content:tio content:ion)~8),

 and return only the first document. All fine and dandy. But I have a
 problem with possible false positives. If the search is e.g.

 content:.100.22

 then the generated query will be

 parsedquery_toString:
 +((content:.10 content:100 content:00. content:0.2 content:.22)~5),

 and because all of tokens are also generated for document test2 in the
 proximity of 5, both documents will wrongly be returned.

 So somehow I'd need to express the query content:.10 content:100
 content:00. content:0.2 content:.22 with *the tokens exactly in this
 order and nothing in between*. Is this somehow possible, maybe by using
 the termvectors/termpositions stuff? Or am I trying to do something
 that's fundamentally impossible? Other good ideas how to achieve this
 kind of behaviour?

 Thanks
 Christian






Re: New Solr installation fails to create collection/core

2015-08-26 Thread Erick Erickson
Deviantcode, did you look at the referenced JIRA:

https://issues.apache.org/jira/browse/SOLR-7826

Or is that irrelevant?

Best,
Erick

On Wed, Aug 26, 2015 at 1:58 AM, deviantcode hnoclel...@gmail.com wrote:
 I run into this exact problem trying out the latest solr, [5.3.0], @Scott,
 how did you fix it?
 KR
 Henry



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: StrDocValues

2015-08-26 Thread Jamie Johnson
I think I found it.  {!boost..} gave me what i was looking for and then a
custom collector filtered out anything that I didn't want to show.

On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson jej2...@gmail.com wrote:

 Are there any example implementation showing how StrDocValues works?  I am
 not sure if this is the right place or not, but I was thinking about having
 some document level doc value that I'd like to read in a function query to
 impact if the document is returned or not.  Am I barking up the right tree
 looking at this or is there another method to supporting this?



Re: StrDocValues

2015-08-26 Thread Jamie Johnson
I don't see it explicitly mentioned, but does the boost only get applied to
the final documents/score that matched the provided query or is it called
for each field that matched?  I'm assuming only once per document that
matched the main query, is that right?

On Wed, Aug 26, 2015 at 5:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 I think I found it.  {!boost..} gave me what i was looking for and then a
 custom collector filtered out anything that I didn't want to show.

 On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson jej2...@gmail.com wrote:

 Are there any example implementation showing how StrDocValues works?  I
 am not sure if this is the right place or not, but I was thinking about
 having some document level doc value that I'd like to read in a function
 query to impact if the document is returned or not.  Am I barking up the
 right tree looking at this or is there another method to supporting this?





Re: find documents based on specific term frequency

2015-08-26 Thread Chris Hostetter

: Is there a way to search for documents that have a word appearing more 
: than a certain number of times? For example, I want to find documents 
: that only have more than 10 instances of the word genetics …

Try...

q=text:geneticsfq={!frange+incl=false+l=10}termfreq('text','genetics')

Note: the q=text:genetics isn't neccessary -- you could do any query and 
then filter on the numeric function range of the termfreq() function, or 
use that {!frange} as your main query (in which case all matchin docs will 
have identical scores).  i just included that in the example to show how 
you can search  sort by the normal style scoring (which takes into 
account full TF-IDF and length normalization) while filtering on the TF 
using a function query.

You can also request the termfreq() as a psuedo field for each doc in the 
the results, and parameterize the details to eliminate redundency in 
the request params...


...fq={!frange+incl=false+l=10+v=$tf}fl=*,$tftf=termfreq('text','genetics')

Is the same as...

...fq={!frange+incl=false+l=10}termfreq('text','genetics')fl=*,termfreq('text','genetics')


A big caveat to this however is that the termfreq function operates on the 
*RAW* underlying term values -- no query time analyzer is used -- so 
if you do stemming, or lowercasing in your index analyzer, you have to 
pass the stemmed/lowercased values to the function  (Although i just filed 
SOLR-7981 since it occurs to me we can make this automatic in the future 
with a new function argument)

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
https://cwiki.apache.org/confluence/display/solr/Function+Queries



-Hoss
http://www.lucidworks.com/

find documents based on specific term frequency

2015-08-26 Thread Tang, Rebecca
Hi there,

We have an index build on solr 5.0.  We received an user question:
Is there a way to search for documents that have a word appearing more than a 
certain number of times? For example, I want to find documents that only have 
more than 10 instances of the word genetics …

I'm not sure if it's possible to do this with solr.  Does anyone know?


Rebecca Tang
Applications Developer, UCSF CKM
Industry Documents Digital Libraries
E: rebecca.t...@ucsf.edu



Re: best way for adding a new field to all indexed documents...

2015-08-26 Thread Mikhail Khludnev
Sadly, it's always a problem
http://searchivarius.org/blog/how-rename-fields-solr


On Wed, Aug 26, 2015 at 11:20 AM, Roxana Danger 
roxana.dan...@reedonline.co.uk wrote:

 Hello,
I have a index created with solr, and I would like to add a new
 field to all the documents of the index. I suppose I could a) use an
 updateRequestHandler or b) create another index importing the data from the
 initial index and the data of my new field. Which could be the best
 approach? Will the background processing be re-indexing the documents?
Thank you very much in advance,
 Roxana

 --
 Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
 WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] http://www.reed.co.uk/
 The
 UK's #1 job site. http://www.reed.co.uk/ [image: Follow us on Twitter]
 https://twitter.com/reedcouk
 https://www.linkedin.com/company/reed.co.uk [image:
 Like us on Facebook] https://www.facebook.com/reedcouk/
 https://plus.google.com/u/0/+reedcouk/posts It's time to Love Mondays »
 http://www.reed.co.uk/lovemondays




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Search opening hours

2015-08-26 Thread Stefan Matheis
Have a look at the links that Alexandre mentioned. a somewhat non-obvious style 
solution because you'd probably not think about spatial features while dealing 
with opening time - but it's worth having a look. 

-Stefan 


On Wednesday, August 26, 2015 at 10:16 AM, O. Klein wrote:

 Thank you for responding.
 
 Yonik's solution is what I had in mind. Was hoping for something more
 elegant, as he said, but it will work.
 
 The thing I haven't figured out is how to deal with closing times early
 morning next day.
 
 So it's 22:00 now and opening hours are 20:00 to 03:00
 
 Can this be done with either or both approaches?
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-26 Thread Jamie Johnson
Thanks again Erick, I created
https://issues.apache.org/jira/browse/SOLR-7975, though I didn't attach s
patch because my current implementation is not useful generally right now,
it meets my use case but likely would not meet others.  I will try to look
about generalizing this to allow something custom to be plugged in.
On Aug 26, 2015 2:46 AM, Erick Erickson erickerick...@gmail.com wrote:

 Sure, I think it's fine to raise a JIRA, especially if you can include
 a patch, even a preliminary one to solicit feedback... which I'll
 leave to people who are more familiar with that code...

 I'm not sure how generally useful this would be, and if it comes
 at a cost to normal searching there's sure to be lively discussion.

 Best
 Erick

 On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson jej2...@gmail.com wrote:
  Looks like I have something basic working for Trie fields.  I am doing
  exactly what I said in my previous email, so good news there.  I think
 this
  is a big step as there are only a few field types left that I need to
  support, those being date (should be similar to Trie) and Spatial fields,
  which at a glance looked like it provided a way to provide the token
 stream
  through an extension.  Definitely need to look more though.
 
  All of this said though, is this really the right way to get payloads
 into
  these types of fields?  Should a jira feature request be added for this?
  On Aug 25, 2015 8:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 
  Right, I had assumed (obviously here is my problem) that I'd be able to
  specify payloads for the field regardless of the field type.  Looking at
  TrieField that is certainly non-trivial.  After a bit of digging it
 appears
  that if I wanted to do something here I'd need to build a new TrieField,
  override createField and provide a Field that would return something
 like
  NumericTokenStream but also provide the payloads.  Like you said sounds
  interesting to say the least...
 
  Were payloads not really intended to be used for these types of fields
  from a Lucene perspective?
 
 
  On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  Well, you're going down a path that hasn't been trodden before ;).
 
  If you can treat your primitive types as text types you might get
  some traction, but that makes a lot of operations like numeric
  comparison difficult.
 
  H. another idea from left field. For single-valued types,
  what about a sidecar field that has the auth token? And even
  for a multiValued field, two parallel fields are guaranteed to
  maintain order so perhaps you could do something here. Yes,
  I'm waving my hands a LOT here.
 
  I suspect that trying to have a custom type that incorporates
  payloads for, say, trie fields will be interesting to say the least.
  Numeric types are packed to save storage etc. so it'll be
  an adventure..
 
  Best,
  Erick
 
  On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson jej2...@gmail.com
 wrote:
   We were originally using this approach, i.e. run things through the
   KeywordTokenizer - DelimitedPayloadFilter - WordDelimiterFilter.
  Again
   this works fine for text, though I had wanted to use the
  StandardTokenizer
   in the chain.  Is there an equivalent filter that does what the
   StandardTokenizer does?
  
   All of this said this doesn't address the issue of the primitive
 field
   types, which at this point is the bigger issue.  Given this use case
  should
   there be another way to provide payloads?
  
   My current thinking is that I will need to provide custom
  implementations
   for all of the field types I would like to support payloads on which
  will
   essentially be copies of the standard versions with some extra
 sugar
  to
   read/write the payloads (I don't see a way to wrap/delegate these at
  this
   point because AttributeSource has the attribute retrieval related
  methods
   as final so I can't simply wrap another tokenizer and return my added
   attributes + the wrapped attributes).  I know my use case is a bit
  strange,
   but I had not expected to need to do this given that Lucene/Solr
  supports
   payloads on these field types, they just aren't exposed.
  
   As always I appreciate any ideas if I'm barking up the wrong tree
 here.
  
   On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
   Well, if i remember correctly (i have no testing facility at hand)
   WordDelimiterFilter maintains payloads on emitted sub terms. So if
 you
  use
   a KeywordTokenizer, input 'some text^PAYLOAD', and have a
   DelimitedPayloadFilter, the entire string gets a payload. You can
 then
   split that string up again in individual tokens. It is possible to
  abuse
   WordDelimiterFilter for it because it has a types parameter that you
  can
   use to split it on whitespace if its input is not trimmed. Otherwise
  you
   can use any other character instead of a space as your input.
  
   This is a crazy idea, 

Re: Search opening hours

2015-08-26 Thread Upayavira


On Tue, Aug 25, 2015, at 10:54 PM, Yonik Seeley wrote:
 On Tue, Aug 25, 2015 at 5:02 PM, O. Klein kl...@octoweb.nl wrote:
  I'm trying to find the best way to search for stores that are open NOW.
 
 It's probably not the *best* way, but assuming it's currently 4:10pm,
 you could do
 
 +open:[* TO 1610] +close:[1610 TO *]
 
 And to account for days of the week have different fields for each day
 openM, closeM, openT, closeT, etc...  not super elegant, but seems to
 get the job done.

So, the basic question is what does now mean? If it is 5:29pm and a
shop closes at 5:30pm, does that count as open? If you want to query
a single time within a range, then Yonik's approach will work
(although I'd use open0 to open6 for the days of the week).

If you want to find a range within another range, then use what
Alexandre suggested - spatial search functionality. For example, you
could say, is the shop open for 10 minutes either side of now. Of
course, you could use spatial for a time within a range, and it might be
a little more elegant because you can use a multivalued field to specify
the open/close ranges for your store.

Upayavira


Re: Please answer my question on StackOverflow ... Best approach to guarantee commits in SOLR

2015-08-26 Thread Charlie Hull

On 25/08/2015 13:21, Simer P wrote:

http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr
.

*Question:* How can I get guarantee commits with Apache SOLR where
persisting data to disk and visibility are both equally important ?

*Background:* We have a website which requires high end search
functionality for machine learning and also requires guaranteed commit for
financial transaction. We just want to SOLR as our only datastore to keep
things simple and *do not* want to use another database on the side.

I can't seem to find any answer to this question. The simplest solution for
a financial transaction seems to be to periodically query SOLR for the
record after it has been persisted but this can have longer wait time or is
there a better solution ?

Can anyone please suggest a solution for achieving guaranteed commits
with SOLR ?

Firstly, if you're asking here, you're likely to be answered here, not 
on Stack Overflow.


A search engine is not a database. Although both Solr and Elasticsearch 
are often used as primary stores with varying degrees of success, they 
are after all search engines and designed for this use.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Behavior of grouping on a field with same value spread across shards.

2015-08-26 Thread Modassar Ather
Thanks Erick.

On Wed, Aug 26, 2015 at 12:11 PM, Erick Erickson erickerick...@gmail.com
wrote:

 That should be the case.

 Best,
 Erick

 On Tue, Aug 25, 2015 at 8:55 PM, Modassar Ather modather1...@gmail.com
 wrote:
  Thanks Erick,
 
  I saw the link. So is it that the grouping functionality works fine in
  distributed search except the two cases mentioned in the link?
 
  Regards,
  Modassar
 
  On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  That's not really the case. Perhaps you're confusing
  group.ngroups and group.facet with just grouping?
 
  See the ref guide:
 
 
 https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
 
  Best,
  Erick
 
  On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather modather1...@gmail.com
 
  wrote:
   Hi,
  
   As per my understanding, to group on a field all documents with the
 same
   value in the field have to be in the same shard.
  
   Can we group by a field where the documents with the same value in
 that
   field will be distributed across shards?
   Please let me know what are the limitations, feature not available or
   performance issues for such fields?
  
   Thanks,
   Modassar
 



best way for adding a new field to all indexed documents...

2015-08-26 Thread Roxana Danger
Hello,
   I have a index created with solr, and I would like to add a new
field to all the documents of the index. I suppose I could a) use an
updateRequestHandler or b) create another index importing the data from the
initial index and the data of my new field. Which could be the best
approach? Will the background processing be re-indexing the documents?
   Thank you very much in advance,
Roxana

-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] http://www.reed.co.uk/ The
UK's #1 job site. http://www.reed.co.uk/ [image: Follow us on Twitter]
https://twitter.com/reedcouk
https://www.linkedin.com/company/reed.co.uk [image:
Like us on Facebook] https://www.facebook.com/reedcouk/
https://plus.google.com/u/0/+reedcouk/posts It's time to Love Mondays »
http://www.reed.co.uk/lovemondays


Re: New Solr installation fails to create collection/core

2015-08-26 Thread deviantcode
I run into this exact problem trying out the latest solr, [5.3.0], @Scott,
how did you fix it?
KR
Henry



--
View this message in context: 
http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Hash of solr documents

2015-08-26 Thread david . davila
Yes, it´s an XY  problem :)

We are making the first tests to split our shard (Solr 5.1)

The problem we have is this: the number of documents indexed in the new 
shards is lower than in the original one (19814  and 19653, vs 61100), and 
always the same. We have no idea why Solr is doing this. A problem with 
some documents, with the segment?

A long time after we changed from normal Solr to Solr Cloud, we found 
that the parameter router in clusterstate.json was incorrect, because we 
wanted to have compositeId and it was set as explicit. The solution 
was deleting the clusterstate.json and restart Solr. And we are thinking 
that maybe the problem with the SPLIT is related with that: some documents 
are stored with the hash value and others not, and SPLIT needs that to 
distribute them. But I know that this likely has nothing to do with the 
SPLIT problem, it's only an idea. 

This is the log, all seem to be normal:

INFO  - 2015-08-26 09:13:47.654; 
org.apache.solr.handler.admin.CoreAdminHandler; Invoked split action for 
core: buscon
INFO  - 2015-08-26 09:13:47.656; 
org.apache.solr.update.DirectUpdateHandler2; start 
commit{,optimize=false,openSearcher=true,
waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-08-26 09:13:47.656; 
org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. 
Skipping IW.commit.
INFO  - 2015-08-26 09:13:47.657; org.apache.solr.core.SolrCore; 
SolrIndexSearcher has not changed - not re-opening: org.apach
e.solr.search.SolrIndexSearcher
INFO  - 2015-08-26 09:13:47.657; 
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-08-26 09:13:47.658; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partitions=2 segments=1
INFO  - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #0 partitionCount=2 r
ange=0-3fff
INFO  - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #0 partitionCount=2 r
ange=0-3fff segment #0 segmentCount=1
INFO  - 2015-08-26 09:22:19.533; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #1 partitionCount=2 r
ange=4000-7fff
INFO  - 2015-08-26 09:22:19.536; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #1 partitionCount=2 r
ange=4000-7fff segment #0 segmentCount=1
INFO  - 2015-08-26 09:30:44.141; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={ta
rgetCore=buscon_shard2_0_replica1targetCore=buscon_shard2_1_replica1action=SPLITcore=busconwt=javabinqt=/admin/coresver
sion=2} status=0 QTime=1016486 
INFO  - 2015-08-26 09:30:44.387; 
org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates 
on core: buscon_sh
ard2_0_replica1
INFO  - 2015-08-26 09:30:44.387; 
org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates 
available. core=buscon_s
hard2_0_replica1
INFO  - 2015-08-26 09:30:44.388; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={na
me=buscon_shard2_0_replica1action=REQUESTAPPLYUPDATESwt=javabinqt=/admin/coresversion=2}
 
status=0 QTime=2 
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates 
on core: buscon_sh
ard2_1_replica1
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates 
available. core=buscon_s
hard2_1_replica1
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={na
me=buscon_shard2_1_replica1action=REQUESTAPPLYUPDATESwt=javabinqt=/admin/coresversion=2}
 
status=0 QTime=0 
INFO  - 2015-08-26 09:30:44.743; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:Syn
cConnected type:NodeDataChanged path:/clusterstate.json, has occurred - 
updating... (live nodes size: 4)




Thanks,

David



De: Anshum Gupta ans...@anshumgupta.net
Para:   solr-user@lucene.apache.org solr-user@lucene.apache.org, 
Fecha:  26/08/2015 10:27
Asunto: Re: Hash of solr documents



Hi David,

The route key itself is indexed, but not the hash value. Why do you need 
to
know and display the hash value? This seems like an XY problem to me:
http://people.apache.org/~hossman/#xyproblem

On Wed, Aug 26, 2015 at 1:17 AM, david.dav...@correo.aeat.es wrote:

 Hi,

 I have read in one post in the Internet that the hash Solr Cloud
 calculates over the key field to send each document to a different shard
 is indexed. Is this true? If true, is there any way to show this hash 
for
 each document?

 Thanks,

 David




-- 
Anshum Gupta



Re: Exact substring search with ngrams

2015-08-26 Thread Christian Ramseyer
On 26/08/15 00:24, Erick Erickson wrote:
 Hmmm, this sounds like a nonsensical question, but what do you mean
 by arbitrary substring?
 
 Because if your substrings consist of whole _tokens_, then ngramming
 is totally unnecessary (and gets in the way). Phrase queries with no slop
 fulfill this requirement.
 
 But let's assume you need to march within tokens, i.e. if the doc
 contains my dog has fleas, you need to match input like as fle, in this
 case ngramming is an option.

Yeah the as fle-thing is exactly what I want to achieve.

 
 You have substantially different index and query time chains. The result is 
 that
 the offsets for all the grams at index time are the same in the quick 
 experiment
 I tried, all were 1. But at query time, each gram had an incremented position.
 
 I'd start by using the query time analysis chain for indexing also. Next, I'd
 try enclosing multiple words in double quotes at query time and go from there.
 What you have now is an anti-pattern in that having substantially
 different index
 and query time analysis chains is not something that's likely to be very
 predictable unless you know _exactly_ what the consequences are.
 
 The admin/analysis page is your friend, in this case check the
 verbose checkbox
 to see what I mean.

Hmm interesting. I had the additional \R tokenizer in the index chain
because the the document can be multiple lines (but the search text is
always a single line) and if the document was

my dog
has fleas

I wouldn't want some variant of og ha to match, but I didn't realize
it didn't give me any positions like you noticed.

I'll try to experiment some more, thanks for the hints!

Chris

 
 Best,
 Erick
 
 On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer r...@networkz.ch wrote:
 Hi

 I'm trying to build an index for technical documents that basically
 works like grep, i.e. the user gives an arbitray substring somewhere
 in a line of a document and the exact matches will be returned. I
 specifically want no stemming etc. and keep all whitespace, parentheses
 etc. because they might be significant. The only normalization is that
 the search should be case-insensitvie.

 I tried to achieve this by tokenizing on line breaks, and then building
 trigrams of the individual lines:

 fieldType name=configtext_trigram class=solr.TextField 

 analyzer type=index

 tokenizer class=solr.PatternTokenizerFactory
 pattern=\R group=-1/

 filter class=solr.NGramFilterFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer

 analyzer type=query

 tokenizer class=solr.NGramTokenizerFactory
 minGramSize=3 maxGramSize=3/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer
 /fieldType

 Then in the search, I use the edismax parser with mm=100%, so given the
 documents


 {id:test1,content:
 encryption
 10.0.100.22
 description
 }

 {id:test2,content:
 10.100.0.22
 description
 }

 and the query content:encryption, this will turn into

 parsedquery_toString:

 +((content:enc content:ncr content:cry content:ryp
 content:ypt content:pti content:tio content:ion)~8),

 and return only the first document. All fine and dandy. But I have a
 problem with possible false positives. If the search is e.g.

 content:.100.22

 then the generated query will be

 parsedquery_toString:
 +((content:.10 content:100 content:00. content:0.2 content:.22)~5),

 and because all of tokens are also generated for document test2 in the
 proximity of 5, both documents will wrongly be returned.

 So somehow I'd need to express the query content:.10 content:100
 content:00. content:0.2 content:.22 with *the tokens exactly in this
 order and nothing in between*. Is this somehow possible, maybe by using
 the termvectors/termpositions stuff? Or am I trying to do something
 that's fundamentally impossible? Other good ideas how to achieve this
 kind of behaviour?

 Thanks
 Christian






Re: Search opening hours

2015-08-26 Thread O. Klein
Thank you for responding.

Yonik's solution is what I had in mind. Was hoping for something more
elegant, as he said, but it will work.

The thing I haven't figured out is how to deal with closing times early
morning next day.

So it's 22:00 now and opening hours are 20:00 to 03:00

Can this be done with either or both approaches?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html
Sent from the Solr - User mailing list archive at Nabble.com.


Hash of solr documents

2015-08-26 Thread david . davila
Hi,

I have read in one post in the Internet that the hash Solr Cloud 
calculates over the key field to send each document to a different shard 
is indexed. Is this true? If true, is there any way to show this hash for 
each document?

Thanks,

David

Re: Search opening hours

2015-08-26 Thread O. Klein
Those options don't fix my problem with closing times the next morning, or is
there a way to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225354.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search opening hours

2015-08-26 Thread Upayavira


On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next morning,
 or is
 there a way to do this?

Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.

Assuming the week starts at 00:00 Monday morning, you might index Monday
9:00-23:00 as  540:1380

Tuesday 9am-Wednesday 1am would be 1980:2940

You convert your NOW time into a minutes since Monday 00:00 and do a
spatial search within that time.

If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
do a search for 683:683.

If you have a shop that is open over Sunday night to Monday, you just
list it as open until Sunday 23:59 and open again Monday 00:00.

Would that do it?

Upayavira


Re: Hash of solr documents

2015-08-26 Thread Anshum Gupta
Hi David,

The route key itself is indexed, but not the hash value. Why do you need to
know and display the hash value? This seems like an XY problem to me:
http://people.apache.org/~hossman/#xyproblem

On Wed, Aug 26, 2015 at 1:17 AM, david.dav...@correo.aeat.es wrote:

 Hi,

 I have read in one post in the Internet that the hash Solr Cloud
 calculates over the key field to send each document to a different shard
 is indexed. Is this true? If true, is there any way to show this hash for
 each document?

 Thanks,

 David




-- 
Anshum Gupta


Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Toke Eskildsen
On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

 Now I've tried to increase the carrot.fragSize to 75 and
 carrot.summarySnippets to 2, and set the carrot.produceSummary to
 true. With this setting, I'm mostly able to get the cluster results
 back within 2 to 3 seconds when I set rows=200. I'm still trying out
 to see if the cluster labels are ok, but in theory do you think this
 is a suitable setting to attempt to improve the clustering results and
 at the same time improve the performance?

I don't know - the quality/performance point as well as which knobs to
tweak is extremely dependent on your corpus and your hardware. A person
with better understanding of carrot might be able to do better sanity
checking, but I am not at all at that level.

Related, it seems to me that the question of how to tweak the clustering
has little to do with Solr and a lot to do with carrot (assuming here
that carrot is the bottleneck). You might have more success asking in a
carrot forum?


- Toke Eskildsen, State and University Library, Denmark





Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade
Hi,


So i got the shards too split. But they are very unbalanced.


 7204922 total docs on the original collection

shard1_0 numdocs 3661699

shard1_1 numdocs 3543132

shard2_0 numdocs 0

shard2_1 numdcs 0

Any ideas?

This is what i had to do to get this to split with the custom libs

I got shard1 to split successfully and it created replicas on the other
servers in the cloud for the new shard/shards.


This is the jist of it.


When you split a shard solr creates a 2 new cores.

When creating a core it uses the solr/solr.xml settings for classpath
etc

This is why searches etc work fine and can find the opa plugins but when we
called shardsplit it could not.


I had to move the custom jars outside of the collection directory and add
this to solr/solr.xml on the 4 nodes.


info here  https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond



solr


str name=sharedLib${sharedLib:../lib}/str


when you restart you can see it in the log loading the jars form the new
location.



INFO  - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer;
loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to
classloader

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-continuation-8.1.10.v20130312.jar'
to classloader

INFO  - 2015-08-25 23:40:52.301; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/groovy-all-2.0.4.jar' to
classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/qpl-solr472-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-jmx-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-deploy-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/ext/' to classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-xml-8.1.10.v20130312.jar' to
classloader

so I then ran the split and checked on it in the morning

http://dj01.aws.narasearch.us:8981/solr/admin/collections?action=SPLITSHARDcollection=collection1shard=shard1


it succeeded and created replicas.

ls /opt/solr/solr-4.7.2/solr0*/solr/

/opt/solr/solr-4.7.2/solr01/solr/:
bin  collection1_shard1_0_replica1  README.txt  zoo.cfg
collection1  collection1_shard1_1_replica1  solr.xml

/opt/solr/solr-4.7.2/solr02/solr/:
bin  collection1  README.txt  solr.xml  zoo.cfg

/opt/solr/solr-4.7.2/solr03/solr/:
bin  collection1  collection1_shard1_0_replica2  README.txt  solr.xml
 zoo.cfg

/opt/solr/solr-4.7.2/solr04/solr/:
bin  collection1  collection1_shard1_1_replica2  README.txt  solr.xml
 zoo.cfg


and it actually distributed it

[root@dj01 solr]# du -sh *
4.0Kbin
41G collection1
18G collection1_shard1_0_replica1
16G collection1_shard1_1_replica1
4.0KREADME.txt
4.0Ksolr.xml
4.0Kzoo.cfg
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
16G /opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2
18G /opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2


Jeff Courtade
M: 240.507.6116
On Aug 25, 2015 11:09 PM, Anshum Gupta ans...@anshumgupta.net wrote:

 Can you elaborate a bit more on the setup, what do the custom plugins do,
 what error do you get ? It seems like a classloader/classpath issue to me
 which doesn't really relate to Shard splitting.


 On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade courtadej...@gmail.com
 wrote:

  I am getting failures when trying too split shards on solr 4.2.7 with
  custom plugins.
 
  It fails regularily it cannot find the jar files for  plugins when
 creating
  the new cores/shards.
 
  Ideas?
 
  --
  Thanks,
 
  Jeff Courtade
  M: 240.507.6116
 



 --
 Anshum Gupta



Re: Search opening hours

2015-08-26 Thread Upayavira
Darren,

That was delightfully dense. Do you think you could unpack it a bit
more? Possibly some sample (pseudo) queries?

Upayavira 

On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
 If you wanted to try a spatial approach that blended times like above,
 you
 could try a polygon of minimum width that spans the globe - this is
 literally using spatial search (geocodes) against time. So in this
 scenario
 you logically subdivide the polygon into 7 distinct regions (for days)
 and
 then within this you can defined, like a timeline, what open and closed
 means. The problem of 3AM is taken care of because of it's continuous
 nature - ie one day is adjacent to the next, with Sunday and Monday
 backing
 up to each other. Just a thought.
 
 On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
 
 
 
  On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
   Those options don't fix my problem with closing times the next morning,
   or is
   there a way to do this?
 
  Use the spatial model, and a time window of a week. There are 10,080
  minutes in a week, so you could use that as your scale.
 
  Assuming the week starts at 00:00 Monday morning, you might index Monday
  9:00-23:00 as  540:1380
 
  Tuesday 9am-Wednesday 1am would be 1980:2940
 
  You convert your NOW time into a minutes since Monday 00:00 and do a
  spatial search within that time.
 
  If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
  do a search for 683:683.
 
  If you have a shop that is open over Sunday night to Monday, you just
  list it as open until Sunday 23:59 and open again Monday 00:00.
 
  Would that do it?
 
  Upayavira
 
 
 
 
 -- 
 Darren


Re: Search opening hours

2015-08-26 Thread Upayavira
delightfully dense = really intriguing, but I couldn't quite
understand it - really hoping for more info

On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
 Darren,
 
 That was delightfully dense. Do you think you could unpack it a bit
 more? Possibly some sample (pseudo) queries?
 
 Upayavira 
 
 On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
  If you wanted to try a spatial approach that blended times like above,
  you
  could try a polygon of minimum width that spans the globe - this is
  literally using spatial search (geocodes) against time. So in this
  scenario
  you logically subdivide the polygon into 7 distinct regions (for days)
  and
  then within this you can defined, like a timeline, what open and closed
  means. The problem of 3AM is taken care of because of it's continuous
  nature - ie one day is adjacent to the next, with Sunday and Monday
  backing
  up to each other. Just a thought.
  
  On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
  
  
   On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
Those options don't fix my problem with closing times the next morning,
or is
there a way to do this?
  
   Use the spatial model, and a time window of a week. There are 10,080
   minutes in a week, so you could use that as your scale.
  
   Assuming the week starts at 00:00 Monday morning, you might index Monday
   9:00-23:00 as  540:1380
  
   Tuesday 9am-Wednesday 1am would be 1980:2940
  
   You convert your NOW time into a minutes since Monday 00:00 and do a
   spatial search within that time.
  
   If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
   do a search for 683:683.
  
   If you have a shop that is open over Sunday night to Monday, you just
   list it as open until Sunday 23:59 and open again Monday 00:00.
  
   Would that do it?
  
   Upayavira
  
  
  
  
  -- 
  Darren


Re: Search opening hours

2015-08-26 Thread Darren Spehr
If you wanted to try a spatial approach that blended times like above, you
could try a polygon of minimum width that spans the globe - this is
literally using spatial search (geocodes) against time. So in this scenario
you logically subdivide the polygon into 7 distinct regions (for days) and
then within this you can defined, like a timeline, what open and closed
means. The problem of 3AM is taken care of because of it's continuous
nature - ie one day is adjacent to the next, with Sunday and Monday backing
up to each other. Just a thought.

On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:



 On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
  Those options don't fix my problem with closing times the next morning,
  or is
  there a way to do this?

 Use the spatial model, and a time window of a week. There are 10,080
 minutes in a week, so you could use that as your scale.

 Assuming the week starts at 00:00 Monday morning, you might index Monday
 9:00-23:00 as  540:1380

 Tuesday 9am-Wednesday 1am would be 1980:2940

 You convert your NOW time into a minutes since Monday 00:00 and do a
 spatial search within that time.

 If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
 do a search for 683:683.

 If you have a shop that is open over Sunday night to Monday, you just
 list it as open until Sunday 23:59 and open again Monday 00:00.

 Would that do it?

 Upayavira




-- 
Darren


Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo
Hi Toke,

Thank you for the link.

I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older
version, as I'm using the latest carrot2-workbench-3.10.3, which is only
released recently. I've changed all the settings like fragSize and
desiredCluserCountBase to be the same on both sides, and I'm now able to
get very similar cluster results.

Now I've tried to increase the carrot.fragSize to 75 and
carrot.summarySnippets to 2, and set the carrot.produceSummary to true.
With this setting, I'm mostly able to get the cluster results back within 2
to 3 seconds when I set rows=200. I'm still trying out to see if the
cluster labels are ok, but in theory do you think this is a suitable
setting to attempt to improve the clustering results and at the same time
improve the performance?

Regards,
Edwin



On 26 August 2015 at 13:58, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
  I'm currently trying out on the Carrot2 Workbench and get it to call Solr
  to see how they did the clustering. Although it still takes some time to
 do
  the clustering, but the results of the cluster is much better than mine.
 I
  think its probably due to the different settings like the fragSize and
  desiredCluserCountBase?

 Either that or the carrot bundled with Solr is an older version.

  By the way, the link on the clustering example
  https://cwiki.apache.org/confluence/display/solr/Result is not working
 as
  it says 'Page Not Found'.

 That is because it is too long for a single line. Try copy-pasting it:

 https://cwiki.apache.org/confluence/display/solr/Result
 +Clustering#ResultClustering-Configuration

 - Toke Eskildsen, State and University Library, Denmark





Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade
im looking at the clusterstate.json t see why it is doing this I really
dont understand it though...

{collection1:{
shards:{
  shard1:{
range:8000-,
state:active,
replicas:{
  core_node1:{
state:active,
base_url:http://10.135.2.153:8981/solr;,
core:collection1,
node_name:10.135.2.153:8981_solr,
leader:true},
  core_node10:{
state:active,
base_url:http://10.135.2.153:8982/solr;,
core:collection1,
node_name:10.135.2.153:8982_solr}}},
  shard2:{
range:0-7fff,
state:inactive,
replicas:{
  core_node9:{
state:active,
base_url:http://10.135.2.153:8984/solr;,
core:collection1,
node_name:10.135.2.153:8984_solr,
leader:true},
  core_node11:{
state:active,
base_url:http://10.135.2.153:8983/solr;,
core:collection1,
node_name:10.135.2.153:8983_solr}}},
  shard1_1:{
range:null,
state:active,
parent:null,
replicas:{
  core_node6:{
state:active,
base_url:http://10.135.2.153:8981/solr;,
core:collection1_shard1_1_replica1,
node_name:10.135.2.153:8981_solr,
leader:true},
  core_node8:{
state:active,
base_url:http://10.135.2.153:8984/solr;,
core:collection1_shard1_1_replica2,
node_name:10.135.2.153:8984_solr}}},
  shard1_0:{
range:null,
state:active,
parent:null,
replicas:{
  core_node5:{
state:active,
base_url:http://10.135.2.153:8981/solr;,
core:collection1_shard1_0_replica1,
node_name:10.135.2.153:8981_solr,
leader:true},
  core_node7:{
state:active,
base_url:http://10.135.2.153:8983/solr;,
core:collection1_shard1_0_replica2,
node_name:10.135.2.153:8983_solr}}},
  shard2_0:{
range:0-3fff,
state:active,
replicas:{
  core_node13:{
state:active,
base_url:http://10.135.2.153:8984/solr;,
core:collection1_shard2_0_replica1,
node_name:10.135.2.153:8984_solr,
leader:true},
  core_node14:{
state:active,
base_url:http://10.135.2.153:8982/solr;,
core:collection1_shard2_0_replica2,
node_name:10.135.2.153:8982_solr}}},
  shard2_1:{
range:4000-7fff,
state:active,
replicas:{
  core_node12:{
state:active,
base_url:http://10.135.2.153:8984/solr;,
core:collection1_shard2_1_replica1,
node_name:10.135.2.153:8984_solr,
leader:true},
  core_node15:{
state:active,
base_url:http://10.135.2.153:8981/solr;,
core:collection1_shard2_1_replica2,
node_name:10.135.2.153:8981_solr,
maxShardsPerNode:1,
router:{name:compositeId},
replicationFactor:1,
autoCreated:true}}


--
Thanks,

Jeff Courtade
M: 240.507.6116

On Wed, Aug 26, 2015 at 8:44 AM, Jeff Courtade courtadej...@gmail.com
wrote:

 Hi,


 So i got the shards too split. But they are very unbalanced.


  7204922 total docs on the original collection

 shard1_0 numdocs 3661699

 shard1_1 numdocs 3543132

 shard2_0 numdocs 0

 shard2_1 numdcs 0

 Any ideas?

 This is what i had to do to get this to split with the custom libs

 I got shard1 to split successfully and it created replicas on the other
 servers in the cloud for the new shard/shards.


 This is the jist of it.


 When you split a shard solr creates a 2 new cores.

 When creating a core it uses the solr/solr.xml settings for classpath
 etc

 This is why searches etc work fine and can find the opa plugins but when
 we called shardsplit it could not.


 I had to move the custom jars outside of the collection directory and add
 this to solr/solr.xml on the 4 nodes.


 info here  https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond



 solr


 str name=sharedLib${sharedLib:../lib}/str


 when you restart you can see it in the log loading the jars form the new
 location.



 INFO  - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer;
 loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib

 INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
 Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to
 classloader

 INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
 Adding
 'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar'
 to classloader

 INFO  - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader;
 Adding
 

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo
Thanks for your recommendation Toke.

Will try to ask in the carrot forum.

Regards,
Edwin

On 26 August 2015 at 18:45, Toke Eskildsen t...@statsbiblioteket.dk wrote:

 On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

  Now I've tried to increase the carrot.fragSize to 75 and
  carrot.summarySnippets to 2, and set the carrot.produceSummary to
  true. With this setting, I'm mostly able to get the cluster results
  back within 2 to 3 seconds when I set rows=200. I'm still trying out
  to see if the cluster labels are ok, but in theory do you think this
  is a suitable setting to attempt to improve the clustering results and
  at the same time improve the performance?

 I don't know - the quality/performance point as well as which knobs to
 tweak is extremely dependent on your corpus and your hardware. A person
 with better understanding of carrot might be able to do better sanity
 checking, but I am not at all at that level.

 Related, it seems to me that the question of how to tweak the clustering
 has little to do with Solr and a lot to do with carrot (assuming here
 that carrot is the bottleneck). You might have more success asking in a
 carrot forum?


 - Toke Eskildsen, State and University Library, Denmark






Re: Search opening hours

2015-08-26 Thread Darren Spehr
Sure - and sorry for its density. I reread it and thought the same ;)

So imagine a polygon of say 1/2 mile width (I made that up) that stretches
around the equator. Let's call this a week's timeline and subdivide it into
7 blocks, one for each day. For the sake of simplicity assume it's a line
(which I forget but is supported in Solr as an infinitely small polygon)
starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
Sunday at 11:59 PM. By subdivide you can think of it either radially or by
longitude, but you have 360 degrees to divide into 7, which means that
every hour is represented by a range of roughly 2.143 degrees (360/7/24).
These regions represent each day and hour (or less), and the region
boundaries represent midnight for the day before.

Now for indexing - your open hours then become a combination of these
subdivisions. If you're open 24x7 then the whole polygon is indexed. If
you're only open on Monday from 9-5 then only the polygon between
(0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
can index any combination of times this way.

So now the varsity question is how to do this with a fluctuating calendar?
I think this example can be extended to include searching against any given
day of the week in a year, or years. Just imagine a translation layer that
adjusts the latitude N or S by some amount to represent which day in which
year you're looking for. Make sense?

On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and do
 a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




-- 
Darren


Connect and sync two solr server

2015-08-26 Thread shahper

Hi,

I want to connect two solrcloud server. and sync there indexes to each 
other so that is any server is down we can work with other and whenever 
I update or add index in any server the other also get updated.


shahper











Re: StrDocValues

2015-08-26 Thread Yonik Seeley
On Wed, Aug 26, 2015 at 6:20 PM, Jamie Johnson jej2...@gmail.com wrote:
 I don't see it explicitly mentioned, but does the boost only get applied to
 the final documents/score that matched the provided query or is it called
 for each field that matched?  I'm assuming only once per document that
 matched the main query, is that right?

Correct.

-Yonik


Re: Solr 5.2.1 versus Solr 4.7.0 performance

2015-08-26 Thread Shawn Heisey
On 8/26/2015 1:11 AM, Esther Goldbraich wrote:
 We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same 
 data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 
 (5-15% better than 5.2.1, with an exception of 100% improvement for one of 
 the queries ).
 Using same JVM (IBM 1.7) and JVM params.
 Index's size is ~500G, spread over 64 shards, with replication factor 2.
 Do you know about any config / setup change for Solr 5.2.1 that can 
 improve the performance? Any idea what causes this behavior?

I have little experience comparing the performance of different
versions, but I have a general sense that OS disk caching becomes
increasingly important to Solr's performance as time goes on.  What this
means in real terms is that if you have enough memory for adequate OS
disk caching, using a later version of Solr will probably yield better
performance, but if you don't have enough memory, you might actually see
*worse* performance.

A question that might become important later, but doesn't really affect
the immediate things I'm thinking about: What GC tuning options you are
using?

How much RAM do you have in each machine, and how big is Solr's heap? 
How much index data actually lives on each server?  Be sure to count all
replicas on each machine.

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn



Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-26 Thread Jamie Johnson
Sorry to poke this again but I'm not following the last comment of how I
could go about extending the solr index searcher and have the extension
used.  Is there an example of this?  Again thanks

Jamie
On Aug 25, 2015 7:18 AM, Jamie Johnson jej2...@gmail.com wrote:

 I had seen this as well, if I over wrote this by extending
 SolrIndexSearcher how do I have my extension used?  I didn't see a way that
 could be plugged in.
 On Aug 25, 2015 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com wrote:

  Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
  delegates to DocValuesFacets when facet method is FC, what used to be
  FieldCache I believe.  DocValuesFacets either uses DocValues or builds
 then
  using the UninvertingReader.
 

 Ah.. got it. Thanks for reminding this details.It seems like even
 docValues=true doesn't help with your custom implementation.


 
  I am not seeing a clean extension point to add a custom
 UninvertingReader
  to Solr, would the only way be to copy the FacetComponent and
 SimpleFacets
  and modify as needed?
 
 Sadly, yes. There is no proper extension point. Also, consider overriding
 SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the
 particular UninvertingReader is created, there you can pass the own one,
 which refers to custom FieldCache.


  On Aug 25, 2015 12:42 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:
 
   Hello Jamie,
   I don't understand how it could choose DocValuesFacets (it occurs on
   docValues=true) field, but then switches to
 UninvertingReader/FieldCache
   which means docValues=false. If you can provide more details it would
 be
   great.
   Beside of that, I suppose you can only implement and inject your own
   UninvertingReader, I don't think there is an extension point for this.
  It's
   too specific requirement.
  
   On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com
  wrote:
  
as mentioned in a previous email I have a need to provide security
   controls
at the term level.  I know that Lucene/Solr doesn't support this so
 I
  had
baked something onto a 4.x baseline that was sufficient for my use
  cases.
I am now looking to move that implementation to 5.x and am running
 into
   an
issue around faceting.  Previously we were able to provide a custom
  cache
implementation that would create separate cache entries given a
   particular
set of security controls, but in Solr 5 some faceting is delegated
 to
DocValuesFacets which delegates to UninvertingReader in my case (we
 are
   not
storing DocValues).  The issue I am running into is that before 5.x
 I
  had
the ability to influence the FieldCache that was used at the Solr
 level
   to
also include a security token into the key so each cache entry was
  scoped
to a particular level.  With the current implementation the
 FieldCache
seems to be an internal detail that I can't influence in anyway.  Is
  this
correct?  I had noticed this Jira ticket
https://issues.apache.org/jira/browse/LUCENE-5427, is there any
  movement
on
this?  Is there another way to influence the information that is put
  into
these caches?  As always thanks in advance for any suggestions.
   
-Jamie
   
  
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
   mkhlud...@griddynamics.com
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com