bin/solr start - long response on screen

2017-02-21 Thread Uchit Patel

Hi,
I have upgraded SOLR to 6.4.0 from 5.1.0.
When I am starting my SOLR I am getting following:
-bash-3.2$ bin/solr startArchiving 1 old GC log files to 
/opt/wml/solr-6.4.0/server/logs/archivedArchiving 1 console log files to 
/opt/wml/solr-6.4.0/server/logs/archivedRotating solr logs, keeping a max of 9 
generationsWaiting up to 180 seconds to see Solr running on port 8983lsof: 
unsupported TCP/TPI info selection: Clsof: unsupported TCP/TPI info selection: 
Plsof: unsupported TCP/TPI info selection: :lsof: unsupported TCP/TPI info 
selection: Llsof: unsupported TCP/TPI info selection: Ilsof: unsupported 
TCP/TPI info selection: Slsof: unsupported TCP/TPI info selection: Tlsof: 
unsupported TCP/TPI info selection: Elsof: unsupported TCP/TPI info selection: 
Nlsof 4.78 latest revision: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/ 
latest FAQ: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/FAQ latest man page: 
ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/lsof_man usage: 
[-?abhlnNoOPRstUvVX] [+|-c c] [+|-d s] [+D D] [+|-f[gG]] [+|-e s] [-F [f]] [-g 
[s]] [-i [i]] [+|-L [l]] [+m [m]] [+|-M] [-o [o]] [-p s] [+|-r [t]] [-S [t]] 
[-T [t]] [-u s] [+|-w] [-x [fl]] [-Z [Z]] [--] [names]Use the ``-h'' option to 
get more help information
Though SOLR starts. Is it OK or something wrong.

Thanks.
Regards,
Uchit PatelWaste Management Inc.



  
 




   

   

Query complexity scorer.

2017-02-21 Thread Modassar Ather
Hi,

I am trying to find possible complexity of a query heuristically/ based on
learning and provide a score to it before it is actually sent to Solr for
execution.
The query may contain wildcards, complex phrases, phrases with wildcards.

The approach is to assign a number to each part of a query and then get an
accumulated normalized score. It can also be extended further for scoring
based on pattern of queries.
The scorer is unaware of Solr, index size and possible complexity involved
in finding a match due to which a simple looking query is scored less
complex but takes more time in Solr than a query looking more complex.

Kindly share your suggestions and inputs on the parameters to consider and
how it can be implemented.

Note: I am using SpanQueryParser (
https://issues.apache.org/jira/browse/LUCENE-5205) for phrases/complex
phrases.

Thanks,
Modassar


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp

Hi Erick,

in the none HDFS case that sounds logical but in the HDFS case all the 
index data is in the shared HDFS file system. Even the transaction logs 
should be in there. So the node that once had the replica should not 
really have more information then any other node, especially if 
legacyClound is set to false so having ZooKeeper truth.


regards,
Hendrik

On 22.02.2017 02:28, Erick Erickson wrote:

Hendrik:

bq: Not really sure why one replica needs to be up though.

I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if it's version of the index was up to date.
But since it would be the leader, when other replicas for that shard
_do_ come on line they'd replicate the index down from the newly added
replica, possibly using very old data.

FWIW,
Erick

On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
 wrote:

Hi,

I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092)
for this a while ago. I was now able to gt this feature working with a very
small code change. After a few seconds Solr reassigns the replica to a
different Solr instance as long as one replica is still up. Not really sure
why one replica needs to be up though. I added the patch based on Solr 6.3
to the bug report. Would be great if it could be merged soon.

regards,
Hendrik

On 19.01.2017 17:08, Hendrik Haddorp wrote:

HDFS is like a shared filesystem so every Solr Cloud instance can access
the data using the same path or URL. The clusterstate.json looks like this:

"shards":{"shard1":{
 "range":"8000-7fff",
 "state":"active",
 "replicas":{
   "core_node1":{
 "core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
 "base_url":"http://slave3:9000/solr;,
 "node_name":"slave3:9000_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
   "core_node2":{
 "core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
 "base_url":"http://slave2:9000/solr;,
 "node_name":"slave2:9000_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
 "leader":"true"},
   "core_node3":{
 "core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
 "base_url":"http://slave4:9005/solr;,
 "node_name":"slave4:9005_solr",
 "state":"active",

"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

So every replica is always assigned to one node and this is being stored
in ZK, pretty much the same as for none HDFS setups. Just as the data is not
stored locally but on the network and as the path does not contain any node
information you can of course easily take over the work to a different Solr
node. You should just need to update the owner of the replica in ZK and you
should basically be done, I assume. That's why the documentation states that
an advantage of using HDFS is that a failing node can be replaced by a
different one. The Overseer just has to move the ownership of the replica,
which seems like what the code is trying to do. There just seems to be a bug
in the code so that the core does not get created on the target node.

Each data directory also contains a lock file. The documentation states
that one should use the HdfsLockFactory, which unfortunately can easily lead
to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup
is however also easily done but seems to require a node restart to take
effect. But I'm also only recently playing around with all this ;-)

regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud 

Re: Arabic words search in solr

2017-02-21 Thread mohanmca01
Hi Stave,

As per your suggestion I added ICU folding filter and I re-indexed entire
solr data, but still am unable to find the expected results which i
highlighted earlier.

attached excel sheet with examples of Arabic names for your investigation &
reproducing the issue.
Arabic_Characters2.xlsx
  

thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321582.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Migrate Documents to Another Collection

2017-02-21 Thread Piyush Kunal
I have also noticed this issue and it happens while creating the collated
result. Mostly due to huge version mismatch between the server and client.

Best idea would be to use same server and client version.

Or else switch off collation (spell check you can still keep on) and do the
collation ( it's nothing but concat the spell checks) in your application
itself.

On 14-Feb-2017 7:34 am, "alias" <524839...@qq.com> wrote:

hi  I use solrj 5.5.0 to inquire solr3.6 reported the following error:
Java.lang.ClassCastException: java.lang.Boolean can not be cast to
org.apache.solr.common.util.NamedList
At org.apache.solr.client.solrj.response.SpellCheckResponse. 
(SpellCheckResponse.java:47)
At 
org.apache.solr.client.solrj.response.QueryResponse.extractSpellCheckInfo
(QueryResponse.java:179)
At org.apache.solr.client.solrj.response.QueryResponse.setResponse
(QueryResponse.java:153)
At org.apache.solr.client.solrj.SolrRequest.process
(SolrRequest.java:149)
At org.apache.solr.client.solrj.SolrClient.query
(SolrClient.java:942)
At org.apache.solr.client.solrj.SolrClient.query
(SolrClient.java:957)
At com.vip.vipme.demo.utils.SolrTest.testCategoryIdPC
(SolrTest.java:66)
At com.vip.vipme.demo.SolrjServlet1.doGet (SolrjServlet1.java:33)
At javax.servlet.http.HttpServlet.service (HttpServlet.java:707)
At javax.servlet.http.HttpServlet.service (HttpServlet.java:820)
At org.mortbay.jetty.servlet.ServletHolder.handle
(ServletHolder.java:487)
At org.mortbay.jetty.servlet.ServletHandler.handle
(ServletHandler.java:362)
At org.mortbay.jetty.security.SecurityHandler.handle
(SecurityHandler.java:216)
At org.mortbay.jetty.servlet.SessionHandler.handle
(SessionHandler.java:181)
At org.mortbay.jetty.handler.ContextHandler.handle
(ContextHandler.java:712)
At org.mortbay.jetty.webapp.WebAppContext.handle
(WebAppContext.java:405)


If you set the query.set ("spellcheck", Boolean.FALSE); can solve this
problem,
But I would like to know what the specific reasons for this problem


thinks


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Walter Underwood
Reindexing is exactly why you want the Single Source of Truth to be in a 
repository outside of Solr.

For our slowly-changing data sets, we have an intermediate JSONL batch. That is 
created from the source repositories and saved in Amazon S3. Then we load it 
into Solr nightly. That allows us to reload whenever we need to, like loading 
prod data in test or moving search to a different Amazon region.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 7:34 PM, Erick Erickson  wrote:
> 
> Dave:
> 
> Oh, I agree that a DB is a perfectly valid place to store the data and
> you're absolutely right that it allows better interaction than flat
> files; you can ask questions of an RDBMS that you can't easily ask the
> disk ;). Storing to disk is an alternative if you're unwilling to deal
> with a DB is all.
> 
> But the main point is you'll change your schema sometime and have to
> re-index. Having the data you're indexing stored locally in whatever
> form will allow much faster turn-around rather than re-crawling. Of
> course it'll result in out of date data so you'll have to refresh
> somehow sometime.
> 
> Erick
> 
> On Tue, Feb 21, 2017 at 6:07 PM, Dave  wrote:
>> Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
>> Eric. I'm going to have to respectfully disagree about the rdbms.  It's such 
>> a well know data format that you could hire a high school programmer to help 
>> with the db end if you knew how to flatten it to solr. Besides it's easy to 
>> visualize and interact with the data before it goes to solr. A Json/Nosql 
>> format would work just as well, but I really think a database has its place 
>> in a scenario like this
>> 
>>> On Feb 21, 2017, at 8:20 PM, Erick Erickson  wrote:
>>> 
>>> I'll add that I _guarantee_ you'll want to re-index the data as you
>>> change your schema
>>> and the like. You'll be able to do that much more quickly if the data
>>> is stored locally somehow.
>>> 
>>> A RDBMS is not necessary however. You could simply store the data on
>>> disk in some format
>>> you could re-read and send to Solr.
>>> 
>>> Best,
>>> Erick
>>> 
 On Tue, Feb 21, 2017 at 5:17 PM, Dave  wrote:
 B is a better option long term. Solr is meant for retrieving flat data, 
 fast, not hierarchical. That's what a database is for and trust me you 
 would rather have a real database on the end point.  Each tool has a 
 purpose, solr can never replace a relational database, and a relational 
 database could not replace solr. Start with the slow model (database) for 
 control/display and enhance with the fast model (solr) for retrieval/search
 
 
 
> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> 
> To learn how to properly use Solr, I'm building a little experimental
> project with it to search for used car listings.
> 
> Car listings appear on a variety of different places ... central places
> Craigslist and also many many individual Used Car dealership websites.
> 
> I am wondering, should I:
> 
> (a) deploy a Solr search engine and build individual indexers for every
> type of web site I want to find listings on?
> 
> or
> 
> (b) build my own database to store car listings, and then build services
> that scrape data from different sites and feed entries into the database;
> then point my Solr search to my database, one simple source of listings?
> 
> My concerns are:
> 
> With (a) ... I have to be smart enough to understand all those different
> data sources and remove/update listings when they change; while this be
> harder to do with custom Solr indexers than writing something from 
> scratch?
> 
> With (b) ... I'm maintaining a huge database of all my listings which 
> seems
> redundant; google doesn't make a *copy* of everything on the internet, it
> just knows it's there.  Is maintaining my own database a bad design?
> 
> Thanks for reading!



Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Erick Erickson
Dave:

Oh, I agree that a DB is a perfectly valid place to store the data and
you're absolutely right that it allows better interaction than flat
files; you can ask questions of an RDBMS that you can't easily ask the
disk ;). Storing to disk is an alternative if you're unwilling to deal
with a DB is all.

But the main point is you'll change your schema sometime and have to
re-index. Having the data you're indexing stored locally in whatever
form will allow much faster turn-around rather than re-crawling. Of
course it'll result in out of date data so you'll have to refresh
somehow sometime.

Erick

On Tue, Feb 21, 2017 at 6:07 PM, Dave  wrote:
> Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
> Eric. I'm going to have to respectfully disagree about the rdbms.  It's such 
> a well know data format that you could hire a high school programmer to help 
> with the db end if you knew how to flatten it to solr. Besides it's easy to 
> visualize and interact with the data before it goes to solr. A Json/Nosql 
> format would work just as well, but I really think a database has its place 
> in a scenario like this
>
>> On Feb 21, 2017, at 8:20 PM, Erick Erickson  wrote:
>>
>> I'll add that I _guarantee_ you'll want to re-index the data as you
>> change your schema
>> and the like. You'll be able to do that much more quickly if the data
>> is stored locally somehow.
>>
>> A RDBMS is not necessary however. You could simply store the data on
>> disk in some format
>> you could re-read and send to Solr.
>>
>> Best,
>> Erick
>>
>>> On Tue, Feb 21, 2017 at 5:17 PM, Dave  wrote:
>>> B is a better option long term. Solr is meant for retrieving flat data, 
>>> fast, not hierarchical. That's what a database is for and trust me you 
>>> would rather have a real database on the end point.  Each tool has a 
>>> purpose, solr can never replace a relational database, and a relational 
>>> database could not replace solr. Start with the slow model (database) for 
>>> control/display and enhance with the fast model (solr) for retrieval/search
>>>
>>>
>>>
 On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:

 To learn how to properly use Solr, I'm building a little experimental
 project with it to search for used car listings.

 Car listings appear on a variety of different places ... central places
 Craigslist and also many many individual Used Car dealership websites.

 I am wondering, should I:

 (a) deploy a Solr search engine and build individual indexers for every
 type of web site I want to find listings on?

 or

 (b) build my own database to store car listings, and then build services
 that scrape data from different sites and feed entries into the database;
 then point my Solr search to my database, one simple source of listings?

 My concerns are:

 With (a) ... I have to be smart enough to understand all those different
 data sources and remove/update listings when they change; while this be
 harder to do with custom Solr indexers than writing something from scratch?

 With (b) ... I'm maintaining a huge database of all my listings which seems
 redundant; google doesn't make a *copy* of everything on the internet, it
 just knows it's there.  Is maintaining my own database a bad design?

 Thanks for reading!


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Dave
Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
Eric. I'm going to have to respectfully disagree about the rdbms.  It's such a 
well know data format that you could hire a high school programmer to help with 
the db end if you knew how to flatten it to solr. Besides it's easy to 
visualize and interact with the data before it goes to solr. A Json/Nosql 
format would work just as well, but I really think a database has its place in 
a scenario like this 

> On Feb 21, 2017, at 8:20 PM, Erick Erickson  wrote:
> 
> I'll add that I _guarantee_ you'll want to re-index the data as you
> change your schema
> and the like. You'll be able to do that much more quickly if the data
> is stored locally somehow.
> 
> A RDBMS is not necessary however. You could simply store the data on
> disk in some format
> you could re-read and send to Solr.
> 
> Best,
> Erick
> 
>> On Tue, Feb 21, 2017 at 5:17 PM, Dave  wrote:
>> B is a better option long term. Solr is meant for retrieving flat data, 
>> fast, not hierarchical. That's what a database is for and trust me you would 
>> rather have a real database on the end point.  Each tool has a purpose, solr 
>> can never replace a relational database, and a relational database could not 
>> replace solr. Start with the slow model (database) for control/display and 
>> enhance with the fast model (solr) for retrieval/search
>> 
>> 
>> 
>>> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
>>> 
>>> To learn how to properly use Solr, I'm building a little experimental
>>> project with it to search for used car listings.
>>> 
>>> Car listings appear on a variety of different places ... central places
>>> Craigslist and also many many individual Used Car dealership websites.
>>> 
>>> I am wondering, should I:
>>> 
>>> (a) deploy a Solr search engine and build individual indexers for every
>>> type of web site I want to find listings on?
>>> 
>>> or
>>> 
>>> (b) build my own database to store car listings, and then build services
>>> that scrape data from different sites and feed entries into the database;
>>> then point my Solr search to my database, one simple source of listings?
>>> 
>>> My concerns are:
>>> 
>>> With (a) ... I have to be smart enough to understand all those different
>>> data sources and remove/update listings when they change; while this be
>>> harder to do with custom Solr indexers than writing something from scratch?
>>> 
>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>> just knows it's there.  Is maintaining my own database a bad design?
>>> 
>>> Thanks for reading!


Re: java.util.concurrent.TimeoutException: Idle timeout expired: 50001/50000 ms

2017-02-21 Thread Sadheera Vithanage
Cool, Thank you very much Erick and Walter.

On Wed, Feb 22, 2017 at 12:32 PM, Walter Underwood 
wrote:

> I’ve run with 8GB for years for moderate data sets (250K to 15M docs).
> Faceting can need more space.
>
> Make -Xms equal to -Xmx. The heap will grow to the max size regardless and
> you’ll get pauses while it grows. Starting at the max will avoid that pain.
>
> Solr uses lots and lots of short-lived allocations. Unless it goes into
> cache, everything allocated for a single request is garbage afterwards. You
> want to slow down the growth of tenured (old) space, so that it only
> includes cache ejections.
>
> I run with 8GB heap and 2GB of new/eden space. That makes my friend who
> works on the Go garbage collector cringe, but it works for Solr. I don’t
> fool around with all the ratio options. Just set the sizes and sleep well
> at night.
>
> Watch the sawtooth of the old space under load. The highest of the minimum
> allocated old space plus the eden space is your smallest working set. Add a
> bit of breathing space above that. Not tons, because more old space garbage
> means longer collections.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Feb 21, 2017, at 5:18 PM, Erick Erickson 
> wrote:
> >
> > Solr is very memory-intensive. 1g is still a very small heap. For any
> > sizeable data store people often run with at least 4G, often 8G or
> > more. If you facet or group or sort on fields that are _not_
> > docValues="true" fields you'll use up a lot of JVM memory. The
> > filterCache uses up maxDoc/8 bytes for every entry etc.
> >
> > I guess my point is that you shouldn't be surprised if 1G is too
> > small. I'd start with 4-8G and then reduce it after you get some
> > experience with your data and queries and now much memory they
> > require.
> >
> > Best,
> > Erick
> >
> > On Tue, Feb 21, 2017 at 3:07 PM, Sadheera Vithanage 
> wrote:
> >> Thanks Eric, It looked like the garbage collection was blocking the
> other
> >> processes.
> >>
> >> I updated the SOLR_JAVA_MEM="-Xms1g -Xmx4g" as it was the default before
> >> and looked like the garbage collection was triggered too frequent.
> >>
> >> Lets see how it goes now.
> >>
> >> Thanks again for the support.
> >>
> >> On Mon, Feb 20, 2017 at 11:50 AM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >>> The first place to look for something like his is garbage collection.
> >>> Are you hitting any really long stop-the-world GC pauses?
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Sun, Feb 19, 2017 at 2:21 PM, Sadheera Vithanage <
> sadhee...@gmail.com>
> >>> wrote:
>  Hi Experts,
> 
>  I have a solr cloud node (Just 1 node for now with a zookeeper
> running on
>  the same machine) running on ubuntu and It has been running without
> >>> issues
>  for a while.
> 
>  This morning I noticed below error in the error log.
> 
> 
>  *2017-02-19 20:27:54.724 ERROR (qtp97730845-4968) [   ]
>  o.a.s.s.HttpSolrCall null:java.io.IOException:
>  java.util.concurrent.TimeoutException: Idle timeout expired:
> >>> 50001/5 ms*
>  * at
>  org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(
> >>> SharedBlockingCallback.java:226)*
>  * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)*
>  * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:530)*
>  * at
>  org.apache.commons.io.output.ProxyOutputStream.write(
> >>> ProxyOutputStream.java:55)*
>  * at
>  org.apache.solr.response.QueryResponseWriterUtil$1.
> >>> write(QueryResponseWriterUtil.java:54)*
>  * at java.io.OutputStream.write(OutputStream.java:116)*
>  * at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)*
>  * at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)*
>  * at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)*
>  * at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)*
>  * at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)*
>  * at org.apache.solr.util.FastWriter.write(FastWriter.java:54)*
>  * at
>  org.apache.solr.response.JSONWriter.writeStr(
> >>> JSONResponseWriter.java:454)*
>  * at
>  org.apache.solr.response.TextResponseWriter.writeVal(
> >>> TextResponseWriter.java:128)*
>  * at
>  org.apache.solr.response.JSONWriter.writeSolrDocument(
> >>> JSONResponseWriter.java:346)*
>  * at
>  org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(
> >>> TextResponseWriter.java:239)*
>  * at
>  org.apache.solr.response.TextResponseWriter.writeVal(
> >>> TextResponseWriter.java:163)*
>  * at
>  org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(
> >>> JSONResponseWriter.java:184)*
>  * at
>  org.apache.solr.response.JSONWriter.writeNamedList(
> >>> JSONResponseWriter.java:300)*
>  * at

Re: java.util.concurrent.TimeoutException: Idle timeout expired: 50001/50000 ms

2017-02-21 Thread Walter Underwood
I’ve run with 8GB for years for moderate data sets (250K to 15M docs). Faceting 
can need more space.

Make -Xms equal to -Xmx. The heap will grow to the max size regardless and 
you’ll get pauses while it grows. Starting at the max will avoid that pain.

Solr uses lots and lots of short-lived allocations. Unless it goes into cache, 
everything allocated for a single request is garbage afterwards. You want to 
slow down the growth of tenured (old) space, so that it only includes cache 
ejections.

I run with 8GB heap and 2GB of new/eden space. That makes my friend who works 
on the Go garbage collector cringe, but it works for Solr. I don’t fool around 
with all the ratio options. Just set the sizes and sleep well at night.

Watch the sawtooth of the old space under load. The highest of the minimum 
allocated old space plus the eden space is your smallest working set. Add a bit 
of breathing space above that. Not tons, because more old space garbage means 
longer collections.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 5:18 PM, Erick Erickson  wrote:
> 
> Solr is very memory-intensive. 1g is still a very small heap. For any
> sizeable data store people often run with at least 4G, often 8G or
> more. If you facet or group or sort on fields that are _not_
> docValues="true" fields you'll use up a lot of JVM memory. The
> filterCache uses up maxDoc/8 bytes for every entry etc.
> 
> I guess my point is that you shouldn't be surprised if 1G is too
> small. I'd start with 4-8G and then reduce it after you get some
> experience with your data and queries and now much memory they
> require.
> 
> Best,
> Erick
> 
> On Tue, Feb 21, 2017 at 3:07 PM, Sadheera Vithanage  
> wrote:
>> Thanks Eric, It looked like the garbage collection was blocking the other
>> processes.
>> 
>> I updated the SOLR_JAVA_MEM="-Xms1g -Xmx4g" as it was the default before
>> and looked like the garbage collection was triggered too frequent.
>> 
>> Lets see how it goes now.
>> 
>> Thanks again for the support.
>> 
>> On Mon, Feb 20, 2017 at 11:50 AM, Erick Erickson 
>> wrote:
>> 
>>> The first place to look for something like his is garbage collection.
>>> Are you hitting any really long stop-the-world GC pauses?
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Sun, Feb 19, 2017 at 2:21 PM, Sadheera Vithanage 
>>> wrote:
 Hi Experts,
 
 I have a solr cloud node (Just 1 node for now with a zookeeper running on
 the same machine) running on ubuntu and It has been running without
>>> issues
 for a while.
 
 This morning I noticed below error in the error log.
 
 
 *2017-02-19 20:27:54.724 ERROR (qtp97730845-4968) [   ]
 o.a.s.s.HttpSolrCall null:java.io.IOException:
 java.util.concurrent.TimeoutException: Idle timeout expired:
>>> 50001/5 ms*
 * at
 org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(
>>> SharedBlockingCallback.java:226)*
 * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)*
 * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:530)*
 * at
 org.apache.commons.io.output.ProxyOutputStream.write(
>>> ProxyOutputStream.java:55)*
 * at
 org.apache.solr.response.QueryResponseWriterUtil$1.
>>> write(QueryResponseWriterUtil.java:54)*
 * at java.io.OutputStream.write(OutputStream.java:116)*
 * at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)*
 * at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)*
 * at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)*
 * at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)*
 * at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)*
 * at org.apache.solr.util.FastWriter.write(FastWriter.java:54)*
 * at
 org.apache.solr.response.JSONWriter.writeStr(
>>> JSONResponseWriter.java:454)*
 * at
 org.apache.solr.response.TextResponseWriter.writeVal(
>>> TextResponseWriter.java:128)*
 * at
 org.apache.solr.response.JSONWriter.writeSolrDocument(
>>> JSONResponseWriter.java:346)*
 * at
 org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(
>>> TextResponseWriter.java:239)*
 * at
 org.apache.solr.response.TextResponseWriter.writeVal(
>>> TextResponseWriter.java:163)*
 * at
 org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(
>>> JSONResponseWriter.java:184)*
 * at
 org.apache.solr.response.JSONWriter.writeNamedList(
>>> JSONResponseWriter.java:300)*
 * at
 org.apache.solr.response.JSONWriter.writeResponse(
>>> JSONResponseWriter.java:96)*
 * at
 org.apache.solr.response.JSONResponseWriter.write(
>>> JSONResponseWriter.java:55)*
 * at
 org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(
>>> QueryResponseWriterUtil.java:65)*
 * at
 

Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Erick Erickson
Hendrik:

bq: Not really sure why one replica needs to be up though.

I didn't write the code so I'm guessing a bit, but consider the
situation where you have no replicas for a shard up and add a new one.
Eventually it could become the leader but there would have been no
chance for it to check if it's version of the index was up to date.
But since it would be the leader, when other replicas for that shard
_do_ come on line they'd replicate the index down from the newly added
replica, possibly using very old data.

FWIW,
Erick

On Tue, Feb 21, 2017 at 1:12 PM, Hendrik Haddorp
 wrote:
> Hi,
>
> I had opened SOLR-10092 (https://issues.apache.org/jira/browse/SOLR-10092)
> for this a while ago. I was now able to gt this feature working with a very
> small code change. After a few seconds Solr reassigns the replica to a
> different Solr instance as long as one replica is still up. Not really sure
> why one replica needs to be up though. I added the patch based on Solr 6.3
> to the bug report. Would be great if it could be merged soon.
>
> regards,
> Hendrik
>
> On 19.01.2017 17:08, Hendrik Haddorp wrote:
>>
>> HDFS is like a shared filesystem so every Solr Cloud instance can access
>> the data using the same path or URL. The clusterstate.json looks like this:
>>
>> "shards":{"shard1":{
>> "range":"8000-7fff",
>> "state":"active",
>> "replicas":{
>>   "core_node1":{
>> "core":"test1.collection-0_shard1_replica1",
>> "dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
>> "base_url":"http://slave3:9000/solr;,
>> "node_name":"slave3:9000_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
>>   "core_node2":{
>> "core":"test1.collection-0_shard1_replica2",
>> "dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
>> "base_url":"http://slave2:9000/solr;,
>> "node_name":"slave2:9000_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
>> "leader":"true"},
>>   "core_node3":{
>> "core":"test1.collection-0_shard1_replica3",
>> "dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
>> "base_url":"http://slave4:9005/solr;,
>> "node_name":"slave4:9005_solr",
>> "state":"active",
>>
>> "ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"
>>
>> So every replica is always assigned to one node and this is being stored
>> in ZK, pretty much the same as for none HDFS setups. Just as the data is not
>> stored locally but on the network and as the path does not contain any node
>> information you can of course easily take over the work to a different Solr
>> node. You should just need to update the owner of the replica in ZK and you
>> should basically be done, I assume. That's why the documentation states that
>> an advantage of using HDFS is that a failing node can be replaced by a
>> different one. The Overseer just has to move the ownership of the replica,
>> which seems like what the code is trying to do. There just seems to be a bug
>> in the code so that the core does not get created on the target node.
>>
>> Each data directory also contains a lock file. The documentation states
>> that one should use the HdfsLockFactory, which unfortunately can easily lead
>> to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual cleanup
>> is however also easily done but seems to require a node restart to take
>> effect. But I'm also only recently playing around with all this ;-)
>>
>> regards,
>> Hendrik
>>
>> On 19.01.2017 16:40, Shawn Heisey wrote:
>>>
>>> On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

 Given that the data is on HDFS it shouldn't matter if any active
 replica is left as the data does not need to get transferred from
 another instance but the new core will just take over the existing
 data. Thus a replication factor of 1 should also work just in that
 case the shard would be down until the new core is up. Anyhow, it
 looks like the above call is missing to set the shard id I guess or
 some code is checking wrongly.
>>>
>>> I know very little about how SolrCloud interacts with HDFS, so although
>>> I'm reasonably certain about what comes below, I could be wrong.
>>>
>>> I have not ever heard of SolrCloud being able to automatically take over
>>> an existing index directory when it creates a replica, or even share
>>> index directories unless the admin fools it into doing so without its
>>> knowledge.  Sharing an index directory for replicas with SolrCloud would
>>> NOT work correctly.  Solr must be able to update all replicas
>>> independently, which means that each of them will lock its index
>>> directory and write to it.
>>>

Re: How to figure out whether stopwords are being indexed or not

2017-02-21 Thread Erick Erickson
Attach =query to your query and look at the parsed query that's returned.
That'll tell you what was searched at least.

You can also use the TermsComponent to examine terms in a field directly.

Best,
Erick

On Tue, Feb 21, 2017 at 2:52 PM, Pratik Patel  wrote:
> I have a field type in schema which has been applied stopwords list.
> I have verified that path of stopwords file is correct and it is being
> loaded fine in solr admin UI. When I analyse these fields using "Analysis" tab
> of the solr admin UI, I can see that stopwords are being filtered out.
> However, when I query with some of these stopwords, I do get the results
> back which makes me think that probably stopwords are being indexed.
>
> For example, when I run following query, I do get back results. I have word
> "and" in the stopwords list so I expect no results for this query.
>
> http://localhost:8081/solr/collection1/select?fq=Description_note:*%20and%20*=on=*:*=100=0=json
>
> Does this mean that the "and" word is being indexed and stopwords are not
> being used?
>
> Following is the field type of field Description_note :
>
>
>  positionIncrementGap="100" omitNorms="true">
>   
>   
> 
> 
>  pattern="((?m)[a-z]+)'s" replacement="$1s" />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
> 
>  words="stopwords.txt" />
>   
>   
>   
> 
> 
>  pattern="((?m)[a-z]+)'s" replacement="$1s" />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
> 
>  words="stopwords.txt" />
>   
> 


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Walter Underwood
Awesome advice. flat=fast in Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 5:17 PM, Dave  wrote:
> 
> B is a better option long term. Solr is meant for retrieving flat data, fast, 
> not hierarchical. That's what a database is for and trust me you would rather 
> have a real database on the end point.  Each tool has a purpose, solr can 
> never replace a relational database, and a relational database could not 
> replace solr. Start with the slow model (database) for control/display and 
> enhance with the fast model (solr) for retrieval/search 
> 
> 
> 
>> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
>> 
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>> 
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>> 
>> I am wondering, should I:
>> 
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>> 
>> or
>> 
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>> 
>> My concerns are:
>> 
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>> 
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>> 
>> Thanks for reading!



Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Erick Erickson
I'll add that I _guarantee_ you'll want to re-index the data as you
change your schema
and the like. You'll be able to do that much more quickly if the data
is stored locally somehow.

A RDBMS is not necessary however. You could simply store the data on
disk in some format
you could re-read and send to Solr.

Best,
Erick

On Tue, Feb 21, 2017 at 5:17 PM, Dave  wrote:
> B is a better option long term. Solr is meant for retrieving flat data, fast, 
> not hierarchical. That's what a database is for and trust me you would rather 
> have a real database on the end point.  Each tool has a purpose, solr can 
> never replace a relational database, and a relational database could not 
> replace solr. Start with the slow model (database) for control/display and 
> enhance with the fast model (solr) for retrieval/search
>
>
>
>> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
>>
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>>
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>>
>> I am wondering, should I:
>>
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>>
>> or
>>
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>>
>> My concerns are:
>>
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>>
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>>
>> Thanks for reading!


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Robert Hume
Thanks for that!  I was thinking (B) too, but wanted guidance that I'm
using the tool correctly.

Am still interested in hearing opinions from others, thanks!

rh

On Tue, Feb 21, 2017 at 8:17 PM, Dave  wrote:

> B is a better option long term. Solr is meant for retrieving flat data,
> fast, not hierarchical. That's what a database is for and trust me you
> would rather have a real database on the end point.  Each tool has a
> purpose, solr can never replace a relational database, and a relational
> database could not replace solr. Start with the slow model (database) for
> control/display and enhance with the fast model (solr) for retrieval/search
>
>
>
> > On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> >
> > To learn how to properly use Solr, I'm building a little experimental
> > project with it to search for used car listings.
> >
> > Car listings appear on a variety of different places ... central places
> > Craigslist and also many many individual Used Car dealership websites.
> >
> > I am wondering, should I:
> >
> > (a) deploy a Solr search engine and build individual indexers for every
> > type of web site I want to find listings on?
> >
> > or
> >
> > (b) build my own database to store car listings, and then build services
> > that scrape data from different sites and feed entries into the database;
> > then point my Solr search to my database, one simple source of listings?
> >
> > My concerns are:
> >
> > With (a) ... I have to be smart enough to understand all those different
> > data sources and remove/update listings when they change; while this be
> > harder to do with custom Solr indexers than writing something from
> scratch?
> >
> > With (b) ... I'm maintaining a huge database of all my listings which
> seems
> > redundant; google doesn't make a *copy* of everything on the internet, it
> > just knows it's there.  Is maintaining my own database a bad design?
> >
> > Thanks for reading!
>


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread David Hastings
And not to sound redundant but if you ever need help, database programmers are 
a dime a dozen, good luck finding solr developers that are available freelance 
for a price you're willing to pay. If you can do the solr anyone else that does 
web dev can do the sql

> On Feb 21, 2017, at 8:17 PM, Dave  wrote:
> 
> B is a better option long term. Solr is meant for retrieving flat data, fast, 
> not hierarchical. That's what a database is for and trust me you would rather 
> have a real database on the end point.  Each tool has a purpose, solr can 
> never replace a relational database, and a relational database could not 
> replace solr. Start with the slow model (database) for control/display and 
> enhance with the fast model (solr) for retrieval/search 
> 
> 
> 
>> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
>> 
>> To learn how to properly use Solr, I'm building a little experimental
>> project with it to search for used car listings.
>> 
>> Car listings appear on a variety of different places ... central places
>> Craigslist and also many many individual Used Car dealership websites.
>> 
>> I am wondering, should I:
>> 
>> (a) deploy a Solr search engine and build individual indexers for every
>> type of web site I want to find listings on?
>> 
>> or
>> 
>> (b) build my own database to store car listings, and then build services
>> that scrape data from different sites and feed entries into the database;
>> then point my Solr search to my database, one simple source of listings?
>> 
>> My concerns are:
>> 
>> With (a) ... I have to be smart enough to understand all those different
>> data sources and remove/update listings when they change; while this be
>> harder to do with custom Solr indexers than writing something from scratch?
>> 
>> With (b) ... I'm maintaining a huge database of all my listings which seems
>> redundant; google doesn't make a *copy* of everything on the internet, it
>> just knows it's there.  Is maintaining my own database a bad design?
>> 
>> Thanks for reading!


Re: java.util.concurrent.TimeoutException: Idle timeout expired: 50001/50000 ms

2017-02-21 Thread Erick Erickson
Solr is very memory-intensive. 1g is still a very small heap. For any
sizeable data store people often run with at least 4G, often 8G or
more. If you facet or group or sort on fields that are _not_
docValues="true" fields you'll use up a lot of JVM memory. The
filterCache uses up maxDoc/8 bytes for every entry etc.

I guess my point is that you shouldn't be surprised if 1G is too
small. I'd start with 4-8G and then reduce it after you get some
experience with your data and queries and now much memory they
require.

Best,
Erick

On Tue, Feb 21, 2017 at 3:07 PM, Sadheera Vithanage  wrote:
> Thanks Eric, It looked like the garbage collection was blocking the other
> processes.
>
> I updated the SOLR_JAVA_MEM="-Xms1g -Xmx4g" as it was the default before
> and looked like the garbage collection was triggered too frequent.
>
> Lets see how it goes now.
>
> Thanks again for the support.
>
> On Mon, Feb 20, 2017 at 11:50 AM, Erick Erickson 
> wrote:
>
>> The first place to look for something like his is garbage collection.
>> Are you hitting any really long stop-the-world GC pauses?
>>
>> Best,
>> Erick
>>
>> On Sun, Feb 19, 2017 at 2:21 PM, Sadheera Vithanage 
>> wrote:
>> > Hi Experts,
>> >
>> > I have a solr cloud node (Just 1 node for now with a zookeeper running on
>> > the same machine) running on ubuntu and It has been running without
>> issues
>> > for a while.
>> >
>> > This morning I noticed below error in the error log.
>> >
>> >
>> > *2017-02-19 20:27:54.724 ERROR (qtp97730845-4968) [   ]
>> > o.a.s.s.HttpSolrCall null:java.io.IOException:
>> > java.util.concurrent.TimeoutException: Idle timeout expired:
>> 50001/5 ms*
>> > * at
>> > org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(
>> SharedBlockingCallback.java:226)*
>> > * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)*
>> > * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:530)*
>> > * at
>> > org.apache.commons.io.output.ProxyOutputStream.write(
>> ProxyOutputStream.java:55)*
>> > * at
>> > org.apache.solr.response.QueryResponseWriterUtil$1.
>> write(QueryResponseWriterUtil.java:54)*
>> > * at java.io.OutputStream.write(OutputStream.java:116)*
>> > * at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)*
>> > * at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)*
>> > * at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)*
>> > * at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)*
>> > * at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)*
>> > * at org.apache.solr.util.FastWriter.write(FastWriter.java:54)*
>> > * at
>> > org.apache.solr.response.JSONWriter.writeStr(
>> JSONResponseWriter.java:454)*
>> > * at
>> > org.apache.solr.response.TextResponseWriter.writeVal(
>> TextResponseWriter.java:128)*
>> > * at
>> > org.apache.solr.response.JSONWriter.writeSolrDocument(
>> JSONResponseWriter.java:346)*
>> > * at
>> > org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(
>> TextResponseWriter.java:239)*
>> > * at
>> > org.apache.solr.response.TextResponseWriter.writeVal(
>> TextResponseWriter.java:163)*
>> > * at
>> > org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(
>> JSONResponseWriter.java:184)*
>> > * at
>> > org.apache.solr.response.JSONWriter.writeNamedList(
>> JSONResponseWriter.java:300)*
>> > * at
>> > org.apache.solr.response.JSONWriter.writeResponse(
>> JSONResponseWriter.java:96)*
>> > * at
>> > org.apache.solr.response.JSONResponseWriter.write(
>> JSONResponseWriter.java:55)*
>> > * at
>> > org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(
>> QueryResponseWriterUtil.java:65)*
>> > * at
>> > org.apache.solr.servlet.HttpSolrCall.writeResponse(
>> HttpSolrCall.java:728)*
>> > * at
>> > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(
>> HttpSolrCall.java:667)*
>> > * at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:441)*
>> > * at
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:303)*
>> > * at
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:254)*
>> > * at
>> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> doFilter(ServletHandler.java:1668)*
>> > * at
>> > org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> ServletHandler.java:581)*
>> > * at
>> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:143)*
>> > * at
>> > org.eclipse.jetty.security.SecurityHandler.handle(
>> SecurityHandler.java:548)*
>> > * at
>> > org.eclipse.jetty.server.session.SessionHandler.
>> doHandle(SessionHandler.java:226)*
>> > * at
>> > org.eclipse.jetty.server.handler.ContextHandler.
>> doHandle(ContextHandler.java:1160)*
>> > * at
>> > org.eclipse.jetty.servlet.ServletHandler.doScope(
>> ServletHandler.java:511)*
>> > * at
>> > org.eclipse.jetty.server.session.SessionHandler.
>> doScope(SessionHandler.java:185)*
>> > * at

Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Dave
B is a better option long term. Solr is meant for retrieving flat data, fast, 
not hierarchical. That's what a database is for and trust me you would rather 
have a real database on the end point.  Each tool has a purpose, solr can never 
replace a relational database, and a relational database could not replace 
solr. Start with the slow model (database) for control/display and enhance with 
the fast model (solr) for retrieval/search 



> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> 
> To learn how to properly use Solr, I'm building a little experimental
> project with it to search for used car listings.
> 
> Car listings appear on a variety of different places ... central places
> Craigslist and also many many individual Used Car dealership websites.
> 
> I am wondering, should I:
> 
> (a) deploy a Solr search engine and build individual indexers for every
> type of web site I want to find listings on?
> 
> or
> 
> (b) build my own database to store car listings, and then build services
> that scrape data from different sites and feed entries into the database;
> then point my Solr search to my database, one simple source of listings?
> 
> My concerns are:
> 
> With (a) ... I have to be smart enough to understand all those different
> data sources and remove/update listings when they change; while this be
> harder to do with custom Solr indexers than writing something from scratch?
> 
> With (b) ... I'm maintaining a huge database of all my listings which seems
> redundant; google doesn't make a *copy* of everything on the internet, it
> just knows it's there.  Is maintaining my own database a bad design?
> 
> Thanks for reading!


Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Robert Hume
To learn how to properly use Solr, I'm building a little experimental
project with it to search for used car listings.

Car listings appear on a variety of different places ... central places
Craigslist and also many many individual Used Car dealership websites.

I am wondering, should I:

(a) deploy a Solr search engine and build individual indexers for every
type of web site I want to find listings on?

or

(b) build my own database to store car listings, and then build services
that scrape data from different sites and feed entries into the database;
then point my Solr search to my database, one simple source of listings?

My concerns are:

With (a) ... I have to be smart enough to understand all those different
data sources and remove/update listings when they change; while this be
harder to do with custom Solr indexers than writing something from scratch?

With (b) ... I'm maintaining a huge database of all my listings which seems
redundant; google doesn't make a *copy* of everything on the internet, it
just knows it's there.  Is maintaining my own database a bad design?

Thanks for reading!


Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Fuad Efendi
Walter, I use BM25 which is default for Solr 6.3, and I clearly visually
saw correlation between number of hits and response times in Solr logs, it
is almost linear.   With underloaded system.

With “solrmeter” 10-requests-per-second CPU goes to 400% on
12-core-hyperthread machine, and with 20-requests-per-second goes to 1100%.
No issues with GC. Java 8  121 from Oracle, 64-bit. 20 requests per second,
Solr 6, (to SOlr) kidding? I never expected that for simplest queries

Doug, I was never been able to make “mm” parameter work for me; I cannot
understand how it works. I use eDisMax, and few “text_general” fields, with
default for Solr operator “OR”, and default “mm” (which should be “1” for
“OR)




From: Walter Underwood  
Reply: solr-user@lucene.apache.org 

Date: February 21, 2017 at 5:24:23 PM
To: solr-user@lucene.apache.org 

Subject:  Re: CPU Intensive Scoring Alternatives

300 ms seems pretty good for 200 million documents. Is that average?
Median? 95th percentile?

Why are you sure it is because the huge number of hits? That would be
unusual. The size of the posting lists is a more common cause.

Why do you think it is caused by tf.idf? That should be faster than BM25.

Does host have enough RAM to hold most or all of the index in file buffers?

What are the hit rates on your caches?

Are you using fuzzy matches? N-gram prefix matching? Phrase matching?
Shingles?

What version of Java are you running? What garbage collector?

wunder
Walter Underwood
wun...@wunderwood.org 
http://observer.wunderwood.org/ (my blog)


> On Feb 21, 2017, at 10:42 AM, Doug Turnbull <
dturnb...@opensourceconnections.com > wrote:
>
> With that many documents, why not start with an AND search and reissue an
> OR query if there's no results? My strategy is to prefer an AND for large
> collections (or a higher mm than 1) and prefer closer to an OR for
smaller
> collections.
>
> -Doug
>
> On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi  wrote:
>
>> Thank you Ahmet, I will try it; sounds reasonable
>>
>>
>> From: Ahmet Arslan  
>> Reply: solr-user@lucene.apache.org  <
solr-user@lucene.apache.org >
>> >,
Ahmet Arslan >
>> >
>> Date: February 21, 2017 at 3:02:11 AM
>> To: solr-user@lucene.apache.org  <
solr-user@lucene.apache.org >
>> >
>> Subject: Re: CPU Intensive Scoring Alternatives
>>
>> Hi,
>>
>> New default similarity is BM25.
>> May be explicitly set similarity to tf-idf and see how it goes?
>>
>> Ahmet
>>
>>
>> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi > wrote:
>> Hello,
>>
>>
>> Default TF-IDF performs poorly with the indexed 200 millions documents.
>> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
>> seconds. eDisMax. Because default operator "OR" and stopword "The" we
have
>> 50-70 millions documents as a query result, and scoring is CPU
intensive.
>> What to do? Our typical queries return over million documents, and
response
>> times of simple queries ranges from 50 milliseconds to 5-10 seconds
>> depending on result set.
>>
>> This was just an exaggerated example with stopword “the”, but even
simplest
>> query “Michael Jackson” runs 300ms instead of 3ms just because huge
number
>> of hits and TF-IDF calculations. Solr 6.3.
>>
>>
>> Thanks,
>>
>> --
>>
>> Fuad Efendi
>>
>> (416) 993-2060
>>
>> http://www.tokenizer.ca 
>> Search Relevancy, Recommender Systems
>>


Re: java.util.concurrent.TimeoutException: Idle timeout expired: 50001/50000 ms

2017-02-21 Thread Sadheera Vithanage
Thanks Eric, It looked like the garbage collection was blocking the other
processes.

I updated the SOLR_JAVA_MEM="-Xms1g -Xmx4g" as it was the default before
and looked like the garbage collection was triggered too frequent.

Lets see how it goes now.

Thanks again for the support.

On Mon, Feb 20, 2017 at 11:50 AM, Erick Erickson 
wrote:

> The first place to look for something like his is garbage collection.
> Are you hitting any really long stop-the-world GC pauses?
>
> Best,
> Erick
>
> On Sun, Feb 19, 2017 at 2:21 PM, Sadheera Vithanage 
> wrote:
> > Hi Experts,
> >
> > I have a solr cloud node (Just 1 node for now with a zookeeper running on
> > the same machine) running on ubuntu and It has been running without
> issues
> > for a while.
> >
> > This morning I noticed below error in the error log.
> >
> >
> > *2017-02-19 20:27:54.724 ERROR (qtp97730845-4968) [   ]
> > o.a.s.s.HttpSolrCall null:java.io.IOException:
> > java.util.concurrent.TimeoutException: Idle timeout expired:
> 50001/5 ms*
> > * at
> > org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(
> SharedBlockingCallback.java:226)*
> > * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)*
> > * at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:530)*
> > * at
> > org.apache.commons.io.output.ProxyOutputStream.write(
> ProxyOutputStream.java:55)*
> > * at
> > org.apache.solr.response.QueryResponseWriterUtil$1.
> write(QueryResponseWriterUtil.java:54)*
> > * at java.io.OutputStream.write(OutputStream.java:116)*
> > * at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)*
> > * at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)*
> > * at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)*
> > * at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)*
> > * at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)*
> > * at org.apache.solr.util.FastWriter.write(FastWriter.java:54)*
> > * at
> > org.apache.solr.response.JSONWriter.writeStr(
> JSONResponseWriter.java:454)*
> > * at
> > org.apache.solr.response.TextResponseWriter.writeVal(
> TextResponseWriter.java:128)*
> > * at
> > org.apache.solr.response.JSONWriter.writeSolrDocument(
> JSONResponseWriter.java:346)*
> > * at
> > org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(
> TextResponseWriter.java:239)*
> > * at
> > org.apache.solr.response.TextResponseWriter.writeVal(
> TextResponseWriter.java:163)*
> > * at
> > org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(
> JSONResponseWriter.java:184)*
> > * at
> > org.apache.solr.response.JSONWriter.writeNamedList(
> JSONResponseWriter.java:300)*
> > * at
> > org.apache.solr.response.JSONWriter.writeResponse(
> JSONResponseWriter.java:96)*
> > * at
> > org.apache.solr.response.JSONResponseWriter.write(
> JSONResponseWriter.java:55)*
> > * at
> > org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(
> QueryResponseWriterUtil.java:65)*
> > * at
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(
> HttpSolrCall.java:728)*
> > * at
> > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(
> HttpSolrCall.java:667)*
> > * at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:441)*
> > * at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:303)*
> > * at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:254)*
> > * at
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1668)*
> > * at
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:581)*
> > * at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)*
> > * at
> > org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)*
> > * at
> > org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)*
> > * at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1160)*
> > * at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:511)*
> > * at
> > org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)*
> > * at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1092)*
> > * at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)*
> > * at
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:213)*
> > * at
> > org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)*
> > * at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)*
> > * at org.eclipse.jetty.server.Server.handle(Server.java:518)*
> > * at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)*
> > * at
> > org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:244)*
> > * at

How to figure out whether stopwords are being indexed or not

2017-02-21 Thread Pratik Patel
I have a field type in schema which has been applied stopwords list.
I have verified that path of stopwords file is correct and it is being
loaded fine in solr admin UI. When I analyse these fields using "Analysis" tab
of the solr admin UI, I can see that stopwords are being filtered out.
However, when I query with some of these stopwords, I do get the results
back which makes me think that probably stopwords are being indexed.

For example, when I run following query, I do get back results. I have word
"and" in the stopwords list so I expect no results for this query.

http://localhost:8081/solr/collection1/select?fq=Description_note:*%20and%20*=on=*:*=100=0=json

Does this mean that the "and" word is being indexed and stopwords are not
being used?

Following is the field type of field Description_note :



  
  






  
  
  






  



Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Walter Underwood
300 ms seems pretty good for 200 million documents. Is that average? Median? 
95th percentile?

Why are you sure it is because the huge number of hits? That would be unusual. 
The size of the posting lists is a more common cause.

Why do you think it is caused by tf.idf? That should be faster than BM25.

Does host have enough RAM to hold most or all of the index in file buffers?

What are the hit rates on your caches?

Are you using fuzzy matches? N-gram prefix matching? Phrase matching? Shingles?

What version of Java are you running? What garbage collector?

wunder
Walter Underwood
wun...@wunderwood.org 
http://observer.wunderwood.org/  (my blog)


> On Feb 21, 2017, at 10:42 AM, Doug Turnbull 
>  > wrote:
> 
> With that many documents, why not start with an AND search and reissue an
> OR query if there's no results? My strategy is to prefer an AND for large
> collections (or a higher mm than 1) and prefer closer to an OR for smaller
> collections.
> 
> -Doug
> 
> On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi  > wrote:
> 
>> Thank you Ahmet, I will try it; sounds reasonable
>> 
>> 
>> From: Ahmet Arslan > > > >
>> Reply: solr-user@lucene.apache.org  
>> >
>> >, Ahmet 
>> Arslan >
>> >
>> Date: February 21, 2017 at 3:02:11 AM
>> To: solr-user@lucene.apache.org  
>> >
>> >
>> Subject:  Re: CPU Intensive Scoring Alternatives
>> 
>> Hi,
>> 
>> New default similarity is BM25.
>> May be explicitly set similarity to tf-idf and see how it goes?
>> 
>> Ahmet
>> 
>> 
>> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi > > wrote:
>> Hello,
>> 
>> 
>> Default TF-IDF performs poorly with the indexed 200 millions documents.
>> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
>> seconds. eDisMax. Because default operator "OR" and stopword "The" we have
>> 50-70 millions documents as a query result, and scoring is CPU intensive.
>> What to do? Our typical queries return over million documents, and response
>> times of simple queries ranges from 50 milliseconds to 5-10 seconds
>> depending on result set.
>> 
>> This was just an exaggerated example with stopword “the”, but even simplest
>> query “Michael Jackson” runs 300ms instead of 3ms just because huge number
>> of hits and TF-IDF calculations. Solr 6.3.
>> 
>> 
>> Thanks,
>> 
>> --
>> 
>> Fuad Efendi
>> 
>> (416) 993-2060
>> 
>> http://www.tokenizer.ca 
>> Search Relevancy, Recommender Systems
>> 



Re: Arabic words search in solr

2017-02-21 Thread Steve Rowe
Hi Mohan,

It looks to me like the example query should match, since the analyzed query 
terms look like a subset of the analyzed document terms.

Did you re-index your docuemnts after you changed your schema?  If not, then 
the indexed documents won’t have the same terms as the ones you see on the 
Admin UI Analysis pane.

If you have re-indexed, and are still not getting matches you expect, please 
include textual examples of the remaining problems, so that I can copy/paste to 
reproduce the problem - I can’t copy/paste Arabic from images you pointed to.

--
Steve
www.lucidworks.com

> On Feb 21, 2017, at 1:28 AM, mohanmca01  wrote:
> 
> Hi Steve,
> 
> I changed ICU folding filter order and re-index entire Arabic content. But
> still problem is present. I am not able to get the expected result.
> 
> I attached screen shot for your references.
>  
>  
>  
> 
> Kindly check and let me know.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321397.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-02-21 Thread Hendrik Haddorp

Hi,

I had opened SOLR-10092 
(https://issues.apache.org/jira/browse/SOLR-10092) for this a while ago. 
I was now able to gt this feature working with a very small code change. 
After a few seconds Solr reassigns the replica to a different Solr 
instance as long as one replica is still up. Not really sure why one 
replica needs to be up though. I added the patch based on Solr 6.3 to 
the bug report. Would be great if it could be merged soon.


regards,
Hendrik

On 19.01.2017 17:08, Hendrik Haddorp wrote:
HDFS is like a shared filesystem so every Solr Cloud instance can 
access the data using the same path or URL. The clusterstate.json 
looks like this:


"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node1":{
"core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
"base_url":"http://slave3:9000/solr;,
"node_name":"slave3:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"}, 


  "core_node2":{
"core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
"base_url":"http://slave2:9000/solr;,
"node_name":"slave2:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog", 


"leader":"true"},
  "core_node3":{
"core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
"base_url":"http://slave4:9005/solr;,
"node_name":"slave4:9005_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog" 



So every replica is always assigned to one node and this is being 
stored in ZK, pretty much the same as for none HDFS setups. Just as 
the data is not stored locally but on the network and as the path does 
not contain any node information you can of course easily take over 
the work to a different Solr node. You should just need to update the 
owner of the replica in ZK and you should basically be done, I assume. 
That's why the documentation states that an advantage of using HDFS is 
that a failing node can be replaced by a different one. The Overseer 
just has to move the ownership of the replica, which seems like what 
the code is trying to do. There just seems to be a bug in the code so 
that the core does not get created on the target node.


Each data directory also contains a lock file. The documentation 
states that one should use the HdfsLockFactory, which unfortunately 
can easily lead to SOLR-8335, which hopefully will be fixed by 
SOLR-8169. A manual cleanup is however also easily done but seems to 
require a node restart to take effect. But I'm also only recently 
playing around with all this ;-)


regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn







Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Pratik Patel
I think I have found something concrete. Reading up more on nvd file
extension, I found that it is being used to store length and boost factors
for documents and fields. These are normalization files. Normalization on a
field is controlled by omitNorms attribute. If omitNorms=true then the
field will not be normalized. I explicitly added omitNorms=true for the
field type text_general and re-indexed the data. Now, my index size is much
smaller. I haven't yet verified this with complete data set yet but I can
see that index size is reduced. We have a large data set and it takes about
5-6 hours to index it completely so I'll index the whole data set overnight
to confirm the fix.

But now I am curious about omitNorms attribute. What would be the default
value for omitNorms for field type "text_general". The documentation says
that omitNorms=true for primitive field types like string, int etc. but I
don't know what is the default value for "text_general"?

I never had omitNorms set explicitly on text_general field type or any of
the fields having type text_general. Has the default value of omitNorms
been changed from solr 5.0.0 to 6.4.1?

Any clarification on this would be really helpful.

I am posting some relevant links here for someone who might face similar
issue in future.

http://apprize.info/php/solr_4/2.html
http://stackoverflow.com/questions/18694242/what-is-omitnorms-and-version-field-in-solr-schema
https://lucidworks.com/2009/09/02/scaling-lucene-and-solr/#d0e71

Thanks,
Pratik

On Tue, Feb 21, 2017 at 12:03 PM, Pratik Patel  wrote:

> I am using the schema from solr 5 which does not have any field with
> docValues enabled.In fact to ensure that everything is same as solr 5
> (except the breaking changes) I am using the solrconfig.xml also from solr
> 5 with schemaFactory set as classicSchemaFactory to use schema.xml from
> solr 5.
>
>
> On Tue, Feb 21, 2017 at 11:33 AM, Alexandre Rafalovitch <
> arafa...@gmail.com> wrote:
>
>> Did you reuse the schema or rebuilt it on top of the latest examples?
>> Because the latest example schema enabled docValues for strings on the
>> fieldType level.
>>
>> I would do a diff of the schemas to see what changed. If they look
>> very different and you are looking for tools to normalize/extract
>> elements from schemas, you may find my latest Revolution presentation
>> useful for that:
>> https://www.slideshare.net/arafalov/rebuilding-solr-6-exampl
>> es-layer-by-layer-lucenesolrrevolution-2016
>> (e.g. slide 20). There is also the video there at the end.
>>
>> Regards,
>>Alex.
>> 
>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>
>>
>> On 21 February 2017 at 11:18, Mike Thomsen 
>> wrote:
>> > Correct me if I'm wrong, but heavy use of doc values should actually
>> blow
>> > up the size of your index considerably if they are in fields that get
>> sent
>> > a lot of data.
>> >
>> > On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel 
>> wrote:
>> >
>> >> Thanks for the reply. I can see that in solr 6, more than 50% of the
>> index
>> >> directory is occupied by ".nvd" file extension. It is something
>> related to
>> >> norms and doc values.
>> >>
>> >> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
>> >> arafa...@gmail.com>
>> >> wrote:
>> >>
>> >> > Did you look in the data directories to check what index file
>> extensions
>> >> > contribute most to the difference? That could give a hint.
>> >> >
>> >> > Regards,
>> >> > Alex
>> >> >
>> >> > On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
>> >> >
>> >> > > Here is the same question in stackOverflow for better format.
>> >> > >
>> >> > > http://stackoverflow.com/questions/42370231/solr-
>> >> > > dynamic-field-blowing-up-the-index-size
>> >> > >
>> >> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app
>> fine
>> >> > but
>> >> > > the problem is that index size with solr 6 is way too large. In
>> solr 5,
>> >> > > index size was about 15GB and in solr 6, for the same data, the
>> index
>> >> > size
>> >> > > is 300GB! I am not able to understand what contributes to such huge
>> >> > > difference in solr 6.
>> >> > >
>> >> > > I have been able to identify a field which is blowing up the size
>> of
>> >> > index.
>> >> > > It is as follows.
>> >> > >
>> >> > > > >> > > stored="true" multiValued="true"  />
>> >> > >
>> >> > > > >> > > stored="false" multiValued="true"  />
>> >> > > 
>> >> > >
>> >> > > When this field is commented out, the index size reduces to less
>> than
>> >> > 10GB.
>> >> > >
>> >> > > This field is of type text_general. Following is the definition of
>> this
>> >> > > type.
>> >> > >
>> >> > > > >> > > positionIncrementGap="100">
>> >> > >   
>> >> > > 
>> >> > > 
>> >> > > 
>> >> > > > >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> >> > > > >> > > protected="protwords.txt" 

Re: Facet query - exlude main query

2017-02-21 Thread Jacques du Rand
Oh right right ! Sorry late night :)
Thank You Chris


On 21 February 2017 at 20:30, Chris Hostetter 
wrote:

>
> : Maybe I'm doing something wrong ?
> : /select?q.op=OR=2={!tag=mq}nissan=name%20name_
> raw=json=0=20={!ex=tag_mq}feature_s_1_make
>
> that url still contains "ex=tag_mq" .. which is looking for a query with a
> tag named "tag_mq" .. in your q param you are using a tag named "mq"
>
> Use "facet.field={!ex=mq}feature_s_1_make" as i mentioned before
>
>
> >>> You are attempting to exclude tags named "tag_make", "tag_model", and
> >>> "tag_mq" -- but the name of the tag you are using in the query is "mq"
> >>>
> >>> if you include "mq" in the list of excluded tags (the "ex" local
> >>> param) it should work fine...
> >>>
> >>> facet.field={!ex=mq}feature_s_1_make"
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
Jacques du Rand
Senior R  Programmer

T: +27214688017
F: +27862160617
E: jacq...@pricecheck.co.za



Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Doug Turnbull
With that many documents, why not start with an AND search and reissue an
OR query if there's no results? My strategy is to prefer an AND for large
collections (or a higher mm than 1) and prefer closer to an OR for smaller
collections.

-Doug

On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi  wrote:

> Thank you Ahmet, I will try it; sounds reasonable
>
>
> From: Ahmet Arslan  
> Reply: solr-user@lucene.apache.org 
> , Ahmet Arslan 
> 
> Date: February 21, 2017 at 3:02:11 AM
> To: solr-user@lucene.apache.org 
> 
> Subject:  Re: CPU Intensive Scoring Alternatives
>
> Hi,
>
> New default similarity is BM25.
> May be explicitly set similarity to tf-idf and see how it goes?
>
> Ahmet
>
>
> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi  wrote:
> Hello,
>
>
> Default TF-IDF performs poorly with the indexed 200 millions documents.
> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
> seconds. eDisMax. Because default operator "OR" and stopword "The" we have
> 50-70 millions documents as a query result, and scoring is CPU intensive.
> What to do? Our typical queries return over million documents, and response
> times of simple queries ranges from 50 milliseconds to 5-10 seconds
> depending on result set.
>
> This was just an exaggerated example with stopword “the”, but even simplest
> query “Michael Jackson” runs 300ms instead of 3ms just because huge number
> of hits and TF-IDF calculations. Solr 6.3.
>
>
> Thanks,
>
> --
>
> Fuad Efendi
>
> (416) 993-2060
>
> http://www.tokenizer.ca
> Search Relevancy, Recommender Systems
>


Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Fuad Efendi
Thank you Ahmet, I will try it; sounds reasonable


From: Ahmet Arslan  
Reply: solr-user@lucene.apache.org 
, Ahmet Arslan 

Date: February 21, 2017 at 3:02:11 AM
To: solr-user@lucene.apache.org 

Subject:  Re: CPU Intensive Scoring Alternatives

Hi,

New default similarity is BM25.
May be explicitly set similarity to tf-idf and see how it goes?

Ahmet


On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi  wrote:
Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

-- 

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


Re: Facet query - exlude main query

2017-02-21 Thread Chris Hostetter

: Maybe I'm doing something wrong ?
: 
/select?q.op=OR=2={!tag=mq}nissan=name%20name_raw=json=0=20={!ex=tag_mq}feature_s_1_make

that url still contains "ex=tag_mq" .. which is looking for a query with a 
tag named "tag_mq" .. in your q param you are using a tag named "mq"

Use "facet.field={!ex=mq}feature_s_1_make" as i mentioned before


>>> You are attempting to exclude tags named "tag_make", "tag_model", and 
>>> "tag_mq" -- but the name of the tag you are using in the query is "mq"
>>>
>>> if you include "mq" in the list of excluded tags (the "ex" local 
>>> param) it should work fine...
>>> 
>>> facet.field={!ex=mq}feature_s_1_make"




-Hoss
http://www.lucidworks.com/


Re: Facet query - exlude main query

2017-02-21 Thread Jacques du Rand
Maybe I'm doing something wrong ?
/select?q.op=OR=2={!tag=mq}nissan=name%20name_raw=json=0=20={!ex=tag_mq}feature_s_1_make

Still only getting ONE facet value ?


{
status: 0,
QTime: 5,
params: {
mm: "2",
facet.field: [
"{!ex=tag_mq}feature_s_1_make",
"{!ex=tag_model}feature_s_2_model"
],
qs: "5",
components: [
"query",
"stats",
"facet",
"debug"
],
df: "text",
ps: "5",
echoParams: "all",
start: "0",
q.op: "OR",
rows: "20",
q: "{!tag=mq}nissan",
tie: "0.1",
facet.limit: "200",
defType: "edismax",
qf: "name name_raw",
facet.method: "fcs",
facet.threads: "6",
facet.mincount: "1",
pf3: "",
wt: "json",
facet: "true",
tr: "example.xsl"
}
},
response: {
numFound: 55,
start: 0,
docs: []
},
facet_counts: {
facet_queries: { },
facet_fields: {
feature_s_1_make: [
"nissan",
55
],

On 21 February 2017 at 19:55, Jacques du Rand 
wrote:

> Sure..
>
> So this is for a car search-application
>
>
> If i change:
>
> q={!tag=mq}nissan
>
> to
>
> q=*:*
>
> I get all the makes.
>
> VAR ECHO:
>
> responseHeader: {
> status: 0,
> QTime: 4,
> params: {
> mm: "2",
> facet.field: [
> "{!ex=tag_make,tag_model,tag_mq}feature_s_1_make",
> "{!ex=tag_model}feature_s_2_model"
>
> ],
> qs: "5",
> components: [
> "query",
> "stats",
> "facet",
> "debug"
> ],
> df: "text",
> ps: "5",
> echoParams: "all",
> start: "0",
> q.op: "OR",
> rows: "20",
> q: "{!tag=mq}nissan",
> tie: "0.1",
> facet.limit: "200",
> defType: "edismax",
> qf: "name name_raw ",
> facet.method: "fcs",
> facet.threads: "6",
> facet.mincount: "1",
> pf3: "",
> wt: "json",
> facet: "true",
> tr: "example.xsl"
>
>
> --->RESPONSE - expecting ALL car makes...
> facet_fields: {
> feature_s_1_make: [
> "nissan",
> 55
> ],
>
>
> On 21 February 2017 at 19:36, Chris Hostetter 
> wrote:
>
>>
>> :  Solr3.1  Starting with Solr
>> 3.1, the
>>
>> : primary relevance query (i.e. the one normally specified by the *q*
>> parameter)
>> : may also be excluded.
>> :
>> : But doesnt show me how to exlude it ??
>>
>> same tag + ex local params, as you had in your example...
>>
>> : 2. q={!tag=mq}blah={!ex=tag_man,mq}manufacturer
>> ...
>> : But  I still only get the facets WITH the main query  of "blah"
>>
>> Can you give us a more more complete picture of what your data & full
>> request & response looks like? ... it's working fine for me with Solr 6
>> and the techproducts sample data (see below).
>>
>> https://wiki.apache.org/solr/UsingMailingLists
>>
>>
>> $ curl 'http://localhost:8983/solr/techproducts/query?rows=0=%7B
>> !tag=y%7dinStock:true=%7B!tag=x%7dname:ipod=true
>> et.field=inStock=%7b!key=exclude_x+ex=x%7dinStoc
>> k=%7b!key=exclude_xy+ex=x,y%7dinStock'
>> {
>>   "responseHeader":{
>> "status":0,
>> "QTime":2,
>> "params":{
>>   "q":"{!tag=x}name:ipod",
>>   "facet.field":["inStock",
>> "{!key=exclude_x ex=x}inStock",
>> "{!key=exclude_xy ex=x,y}inStock"],
>>   "fq":"{!tag=y}inStock:true",
>>   "rows":"0",
>>   "facet":"true"}},
>>   "response":{"numFound":1,"start":0,"docs":[]
>>   },
>>   "facet_counts":{
>> "facet_queries":{},
>> "facet_fields":{
>>   "inStock":[
>> "true",1,
>> "false",0],
>>   "exclude_x":[
>> "true",17,
>> "false",0],
>>   "exclude_xy":[
>> "true",17,
>> "false",4]},
>> "facet_ranges":{},
>> "facet_intervals":{},
>> "facet_heatmaps":{}}}
>>
>>
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
>
>
> --
> Jacques du Rand
> Senior R  Programmer
>
> T: +27214688017
> F: +27862160617
> E: jacq...@pricecheck.co.za
> 
>



-- 
Jacques du Rand
Senior R  Programmer

T: +27214688017
F: +27862160617
E: jacq...@pricecheck.co.za



Re: Facet query - exlude main query

2017-02-21 Thread Chris Hostetter

: facet.field: [
: "{!ex=tag_make,tag_model,tag_mq}feature_s_1_make",
: "{!ex=tag_model}feature_s_2_model"
...
: q: "{!tag=mq}nissan",

You are attempting to exclude tags named "tag_make", "tag_model", and 
"tag_mq" -- but the name of the tag you are using in the query is "mq"

if you include "mq" in the list of excluded tags (the "ex" local param) it 
should work fine...

facet.field={!ex=mq}feature_s_1_make"



-Hoss
http://www.lucidworks.com/


Re: Facet query - exlude main query

2017-02-21 Thread Jacques du Rand
Sure..

So this is for a car search-application


If i change:

q={!tag=mq}nissan

to

q=*:*

I get all the makes.

VAR ECHO:

responseHeader: {
status: 0,
QTime: 4,
params: {
mm: "2",
facet.field: [
"{!ex=tag_make,tag_model,tag_mq}feature_s_1_make",
"{!ex=tag_model}feature_s_2_model"

],
qs: "5",
components: [
"query",
"stats",
"facet",
"debug"
],
df: "text",
ps: "5",
echoParams: "all",
start: "0",
q.op: "OR",
rows: "20",
q: "{!tag=mq}nissan",
tie: "0.1",
facet.limit: "200",
defType: "edismax",
qf: "name name_raw ",
facet.method: "fcs",
facet.threads: "6",
facet.mincount: "1",
pf3: "",
wt: "json",
facet: "true",
tr: "example.xsl"


--->RESPONSE - expecting ALL car makes...
facet_fields: {
feature_s_1_make: [
"nissan",
55
],


On 21 February 2017 at 19:36, Chris Hostetter 
wrote:

>
> :  Solr3.1  Starting with Solr 3.1,
> the
> : primary relevance query (i.e. the one normally specified by the *q*
> parameter)
> : may also be excluded.
> :
> : But doesnt show me how to exlude it ??
>
> same tag + ex local params, as you had in your example...
>
> : 2. q={!tag=mq}blah={!ex=tag_man,mq}manufacturer
> ...
> : But  I still only get the facets WITH the main query  of "blah"
>
> Can you give us a more more complete picture of what your data & full
> request & response looks like? ... it's working fine for me with Solr 6
> and the techproducts sample data (see below).
>
> https://wiki.apache.org/solr/UsingMailingLists
>
>
> $ curl 'http://localhost:8983/solr/techproducts/query?rows=0=%
> 7B!tag=y%7dinStock:true=%7B!tag=x%7dname:ipod=true&
> facet.field=inStock=%7b!key=exclude_x+ex=x%
> 7dinStock=%7b!key=exclude_xy+ex=x,y%7dinStock'
> {
>   "responseHeader":{
> "status":0,
> "QTime":2,
> "params":{
>   "q":"{!tag=x}name:ipod",
>   "facet.field":["inStock",
> "{!key=exclude_x ex=x}inStock",
> "{!key=exclude_xy ex=x,y}inStock"],
>   "fq":"{!tag=y}inStock:true",
>   "rows":"0",
>   "facet":"true"}},
>   "response":{"numFound":1,"start":0,"docs":[]
>   },
>   "facet_counts":{
> "facet_queries":{},
> "facet_fields":{
>   "inStock":[
> "true",1,
> "false",0],
>   "exclude_x":[
> "true",17,
> "false",0],
>   "exclude_xy":[
> "true",17,
> "false",4]},
> "facet_ranges":{},
> "facet_intervals":{},
> "facet_heatmaps":{}}}
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
Jacques du Rand
Senior R  Programmer

T: +27214688017
F: +27862160617
E: jacq...@pricecheck.co.za



Re: Facet query - exlude main query

2017-02-21 Thread Chris Hostetter

:  Solr3.1  Starting with Solr 3.1, the
: primary relevance query (i.e. the one normally specified by the *q* parameter)
: may also be excluded.
: 
: But doesnt show me how to exlude it ??

same tag + ex local params, as you had in your example...

: 2. q={!tag=mq}blah={!ex=tag_man,mq}manufacturer
...
: But  I still only get the facets WITH the main query  of "blah"

Can you give us a more more complete picture of what your data & full 
request & response looks like? ... it's working fine for me with Solr 6 
and the techproducts sample data (see below).

https://wiki.apache.org/solr/UsingMailingLists


$ curl 
'http://localhost:8983/solr/techproducts/query?rows=0=%7B!tag=y%7dinStock:true=%7B!tag=x%7dname:ipod=true=inStock=%7b!key=exclude_x+ex=x%7dinStock=%7b!key=exclude_xy+ex=x,y%7dinStock'
{
  "responseHeader":{
"status":0,
"QTime":2,
"params":{
  "q":"{!tag=x}name:ipod",
  "facet.field":["inStock",
"{!key=exclude_x ex=x}inStock",
"{!key=exclude_xy ex=x,y}inStock"],
  "fq":"{!tag=y}inStock:true",
  "rows":"0",
  "facet":"true"}},
  "response":{"numFound":1,"start":0,"docs":[]
  },
  "facet_counts":{
"facet_queries":{},
"facet_fields":{
  "inStock":[
"true",1,
"false",0],
  "exclude_x":[
"true",17,
"false",0],
  "exclude_xy":[
"true",17,
"false",4]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}





-Hoss
http://www.lucidworks.com/


Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Pratik Patel
I am using the schema from solr 5 which does not have any field with
docValues enabled.In fact to ensure that everything is same as solr 5
(except the breaking changes) I am using the solrconfig.xml also from solr
5 with schemaFactory set as classicSchemaFactory to use schema.xml from
solr 5.

On Tue, Feb 21, 2017 at 11:33 AM, Alexandre Rafalovitch 
wrote:

> Did you reuse the schema or rebuilt it on top of the latest examples?
> Because the latest example schema enabled docValues for strings on the
> fieldType level.
>
> I would do a diff of the schemas to see what changed. If they look
> very different and you are looking for tools to normalize/extract
> elements from schemas, you may find my latest Revolution presentation
> useful for that:
> https://www.slideshare.net/arafalov/rebuilding-solr-6-
> examples-layer-by-layer-lucenesolrrevolution-2016
> (e.g. slide 20). There is also the video there at the end.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 21 February 2017 at 11:18, Mike Thomsen  wrote:
> > Correct me if I'm wrong, but heavy use of doc values should actually blow
> > up the size of your index considerably if they are in fields that get
> sent
> > a lot of data.
> >
> > On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel 
> wrote:
> >
> >> Thanks for the reply. I can see that in solr 6, more than 50% of the
> index
> >> directory is occupied by ".nvd" file extension. It is something related
> to
> >> norms and doc values.
> >>
> >> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> wrote:
> >>
> >> > Did you look in the data directories to check what index file
> extensions
> >> > contribute most to the difference? That could give a hint.
> >> >
> >> > Regards,
> >> > Alex
> >> >
> >> > On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
> >> >
> >> > > Here is the same question in stackOverflow for better format.
> >> > >
> >> > > http://stackoverflow.com/questions/42370231/solr-
> >> > > dynamic-field-blowing-up-the-index-size
> >> > >
> >> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app
> fine
> >> > but
> >> > > the problem is that index size with solr 6 is way too large. In
> solr 5,
> >> > > index size was about 15GB and in solr 6, for the same data, the
> index
> >> > size
> >> > > is 300GB! I am not able to understand what contributes to such huge
> >> > > difference in solr 6.
> >> > >
> >> > > I have been able to identify a field which is blowing up the size of
> >> > index.
> >> > > It is as follows.
> >> > >
> >> > >  >> > > stored="true" multiValued="true"  />
> >> > >
> >> > >  >> > > stored="false" multiValued="true"  />
> >> > > 
> >> > >
> >> > > When this field is commented out, the index size reduces to less
> than
> >> > 10GB.
> >> > >
> >> > > This field is of type text_general. Following is the definition of
> this
> >> > > type.
> >> > >
> >> > >  >> > > positionIncrementGap="100">
> >> > >   
> >> > > 
> >> > > 
> >> > > 
> >> > >  >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> > >  >> > > protected="protwords.txt" generateWordParts="1"
> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > > catenateAll="0" splitOnCaseChange="0"/>
> >> > > 
> >> > >  >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> >> > > />
> >> > >   
> >> > >   
> >> > > 
> >> > > 
> >> > > 
> >> > >  >> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >> > >  >> > > protected="protwords.txt" generateWordParts="1"
> >> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > > catenateAll="0" splitOnCaseChange="0"/>
> >> > > 
> >> > >  >> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> >> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> >> > > />
> >> > >   
> >> > >   
> >> > >
> >> > > Few things which I did to debug this issue:
> >> > >
> >> > >- I have ensured that field type definition is same as what I was
> >> > using
> >> > >in solr 5 and it is also valid in version 6. This field type
> >> > considers a
> >> > >list of "stopwords" to be ignored during indexing. I have
> supplied
> >> the
> >> > > same
> >> > >list of stopwords which we were using in solr 5. I have verified
> >> that
> >> > > path
> >> > >of this file is correct and it is being loaded fine in solr admin
> >> UI.
> >> > > When
> >> > >I analyse these fields using "Analysis" tab of the solr admin
> UI, I
> >> > can
> >> > > see
> >> > >that stopwords are being filtered out. However, when I query with
> >> some
> >> > > of
> >> > >these stopwords, I do get the results back which makes me think
> that
> >> > >  

Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Alexandre Rafalovitch
Did you reuse the schema or rebuilt it on top of the latest examples?
Because the latest example schema enabled docValues for strings on the
fieldType level.

I would do a diff of the schemas to see what changed. If they look
very different and you are looking for tools to normalize/extract
elements from schemas, you may find my latest Revolution presentation
useful for that:
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016
(e.g. slide 20). There is also the video there at the end.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 21 February 2017 at 11:18, Mike Thomsen  wrote:
> Correct me if I'm wrong, but heavy use of doc values should actually blow
> up the size of your index considerably if they are in fields that get sent
> a lot of data.
>
> On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel  wrote:
>
>> Thanks for the reply. I can see that in solr 6, more than 50% of the index
>> directory is occupied by ".nvd" file extension. It is something related to
>> norms and doc values.
>>
>> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> wrote:
>>
>> > Did you look in the data directories to check what index file extensions
>> > contribute most to the difference? That could give a hint.
>> >
>> > Regards,
>> > Alex
>> >
>> > On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
>> >
>> > > Here is the same question in stackOverflow for better format.
>> > >
>> > > http://stackoverflow.com/questions/42370231/solr-
>> > > dynamic-field-blowing-up-the-index-size
>> > >
>> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine
>> > but
>> > > the problem is that index size with solr 6 is way too large. In solr 5,
>> > > index size was about 15GB and in solr 6, for the same data, the index
>> > size
>> > > is 300GB! I am not able to understand what contributes to such huge
>> > > difference in solr 6.
>> > >
>> > > I have been able to identify a field which is blowing up the size of
>> > index.
>> > > It is as follows.
>> > >
>> > > > > > stored="true" multiValued="true"  />
>> > >
>> > > > > > stored="false" multiValued="true"  />
>> > > 
>> > >
>> > > When this field is commented out, the index size reduces to less than
>> > 10GB.
>> > >
>> > > This field is of type text_general. Following is the definition of this
>> > > type.
>> > >
>> > > > > > positionIncrementGap="100">
>> > >   
>> > > 
>> > > 
>> > > 
>> > > > > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> > > > > > protected="protwords.txt" generateWordParts="1"
>> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > > catenateAll="0" splitOnCaseChange="0"/>
>> > > 
>> > > > > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> > > />
>> > >   
>> > >   
>> > > 
>> > > 
>> > > 
>> > > > > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> > > > > > protected="protwords.txt" generateWordParts="1"
>> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > > catenateAll="0" splitOnCaseChange="0"/>
>> > > 
>> > > > > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> > > />
>> > >   
>> > >   
>> > >
>> > > Few things which I did to debug this issue:
>> > >
>> > >- I have ensured that field type definition is same as what I was
>> > using
>> > >in solr 5 and it is also valid in version 6. This field type
>> > considers a
>> > >list of "stopwords" to be ignored during indexing. I have supplied
>> the
>> > > same
>> > >list of stopwords which we were using in solr 5. I have verified
>> that
>> > > path
>> > >of this file is correct and it is being loaded fine in solr admin
>> UI.
>> > > When
>> > >I analyse these fields using "Analysis" tab of the solr admin UI, I
>> > can
>> > > see
>> > >that stopwords are being filtered out. However, when I query with
>> some
>> > > of
>> > >these stopwords, I do get the results back which makes me think that
>> > >probably stopwords are being indexed.
>> > >
>> > > Any idea what could increase the size of index by so much in solr 6?
>> > >
>> >
>>


Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Mike Thomsen
Correct me if I'm wrong, but heavy use of doc values should actually blow
up the size of your index considerably if they are in fields that get sent
a lot of data.

On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel  wrote:

> Thanks for the reply. I can see that in solr 6, more than 50% of the index
> directory is occupied by ".nvd" file extension. It is something related to
> norms and doc values.
>
> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> wrote:
>
> > Did you look in the data directories to check what index file extensions
> > contribute most to the difference? That could give a hint.
> >
> > Regards,
> > Alex
> >
> > On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
> >
> > > Here is the same question in stackOverflow for better format.
> > >
> > > http://stackoverflow.com/questions/42370231/solr-
> > > dynamic-field-blowing-up-the-index-size
> > >
> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine
> > but
> > > the problem is that index size with solr 6 is way too large. In solr 5,
> > > index size was about 15GB and in solr 6, for the same data, the index
> > size
> > > is 300GB! I am not able to understand what contributes to such huge
> > > difference in solr 6.
> > >
> > > I have been able to identify a field which is blowing up the size of
> > index.
> > > It is as follows.
> > >
> > >  > > stored="true" multiValued="true"  />
> > >
> > >  > > stored="false" multiValued="true"  />
> > > 
> > >
> > > When this field is commented out, the index size reduces to less than
> > 10GB.
> > >
> > > This field is of type text_general. Following is the definition of this
> > > type.
> > >
> > >  > > positionIncrementGap="100">
> > >   
> > > 
> > > 
> > > 
> > >  > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> > >  > > protected="protwords.txt" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="0"/>
> > > 
> > >  > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > > />
> > >   
> > >   
> > > 
> > > 
> > > 
> > >  > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> > >  > > protected="protwords.txt" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="0"/>
> > > 
> > >  > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > > />
> > >   
> > >   
> > >
> > > Few things which I did to debug this issue:
> > >
> > >- I have ensured that field type definition is same as what I was
> > using
> > >in solr 5 and it is also valid in version 6. This field type
> > considers a
> > >list of "stopwords" to be ignored during indexing. I have supplied
> the
> > > same
> > >list of stopwords which we were using in solr 5. I have verified
> that
> > > path
> > >of this file is correct and it is being loaded fine in solr admin
> UI.
> > > When
> > >I analyse these fields using "Analysis" tab of the solr admin UI, I
> > can
> > > see
> > >that stopwords are being filtered out. However, when I query with
> some
> > > of
> > >these stopwords, I do get the results back which makes me think that
> > >probably stopwords are being indexed.
> > >
> > > Any idea what could increase the size of index by so much in solr 6?
> > >
> >
>


Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Pratik Patel
Thanks for the reply. I can see that in solr 6, more than 50% of the index
directory is occupied by ".nvd" file extension. It is something related to
norms and doc values.

On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch 
wrote:

> Did you look in the data directories to check what index file extensions
> contribute most to the difference? That could give a hint.
>
> Regards,
> Alex
>
> On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
>
> > Here is the same question in stackOverflow for better format.
> >
> > http://stackoverflow.com/questions/42370231/solr-
> > dynamic-field-blowing-up-the-index-size
> >
> > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine
> but
> > the problem is that index size with solr 6 is way too large. In solr 5,
> > index size was about 15GB and in solr 6, for the same data, the index
> size
> > is 300GB! I am not able to understand what contributes to such huge
> > difference in solr 6.
> >
> > I have been able to identify a field which is blowing up the size of
> index.
> > It is as follows.
> >
> >  > stored="true" multiValued="true"  />
> >
> >  > stored="false" multiValued="true"  />
> > 
> >
> > When this field is commented out, the index size reduces to less than
> 10GB.
> >
> > This field is of type text_general. Following is the definition of this
> > type.
> >
> >  > positionIncrementGap="100">
> >   
> > 
> > 
> > 
> >  > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >  > protected="protwords.txt" generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="0"/>
> > 
> >  > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > />
> >   
> >   
> > 
> > 
> > 
> >  > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> >  > protected="protwords.txt" generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="0"/>
> > 
> >  > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > />
> >   
> >   
> >
> > Few things which I did to debug this issue:
> >
> >- I have ensured that field type definition is same as what I was
> using
> >in solr 5 and it is also valid in version 6. This field type
> considers a
> >list of "stopwords" to be ignored during indexing. I have supplied the
> > same
> >list of stopwords which we were using in solr 5. I have verified that
> > path
> >of this file is correct and it is being loaded fine in solr admin UI.
> > When
> >I analyse these fields using "Analysis" tab of the solr admin UI, I
> can
> > see
> >that stopwords are being filtered out. However, when I query with some
> > of
> >these stopwords, I do get the results back which makes me think that
> >probably stopwords are being indexed.
> >
> > Any idea what could increase the size of index by so much in solr 6?
> >
>


Re: Solr: Return field names that contain search term

2017-02-21 Thread sunayana1991
Hi rahul,

I am working on a similar scenario, was wondering if you find a way to
resolve this.

Thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Return-field-names-that-contain-search-term-tp3329993p4321399.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Alexandre Rafalovitch
Did you look in the data directories to check what index file extensions
contribute most to the difference? That could give a hint.

Regards,
Alex

On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:

> Here is the same question in stackOverflow for better format.
>
> http://stackoverflow.com/questions/42370231/solr-
> dynamic-field-blowing-up-the-index-size
>
> Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine but
> the problem is that index size with solr 6 is way too large. In solr 5,
> index size was about 15GB and in solr 6, for the same data, the index size
> is 300GB! I am not able to understand what contributes to such huge
> difference in solr 6.
>
> I have been able to identify a field which is blowing up the size of index.
> It is as follows.
>
>  stored="true" multiValued="true"  />
>
>  stored="false" multiValued="true"  />
> 
>
> When this field is commented out, the index size reduces to less than 10GB.
>
> This field is of type text_general. Following is the definition of this
> type.
>
>  positionIncrementGap="100">
>   
> 
> 
> 
>  pattern="((?m)[a-z]+)'s" replacement="$1s" />
>  protected="protwords.txt" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0"/>
> 
>  words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> />
>   
>   
> 
> 
> 
>  pattern="((?m)[a-z]+)'s" replacement="$1s" />
>  protected="protwords.txt" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0"/>
> 
>  words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> />
>   
>   
>
> Few things which I did to debug this issue:
>
>- I have ensured that field type definition is same as what I was using
>in solr 5 and it is also valid in version 6. This field type considers a
>list of "stopwords" to be ignored during indexing. I have supplied the
> same
>list of stopwords which we were using in solr 5. I have verified that
> path
>of this file is correct and it is being loaded fine in solr admin UI.
> When
>I analyse these fields using "Analysis" tab of the solr admin UI, I can
> see
>that stopwords are being filtered out. However, when I query with some
> of
>these stopwords, I do get the results back which makes me think that
>probably stopwords are being indexed.
>
> Any idea what could increase the size of index by so much in solr 6?
>


Fwd: Solr dynamic field blowing up the index size

2017-02-21 Thread Pratik Patel
Here is the same question in stackOverflow for better format.

http://stackoverflow.com/questions/42370231/solr-
dynamic-field-blowing-up-the-index-size

Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine but
the problem is that index size with solr 6 is way too large. In solr 5,
index size was about 15GB and in solr 6, for the same data, the index size
is 300GB! I am not able to understand what contributes to such huge
difference in solr 6.

I have been able to identify a field which is blowing up the size of index.
It is as follows.






When this field is commented out, the index size reduces to less than 10GB.

This field is of type text_general. Following is the definition of this
type.


  







  
  







  
  

Few things which I did to debug this issue:

   - I have ensured that field type definition is same as what I was using
   in solr 5 and it is also valid in version 6. This field type considers a
   list of "stopwords" to be ignored during indexing. I have supplied the same
   list of stopwords which we were using in solr 5. I have verified that path
   of this file is correct and it is being loaded fine in solr admin UI. When
   I analyse these fields using "Analysis" tab of the solr admin UI, I can see
   that stopwords are being filtered out. However, when I query with some of
   these stopwords, I do get the results back which makes me think that
   probably stopwords are being indexed.

Any idea what could increase the size of index by so much in solr 6?


Facet query - exlude main query

2017-02-21 Thread Jacques du Rand
HI Guys
The Wiki says

 Solr3.1  Starting with Solr 3.1, the
primary relevance query (i.e. the one normally specified by the *q* parameter)
may also be excluded.

But doesnt show me how to exlude it ??

i've tried:
1 .={!ex=tag_man,mainquery}manufacturer

2. q={!tag=mq}blah={!ex=tag_man,mq}manufacturer

3. ...=manufacturer:*


But  I still only get the facets WITH the main query  of "blah"

Any Ideas ?

-- 
Jacques du Rand
Senior R  Programmer

T: +27214688017
F: +27862160617
E: jacq...@pricecheck.co.za



Re: CPU Intensive Scoring Alternatives

2017-02-21 Thread Ahmet Arslan
Hi,

New default similarity is BM25. 
May be explicitly set similarity to tf-idf and see how it goes?

Ahmet


On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi  wrote:
Hello,


Default TF-IDF performs poorly with the indexed 200 millions documents.
Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3
seconds. eDisMax. Because default operator "OR" and stopword "The" we have
50-70 millions documents as a query result, and scoring is CPU intensive.
What to do? Our typical queries return over million documents, and response
times of simple queries ranges from 50 milliseconds to 5-10 seconds
depending on result set.

This was just an exaggerated example with stopword “the”, but even simplest
query “Michael Jackson” runs 300ms instead of 3ms just because huge number
of hits and TF-IDF calculations. Solr 6.3.


Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems