Re: Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Thank you Markus and Chris, for pointers.
For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
exposed via similarity config is easier to maintain as data changes than
making adjustments to fit a
function. Another piece of info would've been handy is to know the average
position info + position info for the first few occurrences for each term.
This would allow
perhaps higher boosting for term occurrences earlier in the doc. In my case
extra keywords are towards the end of the doc,but that info does not seem
to be propagated into scorer.
Thanks again,
Mihran



On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 You should start by checking out the SweetSpotSimilarity .. it was
 heavily designed arround the idea of dealing with things like excessively
 verbose titles, and keyword stuffing in summary text ... so you can
 configure your expectation for what a normal length doc is, and they
 will be penalized for being longer then that.  similarly you can say what
 a 'resaonable' tf is, and docs that exceed that would't get added boost
 (which in conjunction with teh lengthNorm penality penalizes docs that
 stuff keywords)


 https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html


 https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg

 https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg


 -Hoss
 http://www.lucidworks.com/



Re: Nginx proxy for Solritas

2015-03-16 Thread Erik Hatcher
Have a look at the requests being made to Solr while using /browse (without 
nginx) and that will show you what resources need to be accessible.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Mar 16, 2015, at 4:42 PM, LongY zhangyulin8...@hotmail.com wrote:
 
 Thank you for the reply.
 
 I also thought the relevant resources (CSS, images, JavaScript) need to 
 be accessible for Nginx. 
 
 I copied the velocity folder to solr-webapp/webapp folder. It didn't work.
 
 So how to allow /browse resource accessible by the Nginx rule?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Nginx proxy for Solritas

2015-03-16 Thread LongY
Thanks to Erik and Shawn, I figured out the solution.

* place main.css in velocity folder into
/usr/share/nginx/html/solr/collection1/admin/file/
* don't forget to change the permission of main.css by sudo chmod 755
main.css
* add main.css to the configuration file of Ngix:
server {
listen 80 default_server;
listen [::]:80 default_server ipv6only=on;
index main.css;
server_name localhost;
location ~* /solr/\w+/browse {
proxy_pass   http://localhost:8983; 
allow   127.0.0.1;
denyall;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
 
} 
}
That will work.
Also /var/log/nginx/error.log is good for debugging.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193415.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing db records via SolrJ

2015-03-16 Thread sreedevi s
Hi,
I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ

Best Regards,
Sreedevi S

On Mon, Mar 16, 2015 at 4:17 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Hello,

 Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/
 ?

 On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s sreedevi.payik...@gmail.com
 wrote:

  Hi,
 
  I am a beginner in Solr. I have a scenario, where I need to index data
 from
  my MySQL db and need to query them. I have figured out to provide my db
  data import configs using DIH. I also know to query my index via SolrJ.
 
  How can I do indexing via SorJ client for my db as well other than
 reading
  the db records into documents one by one?
 
  This question is in point whether is there any way I can make use of my
  configuration files and achieve the same. We need to use java APIs, so
 all
  indexing and querying can be done only via SolrJ.
  Best Regards,
  Sreedevi S
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: indexing db records via SolrJ

2015-03-16 Thread Mikhail Khludnev
Hello,

Did you see the great post http://lucidworks.com/blog/indexing-with-solrj/
?

On Mon, Mar 16, 2015 at 1:30 PM, sreedevi s sreedevi.payik...@gmail.com
wrote:

 Hi,

 I am a beginner in Solr. I have a scenario, where I need to index data from
 my MySQL db and need to query them. I have figured out to provide my db
 data import configs using DIH. I also know to query my index via SolrJ.

 How can I do indexing via SorJ client for my db as well other than reading
 the db records into documents one by one?

 This question is in point whether is there any way I can make use of my
 configuration files and achieve the same. We need to use java APIs, so all
 indexing and querying can be done only via SolrJ.
 Best Regards,
 Sreedevi S




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Solr returns incorrect results after sorting

2015-03-16 Thread david.w.smi...@gmail.com
I noticed you have an ‘’ immediately preceding the geodist() asc at the
very end of the query/URL; that’s supposed to be a comma since group.sort
is a comma delimited list of sorts.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Mon, Mar 16, 2015 at 7:51 AM, kumarraj rajitpro2...@gmail.com wrote:

 Hi,

 I am using group.sort to internally sort the values first based on
 store(using function),then stock and finally distance and sort the output
 results based on price, but solr does not return the correct results after
 sorting.
 Below is the  sample query:

 q=*:*start=0rows=200sort=pricecommon_double

 descd=321spatial=truesfield=store_locationfl=geodist(),*pt=37.1037311,-76.5104751


 group.ngroups=truegroup.limit=1group.facet=truegroup.field=code_stringgroup=truegroup.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
 desc,inStock_boolean descgeodist() asc


 I am expecting all the docs to be sorted by price from high to low after
 grouping,  but i see the records not matching the order, Do you see any
 issues with the query or having functions in group.sort is not supported in
 solr?




 Regards,
 Raj



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: indexing db records via SolrJ

2015-03-16 Thread Shawn Heisey
On 3/16/2015 7:15 AM, sreedevi s wrote:
 I had checked this post.I dont know whether this is possible but my query
 is whether I can use the configuration for DIH for indexing via SolrJ

You can use SolrJ for accessing DIH.  I have code that does this, but
only for full index rebuilds.

It won't be particularly obvious how to do it.  Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.

I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to /dataimport or whatever your DIH handler is, and
set the other parameters on the request like command to full-import
and so on.

Thanks,
Shawn



Re: [Poll]: User need for Solr security

2015-03-16 Thread Ahmet Arslan
Hi John,

ManifoldCF in Action book is publicly available to anyone : 
https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/

For solr integration please see :
https://svn.apache.org/repos/asf/manifoldcf/integration/solr-5.x/trunk/README.txt

Ahmet

On Friday, March 13, 2015 2:50 AM, johnmu...@aol.com johnmu...@aol.com 
wrote:



I would love to see record level (or even field level) restricted access in 
Solr / Lucene.

This should be group level, LDAP like or some rule base (which can be dynamic). 
 If the solution means having a second core, so be it.

The following is the closest I found: 
https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I cannot 
use Manifold CF (Connector Framework).  Does anyone know how Manifold does it?

- MJ


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, March 12, 2015 6:51 PM
To: solr-user@lucene.apache.org
Subject: RE: [Poll]: User need for Solr security

Jan - we don't really need any security for our products, nor for most clients. 
However, one client does deal with very sensitive data so we proposed to 
encrypt the transfer of data and the data on disk through a Lucene Directory. 
It won't fill all gaps but it would adhere to such a client's guidelines. 

I think many approaches of security in Solr/Lucene would find advocates, be it 
index encryption or authentication/authorization or transport security, which 
is now possible. I understand the reluctance of the PMC, and i agree with it, 
but some users would definitately benefit and it would certainly make 
Solr/Lucene the search platform to use for some enterprises.

Markus 

-Original message-
 From:Henrique O. Santos hensan...@gmail.com
 Sent: Thursday 12th March 2015 23:43
 To: solr-user@lucene.apache.org
 Subject: Re: [Poll]: User need for Solr security
 
 Hi,
 
 I’m currently working with indexes that need document level security. Based 
 on the user logged in, query results would omit documents that this user 
 doesn’t have access to, with LDAP integration and such.
 
 I think that would be nice to have on a future Solr release.
 
 Henrique.
 
  On Mar 12, 2015, at 7:32 AM, Jan Høydahl jan@cominvent.com wrote:
  
  Hi,
  
  Securing various Solr APIs has once again surfaced as a discussion 
  in the developer list. See e.g. SOLR-7236 Would be useful to get some 
  feedback from Solr users about needs in the field.
  
  Please reply to this email and let us know what security aspect(s) would be 
  most important for your company to see supported in a future version of 
  Solr.
  Examples: Local user management, AD/LDAP integration, SSL, 
  authenticated login to Admin UI, authorization for Admin APIs, e.g. 
  admin user vs read-only user etc
  
  --
  Jan Høydahl, search solution architect Cominvent AS - 
  www.cominvent.com
  
 



indexing db records via SolrJ

2015-03-16 Thread sreedevi s
Hi,

I am a beginner in Solr. I have a scenario, where I need to index data from
my MySQL db and need to query them. I have figured out to provide my db
data import configs using DIH. I also know to query my index via SolrJ.

How can I do indexing via SorJ client for my db as well other than reading
the db records into documents one by one?

This question is in point whether is there any way I can make use of my
configuration files and achieve the same. We need to use java APIs, so all
indexing and querying can be done only via SolrJ.
Best Regards,
Sreedevi S


[ANNOUNCE] Luke 4.10.4 released

2015-03-16 Thread Dmitry Kan
Hello,

Luke 4.10.4 has been released. Download it here:

https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4

The release has been tested against the solr-4.10.4 based index.

Changes:
Trivial pom upgrade to lucene 4.10.4.
Got rid of index version warning on the index summary tab
Luke is now distributed as a tar.gz with the luke binary and a launcher
script.


There is currently luke atop apache pivot cooking in its own branch. You
can try it out already for some basic index loading and search operations:

https://github.com/DmitryKey/luke/tree/pivot-luke

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Solr returns incorrect results after sorting

2015-03-16 Thread kumarraj
Hi,

I am using group.sort to internally sort the values first based on
store(using function),then stock and finally distance and sort the output
results based on price, but solr does not return the correct results after
sorting.  
Below is the  sample query: 

q=*:*start=0rows=200sort=pricecommon_double
descd=321spatial=truesfield=store_locationfl=geodist(),*pt=37.1037311,-76.5104751

group.ngroups=truegroup.limit=1group.facet=truegroup.field=code_stringgroup=truegroup.sort=max(if(exists(query({!v='storeName_string:212'})),2,0),if(exists(query({!v='storeName_string:203'})),1,0))
desc,inStock_boolean descgeodist() asc


I am expecting all the docs to be sorted by price from high to low after
grouping,  but i see the records not matching the order, Do you see any
issues with the query or having functions in group.sort is not supported in
solr?




Regards,
Raj



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-returns-incorrect-results-after-sorting-tp4193266.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Deleted Docs Issue

2015-03-16 Thread Erick Erickson
bq: If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

No, you won't. They'll be merged away as background segments are merged.
Here's a great visualization of the process, the third one down is the
default TieredMergePolicy.

In general, even in the case of replacing all the docs, you'll have 10% of your
corpus be deleted docs. The % of deleted docs in a segment weighs quite
heavily when it comest to the decision of which segment to merge (note that
merging purges the deleted docs).

Also in general, the results of small tests like this simply do not generalize.
i.e. the number of deleted docs in a 200 doc sample size can't be
extrapolated to a reasonable-sized corpus.

Finally, I don't know if this is something temporary, but the implication of
If total commit operations I hit are 20 is that you're committing after every
batch of docs is sent to Solr. You should not do this, let your autocommit
settings handle this.

Here's Mike's blog:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Best,
Erick

On Mon, Mar 16, 2015 at 8:51 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 3/16/2015 9:11 AM, vicky desai wrote:
 I am having an issue with my solr setup. In my solr config I have set
 following property
 *mergeFactor10/mergeFactor*

 The mergeFactor setting is deprecated ... but you are setting it to the
 default value of 10 anyway, so that's not really a big deal.  It's
 possible that mergeFactor will no longer work in 5.0, but I'm not sure
 on that.  You should instead use the settings specific to the merge
 policy, which normally is TieredMergePolicy.

 Note that when mergeFactor is 10, you *will* end up with more than 10
 segments in your index.  There are multiple merge tiers, each one can
 have up to 10 segments before it is merged.

 Now consider following situation. I have* 200* documents in my index. I need
 to update all the 200 docs
 If total commit operations I hit are* 20* i.e I update batches of 10 docs
 merging is done after every 10th update and so the max Segment Count I can
 have is 10 which is fine. However even when merging happens deleted docs are
 not cleared and I end up with 100 deleted docs in index.

 If this operation is continuously done I would end up with a large set of
 deleted docs which will affect the performance of the queries I hit on this
 solr.

 Because there are multiple merge tiers and you cannot easily
 pre-determine which segments will be chosen for a particular merge, the
 merge behavior may not be exactly what you expect.

 The only guaranteed way to get rid of your deleted docs is to do an
 optimize operation, which forces a merge of the entire index down to a
 single segment.  This gets rid of all deleted docs in those segments.
 If you index more data while you are doing the optimize, then you may
 end up with additional deleted docs.

 Thanks,
 Shawn



Re: Solr tlog and soft commit

2015-03-16 Thread vidit.asthana
Can someone please reply to these questions? 

Thanks in advance.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-tlog-and-soft-commit-tp4193105p4193311.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Whole RAM consumed while Indexing.

2015-03-16 Thread Erick Erickson
First start by lengthening your soft and hard commit intervals
substantially. Start with 6 and work backwards I'd say.

Ramkumar has tuned the heck out of his installation to get the commit
intervals to be that short ;).

I'm betting that you'll see your RAM usage go way down, but that' s a
guess until you test.

Best,
Erick

On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com wrote:
 Hi Erick,
 You are saying correct. Something, **overlapping searchers
 warning messages** are coming in logs.
 **numDocs numbers** are changing when documents are adding at the time of
 indexing.
 Any help?

 On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 First, the soft commit interval is very short. Very, very, very, very
 short. 300ms is
 just short of insane unless it's a typo ;).

 Here's a long background:

 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

 But the short form is that you're opening searchers every 300 ms. The
 hard commit is better,
 but every 3 seconds is still far too short IMO. I'd start with soft
 commits of 6 and hard
 commits of 6 (60 seconds), meaning that you're going to have to
 wait 1 minute for
 docs to show up unless you explicitly commit.

 You're throwing away all the caches configured in solrconfig.xml more
 than 3 times a second,
 executing autowarming, etc, etc, etc

 Changing these to longer intervals might cure the problem, but if not
 then, as Hoss would
 say, details matter. I suspect you're also seeing overlapping
 searchers warning messages
 in your log, and it;s _possible_ that what's happening is that you're
 just exceeding the
 max warming searchers and never opening a new searcher with the
 newly-indexed documents.
 But that's a total shot in the dark.

 How are you looking for docs (and not finding them)? Does the numDocs
 number in
 the solr admin screen change?


 Best,
 Erick

 On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com
 wrote:
  Hi Alexandre,
 
 
  *Hard Commit* is :
 
   autoCommit
 maxTime${solr.autoCommit.maxTime:3000}/maxTime
 openSearcherfalse/openSearcher
   /autoCommit
 
  *Soft Commit* is :
 
  autoSoftCommit
  maxTime${solr.autoSoftCommit.maxTime:300}/maxTime
  /autoSoftCommit
 
  And I am committing 2 documents each time.
  Is it good config for committing?
  Or I am good something wrong ?
 
 
  On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
  What's your commit strategy? Explicit commits? Soft commits/hard
  commits (in solrconfig.xml)?
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com wrote:
   Hello,
 I have written a python script to do 2 documents
 indexing
   each time on Solr. I have 28 GB RAM with 8 CPU.
   When I started indexing, at that time 15 GB RAM was freed. While
  indexing,
   all RAM is consumed but **not** a single document is indexed. Why so?
   And it through *HTTPError: HTTP Error 503: Service Unavailable* in
 python
   script.
   I think it is due to heavy load on Zookeeper by which all nodes went
  down.
   I am not sure about that. Any help please..
   Or anything else is happening..
   And how to overcome this issue.
   Please assist me towards right path.
   Thanks..
  
   Warm Regards,
   Nitin Solanki
 



Relevancy : Keyword stuffing

2015-03-16 Thread Mihran Shahinian
Hi all,
I have a use case where the data is generated by SEO minded authors and
more often than not
they perfectly guess the synonym expansions for the document titles skewing
results in their favor.
At the moment I don't have an offline processing infrastructure to detect
these (I can't punish these docs either... just have to level the playing
field).
I am experimenting with taking the max of the term scores, cutting off
scores after certain number of terms,etc but would appreciate any hints if
anyone has experience dealing with a similar use case in solr.

Much appreciated,
Mihran


Re: indexing db records via SolrJ

2015-03-16 Thread Hal Roberts
We import anywhere from five to fifty million small documents a day from 
a postgres database.  I wrestled to get the DIH stuff to work for us for 
about a year and was much happier when I ditched that approach and 
switched to writing the few hundred lines of relatively simple code to 
handle directly the logic of what gets updated and how it gets queried 
from postgres ourselves.


The DIH stuff is great for lots of cases, but if you are getting to the 
point of trying to hack its undocumented internals, I suspect you are 
better off spending a day or two of your time just writing all of the 
update logic yourself.


We found a relatively simple combination of postgres triggers, export to 
csv based on those triggers, and then just calling update/csv to work 
best for us.


-hal

On 3/16/15 9:59 AM, Shawn Heisey wrote:

On 3/16/2015 7:15 AM, sreedevi s wrote:

I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ


You can use SolrJ for accessing DIH.  I have code that does this, but
only for full index rebuilds.

It won't be particularly obvious how to do it.  Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.

I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to /dataimport or whatever your DIH handler is, and
set the other parameters on the request like command to full-import
and so on.

Thanks,
Shawn



--
Hal Roberts
Fellow
Berkman Center for Internet  Society
Harvard University


Solr Deleted Docs Issue

2015-03-16 Thread vicky desai
Hi,

I am having an issue with my solr setup. In my solr config I have set
following property
*mergeFactor10/mergeFactor*

Now consider following situation. I have* 200* documents in my index. I need
to update all the 200 docs
If total commit operations I hit are* 20* i.e I update batches of 10 docs
merging is done after every 10th update and so the max Segment Count I can
have is 10 which is fine. However even when merging happens deleted docs are
not cleared and I end up with 100 deleted docs in index. 

If this operation is continuously done I would end up with a large set of
deleted docs which will affect the performance of the queries I hit on this
solr.

Can anyone please help me if I have missed a config or if this is an
expected behaviour



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Deleted-Docs-Issue-tp4193292.html
Sent from the Solr - User mailing list archive at Nabble.com.


thresholdTokenFrequency changes suggestion frequency..

2015-03-16 Thread Nitin Solanki
Hi,
  I am not getting that why suggestion frequency goes varies from
original frequency.
Example - I have a word = *who* and its original frequency is *100* but
when I find suggestion of it. It suggestion goes change to *50*.

I think it is happening because of *thresholdTokenFrequency*.
When I set the value of thresholdTokenFrequency to *0.1* then it gives
different frequency for 'who' suggestion while  set the value of
thresholdTokenFrequency to *0.0001* then it gives something different
frequency. Why so? I am not getting logic behind this..

As we know suggestion frequency is same as the index original frequency -

*The spellcheck.extendedResults=true parameter provides frequency of each
original term in the index (origFreq) as well as the frequency of each
suggestion in the index (frequency).*


Re: Solr Deleted Docs Issue

2015-03-16 Thread Shawn Heisey
On 3/16/2015 9:11 AM, vicky desai wrote:
 I am having an issue with my solr setup. In my solr config I have set
 following property
 *mergeFactor10/mergeFactor*

The mergeFactor setting is deprecated ... but you are setting it to the
default value of 10 anyway, so that's not really a big deal.  It's
possible that mergeFactor will no longer work in 5.0, but I'm not sure
on that.  You should instead use the settings specific to the merge
policy, which normally is TieredMergePolicy.

Note that when mergeFactor is 10, you *will* end up with more than 10
segments in your index.  There are multiple merge tiers, each one can
have up to 10 segments before it is merged.

 Now consider following situation. I have* 200* documents in my index. I need
 to update all the 200 docs
 If total commit operations I hit are* 20* i.e I update batches of 10 docs
 merging is done after every 10th update and so the max Segment Count I can
 have is 10 which is fine. However even when merging happens deleted docs are
 not cleared and I end up with 100 deleted docs in index. 

 If this operation is continuously done I would end up with a large set of
 deleted docs which will affect the performance of the queries I hit on this
 solr.

Because there are multiple merge tiers and you cannot easily
pre-determine which segments will be chosen for a particular merge, the
merge behavior may not be exactly what you expect.

The only guaranteed way to get rid of your deleted docs is to do an
optimize operation, which forces a merge of the entire index down to a
single segment.  This gets rid of all deleted docs in those segments. 
If you index more data while you are doing the optimize, then you may
end up with additional deleted docs.

Thanks,
Shawn



RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - setting (e)dismax' tie breaker to 0 or much low than default would 
`solve` this for now.
Markus 
 
-Original message-
 From:Mihran Shahinian slowmih...@gmail.com
 Sent: Monday 16th March 2015 16:29
 To: solr-user@lucene.apache.org
 Subject: Relevancy : Keyword stuffing
 
 Hi all,
 I have a use case where the data is generated by SEO minded authors and
 more often than not
 they perfectly guess the synonym expansions for the document titles skewing
 results in their favor.
 At the moment I don't have an offline processing infrastructure to detect
 these (I can't punish these docs either... just have to level the playing
 field).
 I am experimenting with taking the max of the term scores, cutting off
 scores after certain number of terms,etc but would appreciate any hints if
 anyone has experience dealing with a similar use case in solr.
 
 Much appreciated,
 Mihran
 


maxQueryFrequency v/s thresholdTokenFrequency

2015-03-16 Thread Nitin Solanki
Hello Everyone,
 Please anybody can explain me what is the
difference between maxQueryFrequency and thresholdTokenFrequency?
Got the link -
http://wiki.apache.org/solr/SpellCheckComponent#thresholdTokenFrequency but
unable to understand..
I am very much confusing in both of them.
Your help is appreciated.


Warm Regards,
Nitin


RE: Relevancy : Keyword stuffing

2015-03-16 Thread Markus Jelsma
Hello - Chris' suggestion is indeed a good one but it can be tricky to properly 
configure the parameters. Regarding position information, you can override 
dismax to have it use SpanFirstQuery. It allows for setting strict boundaries 
from the front of the document to a given position. You can also override 
SpanFirstQuery to incorporate a gradient, to decrease boosting as distance from 
the front increases.

I don't know how you ingest document bodies, but if they are unstructured HTML, 
you may want to install proper main content extraction if you haven't already. 
Having decent control over HTML is a powerful tool.

You may also want to look at Lucene's BM25 implementation. It is simple to set 
up and easier to control. It isn't as rough a tool as TFIDF is regarding to 
length normalization. Plus it allows you to smooth TF, which in your case 
should also help.

If you like to scrutinize SSS and get some proper results, you are more than 
welcome to share them here :)

Markus
 
-Original message-
 From:Mihran Shahinian slowmih...@gmail.com
 Sent: Monday 16th March 2015 22:41
 To: solr-user@lucene.apache.org
 Subject: Re: Relevancy : Keyword stuffing
 
 Thank you Markus and Chris, for pointers.
 For SweetSpotSimilarity I am thinking perhaps a set of closed ranges
 exposed via similarity config is easier to maintain as data changes than
 making adjustments to fit a
 function. Another piece of info would've been handy is to know the average
 position info + position info for the first few occurrences for each term.
 This would allow
 perhaps higher boosting for term occurrences earlier in the doc. In my case
 extra keywords are towards the end of the doc,but that info does not seem
 to be propagated into scorer.
 Thanks again,
 Mihran
 
 
 
 On Mon, Mar 16, 2015 at 1:52 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:
 
 
  You should start by checking out the SweetSpotSimilarity .. it was
  heavily designed arround the idea of dealing with things like excessively
  verbose titles, and keyword stuffing in summary text ... so you can
  configure your expectation for what a normal length doc is, and they
  will be penalized for being longer then that.  similarly you can say what
  a 'resaonable' tf is, and docs that exceed that would't get added boost
  (which in conjunction with teh lengthNorm penality penalizes docs that
  stuff keywords)
 
 
  https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html
 
 
  https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
 
  https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg
 
 
  -Hoss
  http://www.lucidworks.com/
 
 


discrepancy between LuceneQParser and ExtendedDismaxQParser

2015-03-16 Thread Arsen
Hello,

Found discrepancy between LuceneQParser and ExtendedDismaxQParser when 
executing following query:
((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451

When executing it through Solr Admin panel and placing query in q field I 
having following debug output for LuceneQParser   
--
debug: {
rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451,
querystring: ((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451,
parsedquery: +((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO 300]) 
+objectId:40105451,
parsedquery_toString: +((+*:* -text:area) area:[100 TO 300]) +objectId: 
\u0001\u\u\u\u\u\u0013\u000fkk,
explain: {
  40105451: \n14.3511 = (MATCH) sum of:\n  0.034590416 = (MATCH) product 
of:\n0.06918083 = (MATCH) sum of:\n  0.06918083 = (MATCH) sum of:\n 
   0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n  0.06918083 = 
queryNorm\n0.5 = coord(1/2)\n  14.316509 = (MATCH) weight(objectId: 
\u0001\u\u\u\u\u\u0013\u000fkk in 1109978) 
[DefaultSimilarity], result of:\n14.316509 = score(doc=1109978,freq=1.0), 
product of:\n  0.9952025 = queryWeight, product of:\n14.385524 = 
idf(docFreq=1, maxDocs=1301035)\n0.06918083 = queryNorm\n  
14.385524 = fieldWeight in 1109978, product of:\n1.0 = tf(freq=1.0), 
with freq of:\n  1.0 = termFreq=1.0\n14.385524 = idf(docFreq=1, 
maxDocs=1301035)\n1.0 = fieldNorm(doc=1109978)\n
},
--
So, one object found which is expectable

For ExtendedDismaxQParser (only difference is checkbox edismax checked) I am 
seeing this output
--
debug: {
rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451,
querystring: ((*:* AND -area) OR area:[100 TO 300]) AND 
objectId:40105451,
parsedquery: (+(+((+DisjunctionMaxQuery((text:*\\:*)) 
-DisjunctionMaxQuery((text:area))) area:[100 TO 300]) 
+objectId:40105451))/no_coord,
parsedquery_toString: +(+((+(text:*\\:*) -(text:area)) area:[100 TO 
300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk),
explain: {},
--
oops, no objects found!

I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249 (sorry, my 
bad)
You may refer to it for additional info (not going to duplicate it here)

Thanks

-- 
Best regards,
 Arsen  mailto:barracuda...@mail.ru



Re: Nginx proxy for Solritas

2015-03-16 Thread Shawn Heisey
On 3/16/2015 2:42 PM, LongY wrote:
 Thank you for the reply.

 I also thought the relevant resources (CSS, images, JavaScript) need to 
 be accessible for Nginx. 

 I copied the velocity folder to solr-webapp/webapp folder. It didn't work.

 So how to allow /browse resource accessible by the Nginx rule?

The /browse handler causes your browser to make requests directly to
Solr on handlers other than /browse.  You must figure out what those
requests are and allow them in the proxy configuration.  I do not know
whether they are relative URLs ... I would not be terribly surprised to
learn that they have port 8983 in them rather than the port 80 on your
proxy.  Hopefully that's not the case, or you'll really have problems
making it work on port 80.

I've never spent any real time with the /browse handler.  Requiring
direct access to Solr is completely unacceptable for us.

Thanks,
Shawn



Re: Nginx proxy for Solritas

2015-03-16 Thread LongY
Thank you for the reply.

I also thought the relevant resources (CSS, images, JavaScript) need to 
be accessible for Nginx. 

I copied the velocity folder to solr-webapp/webapp folder. It didn't work.

So how to allow /browse resource accessible by the Nginx rule?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193347p4193352.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: discrepancy between LuceneQParser and ExtendedDismaxQParser

2015-03-16 Thread Jack Krupansky
There was a Solr release with a bug that required that you put a space
between the left parenthesis and the *:*. The edismax parsed query here
indicates that the *:* has not parsed properly.

You have area, but in your jira you had a range query.

-- Jack Krupansky

On Mon, Mar 16, 2015 at 6:42 PM, Arsen barracuda...@mail.ru wrote:

 Hello,

 Found discrepancy between LuceneQParser and ExtendedDismaxQParser when
 executing following query:
 ((*:* AND -area) OR area:[100 TO 300]) AND objectId:40105451

 When executing it through Solr Admin panel and placing query in q field
 I having following debug output for LuceneQParser
 --
 debug: {
 rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND
 objectId:40105451,
 querystring: ((*:* AND -area) OR area:[100 TO 300]) AND
 objectId:40105451,
 parsedquery: +((+MatchAllDocsQuery(*:*) -text:area) area:[100 TO
 300]) +objectId:40105451,
 parsedquery_toString: +((+*:* -text:area) area:[100 TO 300])
 +objectId: \u0001\u\u\u\u\u\u0013\u000fkk,
 explain: {
   40105451: \n14.3511 = (MATCH) sum of:\n  0.034590416 = (MATCH)
 product of:\n0.06918083 = (MATCH) sum of:\n  0.06918083 = (MATCH)
 sum of:\n0.06918083 = (MATCH) MatchAllDocsQuery, product of:\n
 0.06918083 = queryNorm\n0.5 = coord(1/2)\n  14.316509 = (MATCH)
 weight(objectId: \u0001\u\u\u\u\u\u0013\u000fkk in
 1109978) [DefaultSimilarity], result of:\n14.316509 =
 score(doc=1109978,freq=1.0), product of:\n  0.9952025 = queryWeight,
 product of:\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n
 0.06918083 = queryNorm\n  14.385524 = fieldWeight in 1109978, product
 of:\n1.0 = tf(freq=1.0), with freq of:\n  1.0 =
 termFreq=1.0\n14.385524 = idf(docFreq=1, maxDocs=1301035)\n
 1.0 = fieldNorm(doc=1109978)\n
 },
 --
 So, one object found which is expectable

 For ExtendedDismaxQParser (only difference is checkbox edismax checked)
 I am seeing this output
 --
 debug: {
 rawquerystring: ((*:* AND -area) OR area:[100 TO 300]) AND
 objectId:40105451,
 querystring: ((*:* AND -area) OR area:[100 TO 300]) AND
 objectId:40105451,
 parsedquery: (+(+((+DisjunctionMaxQuery((text:*\\:*))
 -DisjunctionMaxQuery((text:area))) area:[100 TO 300])
 +objectId:40105451))/no_coord,
 parsedquery_toString: +(+((+(text:*\\:*) -(text:area)) area:[100 TO
 300]) +objectId: \u0001\u\u\u\u\u\u0013\u000fkk),
 explain: {},
 --
 oops, no objects found!

 I hastened to fill https://issues.apache.org/jira/browse/SOLR-7249
 (sorry, my bad)
 You may refer to it for additional info (not going to duplicate it here)

 Thanks

 --
 Best regards,
  Arsen  mailto:barracuda...@mail.ru




Re: Whole RAM consumed while Indexing.

2015-03-16 Thread Ramkumar R. Aiyengar
Yes, and doing so is painful and takes lots of people and hardware
resources to get there for large amounts of data and queries :)

As Erick says, work backwards from 60s and first establish how high the
commit interval can be to satisfy your use case..
On 16 Mar 2015 16:04, Erick Erickson erickerick...@gmail.com wrote:

 First start by lengthening your soft and hard commit intervals
 substantially. Start with 6 and work backwards I'd say.

 Ramkumar has tuned the heck out of his installation to get the commit
 intervals to be that short ;).

 I'm betting that you'll see your RAM usage go way down, but that' s a
 guess until you test.

 Best,
 Erick

 On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com
 wrote:
  Hi Erick,
  You are saying correct. Something, **overlapping searchers
  warning messages** are coming in logs.
  **numDocs numbers** are changing when documents are adding at the time of
  indexing.
  Any help?
 
  On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  First, the soft commit interval is very short. Very, very, very, very
  short. 300ms is
  just short of insane unless it's a typo ;).
 
  Here's a long background:
 
 
 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 
  But the short form is that you're opening searchers every 300 ms. The
  hard commit is better,
  but every 3 seconds is still far too short IMO. I'd start with soft
  commits of 6 and hard
  commits of 6 (60 seconds), meaning that you're going to have to
  wait 1 minute for
  docs to show up unless you explicitly commit.
 
  You're throwing away all the caches configured in solrconfig.xml more
  than 3 times a second,
  executing autowarming, etc, etc, etc
 
  Changing these to longer intervals might cure the problem, but if not
  then, as Hoss would
  say, details matter. I suspect you're also seeing overlapping
  searchers warning messages
  in your log, and it;s _possible_ that what's happening is that you're
  just exceeding the
  max warming searchers and never opening a new searcher with the
  newly-indexed documents.
  But that's a total shot in the dark.
 
  How are you looking for docs (and not finding them)? Does the numDocs
  number in
  the solr admin screen change?
 
 
  Best,
  Erick
 
  On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com
  wrote:
   Hi Alexandre,
  
  
   *Hard Commit* is :
  
autoCommit
  maxTime${solr.autoCommit.maxTime:3000}/maxTime
  openSearcherfalse/openSearcher
/autoCommit
  
   *Soft Commit* is :
  
   autoSoftCommit
   maxTime${solr.autoSoftCommit.maxTime:300}/maxTime
   /autoSoftCommit
  
   And I am committing 2 documents each time.
   Is it good config for committing?
   Or I am good something wrong ?
  
  
   On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch 
  arafa...@gmail.com
   wrote:
  
   What's your commit strategy? Explicit commits? Soft commits/hard
   commits (in solrconfig.xml)?
  
   Regards,
  Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com
 wrote:
Hello,
  I have written a python script to do 2 documents
  indexing
each time on Solr. I have 28 GB RAM with 8 CPU.
When I started indexing, at that time 15 GB RAM was freed. While
   indexing,
all RAM is consumed but **not** a single document is indexed. Why
 so?
And it through *HTTPError: HTTP Error 503: Service Unavailable* in
  python
script.
I think it is due to heavy load on Zookeeper by which all nodes
 went
   down.
I am not sure about that. Any help please..
Or anything else is happening..
And how to overcome this issue.
Please assist me towards right path.
Thanks..
   
Warm Regards,
Nitin Solanki
  
 



Re: indexing db records via SolrJ

2015-03-16 Thread mike st. john
Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts 
hrobe...@cyber.law.harvard.edu wrote:

 We import anywhere from five to fifty million small documents a day from a
 postgres database.  I wrestled to get the DIH stuff to work for us for
 about a year and was much happier when I ditched that approach and switched
 to writing the few hundred lines of relatively simple code to handle
 directly the logic of what gets updated and how it gets queried from
 postgres ourselves.

 The DIH stuff is great for lots of cases, but if you are getting to the
 point of trying to hack its undocumented internals, I suspect you are
 better off spending a day or two of your time just writing all of the
 update logic yourself.

 We found a relatively simple combination of postgres triggers, export to
 csv based on those triggers, and then just calling update/csv to work best
 for us.

 -hal


 On 3/16/15 9:59 AM, Shawn Heisey wrote:

 On 3/16/2015 7:15 AM, sreedevi s wrote:

 I had checked this post.I dont know whether this is possible but my query
 is whether I can use the configuration for DIH for indexing via SolrJ


 You can use SolrJ for accessing DIH.  I have code that does this, but
 only for full index rebuilds.

 It won't be particularly obvious how to do it.  Writing code that can
 intepret DIH status and know when it finishes, succeeds, or fails is
 very tricky because DIH only uses human-readable status info, not
 machine-readable, and the info is not very consistent.

 I can't just share my code, because it's extremely convoluted ... but
 the general gist is to create a SolrQuery object, use setRequestHandler
 to set the handler to /dataimport or whatever your DIH handler is, and
 set the other parameters on the request like command to full-import
 and so on.

 Thanks,
 Shawn


 --
 Hal Roberts
 Fellow
 Berkman Center for Internet  Society
 Harvard University



Re: indexing db records via SolrJ

2015-03-16 Thread Jean-Sebastien Vachon
Do you have any references to such integrations (Solr + Storm)?

Thanks


From: mike st. john mstj...@gmail.com
Sent: Monday, March 16, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing db records via SolrJ

Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts 
hrobe...@cyber.law.harvard.edu wrote:

 We import anywhere from five to fifty million small documents a day from a
 postgres database.  I wrestled to get the DIH stuff to work for us for
 about a year and was much happier when I ditched that approach and switched
 to writing the few hundred lines of relatively simple code to handle
 directly the logic of what gets updated and how it gets queried from
 postgres ourselves.

 The DIH stuff is great for lots of cases, but if you are getting to the
 point of trying to hack its undocumented internals, I suspect you are
 better off spending a day or two of your time just writing all of the
 update logic yourself.

 We found a relatively simple combination of postgres triggers, export to
 csv based on those triggers, and then just calling update/csv to work best
 for us.

 -hal


 On 3/16/15 9:59 AM, Shawn Heisey wrote:

 On 3/16/2015 7:15 AM, sreedevi s wrote:

 I had checked this post.I dont know whether this is possible but my query
 is whether I can use the configuration for DIH for indexing via SolrJ


 You can use SolrJ for accessing DIH.  I have code that does this, but
 only for full index rebuilds.

 It won't be particularly obvious how to do it.  Writing code that can
 intepret DIH status and know when it finishes, succeeds, or fails is
 very tricky because DIH only uses human-readable status info, not
 machine-readable, and the info is not very consistent.

 I can't just share my code, because it's extremely convoluted ... but
 the general gist is to create a SolrQuery object, use setRequestHandler
 to set the handler to /dataimport or whatever your DIH handler is, and
 set the other parameters on the request like command to full-import
 and so on.

 Thanks,
 Shawn


 --
 Hal Roberts
 Fellow
 Berkman Center for Internet  Society
 Harvard University



Re: Data Import Handler - reading GET

2015-03-16 Thread Alexandre Rafalovitch
Have you tried? As ${dih.request.foo}?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 16 March 2015 at 14:51, Kiran J kiranjuni...@gmail.com wrote:
 Hi,

 In data import handler, I can read the clean query parameter using
 ${dih.request.clean} and pass it on to the queries. Is it possible to read
 any query parameter from the URL ? for eg ${foo} ?

 Thanks


Data Import Handler - reading GET

2015-03-16 Thread Kiran J
Hi,

In data import handler, I can read the clean query parameter using
${dih.request.clean} and pass it on to the queries. Is it possible to read
any query parameter from the URL ? for eg ${foo} ?

Thanks


Re: Relevancy : Keyword stuffing

2015-03-16 Thread Chris Hostetter

You should start by checking out the SweetSpotSimilarity .. it was 
heavily designed arround the idea of dealing with things like excessively 
verbose titles, and keyword stuffing in summary text ... so you can 
configure your expectation for what a normal length doc is, and they 
will be penalized for being longer then that.  similarly you can say what 
a 'resaonable' tf is, and docs that exceed that would't get added boost 
(which in conjunction with teh lengthNorm penality penalizes docs that 
stuff keywords)

https://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.computeLengthNorm.svg
https://lucene.apache.org/core/5_0_0/misc/org/apache/lucene/misc/doc-files/ss.hyperbolicTf.svg


-Hoss
http://www.lucidworks.com/


Re: [Poll]: User need for Solr security

2015-03-16 Thread Jan Høydahl
Hi,

We tend to recommend ManifoldCF for document level security since that is 
exactly what it is built for. So I doubt we'll see that as a built in feature 
in Solr.
However, the Solr integration is really not that advanced, and I also see 
customers implementing similar logic themselves with success.
On the document feeding side you need to add a few more fields to all your 
documents, typically include_acl and exclude_acl. Populate those fields
with data from LDAP about who (what groups) have access to that document and 
who not. If it is open information, index a special token open in the include 
field.
Then assuming your search client application has authenticated a user, you 
would construct a filter with this users groups, e.g. 
  fq=include_acl:(groupA OR open)fq=-exclude_acl:(groupA)
The filter would be constructed either in your application or in a Solr search 
component or query parser.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

 13. mar. 2015 kl. 01.48 skrev johnmu...@aol.com:
 
 I would love to see record level (or even field level) restricted access in 
 Solr / Lucene.
 
 This should be group level, LDAP like or some rule base (which can be 
 dynamic).  If the solution means having a second core, so be it.
 
 The following is the closest I found: 
 https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security but I 
 cannot use Manifold CF (Connector Framework).  Does anyone know how Manifold 
 does it?
 
 - MJ
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Thursday, March 12, 2015 6:51 PM
 To: solr-user@lucene.apache.org
 Subject: RE: [Poll]: User need for Solr security
 
 Jan - we don't really need any security for our products, nor for most 
 clients. However, one client does deal with very sensitive data so we 
 proposed to encrypt the transfer of data and the data on disk through a 
 Lucene Directory. It won't fill all gaps but it would adhere to such a 
 client's guidelines. 
 
 I think many approaches of security in Solr/Lucene would find advocates, be 
 it index encryption or authentication/authorization or transport security, 
 which is now possible. I understand the reluctance of the PMC, and i agree 
 with it, but some users would definitately benefit and it would certainly 
 make Solr/Lucene the search platform to use for some enterprises.
 
 Markus 
 
 -Original message-
 From:Henrique O. Santos hensan...@gmail.com
 Sent: Thursday 12th March 2015 23:43
 To: solr-user@lucene.apache.org
 Subject: Re: [Poll]: User need for Solr security
 
 Hi,
 
 I’m currently working with indexes that need document level security. Based 
 on the user logged in, query results would omit documents that this user 
 doesn’t have access to, with LDAP integration and such.
 
 I think that would be nice to have on a future Solr release.
 
 Henrique.
 
 On Mar 12, 2015, at 7:32 AM, Jan Høydahl jan@cominvent.com wrote:
 
 Hi,
 
 Securing various Solr APIs has once again surfaced as a discussion 
 in the developer list. See e.g. SOLR-7236 Would be useful to get some 
 feedback from Solr users about needs in the field.
 
 Please reply to this email and let us know what security aspect(s) would be 
 most important for your company to see supported in a future version of 
 Solr.
 Examples: Local user management, AD/LDAP integration, SSL, 
 authenticated login to Admin UI, authorization for Admin APIs, e.g. 
 admin user vs read-only user etc
 
 --
 Jan Høydahl, search solution architect Cominvent AS - 
 www.cominvent.com
 
 
 
 



Re: Nginx proxy for Solritas

2015-03-16 Thread Erik Hatcher
The links to the screenshots aren’t working for me.  I’m not sure what the 
issue is - but do be aware that /browse with its out of the box templates do 
refer to resources (CSS, images, JavaScript) that isn’t under /browse, so 
you’ll need to allow those to be accessible as well with different rules.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Mar 16, 2015, at 3:39 PM, LongY zhangyulin8...@hotmail.com wrote:
 
 Dear Community Members,
 
 I have searched over the forum and googled a lot, still didn't find the
 solution. Finally got me here for help.
 
 I am implementing a Nginx reverse proxy for Solritas
 (VelocityResponseWriter) of the example included in Solr.
 . Nginx listens on port 80, and solr runs on port 8983. This is my Nginx
 configuration file (It only permits localhost
 to access the browse request handler).
 
 *location ~* /solr/\w+/browse {
   proxy_pass  http://localhost:8983;
 
allow   127.0.0.1;
denyall;
 
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
 
}*
 
 when I input http://localhost/solr/collection1/browse in the browser address
 bar. 
 The output I got is this. 
 http://lucene.472066.n3.nabble.com/file/n4193346/left.png 
 The supposed output should be like this 
 http://lucene.472066.n3.nabble.com/file/n4193346/right.png 
 
 I tested the Admin page with this Nginx configuration file with some minor
 modifications, it worked well,
 but when used in velocity templates, it did not render the output properly.
 
 Any input is welcome.
 Thank you.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Nginx proxy for Solritas

2015-03-16 Thread LongY
Dear Community Members,

I have searched over the forum and googled a lot, still didn't find the
solution. Finally got me here for help.
 
I am implementing a Nginx reverse proxy for Solritas
(VelocityResponseWriter) of the example included in Solr.
. Nginx listens on port 80, and solr runs on port 8983. This is my Nginx
configuration file (It only permits localhost
to access the browse request handler).

*location ~* /solr/\w+/browse {
   proxy_pass  http://localhost:8983;

allow   127.0.0.1;
denyall;

proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
  
}*

when I input http://localhost/solr/collection1/browse in the browser address
bar. 
 The output I got is this. 
http://lucene.472066.n3.nabble.com/file/n4193346/left.png 
The supposed output should be like this 
http://lucene.472066.n3.nabble.com/file/n4193346/right.png 

I tested the Admin page with this Nginx configuration file with some minor
modifications, it worked well,
but when used in velocity templates, it did not render the output properly.
 
Any input is welcome.
Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nginx-proxy-for-Solritas-tp4193346.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 4.7.2 mergeFactor/ Merge policy issue

2015-03-16 Thread Dmitry Kan
Hi,

I can confirm similar behaviour, but for solr 4.3.1. We use default values
for merge related settings. Even though mergeFactor=10
by default, there are 13 segments in one core and 30 segments in another. I
am not sure it proves there is a bug in the merging, because it depends on
the TieredMergePolicy. Relevant discussion from the past:
http://lucene.472066.n3.nabble.com/TieredMergePolicy-reclaimDeletesWeight-td4071487.html
Apart from other policy parameters you could play with ReclaimDeletesWeight,
in case you'd like to affect on merging the segments with deletes in them.
See
http://stackoverflow.com/questions/18361300/informations-about-tieredmergepolicy


Regarding your attachment: I believe it got cut by the mailing list system,
could you share it via a file sharing system?

On Sat, Mar 14, 2015 at 7:36 AM, Summer Shire shiresum...@gmail.com wrote:

 Hi All,

 Did anyone get a chance to look at my config and the InfoStream File ?

 I am very curious to see what you think

 thanks,
 Summer

  On Mar 6, 2015, at 5:20 PM, Summer Shire shiresum...@gmail.com wrote:
 
  Hi All,
 
  Here’s more update on where I am at with this.
  I enabled infoStream logging and quickly figured that I need to get rid
 of maxBufferedDocs. So Erick you
  were absolutely right on that.
  I increased my ramBufferSize to 100MB
  and reduced maxMergeAtOnce to 3 and segmentsPerTier to 3 as well.
  My config looks like this
 
  indexConfig
 useCompoundFilefalse/useCompoundFile
 ramBufferSizeMB100/ramBufferSizeMB
 
 
 !--maxMergeSizeForForcedMerge9223372036854775807/maxMergeSizeForForcedMerge--
 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce3/int
   int name=segmentsPerTier3/int
 /mergePolicy
 mergeScheduler
 class=org.apache.lucene.index.ConcurrentMergeScheduler/
 infoStream file=“/tmp/INFOSTREAM.txt”true/infoStream
   /indexConfig
 
  I am attaching a sample infostream log file.
  In the infoStream logs though you an see how the segments keep on adding
  and it shows (just an example )
  allowedSegmentCount=10 vs count=9 (eligible count=9) tooBigCount=0
 
  I looked at TieredMergePolicy.java to see how allowedSegmentCount is
 getting calculated
  // Compute max allowed segs in the index
 long levelSize = minSegmentBytes;
 long bytesLeft = totIndexBytes;
 double allowedSegCount = 0;
 while(true) {
   final double segCountLevel = bytesLeft / (double) levelSize;
   if (segCountLevel  segsPerTier) {
 allowedSegCount += Math.ceil(segCountLevel);
 break;
   }
   allowedSegCount += segsPerTier;
   bytesLeft -= segsPerTier * levelSize;
   levelSize *= maxMergeAtOnce;
 }
 int allowedSegCountInt = (int) allowedSegCount;
  and the minSegmentBytes is calculated as follows
  // Compute total index bytes  print details about the index
 long totIndexBytes = 0;
 long minSegmentBytes = Long.MAX_VALUE;
 for(SegmentInfoPerCommit info : infosSorted) {
   final long segBytes = size(info);
   if (verbose()) {
 String extra = merging.contains(info) ?  [merging] : ;
 if (segBytes = maxMergedSegmentBytes/2.0) {
   extra +=  [skip: too large];
 } else if (segBytes  floorSegmentBytes) {
   extra +=  [floored];
 }
 message(  seg= + writer.get().segString(info) +  size= +
 String.format(Locale.ROOT, %.3f, segBytes/1024/1024.) +  MB + extra);
   }
 
   minSegmentBytes = Math.min(segBytes, minSegmentBytes);
   // Accum total byte size
   totIndexBytes += segBytes;
 }
 
 
  any input is welcome.
 
  myinfoLog.rtf
 
 
  thanks,
  Summer
 
 
  On Mar 5, 2015, at 8:11 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  I would, BTW, either just get rid of the maxBufferedDocs all together
 or
  make it much higher, i.e. 10. I don't think this is really your
  problem, but you're creating a lot of segments here.
 
  But I'm kind of at a loss as to what would be different about your
 setup.
  Is there _any_ chance that you have some secondary process looking at
  your index that's maintaining open searchers? Any custom code that's
  perhaps failing to close searchers? Is this a Unix or Windows system?
 
  And just to be really clear, you _only_ seeing more segments being
  added, right? If you're only counting files in the index directory, it's
  _possible_ that merging is happening, you're just seeing new files take
  the place of old ones.
 
  Best,
  Erick
 
  On Wed, Mar 4, 2015 at 7:12 PM, Shawn Heisey apa...@elyograg.org
 wrote:
  On 3/4/2015 4:12 PM, Erick Erickson wrote:
  I _think_, but don't know for sure, that the merging stuff doesn't get
  triggered until you commit, it doesn't just happen.
 
  Shot in the dark...
 
  I believe that new segments are created when the indexing buffer
  (ramBufferSizeMB) fills up, even without commits.  I'm pretty sure that
  anytime a new segment is created, the merge policy is