Re: Re: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread karsten-solr
Hi Aleksander,
 
The Fuzzy Searche '~' is not supported in dismax (defType=dismax)
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
 
You are using SearchComponent spellchecker. This does not change the query 
results.
 

btw: It looks like you are using path /select with qt=dismax. This normaly 
would throw an exception.
Is there a tag
  requestHandler name=/dismax ...
inside your solrconfig.xml ? 
 
Best regards
 
  Karsten
 
P.S. in Context: 
http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-distance-with-in-Java-td4164793.html
 

 On 20 October 2014 11:13, Aleksander Sadecki wrote:

 Ok, thank you for your response. But why I cannot use '~'?


Re: unstable results on refresh

2014-10-23 Thread Giovanni Bricconi
My user interface shows some boxes to describe results categories. After
half a day of small updates and delete I noticed with various queries that
the boxes started swapping while browsing.
For sure I relied too much in getting the same results on each call, now
I'm keeping the categories order in request parameters to avoid the blink
effect while browsing.

The optimize process is really slow, and I can't use it. Since I have many
other parameters that should be carried along the request to make sure that
the navigation is consistent, I would like to understand if is there a
setup that can limit the idf change and keep it low enough

I tried with

indexConfig

mergeFactor5/mergeFactor

/indexConfig
In solrconfig but this morning /solr/admin/cores?action=STATUS still
reports a number of segments above ten for all cores of the shard. (I'm
sure I have reloaded each core after changing the value)

Now I'm trying with expungeDeletes called from solrj, but still I don't see
the segment count decrease

UpdateRequest commitRequest = new UpdateRequest();

  commitRequest.setAction

//(action, waitFlush, waitSearcher, maxSegments, softCommit, expungeDeletes)

   ( ACTION.COMMIT, true, true, 10, false, true);

  commitRequest.process(solrServer);



2014-10-22 15:48 GMT+02:00 Erick Erickson erickerick...@gmail.com:

 I would rather ask whether such small differences matter enough to
 do this. Is this something users will _ever_ notice? Optimization
 is quite a heavyweight operation, and is generally not recommended
 on indexes that change often, and 5 minutes is certainly below
 the recommendation for optimizing.

 There is/has been work done on distributed IDF, but I don't quite
 know the current status that should address this (I think).

 But other than in a test setup, is it worth the effort?

 Best,
 Erick

 On Wed, Oct 22, 2014 at 3:54 AM, Giovanni Bricconi
 giovanni.bricc...@banzai.it wrote:
  I have made some small patch to the application to make this problem less
  visible, and I'm trying to perform the optimize once per hour, yesterday
 it
  took 5 minutes to perform it, this morning 15 minutes. Today I will
 collect
  some statistics but the publication process sends documents every 5
  minutes, and I think the optimize is taking too much time.
 
  I have no default mergeFactor configured for this collection, do you
 think
  that setting it to a small value could improve the situation? If I have
  understood well having to merge segments will keep similar stats on all
  nodes. It's ok to have the indexing process a little bit slower.
 
 
  2014-10-21 18:44 GMT+02:00 Erick Erickson erickerick...@gmail.com:
 
  Giovanni:
 
  To see how this happens, consider a shard with a leader and two
  followers. Assume your autocommit interval is 60 seconds on each.
 
  This interval can expire at slightly different wall clock times.
  Even if the servers started perfectly in synch, they can get slightly
  out of sync. So, you index a bunch of docs and these replicas close
  the current segment and re-open a new segment with slightly different
  contents.
 
  Now docs come in that replace older docs. The tf/idf statistics
  _include_ deleted document data (which is purged on optimize). Given
  that doc X an be in different segments (or, more accurately, segments
  that get merged at different times on different machines), replica 1
  may have slightly different stats than replica 2, thus computing
  slightly different scores.
 
  Optimizing purges all data related to deleted documents, so it all
  regularizes itself on optimize.
 
  Best,
  Erick
 
  On Tue, Oct 21, 2014 at 11:08 AM, Giovanni Bricconi
  giovanni.bricc...@banzai.it wrote:
   I noticed again the problem, now I was able to collect some data. in
 my
   paste http://pastebin.com/nVwf327c you can see the result of the same
  query
   issued twice, the 2nd and 3rd group are swapped.
  
   I pasted also the clusterstate and the core state for each core.
  
   The logs did'n show any problem related to indexing, only some
 malformed
   query.
  
   After doing an optimize the problem disappeared.
  
   So, is the problem related to documents that where deleted from the
  index?
  
   The optimization took 5 minutes to complete
  
   2014-10-21 11:41 GMT+02:00 Giovanni Bricconi 
  giovanni.bricc...@banzai.it:
  
   Nice!
   I will monitor the index and try this if the problem comes back.
   Actually the problem was due to small differences in score, so I
 think
  the
   problem has the same origin
  
   2014-10-21 8:10 GMT+02:00 lboutros boutr...@gmail.com:
  
   Hi Giovanni,
  
   we had this problem as well.
   The cause was that the different nodes have slightly different idf
  values.
  
   We solved this problem by doing an optimize operation which really
  remove
   suppressed data.
  
   Ludovic.
  
  
  
   -
   Jouve
   France.
   --
   View this message in context:
  
 
 

Re: StatelessScriptUpdateProcessorFactory Access to Solr Core/schema/analyzer etc

2014-10-23 Thread Erik Hatcher

On Oct 22, 2014, at 3:27 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/22/2014 11:50 AM, Tom LAMPERT wrote:
 I am attempting to create a script (java script) using the 
 StatelessScriptUpdateProcessorFactory feature of solr but I am blocked on 
 how to access the current core instance (ultimately to access it's schema)? 
 In the wikipedia example the input document is accessible using doc = 
 cmd.solrDoc but no other information is given. The aim of the script is to 
 apply any filters/tokenisers to the input fields before solr indexes them so 
 that the stored values are those after processing, not the original data. 
 Any tips would be gratefully received as I cannot find any info on the API 
 for this framework...
 
 I would guess that you'd need to be writing Java code to have this kind
 of detail, not javascript.  The info in the other replies you received
 is talking about Java code.
 
 Javascript would not be able to execute the analysis on the input anyway
 -- that's all Java as well.

Ummm… see slides 10 and 11 here: 
http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks

So yes, you can do analysis tricks in an update script.  And it’s incredibly 
useful and powerful!  :)

Erik



Analytics component

2014-10-23 Thread nabil Kouici
Hi All,

I'm trying to use Solr to do some analytic function (percentile, median...). I 
got Trunck branch from Solr which contain the analytics component 
implementation. I've rebuild solr but unfortunately this component wasn't taken 
into consideration and no lib generated in  /contrib/analytics.

Do you have any idea how to get it complied. Otherwise, any idea to have this 
analytics in Solr.

Regards,
Nabil. 

Re: Difference between unloading of cores with LotsOfCores and unloading a core with CoreAdmin

2014-10-23 Thread Erick Erickson
Memory should eventually be returned when a core is unloaded. There's
a very small amount of overhead for keeping a list of all the cores
and their locations, but this shouldn't increase with time unless
you're adding more cores.

Do note that the transient cache size is fixed, but may be exceeded. A
core is held open when it gets reclaimed long enough to serve any
outstanding requests, but it _should_ have the memory reclaimed
eventually.

Of course there's always the possibility of some memory being kept
inadvertently, I'd consider that a  bug so if you can define how this
happens, perhaps with a test case that would be great. Dumping the
memory would help see what's kept if anything actually is.

Best,
Erick

On Wed, Oct 22, 2014 at 12:33 PM, Xiaolu Zhao xiaolu.z...@oracle.com wrote:
 Hi Erick,

 Thanks a lot for your explanation.

 Last time, when I try out LotsOfCores, I find JVM memory usage will increase
 as the total number of cores grows, though the transient cache size is
 fixed. Finally, JVM will run out of memory when I have thousands of cores.
 Does it mean other currently unloaded cores will consume memory? Or swapping
 among loaded/unloaded cores will consume memory?

 Best,
 Xiaolu

 On 10/22/2014 12:23 PM, Erick Erickson wrote:

 The difference here is that the LotsOfCores is intended to cache open
 cores and thus limit the number of currently loaded cores. However,
 cores not currently loaded are available for use; the next request
 that needs that core will cause it to be loaded (or reloaded).

 The admin/core/UNLOAD command, on the other hand, is designed to
 _permanently_ remove the core from Solr. Or at least have it become
 unavailable until another explicit admin/core command is executed to
 bring it back. There is nothing automatic about this.

 Another way of looking at it is that LotsOfCores is used in a
 situation where you don't know what requests are coming in, but you
 _can_ predict that not many will be used at once. So if I have 500
 cores, and my expectation is that only 20 of them are used at once,
 there's no good in having the 480 other cores loaded all the time.
 When a query comes in for one of the currently-unloaded cores (call it
 core21), that core is loaded (perhaps displacing one of the
 currently-loaded cores) and the request is served.

 If core21 above had been unloaded with the core/admin command, then a
 request directed to it would return an error instead.

 Best,
 Erick

 On Wed, Oct 22, 2014 at 12:11 PM, Xiaolu Zhao xiaolu.z...@oracle.com
 wrote:

 Hi All,

 I am confused about the difference between unloading of cores with
 LotsOfCores and unloading a core with CoreAdmin.

  From my understanding of LotsOfCores, if one core is removed from
 transient
 cache, it is pending to close, it means close all resources allocated by
 the
 core if it is no longer in use, e.g. searcher, updateHandler... While for
 unloading a core with CoreAdmin, this core needs to be removed from the
 cores list, either ordinary cores list or transient cores list, and cores
 locator will delete it. If this core is loaded but not pending to close,
 it
 will be close.

 Also, one more interesting thing is if I unload a core with CoreAdmin,
 core.properties will be renamed core.properties.unloaded. Then this
 core
 cannot be found in the Solr API, and STATUS url won't return its status
 as
 well. But with LotsOfCores, a core not in the transient cache will still
 have core.properties and could be found through STATUS url, though it
 is
 marked with isLoaded=false.

 Could anyone tell me the underlying mechanism for these two cases? Why
 LotsOfCores could realize frequent unloading/loading of cores? Do cores
 not
 in the transient cores still consume JVM memory, while unloaded cores
 with
 CoreAdmin not?

 Thanks,
 Xiaolu




Re: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread Walter Underwood
We’re reimplementing fuzzy support in edismax on Solr 4.x right now. See: 
https://issues.apache.org/jira/browse/SOLR-629

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/

On Oct 22, 2014, at 11:08 PM, karsten-s...@gmx.de wrote:

 Hi Aleksander,
  
 The Fuzzy Searche '~' is not supported in dismax (defType=dismax)
 https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
  
 You are using SearchComponent spellchecker. This does not change the query 
 results.
  
 
 btw: It looks like you are using path /select with qt=dismax. This normaly 
 would throw an exception.
 Is there a tag
   requestHandler name=/dismax ...
 inside your solrconfig.xml ? 
  
 Best regards
  
   Karsten
  
 P.S. in Context: 
 http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-distance-with-in-Java-td4164793.html
  
 
 On 20 October 2014 11:13, Aleksander Sadecki wrote:
 
 Ok, thank you for your response. But why I cannot use '~'?



Re: unstable results on refresh

2014-10-23 Thread Shawn Heisey
On 10/23/2014 2:44 AM, Giovanni Bricconi wrote:
 My user interface shows some boxes to describe results categories. After
 half a day of small updates and delete I noticed with various queries that
 the boxes started swapping while browsing.
 For sure I relied too much in getting the same results on each call, now
 I'm keeping the categories order in request parameters to avoid the blink
 effect while browsing.
 
 The optimize process is really slow, and I can't use it. Since I have many
 other parameters that should be carried along the request to make sure that
 the navigation is consistent, I would like to understand if is there a
 setup that can limit the idf change and keep it low enough
 
 I tried with
 
 indexConfig
 
 mergeFactor5/mergeFactor
 
 /indexConfig
 In solrconfig but this morning /solr/admin/cores?action=STATUS still
 reports a number of segments above ten for all cores of the shard. (I'm
 sure I have reloaded each core after changing the value)
 
 Now I'm trying with expungeDeletes called from solrj, but still I don't see
 the segment count decrease

It's completely normal to have more segments than the mergeFactor.
Think about this scenario with a mergeFactor of 5:

You index five segments.  They get merged to one segment.  Let's say
that this happens a total of four times, so you've indexed a total of 20
segments and merging has reduced that to four larger segments.  Let's
say that you now index four more segments.  You'll be completely stable
with eight segments.  If you index another one, that will result in a
fifth larger segment.  This sets conditions up just right for another
merge -- to one even larger segment.  This represents three levels of
merging, and there can be even more levels, each of which can have four
segments and remain stable.  Starting at the last state I described, if
you then indexed 24 more segments, you'd have a stable index with a
total of nine segments - four of them would be normal sized, four of
them would be about five times normal size, and the first one would be
about 25 times normal size.

The Solr default for the merge policy in all recent versions is
TieredMergePolicy, and this can make things slightly more complicated
than I've described, because it can merge *any* segments, not just those
indexed sequentially, and I believe that it can delay merging until the
right number of segments with suitable characteristics appear.

I've got merge settings equivalent to a mergeFactor of 35, but I
regularly see the segment count approach 100, and there's absolutely
nothing wrong with my merging.

If I understand it correctly, expungeDeletes will not decrease the
segment count.  It will simply rewrite segments that have deleted
documents so there are none.  I'm not 100% sure that I know exactly what
expungeDeletes does, though.

Thanks,
Shawn



Re: StatelessScriptUpdateProcessorFactory Access to Solr Core/schema/analyzer etc

2014-10-23 Thread Shawn Heisey
On 10/23/2014 2:47 AM, Erik Hatcher wrote:
 Ummm… see slides 10 and 11 here: 
 http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
 
 So yes, you can do analysis tricks in an update script.  And it’s incredibly 
 useful and powerful!  :)

That's pretty amazing.  I would not have imagined that this kind of
crossover would be possible.  Thanks for the info!

Shawn



QueryAutoStopWordAnalyzer

2014-10-23 Thread Bernd Fehling
I just located the QueryAutoStopWordAnalyzer in lucene.
Has anyone managed to use it for solr?

Could imagine to have a language independent search clean up
for the text_all field.

Can it be used for solr right out of the box or do I have to
write a wrapper or factory?

Regards
Bernd


Re: Analytics component

2014-10-23 Thread Jorge Luis Betancourt González
I believe some of this statistics function that you're trying to use are 
precent in facets. 

- Original Message -
From: nabil Kouici koui...@yahoo.fr
To: solr-user@lucene.apache.org
Sent: Thursday, October 23, 2014 5:57:27 AM
Subject: Analytics component 

Hi All,

I'm trying to use Solr to do some analytic function (percentile, median...). I 
got Trunck branch from Solr which contain the analytics component 
implementation. I've rebuild solr but unfortunately this component wasn't taken 
into consideration and no lib generated in  /contrib/analytics.

Do you have any idea how to get it complied. Otherwise, any idea to have this 
analytics in Solr.

Regards,
Nabil.


Re: Difference between unloading of cores with LotsOfCores and unloading a core with CoreAdmin

2014-10-23 Thread Xiaolu Zhao

Hi Erick,

Actually we are adding more cores. In this case, we set 
transientCacheSize=500, create 16,000 cores in total, each with 10k 
log entries.


During the process, we could easily see JVM memory usage will increase 
as the total number of cores grows. It runs out of memory when the total 
number of cores reaches 5,400.


Then we restart Solr, continue creating and loading cores. JVM memory 
usage will rise to over 7GB (Max: 8GB), but not exceed the maximum. The 
process could be very slow then, we believe garbage collection may take 
place and cost some time.


How about the resources usage for LotsOfCores (loaded/unloaded), e.g. 
searcher? Are all resources allocated by the core close for unloaded 
cores? And how about the processing time for unloaded cores to get it 
loaded first if we issue a query to it?


We do the testing to look into the processing time for unloaded cores. 
In this case, we have 100 cores, 1-50 with 100M, 51-55 with 1M, 56-60 
with 10M, 61-70 with 100K, 71-100 with 10K. Then we could do query to 
unloaded cores with different data size to get the processing time for 
each group. Here, this query is for all: select?q=*.


*Collection Name*



*Total Time(ms)*



*QTime(ms)*



*Processing Time(ms)*

collection71(10K)



418



1



417

collection72(10K)



413



0



413

collection61(100K)



439



2



437

collection62(100K)



424



1



423

collection51(1M)



527



5



522

collection52(1M)



538



5



533

collection56(10M)



560



33



527

collection57(10M)



553



33



520

collection3(100M)



5971



322



5649

collection4(100M)



6052



327



5725


Based on the table above, we could see an ascending trend with larger 
data. But there is a big gap between 10M and 100M.


Thanks,
Xiaolu

On 10/23/2014 9:51 AM, Erick Erickson wrote:

Memory should eventually be returned when a core is unloaded. There's
a very small amount of overhead for keeping a list of all the cores
and their locations, but this shouldn't increase with time unless
you're adding more cores.

Do note that the transient cache size is fixed, but may be exceeded. A
core is held open when it gets reclaimed long enough to serve any
outstanding requests, but it _should_ have the memory reclaimed
eventually.

Of course there's always the possibility of some memory being kept
inadvertently, I'd consider that a  bug so if you can define how this
happens, perhaps with a test case that would be great. Dumping the
memory would help see what's kept if anything actually is.

Best,
Erick

On Wed, Oct 22, 2014 at 12:33 PM, Xiaolu Zhao xiaolu.z...@oracle.com wrote:

Hi Erick,

Thanks a lot for your explanation.

Last time, when I try out LotsOfCores, I find JVM memory usage will increase
as the total number of cores grows, though the transient cache size is
fixed. Finally, JVM will run out of memory when I have thousands of cores.
Does it mean other currently unloaded cores will consume memory? Or swapping
among loaded/unloaded cores will consume memory?

Best,
Xiaolu

On 10/22/2014 12:23 PM, Erick Erickson wrote:

The difference here is that the LotsOfCores is intended to cache open
cores and thus limit the number of currently loaded cores. However,
cores not currently loaded are available for use; the next request
that needs that core will cause it to be loaded (or reloaded).

The admin/core/UNLOAD command, on the other hand, is designed to
_permanently_ remove the core from Solr. Or at least have it become
unavailable until another explicit admin/core command is executed to
bring it back. There is nothing automatic about this.

Another way of looking at it is that LotsOfCores is used in a
situation where you don't know what requests are coming in, but you
_can_ predict that not many will be used at once. So if I have 500
cores, and my expectation is that only 20 of them are used at once,
there's no good in having the 480 other cores loaded all the time.
When a query comes in for one of the currently-unloaded cores (call it
core21), that core is loaded (perhaps displacing one of the
currently-loaded cores) and the request is served.

If core21 above had been unloaded with the core/admin command, then a
request directed to it would return an error instead.

Best,
Erick

On Wed, Oct 22, 2014 at 12:11 PM, Xiaolu Zhao xiaolu.z...@oracle.com
wrote:

Hi All,

I am confused about the difference between unloading of cores with
LotsOfCores and unloading a core with CoreAdmin.

  From my understanding of LotsOfCores, if one core is removed from
transient
cache, it is pending to close, it means close all resources allocated by
the
core if it is no longer in use, e.g. searcher, updateHandler... While for
unloading a core with CoreAdmin, this 

Re: QueryAutoStopWordAnalyzer

2014-10-23 Thread Alexandre Rafalovitch
How is this different from using StopFilterFactory in Solr:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/core/StopFilterFactory.html
?

Lucene wraps analyzers, Solr has a chain instead (though analyzers
are supported as well).

You just configure the chain. Writing a factory for when one analyzer
wraps another would be just duplication of the chain code.

What am I missing?

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 23 October 2014 10:31, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote:
 I just located the QueryAutoStopWordAnalyzer in lucene.
 Has anyone managed to use it for solr?

 Could imagine to have a language independent search clean up
 for the text_all field.

 Can it be used for solr right out of the box or do I have to
 write a wrapper or factory?

 Regards
 Bernd


Re: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread Alexandre Rafalovitch
The last real update on that is 2.5 years old. Is there more recent
update? I am interested in this topic as well.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 23 October 2014 10:10, Walter Underwood wun...@wunderwood.org wrote:
 We’re reimplementing fuzzy support in edismax on Solr 4.x right now. See: 
 https://issues.apache.org/jira/browse/SOLR-629

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/

 On Oct 22, 2014, at 11:08 PM, karsten-s...@gmx.de wrote:

 Hi Aleksander,

 The Fuzzy Searche '~' is not supported in dismax (defType=dismax)
 https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

 You are using SearchComponent spellchecker. This does not change the query 
 results.


 btw: It looks like you are using path /select with qt=dismax. This normaly 
 would throw an exception.
 Is there a tag
   requestHandler name=/dismax ...
 inside your solrconfig.xml ?

 Best regards

   Karsten

 P.S. in Context: 
 http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-distance-with-in-Java-td4164793.html


 On 20 October 2014 11:13, Aleksander Sadecki wrote:

 Ok, thank you for your response. But why I cannot use '~'?



update external file

2014-10-23 Thread Michael Sokolov
I've been looking at ExternalFileField to handle popularity boosting.  
Since Solr updatable docvalues (SOLR-5944) isn't quite there yet.  My 
question is whether there is any support for uploading the external file 
via Solr, or if people do that some other (external, I guess) way?


-Mike


Re: update external file

2014-10-23 Thread Ramzi Alqrainy
Of course, there is a support for uploading the external file via Solr, you
can find more details in below links

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/schema/ExternalFileField.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/update-external-file-tp4165563p4165565.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: update external file

2014-10-23 Thread Michael Sokolov
Thanks for the links, Ramzi.  I had already read the wiki page, which 
merely talks about how to reload the file into memory once it has been 
updated on disk. It doesn't mention any support for uploading that I can 
see.  Did I miss it?


-Mike


On 10/23/14 1:36 PM, Ramzi Alqrainy wrote:

Of course, there is a support for uploading the external file via Solr, you
can find more details in below links

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

http://lucene.apache.org/solr/4_10_0/solr-core/org/apache/solr/schema/ExternalFileField.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/update-external-file-tp4165563p4165565.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: update external file

2014-10-23 Thread Ramzi Alqrainy
I hope I understand your question well. so I had the same problem. This is
what I did:

1. Create a file:
solr_home/PROJECT/multicore/core1/data/external_popularProducts.txt

The file should contain values like this:
uniqueID_in_core=count

Example:
873728721=19
842728342=20

2. Update schema.xml, add this under types  /types
fieldType name=popularProductsFile keyField=key defVal=0
stored=true indexed=true class=solr.ExternalFileField valType=float
/

Here, key is the column name for the primaryID of solr core.
Add this under fields/fields
field name=popularProducts type=popularProductsFile indexed=true
stored=true /

3. Reload the core. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/update-external-file-tp4165563p4165572.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: update external file

2014-10-23 Thread Markus Jelsma
You either need to upload them and issue the reload command, or download them 
from the machine, and then issue the reload command. There is no REST support 
for it (yet) like the synonym filter, or was it stop filter?

MArkus 
 
-Original message-
 From:Michael Sokolov msoko...@safaribooksonline.com
 Sent: Thursday 23rd October 2014 19:19
 To: solr-user solr-user@lucene.apache.org
 Subject: update external file
 
 I've been looking at ExternalFileField to handle popularity boosting.  
 Since Solr updatable docvalues (SOLR-5944) isn't quite there yet.  My 
 question is whether there is any support for uploading the external file 
 via Solr, or if people do that some other (external, I guess) way?
 
 -Mike
 


Re: Difference between unloading of cores with LotsOfCores and unloading a core with CoreAdmin

2014-10-23 Thread Erick Erickson
bq: ..allocated by the core close for unloaded cores? And how about
the processing time for unloaded cores to get it loaded first if we
issue a query to it?

Well, all resources are supposed to be returned to the system. Even
500 cores open at one time is a lot though.

My theory is this has nothing to do with transient or non-transient
cores. What's happening here is that you simply are opening too many
cores (eventually) for the memory you're allocating. Plus, various
caches get filled up at different times depending on the query. Also,
is you have, say, 1,000 simultaneous queries outstanding to 1,000
different cores, _all_ 1,000 will be loaded in memory at the same time
(I'm simplifying a bit here). After 500 of the queries have been
satisfied, the number should drop back.

So here's what I'd do to test if there's really a memory leak or
you're just being too ambitious: Drop the transient cache size to,
say, 100 (or 50 or 10). You'll also have to take some care not to
flood the system with lots of queries to lots of different cores, but
you should vary the cores to cycle through them all. If your process
still shows memory creeping, you'll need to take some memory snapshots
so we can analyze what's going on.

And by mixing very different numbers of documents in your various
cores, you're introducing another variable that will make
apples-to-apples comparisons difficult.

The model the LotsOfCores stuff was built to deal with is having 100's
to 1,000's of cores, but not very many of them active at once.
Consider a situation where each e-mail user has their own core. A user
searches old e-mails only very rarely, so having 10,000 cores on a
machine, only, say, 10-20 may be active at once. You never know which
ones, of course. Eventually all of them will be used but rarely very
many simultaneously.

So you may be hitting an edge case if you are continually firing
queries at different cores. Loading a core is expensive, all the
underlying caches will be warmed, firstSearcher queries will be fired,
etc. And on only 8G of memory for 500 active cores, it's not
surprising that you're blowing up memory IMO.

Best,
Erick

On Thu, Oct 23, 2014 at 11:28 AM, Xiaolu Zhao xiaolu.z...@oracle.com wrote:
 Hi Erick,

 Actually we are adding more cores. In this case, we set
 transientCacheSize=500, create 16,000 cores in total, each with 10k log
 entries.

 During the process, we could easily see JVM memory usage will increase as
 the total number of cores grows. It runs out of memory when the total number
 of cores reaches 5,400.

 Then we restart Solr, continue creating and loading cores. JVM memory usage
 will rise to over 7GB (Max: 8GB), but not exceed the maximum. The process
 could be very slow then, we believe garbage collection may take place and
 cost some time.

 How about the resources usage for LotsOfCores (loaded/unloaded), e.g.
 searcher? Are all resources allocated by the core close for unloaded cores?
 And how about the processing time for unloaded cores to get it loaded first
 if we issue a query to it?

 We do the testing to look into the processing time for unloaded cores. In
 this case, we have 100 cores, 1-50 with 100M, 51-55 with 1M, 56-60 with 10M,
 61-70 with 100K, 71-100 with 10K. Then we could do query to unloaded cores
 with different data size to get the processing time for each group. Here,
 this query is for all: select?q=*.

 *Collection Name*



 *Total Time(ms)*



 *QTime(ms)*



 *Processing Time(ms)*

 collection71(10K)



 418



 1



 417

 collection72(10K)



 413



 0



 413

 collection61(100K)



 439



 2



 437

 collection62(100K)



 424



 1



 423

 collection51(1M)



 527



 5



 522

 collection52(1M)



 538



 5



 533

 collection56(10M)



 560



 33



 527

 collection57(10M)



 553



 33



 520

 collection3(100M)



 5971



 322



 5649

 collection4(100M)



 6052



 327



 5725


 Based on the table above, we could see an ascending trend with larger data.
 But there is a big gap between 10M and 100M.

 Thanks,
 Xiaolu


 On 10/23/2014 9:51 AM, Erick Erickson wrote:

 Memory should eventually be returned when a core is unloaded. There's
 a very small amount of overhead for keeping a list of all the cores
 and their locations, but this shouldn't increase with time unless
 you're adding more cores.

 Do note that the transient cache size is fixed, but may be exceeded. A
 core is held open when it gets reclaimed long enough to serve any
 outstanding requests, but it _should_ have the memory reclaimed
 eventually.

 Of course there's always the possibility of some memory being kept
 inadvertently, I'd consider that a  bug so if you can define how this
 happens, perhaps with a test case that would be great. Dumping the
 memory would help see what's kept if anything actually is.

 Best,
 Erick

 On Wed, Oct 22, 2014 at 12:33 PM, Xiaolu Zhao xiaolu.z...@oracle.com
 wrote:

 Hi Erick,

 Thanks a lot for your explanation.

 Last time, when I try out 

RE: update external file

2014-10-23 Thread Ramzi Alqrainy
Right, There is no REST support for it like the synonym filter, or was it
stop filter.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/update-external-file-tp4165563p4165577.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: update external file

2014-10-23 Thread Michael Sokolov

That's what I thought; thanks, Markus.

On 10/23/14 2:19 PM, Markus Jelsma wrote:

You either need to upload them and issue the reload command, or download them 
from the machine, and then issue the reload command. There is no REST support 
for it (yet) like the synonym filter, or was it stop filter?

MArkus
  
-Original message-

From:Michael Sokolov msoko...@safaribooksonline.com
Sent: Thursday 23rd October 2014 19:19
To: solr-user solr-user@lucene.apache.org
Subject: update external file

I've been looking at ExternalFileField to handle popularity boosting.
Since Solr updatable docvalues (SOLR-5944) isn't quite there yet.  My
question is whether there is any support for uploading the external file
via Solr, or if people do that some other (external, I guess) way?

-Mike





recip function error

2014-10-23 Thread eShard
Good evening,
I'm using solr 4.0 Final.
I tried using this function
boost=recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))
but it fails with this error:
org.apache.lucene.queryparser.classic.ParseException: Expected ')' at
position 29 in 'recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))'

I applied this patch https://issues.apache.org/jira/browse/SOLR-3522 
Rebuilt and redeployed AND I get the exact same error.
I only copied over the new jars and war file. Non of the other libraries
seemed to have changed.
the patch is in solr core so I figured I was safe.

Does anyone know how to fix this?

Thanks,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/recip-function-error-tp4165600.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread Will Martin
In terms of recent work with edit-distance (specifically Levenshtein) and your 
expressed interest might find this paper provocative.

We measure the keyword similarity between two strings
by lemmatizing them, removing stopwords, and computing
the cosine similarity. We then include the keyword similar-
ity between the query and the input question, the keyword
similarity between the query and the returned evidence, and
an indicator feature for whether the query involves a join.
The evidence features compute KB-specific properties... We compute the join-key 
string similarity mea-
sured using the Levenshtein distance.


http://dx.doi.org/10.1145/2623330.2623677

re
will


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, October 23, 2014 12:05 PM
To: solr-user
Subject: Re: How to properly use Levenstein distance with ~ in Java

The last real update on that is 2.5 years old. Is there more recent update? I 
am interested in this topic as well.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and 
newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers 
community: https://www.linkedin.com/groups?gid=6713853


On 23 October 2014 10:10, Walter Underwood wun...@wunderwood.org wrote:
 We’re reimplementing fuzzy support in edismax on Solr 4.x right now. 
 See: https://issues.apache.org/jira/browse/SOLR-629

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/

 On Oct 22, 2014, at 11:08 PM, karsten-s...@gmx.de wrote:

 Hi Aleksander,

 The Fuzzy Searche '~' is not supported in dismax (defType=dismax) 
 https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Par
 ser

 You are using SearchComponent spellchecker. This does not change the query 
 results.


 btw: It looks like you are using path /select with qt=dismax. This normaly 
 would throw an exception.
 Is there a tag
   requestHandler name=/dismax ...
 inside your solrconfig.xml ?

 Best regards

   Karsten

 P.S. in Context: 
 http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-dis
 tance-with-in-Java-td4164793.html


 On 20 October 2014 11:13, Aleksander Sadecki wrote:

 Ok, thank you for your response. But why I cannot use '~'?




Re: recip function error

2014-10-23 Thread Shawn Heisey
On 10/23/2014 3:09 PM, eShard wrote:
 Good evening,
 I'm using solr 4.0 Final.
 I tried using this function
 boost=recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))
 but it fails with this error:
 org.apache.lucene.queryparser.classic.ParseException: Expected ')' at
 position 29 in 'recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))'

 I applied this patch https://issues.apache.org/jira/browse/SOLR-3522 
 Rebuilt and redeployed AND I get the exact same error.
 I only copied over the new jars and war file. Non of the other libraries
 seemed to have changed.
 the patch is in solr core so I figured I was safe.

 Does anyone know how to fix this?

The Solr version you are running is more than two years old.  There have
been MANY new releases and MANY problems fixed since July 2012.

I have been using the recip function in a similar manner without any
problem on Solr versions starting at 4.2.1, up through 4.9.1, but I
|have never used 4.0.||

boost=min(recip(abs(ms(NOW/HOUR,pd)),1.92901e-10,1.5,1.5),0.85)|

Upgrading is strongly advised.  The current Solr version is 4.10.1,
released less than a month ago.

Thanks,
Shawn



Re: Analytics component

2014-10-23 Thread nabil Kouici
Thank you for this replay. Yes but many analytics functions are not available 
like percentile, median, SD deviation...
Regards,Nabil 

 Le Jeudi 23 octobre 2014 16h34, Jorge Luis Betancourt González 
jlbetanco...@uci.cu a écrit :
   

 I believe some of this statistics function that you're trying to use are 
precent in facets. 

- Original Message -
From: nabil Kouici koui...@yahoo.fr
To: solr-user@lucene.apache.org
Sent: Thursday, October 23, 2014 5:57:27 AM
Subject: Analytics component 

Hi All,

I'm trying to use Solr to do some analytic function (percentile, median...). I 
got Trunck branch from Solr which contain the analytics component 
implementation. I've rebuild solr but unfortunately this component wasn't taken 
into consideration and no lib generated in  /contrib/analytics.

Do you have any idea how to get it complied. Otherwise, any idea to have this 
analytics in Solr.

Regards,
Nabil.


   

Re: recip function error

2014-10-23 Thread eShard
Thanks we're planning on going to 4.10.1 in a few months.
I discovered that recip only works with dismax; I use edismax by default.
does anyone know why I can't use recip with edismax??

I hope this is fixed in 4.10.1...


Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/recip-function-error-tp4165600p4165613.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: recip function error

2014-10-23 Thread Chris Hostetter

: I tried using this function
: boost=recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))
: but it fails with this error:
: org.apache.lucene.queryparser.classic.ParseException: Expected ')' at
: position 29 in 'recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))'

look very carefully at your input, and at the error message.

you are only passing *1* argument to the recip() function -- the output of 
the ms() funciton.

you are passing *5* arguments to the ms() function -- it supports a max of 
2.

which is why at the 29th character of your input, after the second 
argument to your ms() function, it's complaining that it's expecting a ) 
character -- not more arguments.


-Hoss
http://www.lucidworks.com/


Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-23 Thread S.L
Shawn ,

Just wanted to follow up , I still face this issue of inconsistent search
results on Solr Cloud 4.1.0.1 , upon further looking into logs , I found
out a few exceptions , what was obvious was zkConnection time out issues
and other exceptions , please take a look .

*Logs*

/opt/tomcat1/logs/catalina.out:103651230 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651579 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651586 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651592 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651600 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651611 [http-bio-8081-exec-203] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
471640118 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – Watcher
org.apache.solr.common.cloud.ConnectionManager@2a7dcd74
name:ZooKeeperConnection Watcher:server1.mydomain.com:2181,
server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent
state:Disconnected type:None path:null path:null type:None
471640120 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – zkClient has disconnected
471642457 [zkCallback-2-thread-8] INFO
org.apache.solr.cloud.DistributedQueue  – LatchChildWatcher fired on path:
null state: Expired type None
471642458 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – 

Re: recip function error

2014-10-23 Thread Michael Sokolov

3.16e-11.0 looks fishy to me



On 10/23/14 5:09 PM, eShard wrote:

Good evening,
I'm using solr 4.0 Final.
I tried using this function
boost=recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))
but it fails with this error:
org.apache.lucene.queryparser.classic.ParseException: Expected ')' at
position 29 in 'recip(ms(NOW/HOUR,startdatez,3.16e-11.0,0.08,0.05))'

I applied this patch https://issues.apache.org/jira/browse/SOLR-3522
Rebuilt and redeployed AND I get the exact same error.
I only copied over the new jars and war file. Non of the other libraries
seemed to have changed.
the patch is in solr core so I figured I was safe.

Does anyone know how to fix this?

Thanks,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/recip-function-error-tp4165600.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: recip function error

2014-10-23 Thread Yonik Seeley
On Thu, Oct 23, 2014 at 7:47 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 3.16e-11.0 looks fishy to me

Indeed... looks like it should be 3.16e-11
Standard scientific notation shouldn't have decimal points in the
exponent.  Not sure if that causes Java problems or not though...

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data