Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
Erick as per your advise I used cursorMarks (see code below). It was
slightly better but Solr throws Exceptions randomly. Please look at the
code and Stacktrace below

2015-09-26 01:00:45 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 500/1453133
2015-09-26 01:00:49 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1000/1453133
2015-09-26 01:00:54 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 1500/1452592
2015-09-26 01:00:58 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2000/1452095
2015-09-26 01:01:03 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 2500/1451675
2015-09-26 01:01:10 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3000/1450924
2015-09-26 01:01:15 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 3500/1450445
2015-09-26 01:01:19 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4000/1449997
2015-09-26 01:01:24 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 4500/1449692
2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - Indexed 5000/1449201
2015-09-26 01:01:28 ERROR [a.b.c.AdhocCorrectUUID] - Error indexing
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://xx.xx.xx.xx:/solr/collection1: missing content
stream
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:560)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:234)
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:226)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:376)
at
org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:328)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1085)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:856)
at
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:799)
at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
at a.b.c.AdhocCorrectUUID.processDocs(AdhocCorrectUUID.java:97)
at a.b.c.AdhocCorrectUUID.main(AdhocCorrectUUID.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.simontuffs.onejar.Boot.run(Boot.java:306)
at com.simontuffs.onejar.Boot.main(Boot.java:159)
2015-09-26 01:01:28 INFO  [a.b.c.AdhocCorrectUUID] - FINISHED !!!


CODE

protected static void processDocs() {

try {
CloudSolrClient client = new
CloudSolrClient("zk1:,zk2:,zk3.com:");
client.setDefaultCollection("collection1");

boolean done = false;
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
Integer count = 0;

while (!done) {
SolrQuery q = new
SolrQuery("*:*").setRows(500).addSort("publishtime",
ORDER.desc).addSort("uniqueId",ORDER.desc).setFields(new
String[]{"uniqueId","uuid"});
q.addFilterQuery(new String[] {"uuid:[* TO *]",
"uuid:sun.org.mozilla*"});
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

QueryResponse resp = client.query(q);
String nextCursorMark = resp.getNextCursorMark();

SolrDocumentList docList = resp.getResults();

List inList = new
ArrayList();
for(SolrDocument doc : docList) {

SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);

//This is my system's id
String uniqueId = (String)
iDoc.getFieldValue("uniqueId");

/*
 * This is another system's unique id which is what I
want to correct that was messed
 * because of script transformer in DIH import via
SolrEntityProcessor
 * ex-
sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
 */
String uuid = (String) iDoc.getFieldValue("uuid");
String sanitizedUUID =
uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
Map fieldModifier = new
HashMap(1);
fieldModifier.put("set",sanitizedUUID);
iDoc.setField("uuid", fieldModifier);

inList.add(iDoc);
}
client.add(inList);

count = count + docList.size();
log.info("Indexed " + count + "/" +
docList.getNumFound());

if 

Re: firstSearcher cache warming with own QuerySenderListener

2015-09-25 Thread Erick Erickson
That's what the firstSearcher event in solrconfig.xml is for, exactly the
case of autowarming Solr when it's just been started. The queries you put
in that event are fired only when the server starts.

So I'd just put my queries there. And you do not have to put a zillion
queries here. Start with one that mentions all the facets you intend to
use, sorts by all the various sort fields you use, perhaps (if you have any
_very_ common filter queries) put those in too.

Then analyze the queries that are still slow when issued the first time
after startup and add what you suspect are the relevant bits to the
firstSearcher query (or queries).

I suggest that this is a much easier thing to do, and focus efforts on why
you are shutting down your Solr servers often enough that anyone notices..

Best,
Erick



On Fri, Sep 25, 2015 at 8:31 AM, Christian Reuschling <
christian.reuschl...@gmail.com> wrote:

> Hey all,
>
> we want to avoid cold start performance issues when the caches are cleared
> after a server restart.
>
> For this, we have written a SearchComponent that saves least recently used
> queries. These are
> written to a file inside a closeHook of a SolrCoreAware at server shutdown.
>
> The plan is to perform these queries at server startup to warm up the
> caches. For this, we have
> written a derivative of the QuerySenderListener and configured it as
> firstSearcher listener in
> solrconfig.xml. The only difference to the origin QuerySenderListener is
> that it gets it's queries
> from the formerly dumped lru queries rather than getting them from the
> config file.
>
> It seems that everything is called correctly, and we have the impression
> that the query response
> times for the dumped queries are sometimes slightly better than without
> this warming.
>
> Nevertheless, there is still a huge difference against the times when we
> manually perform the same
> queries once, e.g. from a browser. If we do this, the second time we
> perform these queries they
> respond much faster (up to 10 times) than the response times after the
> implemented warming.
>
> It seems that not all caches are warmed up during our warming. And because
> of these huge
> differences, I doubt we missed something.
>
> The index has about 25M documents, and is splitted into two shards in a
> cloud configuration, both
> shards are on the same server instance for now, for testing purposes.
>
> Does anybody have an idea? I tried to disable lazy field loading as a
> potential issue, but with no
> success.
>
>
> Cheers,
>
> Christian
>
>


Expensive GC Remark Phase for JNI Weak Reference

2015-09-25 Thread Keith L
Using:
- JDK 1.8u40
- UseG1GC, ParallelRefProcEnabled, Xmx12g,Xms12g
- Solr 4.10.4


When using G1GC we are seeing very high processing times in the GC Remark
phase during reference processing. Originally we saw high times during
WeakReference processing but adding"-XX:+ParallelRefProcEnabled" flag did
away with this. Now we frequently see these times are very high for JNI
Weak Reference processing (90seconds+!!!).  I've only noticed this being an
issue during STW processing in the Remark phase after an initial-mark phase
has executed. Reference processing during young or mixed collections never
seems to be as big an issue (milliseconds to 1-2s max). Currently JNI Weak
Reference are not processed in parallel in HotSpot so adding the parallel
flag has no affect.  (https://bugs.openjdk.java.net/browse/JDK-8072498)

Over the lifetime of an application we will see a similar pattern after
days of uptime, one where the amount of time taken for Remark gradually
continues increasing from milliseconds to minutes. If a FullGC is issued we
will see the times for reference processing reset back down to milliseconds.

Likely related: after a heap dump of a running application we noticed
hundreds of thousands of unreferenced DirectByteBuffers. And hundreds of
thousands more of them being referenced from Lucene410DocValuesProducer.

*Application Details:*
- Heavy indexing:
  - ~4000 bulk updates per minute, ~20 documents each update or 80k
documents per minute
- ~100 fields per document, mostly small strings, TrieInt, TrieDouble
- Usage of docValues (1-5 fields per document)
- Some fields multi-value TrieInt (precisionStep=0) fields could
potentially have hundreds of values
- 1s auto soft commit openSearcher=true, 15s auto hard commit
openSearcher=false


*Questions*:
1) What is causing the JNI Weak References?
  - Is it from nio? Usage of MMaps, DirectBuffers, etc?
2) Why does it become worse during the lifetime of the application?
  - Is there a leak?
  - Are we getting into a situation where we begin promoting Searchers
along with
their IndexReaders/Writers too early. i.e. do we have a lot expensive
objects moving to
tenured generation just before they become dereferenced.
  - Tune G1NewSizePercent?
3) Are the JNI Weak References being visited but not cleaned during GC?
  - Why not? tune G1HeapWastePercent?


As of right now we are experimenting with different G1GC flags, however
without understanding the root cause it sometimes requires days between
experiments for the problem to present itself.


We have not tuned any of the options above which disable the adaptive
sizing policy, some options we are actively testing are:

- Lowering InitiatingHeapOccupancyPercent
 - To force dereferenced objects in tenured to be cleaned earlier.

- Increasing MaxGCPauseMillis
 - To encourage policy for the young generation size to be larger
(discouraging premature promotion).


Example of remark phases below (full gc log can be provided if required)

One *example log file* (however we have seen worse), using flags in the
beginning:

1348278.630: [GC pause (G1 Evacuation Pause) (young) (initial-mark)
Desired survivor size 41943040 bytes, new threshold 15 (max 15)
- age   1:   36138128 bytes,   36138128 total
1348278.676: [SoftReference, 0 refs, 0.0086523 secs]1348278.684:
[WeakReference, 7128 refs, 0.0047846 secs]1348278.689: [FinalReference, 414
refs, 0.0063207 secs]1348278.695: [PhantomReference, 0 refs, 376 refs,
0.0129979 secs]1348278.709: [JNI Weak Reference, 0.1018689 secs], 0.1955299
secs]
   [Parallel Time: 39.7 ms, GC Workers: 23]
  [GC Worker Start (ms): Min: 1348278633.2, Avg: 1348278633.7, Max:
1348278634.2, Diff: 1.0]
  [Ext Root Scanning (ms): Min: 11.2, Avg: 11.7, Max: 14.9, Diff: 3.7,
Sum: 269.2]
  [Update RS (ms): Min: 9.9, Avg: 13.3, Max: 14.0, Diff: 4.1, Sum:
304.8]
 [Processed Buffers: Min: 23, Avg: 66.9, Max: 117, Diff: 94, Sum:
1538]
  [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
  [Object Copy (ms): Min: 11.8, Avg: 12.4, Max: 12.9, Diff: 1.1, Sum:
286.3]
  [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
  [GC Worker Other (ms): Min: 0.1, Avg: 0.3, Max: 0.7, Diff: 0.6, Sum:
6.2]
  [GC Worker Total (ms): Min: 37.0, Avg: 37.7, Max: 38.6, Diff: 1.6,
Sum: 867.4]
  [GC Worker End (ms): Min: 1348278671.2, Avg: 1348278671.4, Max:
1348278671.8, Diff: 0.6]
   [Code Root Fixup: 0.4 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 2.8 ms]
   [Other: 152.6 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 137.2 ms]
  [Ref Enq: 2.6 ms]
  [Redirty Cards: 3.2 ms]
  [Humongous Reclaim: 0.0 ms]
  [Free CSet: 0.6 ms]
   [Eden: 532.0M(532.0M)->0.0B(532.0M) Survivors: 80.0M->80.0M Heap:
6225.6M(12.0G)->5701.0M(12.0G)]
 [Times: user=0.80 sys=0.04, real=0.20 secs]
1348278.827: [GC concurrent-root-region-scan-start]
1348278.828: Total time 

Re: How to know index file in OS Cache

2015-09-25 Thread Gili Nachum
Gonna try Mikhail suggestion, but just for fun you can also empirically
"test" for how much of a file is in the oshr...@matrix.co.il cache with:
time cat  > /dev/null

The faster it completes the more blocks are cached you can take a baseline
after manually purging of cache - don't recall the command. Note that
running the command by itself encourages to cache the file.
On Sep 25, 2015 12:39, "Aman Tandon"  wrote:

> Awesome thank you Mikhail. This is what I was looking for.
>
> This was just a random question poped up in my mind. So I just asked this
> on the group.
>
> With Regards
> Aman Tandon
>
> On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > What about Linux:
> > $less /proc//maps
> > $pmap 
> >
> > On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello - as far as i remember, you don't. A file itself is not the unit
> to
> > > cache, but blocks are.
> > > Markus
> > >
> > >
> > > -Original message-
> > > > From:Aman Tandon 
> > > > Sent: Friday 25th September 2015 5:56
> > > > To: solr-user@lucene.apache.org
> > > > Subject: How to know index file in OS Cache
> > > >
> > > > Hi,
> > > >
> > > > Is there any way to know that the index file/s is present in the OS
> > cache
> > > > or RAM. I want to check if the index is present in the RAM or in OS
> > cache
> > > > and which files are not in either of them.
> > > >
> > > > With Regards
> > > > Aman Tandon
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> > 
> >
>


Re: Help for Highlights

2015-09-25 Thread Erick Erickson
You're only returning the "submissaoid" and "tituloprojeto"  fields (along
with score), and dismax is probably searching across other fields (I can't
tell from the fragment, it'll be the parameters set up in solrconfig.xml,
the select handler). Add =all to the query and you'll see all the
fields dismax is searching over.

When you specify hl.fl=*, it's saying in effect "any field that is
specified in the fl list should be highlighted if there's a match". So a
simple test would be to specify fl=*. Although do note that if the match is
on a field that is not stored, you'll see nothing.

Best,
Erick

2015-09-25 7:28 GMT-07:00 Leandro Henrique :

> Dear Colleagues of Solr-list,
>
> I am using the Solr 5.0 on my work to index textual base of approximately
> 3500 documents. The documents are stored in XML files. Almost everything is
> right and functioning normally ... unless the highlight functionality.
>
> This feature is not working well! After a survey any, Solr presents the
> findings, but there are documents matched that do not have highlights. I do
> not understand: How a document is found that there is no highlight for him?
>
> Here is an example:
>
> => Search for "rabanete" (in Portuguese):
>
> => URL search:
> http://localhost:8983/solr/baseprojetos/select?q=rabanete=score+desc=5=tituloprojeto%2Csubmissaoid%2Cscore=json=true=dismax=true=*=%3Cem%3E=%3C%2Fem%3E=true=true
>
> => Results (JSON):
> **
>  "responseHeader":{ "status":0, "QTime":146, "params":{ "hl":"true",
> "indent":"true", "fl":"tituloprojeto,submissaoid,score",
> "hl.usePhraseHighlighter":"true", "sort":"score desc", "rows":"5",
> "hl.simple.pre":"", "q":"rabanete", "defType":"dismax",
> "hl.simple.post":"", "hl.fl":"*", "wt":"json",
> "hl.highlightMultiTerm":"true"}},
> "response":{"numFound":5,"start":0,"maxScore":0.4094792,"docs":[
>
> { "submissaoid":"22920", "tituloprojeto":"AVALIAÇÃO DE DISPONIBILIDADE DE
> METAIS PESADOS PARA PLANTAS CULTIVADAS EM UM SOLO TRATADO COM FONTES
> ALTERNATIVAS DE POTÁSSIO", "score":0.4094792},
>
> { "submissaoid":"34721", "tituloprojeto":"Aperfeiçoamento do processo de
> produção e definição de parâmetros ideais para produção de conservas de
> brotos de soja a partir da cultivar BRS 216", "score":0.24568753},
>
> { "submissaoid":"204661", "tituloprojeto":"Transferência de tecnologias de
> cobertura vegetal na cultura dos citros e sua contribuição para a
> agricultura conservacionista.", "score":0.08686366},
>
> { "submissaoid":"204607", "tituloprojeto":"DESENVOLVIMENTO DE
> INSTRUMENTAÇÃO, MÉTODOS E PROCESSOS PARA AVALIAÇÃO E USO SEGURO DE
> RESÍDUOS", "score":0.057909105},
>
> { "submissaoid":"210515", "tituloprojeto":"Projeto Xisto Agrícola -
> Pesquisa e desenvolvimento do potencial de uso do xisto e seus coprodutos
> na agricultura", "score":0.057909105}]
>
> "highlighting":{
> "22920":{"objetivogeral":[" presente Projeto tem como objetivo geral
> estudar a disponibilidade de metais pesados provenientes de quatro fontes
> alternativas de potássio, para a alface, soja e rabanete"]},
> "34721":{"resumoprojeto":[" o tamanho necessário para serem consumidos,
> sendo fontes ricas em minerais, vitaminas, proteínas e com baixa caloria. O
> \"feijão moyashi\", também conhecido como feijão mungo é a espécie mais
> utilizada para a produção de brotos no Brasil. Mais de 30 espécies de
> plantas, principalmente de olerícolas (brócolis, rabanete"]},
> "204661":{},
> "204607":{"descricaoatividade":[" de massa seca da parte aérea e raízes e
> na produtividade de hortaliças.Os experimentos serão realizados na Estação
> Experimental da Embrapa Clima Temperado, num Argissolo Vermelho,
> utilizando-se espécies de hortaliças cujo órgão de consumo são as folhas
> (alface), as raízes (rabanete"]},
> "210515":{descricaoatividade":[" plástica em março de 2012. O uso de
> cobertura plástica nos canteiros foi para evitar possíveis perdas dos
> tratamentos aplicados por lixiviação. As espécies de hortaliças avaliadas
> neste estudo são rabanete"]}}}
>
> **
>
> See the document with ID = 204661 does not highlight but was found with
> the third score!!!
>
> Where am I going wrong? Which configuration is wrong? Can anyone help me?
>
> Thanks in advance!
> Leandro.


Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Alessandro Benedetti
Clear !
Now I understand the current situation.
Hope the issue will be fixed soon and the conference is recorded,
good luck!

Cheers

2015-09-25 15:22 GMT+01:00 Yonik Seeley :

> On Fri, Sep 25, 2015 at 5:07 AM, Alessandro Benedetti
>  wrote:
> >There is an undocumented "method" parameter - I need to enable that to
> >
> >> allow switching between the docvalues approach and the UnInvertedField
> >> approach.
> >>
> >
> > Only to clarify, please correct me Yonik if my understanding is wrong or
> > outdated :
> > To calculate facets, without going into the algorithm details there are 2
> > approaches available :
> > Term Enum ( good for limited number of unique values for your field) and
> Fc
> > ( FieldCache) good for a lot of unique values, but not for big fields.
> >
> > For the FC approach,
> >  - storing the DocValues for the field would transparently use them (
> with
> > the known benefit at the cost of disk space for the docValues data
> > structures)
> >  - without the DocValues , there algorithm will un-invert the index at
> > runtime using the field cache to store the results
>
> Yeah, that's right so far.
> We should add a switch though for the method of uninversion...
> UnInvertedField (for indexes that change less frequently) vs DocValues
> (i.e. if you didn't index with DocValues, UnInvertedReader will
> uninvert to an in-memory structure that looks like DocValues).
>
> > So , from your quote, Term Enum will not be supported by Json Faceting ?
>
> We can, it just hasn't been a priority yet.
>
> Anyway, I'm going to step away from email and
> https://issues.apache.org/jira/browse/SOLR-8096 for a couple of days.
> I need to go focus on putting some slides together for
> Strata/HadoopWorld next week. I'll be talking about the new facet
> module / json facets there.
>
> -Yonik
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Help on autocomplete / suggester

2015-09-25 Thread Alessandro Benedetti
Hi Andrea,
really curious I found the province where I was born in the Solr mailing
list :)

Apart that , based on your requirements, it's not possible to use any
suggester.
You should definitely design a new Solr collection(core) for your
requirements.
Would be quite easy to provide those services through a new specific Solr
core.

Cheers

2015-09-25 13:08 GMT+01:00 Andrea Gazzarini :

> Sorry, in the first point I meant "prefix_search"
>
> Best,
> Andrea
>
>
> On 09/24/2015 11:18 AM, Andrea Gazzarini wrote:
>
>> Hi guys,
>> as part of a customer requirement, I need to provide an autocomplete /
>> suggester feature. For that reason I started looking at the Suggester
>> Component.
>>
>> The target Solr version is not yet determined: I mean, there's another
>> project in production, of the same customer, which is using Solr 4.7.1 (no
>> SolrCloud, just a master with two slaves) so I guess they will extend those
>> instances with additional cores, but I'm not sure about that, maybe they
>> would like to migrate towards a new version  / new architecture.
>>
>> Anyway, after reading some info [1]  [2]  [3] about the Suggester, and
>> after trying a bit with some sample data, I'm not sure if that fits my
>> needs, because the proposed suggestions must follow these criteria:
>>
>>   * suffix search: Vi = *Vi*terbo, *Vi*cenza, *Vi*llanova (max priority)
>>   * infix search: Vi = A*vi*gliano, Tar*vi*sio (medium priority)
>>   * fuzzy (phonetic?) search: Vitr= Viterbo, Vitorchiano (lowest
>> priority, this requirement could be even removed)
>>
>>   * everything could be constrained by one or more filter queries
>>   * each suggestion could contain (depending on the use case) up to
>> five additional attributes (other than the suggestion itself), so
>> the payload provided by the Suggester couldn't be enough (or it
>> would require a custom encoding of such data in that field)
>>   * in a couple of scenarios, the search needs to be executed on
>> several fields, with different boosts (e.g. description, address,
>> code) and the corresponding suggestions come from another field
>> (e.g. name)
>>   * I don't have any incremental / delta indexing issue, the whole
>> dataset is not huge, a couple of millions of database records,
>> with a low grow rate, and I can recreate everything from scratch
>> using the DIH
>>
>> Do you think this is something for the built-in Suggester? Or is this
>> something that it's better to implement with a RequestHandler with
>> something like (e)dismax and ngramming?
>>
>> Many thanks in advance
>> Andrea
>>
>> [1] https://cwiki.apache.org/confluence/display/solr/Suggester
>> [2] http://lucidworks.com/blog/solr-suggester/
>> [3] http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
>>
>>
>>
>>
>


-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


firstSearcher cache warming with own QuerySenderListener

2015-09-25 Thread Christian Reuschling
Hey all,

we want to avoid cold start performance issues when the caches are cleared 
after a server restart.

For this, we have written a SearchComponent that saves least recently used 
queries. These are
written to a file inside a closeHook of a SolrCoreAware at server shutdown.

The plan is to perform these queries at server startup to warm up the caches. 
For this, we have
written a derivative of the QuerySenderListener and configured it as 
firstSearcher listener in
solrconfig.xml. The only difference to the origin QuerySenderListener is that 
it gets it's queries
from the formerly dumped lru queries rather than getting them from the config 
file.

It seems that everything is called correctly, and we have the impression that 
the query response
times for the dumped queries are sometimes slightly better than without this 
warming.

Nevertheless, there is still a huge difference against the times when we 
manually perform the same
queries once, e.g. from a browser. If we do this, the second time we perform 
these queries they
respond much faster (up to 10 times) than the response times after the 
implemented warming.

It seems that not all caches are warmed up during our warming. And because of 
these huge
differences, I doubt we missed something.

The index has about 25M documents, and is splitted into two shards in a cloud 
configuration, both
shards are on the same server instance for now, for testing purposes.

Does anybody have an idea? I tried to disable lazy field loading as a potential 
issue, but with no
success.


Cheers,

Christian



Re: Different ports for search and upload request

2015-09-25 Thread Uwe Reh

Am 25.09.2015 um 00:05 schrieb Siddhartha Singh Sandhu:

*Never did this. *But how about this crazy idea:
Take an Amazon EFS and share it between 2 EC2. 


I think, you are on the right way. Imho this requirement should be 
solved external.


Option 1:
Hide your Solr node behind a http-proxy which publishes the APIs/handler 
on different Ports. Or publish only the requestHandler  like 'select' 
and 'get' and let use your updateprocess the full API.


Option 2: Use replication. Update the Master and send your Querys to the 
Slave


Uwe



Using a plugin to filter in schema.xml

2015-09-25 Thread Siddhartha Singh Sandhu
Hi,

I wanted to use the twitter-text libraries github implementation to filter
the tokens(hashtags) in my text. I know I can use the Pattern Matching
tokenizer also, but would trust twitter's library more then my own regex to
do the job for me. I wanted to use it in unison with
the solr.WhitespaceTokenizerFactory to get the tokens.

Need help in understanding on how can I do that. Do I have to refactor the
twitter Java library to "extends TokenFilterFactory" or can I use it the
way it is.

Regards,

Sid.


Re: recovering mode loop

2015-09-25 Thread Erick Erickson
On a quick look at the replica jstack (the leader didn't come through in
text form) there's nothing that jumps out.

I _have_ seen lots and lots of updates coming through one at a time do some
weird things with replicas going in and out of recovery, so that's a good
intuition to follow up on.

Wish I could point to a "smoking gun".

Erick

On Fri, Sep 25, 2015 at 1:07 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> I think the attachment was stripped off from the mail :( .
> here's a public link.
>
>
> https://drive.google.com/file/d/0B_z8xmsby0uxRDZEeWpLcnR2b3M/view?usp=sharing
>
> On 25 September 2015 at 09:59, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > This is the last logs i've got, even with a higher zkClientTimeout of
> 30s.
> >
> > this is a replica:
> >
> > 9/24/2015, 8:14:46 PMWARNRecoveryStrategyStopping recovery for
> > core=dawanda coreNodeName=core_node69/24/2015, 8:14:56 PMWARN
> > RecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node69/24/2015,
> > 8:16:34 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> > coreNodeName=core_node69/24/2015, 8:16:37 PMWARNPeerSyncPeerSync:
> > core=dawanda url=http://solr6.dawanda.services:8983/solr too many
> updates
> > received since start - startingUpdates no longer overlaps with our
> > currentUpdates9/24/2015, 8:16:40 PMERRORReplicationHandlerSnapPull failed
> > :org.apache.solr.common.SolrException: Index fetch failed :9/24/2015,
> > 8:16:40 PMERRORRecoveryStrategyError while trying to
> > recover:org.apache.solr.common.SolrException: Replication for recovery
> > failed.9/24/2015, 8:16:40 PMERRORRecoveryStrategyRecovery failed - trying
> > again... (0) core=dawanda9/24/2015, 8:16:55
> PMERRORSolrCorenull:org.apache.lucene.store.AlreadyClosedException:
> > Already closed:
> >
> MMapIndexInput(path="/srv/loveos/solr/server/solr/dawanda/data/index.20150924174642739/_ulvt_Lucene50_0.tim")9/24/2015,
> > 8:16:59 PMWARNSolrCore[dawanda] PERFORMANCE WARNING: Overlapping
> > onDeckSearchers=29/24/2015, 8:17:53 PMWARNUpdateLogStarting log replay
> >
> tlog{file=/srv/loveos/solr/server/solr/dawanda/data/tlog/tlog.0024343
> > refcount=2} active=true starting pos=33935659/24/2015, 8:18:47 PMWARN
> > RecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node6
> > 9/24/2015, 8:18:57 PMWARNUpdateLogLog replay finished.
> > recoveryInfo=RecoveryInfo{adds=2556 deletes=2455 deleteByQuery=0 errors=0
> > positionOfStart=3393565}9/24/2015, 8:18:57 PMWARNRecoveryStrategyStopping
> > recovery for core=dawanda coreNodeName=core_node69/24/2015, 8:19:07
> PMWARN
> > RecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node69/24/2015,
> > 8:19:17 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> > coreNodeName=core_node69/24/2015, 8:19:27 PMWARNRecoveryStrategyStopping
> > recovery for core=dawanda coreNodeName=core_node69/24/2015, 8:19:37
> PMWARN
> > RecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node69/24/2015,
> > 8:19:47 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> > coreNodeName=core_node6
> >
> > and *again an overlapping searcher*.
> >
> > this is the leader:
> >
> > 9/24/2015, 8:18:56 PMWARNDistributedUpdateProcessorError sending update
> > to http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:56 PMERROR
> > StreamingSolrClientserror9/24/2015, 8:18:56 PMWARN
> > DistributedUpdateProcessorError sending update to
> > http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:56 PMWARN
> > ZkControllerLeader is publishing core=dawanda coreNodeName =core_node6
> > state=down on behalf of un-reachable replica
> > http://solr6.dawanda.services:8983/solr/dawanda/; forcePublishState?
> false9/24/2015,
> > 8:18:56 PMERRORDistributedUpdateProcessorSetting up to try to start
> > recovery on replica http://solr6.dawanda.services:8983/solr/dawanda/
> after:
> > org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
> > connection from pool9/24/2015, 8:18:57
> PMERRORStreamingSolrClientserror9/24/2015,
> > 8:18:57 PMWARNDistributedUpdateProcessorError sending update to
> > http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:57 PMWARN
> > ZkControllerLeader is publishing core=dawanda coreNodeName =core_node6
> > state=down on behalf of un-reachable replica
> > http://solr6.dawanda.services:8983/solr/dawanda/; forcePublishState?
> false9/24/2015,
> > 8:18:57 PMERRORDistributedUpdateProcessorSetting up to try to start
> > recovery on replica http://solr6.dawanda.services:8983/solr/dawanda/
> after:
> > org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
> > connection from pool
> >
> >
> > I am doing jmx monitoring and right when two replicas went into
> recovering
> > and down respectively, I checked the threads on the leader and there were
> > at least 30 updatesExecutor running.
> >
> > I did a check on one of the replicas gc logs, and with gcviewer i get
> this
> > for
> >
> > -* 

Re: Help for Highlights

2015-09-25 Thread Erick Erickson
Glad to help!

Erick

2015-09-25 10:05 GMT-07:00 Leandro Henrique :

> Hello Erick,
>
> Very, very, very thanks! The highlights "null" was fields with stored
> parameter setted to "false".
>
> Thanks again!
>
> Leandro.
>
> > Date: Fri, 25 Sep 2015 09:14:16 -0700
> > Subject: Re: Help for Highlights
> > From: erickerick...@gmail.com
> > To: solr-user@lucene.apache.org
> >
> > You're only returning the "submissaoid" and "tituloprojeto"  fields
> (along
> > with score), and dismax is probably searching across other fields (I
> can't
> > tell from the fragment, it'll be the parameters set up in solrconfig.xml,
> > the select handler). Add =all to the query and you'll see all the
> > fields dismax is searching over.
> >
> > When you specify hl.fl=*, it's saying in effect "any field that is
> > specified in the fl list should be highlighted if there's a match". So a
> > simple test would be to specify fl=*. Although do note that if the match
> is
> > on a field that is not stored, you'll see nothing.
> >
> > Best,
> > Erick
> >
> > 2015-09-25 7:28 GMT-07:00 Leandro Henrique :
> >
> > > Dear Colleagues of Solr-list,
> > >
> > > I am using the Solr 5.0 on my work to index textual base of
> approximately
> > > 3500 documents. The documents are stored in XML files. Almost
> everything is
> > > right and functioning normally ... unless the highlight functionality.
> > >
> > > This feature is not working well! After a survey any, Solr presents the
> > > findings, but there are documents matched that do not have highlights.
> I do
> > > not understand: How a document is found that there is no highlight for
> him?
> > >
> > > Here is an example:
> > >
> > > => Search for "rabanete" (in Portuguese):
> > >
> > > => URL search:
> > >
> http://localhost:8983/solr/baseprojetos/select?q=rabanete=score+desc=5=tituloprojeto%2Csubmissaoid%2Cscore=json=true=dismax=true=*=%3Cem%3E=%3C%2Fem%3E=true=true
> > >
> > > => Results (JSON):
> > > **
> > >  "responseHeader":{ "status":0, "QTime":146, "params":{ "hl":"true",
> > > "indent":"true", "fl":"tituloprojeto,submissaoid,score",
> > > "hl.usePhraseHighlighter":"true", "sort":"score desc", "rows":"5",
> > > "hl.simple.pre":"", "q":"rabanete", "defType":"dismax",
> > > "hl.simple.post":"", "hl.fl":"*", "wt":"json",
> > > "hl.highlightMultiTerm":"true"}},
> > > "response":{"numFound":5,"start":0,"maxScore":0.4094792,"docs":[
> > >
> > > { "submissaoid":"22920", "tituloprojeto":"AVALIAÇÃO DE DISPONIBILIDADE
> DE
> > > METAIS PESADOS PARA PLANTAS CULTIVADAS EM UM SOLO TRATADO COM FONTES
> > > ALTERNATIVAS DE POTÁSSIO", "score":0.4094792},
> > >
> > > { "submissaoid":"34721", "tituloprojeto":"Aperfeiçoamento do processo
> de
> > > produção e definição de parâmetros ideais para produção de conservas de
> > > brotos de soja a partir da cultivar BRS 216", "score":0.24568753},
> > >
> > > { "submissaoid":"204661", "tituloprojeto":"Transferência de
> tecnologias de
> > > cobertura vegetal na cultura dos citros e sua contribuição para a
> > > agricultura conservacionista.", "score":0.08686366},
> > >
> > > { "submissaoid":"204607", "tituloprojeto":"DESENVOLVIMENTO DE
> > > INSTRUMENTAÇÃO, MÉTODOS E PROCESSOS PARA AVALIAÇÃO E USO SEGURO DE
> > > RESÍDUOS", "score":0.057909105},
> > >
> > > { "submissaoid":"210515", "tituloprojeto":"Projeto Xisto Agrícola -
> > > Pesquisa e desenvolvimento do potencial de uso do xisto e seus
> coprodutos
> > > na agricultura", "score":0.057909105}]
> > >
> > > "highlighting":{
> > > "22920":{"objetivogeral":[" presente Projeto tem como objetivo geral
> > > estudar a disponibilidade de metais pesados provenientes de quatro
> fontes
> > > alternativas de potássio, para a alface, soja e rabanete"]},
> > > "34721":{"resumoprojeto":[" o tamanho necessário para serem consumidos,
> > > sendo fontes ricas em minerais, vitaminas, proteínas e com baixa
> caloria. O
> > > \"feijão moyashi\", também conhecido como feijão mungo é a espécie mais
> > > utilizada para a produção de brotos no Brasil. Mais de 30 espécies de
> > > plantas, principalmente de olerícolas (brócolis, rabanete"]},
> > > "204661":{},
> > > "204607":{"descricaoatividade":[" de massa seca da parte aérea e
> raízes e
> > > na produtividade de hortaliças.Os experimentos serão realizados na
> Estação
> > > Experimental da Embrapa Clima Temperado, num Argissolo Vermelho,
> > > utilizando-se espécies de hortaliças cujo órgão de consumo são as
> folhas
> > > (alface), as raízes (rabanete"]},
> > > "210515":{descricaoatividade":[" plástica em março de 2012. O uso de
> > > cobertura plástica nos canteiros foi para evitar possíveis perdas dos
> > > tratamentos aplicados por lixiviação. As espécies de hortaliças
> avaliadas
> > > neste estudo são rabanete"]}}}
> > >
> > > **
> > >
> > > See the document with ID = 204661 does not highlight but was found with
> > > the third score!!!
> > >
> > > Where am I going wrong? Which configuration is wrong? Can 

Re: How to know index file in OS Cache

2015-09-25 Thread Jeff Wartes


I’ve been relying on this:
https://code.google.com/archive/p/linux-ftools/


fincore will tell you what percentage of a given file is in cache, and
fadvise can suggest to the OS that a file be cached.

All of the solr start scripts at my company first call fadvise
(FADV_WILLNEED) on all the files in the index directories. It works great
if you’re on a linux system.



On 9/25/15, 8:41 AM, "Gili Nachum"  wrote:

>Gonna try Mikhail suggestion, but just for fun you can also empirically
>"test" for how much of a file is in the oshr...@matrix.co.il cache with:
>time cat  > /dev/null
>
>The faster it completes the more blocks are cached you can take a baseline
>after manually purging of cache - don't recall the command. Note that
>running the command by itself encourages to cache the file.
>On Sep 25, 2015 12:39, "Aman Tandon"  wrote:
>
>> Awesome thank you Mikhail. This is what I was looking for.
>>
>> This was just a random question poped up in my mind. So I just asked
>>this
>> on the group.
>>
>> With Regards
>> Aman Tandon
>>
>> On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>> > What about Linux:
>> > $less /proc//maps
>> > $pmap 
>> >
>> > On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
>> > markus.jel...@openindex.io>
>> > wrote:
>> >
>> > > Hello - as far as i remember, you don't. A file itself is not the
>>unit
>> to
>> > > cache, but blocks are.
>> > > Markus
>> > >
>> > >
>> > > -Original message-
>> > > > From:Aman Tandon 
>> > > > Sent: Friday 25th September 2015 5:56
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: How to know index file in OS Cache
>> > > >
>> > > > Hi,
>> > > >
>> > > > Is there any way to know that the index file/s is present in the
>>OS
>> > cache
>> > > > or RAM. I want to check if the index is present in the RAM or in
>>OS
>> > cache
>> > > > and which files are not in either of them.
>> > > >
>> > > > With Regards
>> > > > Aman Tandon
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > 
>> > 
>> >
>>



Re: [Open source] SolrCloud High Availability (HAFT) Library - Bloomreach

2015-09-25 Thread Shawn Heisey
On 9/25/2015 11:30 AM, Nitin Sharma wrote:
> It would be great if we can link this in the solrcloud contributions wiki.
> Can you give us access to that?

Just create an account on the wiki, tell us the username, and you'll be
added quickly to the group that allows editing.

https://wiki.apache.org/solr/FrontPage?action=login

Thanks,
Shawn



Re: Help on autocomplete / suggester

2015-09-25 Thread Andrea Gazzarini
Hi Alessandro,
Yes, I read a lot of posts from you about the Suggester component,
including your blog, so the province name was just to catch your attention
:D...just kidding, I'm living there.

Many many thanks

Best,
Andrea
On 25 Sep 2015 17:12, "Alessandro Benedetti" 
wrote:

> Hi Andrea,
> really curious I found the province where I was born in the Solr mailing
> list :)
>
> Apart that , based on your requirements, it's not possible to use any
> suggester.
> You should definitely design a new Solr collection(core) for your
> requirements.
> Would be quite easy to provide those services through a new specific Solr
> core.
>
> Cheers
>
> 2015-09-25 13:08 GMT+01:00 Andrea Gazzarini :
>
> > Sorry, in the first point I meant "prefix_search"
> >
> > Best,
> > Andrea
> >
> >
> > On 09/24/2015 11:18 AM, Andrea Gazzarini wrote:
> >
> >> Hi guys,
> >> as part of a customer requirement, I need to provide an autocomplete /
> >> suggester feature. For that reason I started looking at the Suggester
> >> Component.
> >>
> >> The target Solr version is not yet determined: I mean, there's another
> >> project in production, of the same customer, which is using Solr 4.7.1
> (no
> >> SolrCloud, just a master with two slaves) so I guess they will extend
> those
> >> instances with additional cores, but I'm not sure about that, maybe they
> >> would like to migrate towards a new version  / new architecture.
> >>
> >> Anyway, after reading some info [1]  [2]  [3] about the Suggester, and
> >> after trying a bit with some sample data, I'm not sure if that fits my
> >> needs, because the proposed suggestions must follow these criteria:
> >>
> >>   * suffix search: Vi = *Vi*terbo, *Vi*cenza, *Vi*llanova (max priority)
> >>   * infix search: Vi = A*vi*gliano, Tar*vi*sio (medium priority)
> >>   * fuzzy (phonetic?) search: Vitr= Viterbo, Vitorchiano (lowest
> >> priority, this requirement could be even removed)
> >>
> >>   * everything could be constrained by one or more filter queries
> >>   * each suggestion could contain (depending on the use case) up to
> >> five additional attributes (other than the suggestion itself), so
> >> the payload provided by the Suggester couldn't be enough (or it
> >> would require a custom encoding of such data in that field)
> >>   * in a couple of scenarios, the search needs to be executed on
> >> several fields, with different boosts (e.g. description, address,
> >> code) and the corresponding suggestions come from another field
> >> (e.g. name)
> >>   * I don't have any incremental / delta indexing issue, the whole
> >> dataset is not huge, a couple of millions of database records,
> >> with a low grow rate, and I can recreate everything from scratch
> >> using the DIH
> >>
> >> Do you think this is something for the built-in Suggester? Or is this
> >> something that it's better to implement with a RequestHandler with
> >> something like (e)dismax and ngramming?
> >>
> >> Many thanks in advance
> >> Andrea
> >>
> >> [1] https://cwiki.apache.org/confluence/display/solr/Suggester
> >> [2] http://lucidworks.com/blog/solr-suggester/
> >> [3] http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html
> >>
> >>
> >>
> >>
> >
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
I have been trying to re-index the docs (about 1.5 million) as one of the
field needed part of string value removed (accidentally introduced). I was
issuing a query for 100 docs getting 4 fields and updating the doc  (atomic
update with "set") via the CloudSolrClient in batches, However from time to
time the query returns 0 results, which exits the re-indexing program.

I cant understand as to why the cloud returns 0 results when there are 1.4x
million docs which have the "accidental" string in them.

Is there another way to do bulk massive updates ?

Thanks

Ravi Kiran Bhaskar


Re: [Open source] SolrCloud High Availability (HAFT) Library - Bloomreach

2015-09-25 Thread Shawn Heisey
On 9/25/2015 12:00 PM, Nitin Sharma wrote:
>  My user name is nitin.sharma.  Does this give edit access to the
> confluence page as well?

You are added as a contributor on the Solr wiki.

Only Apache committers for the Solr project have access to edit the
confluence wiki.  This is because the wiki is used to produce the Apache
Reference Guide, which is released as official documentation.

You are welcome to comment on the confluence wiki if you find something
missing or incorrect, and it will be given full consideration.

Thanks,
Shawn



Re: Using a plugin to filter in schema.xml

2015-09-25 Thread Siddhartha Singh Sandhu
I need a go to for writing the custom tokenizer. any suggestions?

On Fri, Sep 25, 2015 at 2:36 PM, Siddhartha Singh Sandhu <
sandhus...@gmail.com> wrote:

> For sure.
>
> On Fri, Sep 25, 2015 at 1:13 PM, Alexandre Rafalovitch  > wrote:
>
>> I think (I lost the library link) you would need to build a bridge by
>> doing a custom Analyzer or Tokenizer and then using the library under
>> the covers. Would be a nice contribution to open-source if you managed
>> to achieve that.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 25 September 2015 at 12:58, Siddhartha Singh Sandhu
>>  wrote:
>> > Hi,
>> >
>> > I wanted to use the twitter-text libraries github implementation to
>> filter
>> > the tokens(hashtags) in my text. I know I can use the Pattern Matching
>> > tokenizer also, but would trust twitter's library more then my own
>> regex to
>> > do the job for me. I wanted to use it in unison with
>> > the solr.WhitespaceTokenizerFactory to get the tokens.
>> >
>> > Need help in understanding on how can I do that. Do I have to refactor
>> the
>> > twitter Java library to "extends TokenFilterFactory" or can I use it the
>> > way it is.
>> >
>> > Regards,
>> >
>> > Sid.
>>
>
>


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
Sure.

1. Delete all the docs (no commit).
2. Add all the docs (no commit).
3. Commit.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:17 PM, Ravi Solr  wrote:
> 
> I have been trying to re-index the docs (about 1.5 million) as one of the
> field needed part of string value removed (accidentally introduced). I was
> issuing a query for 100 docs getting 4 fields and updating the doc  (atomic
> update with "set") via the CloudSolrClient in batches, However from time to
> time the query returns 0 results, which exits the re-indexing program.
> 
> I cant understand as to why the cloud returns 0 results when there are 1.4x
> million docs which have the "accidental" string in them.
> 
> Is there another way to do bulk massive updates ?
> 
> Thanks
> 
> Ravi Kiran Bhaskar



Re: firstSearcher cache warming with own QuerySenderListener

2015-09-25 Thread Walter Underwood
Right.

I chose the twenty most frequent terms from our documents and use those for 
cache warming. The list of most frequent terms is pretty stable in most 
collections.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 8:38 AM, Erick Erickson  wrote:
> 
> That's what the firstSearcher event in solrconfig.xml is for, exactly the
> case of autowarming Solr when it's just been started. The queries you put
> in that event are fired only when the server starts.
> 
> So I'd just put my queries there. And you do not have to put a zillion
> queries here. Start with one that mentions all the facets you intend to
> use, sorts by all the various sort fields you use, perhaps (if you have any
> _very_ common filter queries) put those in too.
> 
> Then analyze the queries that are still slow when issued the first time
> after startup and add what you suspect are the relevant bits to the
> firstSearcher query (or queries).
> 
> I suggest that this is a much easier thing to do, and focus efforts on why
> you are shutting down your Solr servers often enough that anyone notices..
> 
> Best,
> Erick
> 
> 
> 
> On Fri, Sep 25, 2015 at 8:31 AM, Christian Reuschling <
> christian.reuschl...@gmail.com> wrote:
> 
>> Hey all,
>> 
>> we want to avoid cold start performance issues when the caches are cleared
>> after a server restart.
>> 
>> For this, we have written a SearchComponent that saves least recently used
>> queries. These are
>> written to a file inside a closeHook of a SolrCoreAware at server shutdown.
>> 
>> The plan is to perform these queries at server startup to warm up the
>> caches. For this, we have
>> written a derivative of the QuerySenderListener and configured it as
>> firstSearcher listener in
>> solrconfig.xml. The only difference to the origin QuerySenderListener is
>> that it gets it's queries
>> from the formerly dumped lru queries rather than getting them from the
>> config file.
>> 
>> It seems that everything is called correctly, and we have the impression
>> that the query response
>> times for the dumped queries are sometimes slightly better than without
>> this warming.
>> 
>> Nevertheless, there is still a huge difference against the times when we
>> manually perform the same
>> queries once, e.g. from a browser. If we do this, the second time we
>> perform these queries they
>> respond much faster (up to 10 times) than the response times after the
>> implemented warming.
>> 
>> It seems that not all caches are warmed up during our warming. And because
>> of these huge
>> differences, I doubt we missed something.
>> 
>> The index has about 25M documents, and is splitted into two shards in a
>> cloud configuration, both
>> shards are on the same server instance for now, for testing purposes.
>> 
>> Does anybody have an idea? I tried to disable lazy field loading as a
>> potential issue, but with no
>> success.
>> 
>> 
>> Cheers,
>> 
>> Christian
>> 
>> 



Re: [Open source] SolrCloud High Availability (HAFT) Library - Bloomreach

2015-09-25 Thread Nitin Sharma
Hi Shawn,

 My user name is nitin.sharma.  Does this give edit access to the
confluence page as well?

Thanks,
Nitin

On Fri, Sep 25, 2015 at 10:44 AM, Shawn Heisey  wrote:

> On 9/25/2015 11:30 AM, Nitin Sharma wrote:
> > It would be great if we can link this in the solrcloud contributions
> wiki.
> > Can you give us access to that?
>
> Just create an account on the wiki, tell us the username, and you'll be
> added quickly to the group that allows editing.
>
> https://wiki.apache.org/solr/FrontPage?action=login
>
> Thanks,
> Shawn
>
>


-- 
-- Nitin


Re: Using a plugin to filter in schema.xml

2015-09-25 Thread Siddhartha Singh Sandhu
For sure.

On Fri, Sep 25, 2015 at 1:13 PM, Alexandre Rafalovitch 
wrote:

> I think (I lost the library link) you would need to build a bridge by
> doing a custom Analyzer or Tokenizer and then using the library under
> the covers. Would be a nice contribution to open-source if you managed
> to achieve that.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 25 September 2015 at 12:58, Siddhartha Singh Sandhu
>  wrote:
> > Hi,
> >
> > I wanted to use the twitter-text libraries github implementation to
> filter
> > the tokens(hashtags) in my text. I know I can use the Pattern Matching
> > tokenizer also, but would trust twitter's library more then my own regex
> to
> > do the job for me. I wanted to use it in unison with
> > the solr.WhitespaceTokenizerFactory to get the tokens.
> >
> > Need help in understanding on how can I do that. Do I have to refactor
> the
> > twitter Java library to "extends TokenFilterFactory" or can I use it the
> > way it is.
> >
> > Regards,
> >
> > Sid.
>


RE: Help for Highlights

2015-09-25 Thread Leandro Henrique
Hello Erick,

Very, very, very thanks! The highlights "null" was fields with stored parameter 
setted to "false".

Thanks again!

Leandro.

> Date: Fri, 25 Sep 2015 09:14:16 -0700
> Subject: Re: Help for Highlights
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> You're only returning the "submissaoid" and "tituloprojeto"  fields (along
> with score), and dismax is probably searching across other fields (I can't
> tell from the fragment, it'll be the parameters set up in solrconfig.xml,
> the select handler). Add =all to the query and you'll see all the
> fields dismax is searching over.
> 
> When you specify hl.fl=*, it's saying in effect "any field that is
> specified in the fl list should be highlighted if there's a match". So a
> simple test would be to specify fl=*. Although do note that if the match is
> on a field that is not stored, you'll see nothing.
> 
> Best,
> Erick
> 
> 2015-09-25 7:28 GMT-07:00 Leandro Henrique :
> 
> > Dear Colleagues of Solr-list,
> >
> > I am using the Solr 5.0 on my work to index textual base of approximately
> > 3500 documents. The documents are stored in XML files. Almost everything is
> > right and functioning normally ... unless the highlight functionality.
> >
> > This feature is not working well! After a survey any, Solr presents the
> > findings, but there are documents matched that do not have highlights. I do
> > not understand: How a document is found that there is no highlight for him?
> >
> > Here is an example:
> >
> > => Search for "rabanete" (in Portuguese):
> >
> > => URL search:
> > http://localhost:8983/solr/baseprojetos/select?q=rabanete=score+desc=5=tituloprojeto%2Csubmissaoid%2Cscore=json=true=dismax=true=*=%3Cem%3E=%3C%2Fem%3E=true=true
> >
> > => Results (JSON):
> > **
> >  "responseHeader":{ "status":0, "QTime":146, "params":{ "hl":"true",
> > "indent":"true", "fl":"tituloprojeto,submissaoid,score",
> > "hl.usePhraseHighlighter":"true", "sort":"score desc", "rows":"5",
> > "hl.simple.pre":"", "q":"rabanete", "defType":"dismax",
> > "hl.simple.post":"", "hl.fl":"*", "wt":"json",
> > "hl.highlightMultiTerm":"true"}},
> > "response":{"numFound":5,"start":0,"maxScore":0.4094792,"docs":[
> >
> > { "submissaoid":"22920", "tituloprojeto":"AVALIAÇÃO DE DISPONIBILIDADE DE
> > METAIS PESADOS PARA PLANTAS CULTIVADAS EM UM SOLO TRATADO COM FONTES
> > ALTERNATIVAS DE POTÁSSIO", "score":0.4094792},
> >
> > { "submissaoid":"34721", "tituloprojeto":"Aperfeiçoamento do processo de
> > produção e definição de parâmetros ideais para produção de conservas de
> > brotos de soja a partir da cultivar BRS 216", "score":0.24568753},
> >
> > { "submissaoid":"204661", "tituloprojeto":"Transferência de tecnologias de
> > cobertura vegetal na cultura dos citros e sua contribuição para a
> > agricultura conservacionista.", "score":0.08686366},
> >
> > { "submissaoid":"204607", "tituloprojeto":"DESENVOLVIMENTO DE
> > INSTRUMENTAÇÃO, MÉTODOS E PROCESSOS PARA AVALIAÇÃO E USO SEGURO DE
> > RESÍDUOS", "score":0.057909105},
> >
> > { "submissaoid":"210515", "tituloprojeto":"Projeto Xisto Agrícola -
> > Pesquisa e desenvolvimento do potencial de uso do xisto e seus coprodutos
> > na agricultura", "score":0.057909105}]
> >
> > "highlighting":{
> > "22920":{"objetivogeral":[" presente Projeto tem como objetivo geral
> > estudar a disponibilidade de metais pesados provenientes de quatro fontes
> > alternativas de potássio, para a alface, soja e rabanete"]},
> > "34721":{"resumoprojeto":[" o tamanho necessário para serem consumidos,
> > sendo fontes ricas em minerais, vitaminas, proteínas e com baixa caloria. O
> > \"feijão moyashi\", também conhecido como feijão mungo é a espécie mais
> > utilizada para a produção de brotos no Brasil. Mais de 30 espécies de
> > plantas, principalmente de olerícolas (brócolis, rabanete"]},
> > "204661":{},
> > "204607":{"descricaoatividade":[" de massa seca da parte aérea e raízes e
> > na produtividade de hortaliças.Os experimentos serão realizados na Estação
> > Experimental da Embrapa Clima Temperado, num Argissolo Vermelho,
> > utilizando-se espécies de hortaliças cujo órgão de consumo são as folhas
> > (alface), as raízes (rabanete"]},
> > "210515":{descricaoatividade":[" plástica em março de 2012. O uso de
> > cobertura plástica nos canteiros foi para evitar possíveis perdas dos
> > tratamentos aplicados por lixiviação. As espécies de hortaliças avaliadas
> > neste estudo são rabanete"]}}}
> >
> > **
> >
> > See the document with ID = 204661 does not highlight but was found with
> > the third score!!!
> >
> > Where am I going wrong? Which configuration is wrong? Can anyone help me?
> >
> > Thanks in advance!
> > Leandro.
  

Re: Expensive GC Remark Phase for JNI Weak Reference

2015-09-25 Thread Shawn Heisey
On 9/25/2015 8:53 AM, Keith L wrote:
> Using:
> - JDK 1.8u40
> - UseG1GC, ParallelRefProcEnabled, Xmx12g,Xms12g
> - Solr 4.10.4

My own testing has not been extremely rigorous, and I have not spent a
lot of time looking at the fine details in the GC logs.  The details of
your message that I omitted here indicate that you have looked much
closer than I have.

> *Application Details:*
> - Heavy indexing:
>   - ~4000 bulk updates per minute, ~20 documents each update or 80k
> documents per minute
> - ~100 fields per document, mostly small strings, TrieInt, TrieDouble
> - Usage of docValues (1-5 fields per document)
> - Some fields multi-value TrieInt (precisionStep=0) fields could
> potentially have hundreds of values
> - 1s auto soft commit openSearcher=true, 15s auto hard commit
> openSearcher=false

Heavy indexing creates a lot of garbage, so it keeps GC algorithms busy.
 It is a good way to find GC problems.

> *Questions*:
> 1) What is causing the JNI Weak References?
>   - Is it from nio? Usage of MMaps, DirectBuffers, etc?
> 2) Why does it become worse during the lifetime of the application?
>   - Is there a leak?
>   - Are we getting into a situation where we begin promoting Searchers
> along with
> their IndexReaders/Writers too early. i.e. do we have a lot expensive
> objects moving to
> tenured generation just before they become dereferenced.
>   - Tune G1NewSizePercent?
> 3) Are the JNI Weak References being visited but not cleaned during GC?
>   - Why not? tune G1HeapWastePercent?


I don't have any specific answers for you, but I can point you at the
results of my own work on GC tuning.

https://wiki.apache.org/solr/ShawnHeisey

There are both G1 and CMS settings there. For G1, the exact size to use
for the G1HeapRegionSize parameter will depend on how many documents
you've got in each index. There's a few paragraphs about it on the page.

The start script for Solr 5.x, if left with its default settings,
includes GC tuning that's very similar to the CMS settings on my wiki page.

My initial testing on 5.2.1 has shown that general performance is a
little bit higher than the 4.9 version.  There is one nagging problem
that I know about, which will hopefully be addressed before the 5.4
release.  There are at least two issues with that problem as a root cause:

https://issues.apache.org/jira/browse/SOLR-8088
https://issues.apache.org/jira/browse/SOLR-8096

If you are not using grouping or facets, then you wouldn't need to worry
about that problem.

I am finding a nice rabbit hole to get lost in starting at this page and
diving into the "Related issues" links at the bottom:

http://www.evanjones.ca/jvm-mmap-pause.html

Thanks,
Shawn



Dataimporthandler sql query don't run

2015-09-25 Thread Jens Mayer
Hello everyone,
I need to run the following query to import my index from a H2 database:

but if I start to full-import nothing happens. The last information from my log 
file is the following:
[25.09.2015 20:06:24.418 INFO  commitScheduler-11-thread-1 
o.a.s.u.DirectUpdateHandler2.commit:548] start 
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}[25.09.2015
 20:06:24.418 INFO  commitScheduler-11-thread-1 
o.a.s.u.DirectUpdateHandler2.commit:588] No uncommitted changes. Skipping 
IW.commit.[25.09.2015 20:06:24.419 INFO  commitScheduler-11-thread-1 
o.a.s.u.DirectUpdateHandler2.commit:637] end_commit_flush
If I try to run the SQL into the H2 directly I'll receive the expected 
results.The strange thing is that if I run the following query everything goes 
right and my data will be indexed. But I'm confused why the first statement 
fails and the second one succeed:

Have someone of you an Idea why the first statement fails?
Thanks for your help

Re: Using a plugin to filter in schema.xml

2015-09-25 Thread Alexandre Rafalovitch
I think (I lost the library link) you would need to build a bridge by
doing a custom Analyzer or Tokenizer and then using the library under
the covers. Would be a nice contribution to open-source if you managed
to achieve that.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 September 2015 at 12:58, Siddhartha Singh Sandhu
 wrote:
> Hi,
>
> I wanted to use the twitter-text libraries github implementation to filter
> the tokens(hashtags) in my text. I know I can use the Pattern Matching
> tokenizer also, but would trust twitter's library more then my own regex to
> do the job for me. I wanted to use it in unison with
> the solr.WhitespaceTokenizerFactory to get the tokens.
>
> Need help in understanding on how can I do that. Do I have to refactor the
> twitter Java library to "extends TokenFilterFactory" or can I use it the
> way it is.
>
> Regards,
>
> Sid.


[Open source] SolrCloud High Availability (HAFT) Library - Bloomreach

2015-09-25 Thread Nitin Sharma
Hi all,

 We are glad to announce that we have open sourced the SolrCloud HAFT
library  under the Apache
License Version 2.0.

HAFT is a High Availability and Fault Tolerant framework for solrcloud. It
was built from the ground up at Bloomreach to provide high availability
operations for solrcloud. It was covered in great detail at our talk

last
year in the Lucene Revolution Conference. We are very happy to contribute
this back to the community.

The current open source version offer the following functionality

   1. Cloning/Backing up SolrCloud Data across clusters
   2. Backing up Zookeeper configs


We are actively working on open sourcing more functionality to the library
in the coming months. We will be monitoring the github frequently to
address any bugs/feature requests :)

It would be great if we can link this in the solrcloud contributions wiki.
Can you give us access to that?

Let us know if you have questions/comments.

Thanks,
Nitin


Re: How to know index file in OS Cache

2015-09-25 Thread Edward Ribeiro
You can use pcstat ( https://github.com/tobert/pcstat ) to get page cache
statistics for files. I have used this app in the past to see which and how
much Lucene indexes were on Linux page cache.

Edward


On Fri, Sep 25, 2015 at 2:22 PM, Jeff Wartes  wrote:

>
>
> I’ve been relying on this:
> https://code.google.com/archive/p/linux-ftools/
>
>
> fincore will tell you what percentage of a given file is in cache, and
> fadvise can suggest to the OS that a file be cached.
>
> All of the solr start scripts at my company first call fadvise
> (FADV_WILLNEED) on all the files in the index directories. It works great
> if you’re on a linux system.
>
>
>
> On 9/25/15, 8:41 AM, "Gili Nachum"  wrote:
>
> >Gonna try Mikhail suggestion, but just for fun you can also empirically
> >"test" for how much of a file is in the oshr...@matrix.co.il cache with:
> >time cat  > /dev/null
> >
> >The faster it completes the more blocks are cached you can take a baseline
> >after manually purging of cache - don't recall the command. Note that
> >running the command by itself encourages to cache the file.
> >On Sep 25, 2015 12:39, "Aman Tandon"  wrote:
> >
> >> Awesome thank you Mikhail. This is what I was looking for.
> >>
> >> This was just a random question poped up in my mind. So I just asked
> >>this
> >> on the group.
> >>
> >> With Regards
> >> Aman Tandon
> >>
> >> On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
> >> mkhlud...@griddynamics.com> wrote:
> >>
> >> > What about Linux:
> >> > $less /proc//maps
> >> > $pmap 
> >> >
> >> > On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
> >> > markus.jel...@openindex.io>
> >> > wrote:
> >> >
> >> > > Hello - as far as i remember, you don't. A file itself is not the
> >>unit
> >> to
> >> > > cache, but blocks are.
> >> > > Markus
> >> > >
> >> > >
> >> > > -Original message-
> >> > > > From:Aman Tandon 
> >> > > > Sent: Friday 25th September 2015 5:56
> >> > > > To: solr-user@lucene.apache.org
> >> > > > Subject: How to know index file in OS Cache
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > Is there any way to know that the index file/s is present in the
> >>OS
> >> > cache
> >> > > > or RAM. I want to check if the index is present in the RAM or in
> >>OS
> >> > cache
> >> > > > and which files are not in either of them.
> >> > > >
> >> > > > With Regards
> >> > > > Aman Tandon
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Sincerely yours
> >> > Mikhail Khludnev
> >> > Principal Engineer,
> >> > Grid Dynamics
> >> >
> >> > 
> >> > 
> >> >
> >>
>
>


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Walter Underwood
Sorry, I did not mean to be rude. The original question did not say that you 
don’t have the docs outside of Solr. Some people jump to the advanced features 
and miss the simple ones.

It might be faster to fetch all the docs from Solr and save them in files. Then 
modify them. Then reload all of them. No guarantee, but it is worth a try.

Good luck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 25, 2015, at 2:59 PM, Ravi Solr  wrote:
> 
> Walter, Not in a mood for banter right now Its 6:00pm on a friday and
> Iam stuck here trying to figure reindexing issues :-)
> I dont have source of docs so I have to query the SOLR, modify and put it
> back and that is seeming to be quite a task in 5.3.0, I did reindex several
> times with 4.7.2 in a master slave env without any issue. Since then we
> have moved to cloud and it has been a pain all day.
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> 
> On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood 
> wrote:
> 
>> Sure.
>> 
>> 1. Delete all the docs (no commit).
>> 2. Add all the docs (no commit).
>> 3. Commit.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 25, 2015, at 2:17 PM, Ravi Solr  wrote:
>>> 
>>> I have been trying to re-index the docs (about 1.5 million) as one of the
>>> field needed part of string value removed (accidentally introduced). I
>> was
>>> issuing a query for 100 docs getting 4 fields and updating the doc
>> (atomic
>>> update with "set") via the CloudSolrClient in batches, However from time
>> to
>>> time the query returns 0 results, which exits the re-indexing program.
>>> 
>>> I cant understand as to why the cloud returns 0 results when there are
>> 1.4x
>>> million docs which have the "accidental" string in them.
>>> 
>>> Is there another way to do bulk massive updates ?
>>> 
>>> Thanks
>>> 
>>> Ravi Kiran Bhaskar
>> 
>> 



Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
No problem Walter, it's all fun. Was just wondering if there was some other
good way that I did not know of, that's all 

Thanks

Ravi Kiran Bhaskar

On Friday, September 25, 2015, Walter Underwood 
wrote:

> Sorry, I did not mean to be rude. The original question did not say that
> you don’t have the docs outside of Solr. Some people jump to the advanced
> features and miss the simple ones.
>
> It might be faster to fetch all the docs from Solr and save them in files.
> Then modify them. Then reload all of them. No guarantee, but it is worth a
> try.
>
> Good luck.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org 
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Sep 25, 2015, at 2:59 PM, Ravi Solr  > wrote:
> >
> > Walter, Not in a mood for banter right now Its 6:00pm on a friday and
> > Iam stuck here trying to figure reindexing issues :-)
> > I dont have source of docs so I have to query the SOLR, modify and put it
> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> several
> > times with 4.7.2 in a master slave env without any issue. Since then we
> > have moved to cloud and it has been a pain all day.
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood  >
> > wrote:
> >
> >> Sure.
> >>
> >> 1. Delete all the docs (no commit).
> >> 2. Add all the docs (no commit).
> >> 3. Commit.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org 
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr  > wrote:
> >>>
> >>> I have been trying to re-index the docs (about 1.5 million) as one of
> the
> >>> field needed part of string value removed (accidentally introduced). I
> >> was
> >>> issuing a query for 100 docs getting 4 fields and updating the doc
> >> (atomic
> >>> update with "set") via the CloudSolrClient in batches, However from
> time
> >> to
> >>> time the query returns 0 results, which exits the re-indexing program.
> >>>
> >>> I cant understand as to why the cloud returns 0 results when there are
> >> 1.4x
> >>> million docs which have the "accidental" string in them.
> >>>
> >>> Is there another way to do bulk massive updates ?
> >>>
> >>> Thanks
> >>>
> >>> Ravi Kiran Bhaskar
> >>
> >>
>
>


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Erick Erickson
How are you querying Solr? You say you query for 100 docs,
update then get the next set. What are you using for a marker?
If you're using the start parameter, and somehow a commit is
creeping in things might be weird, especially if you're using any
of the internal Lucene doc IDs. If you're absolutely sure no commits
are taking place even that should be OK.

The "deep paging" stuff could be helpful here, see:
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Best,
Erick

On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr  wrote:
> No problem Walter, it's all fun. Was just wondering if there was some other
> good way that I did not know of, that's all 
>
> Thanks
>
> Ravi Kiran Bhaskar
>
> On Friday, September 25, 2015, Walter Underwood 
> wrote:
>
>> Sorry, I did not mean to be rude. The original question did not say that
>> you don’t have the docs outside of Solr. Some people jump to the advanced
>> features and miss the simple ones.
>>
>> It might be faster to fetch all the docs from Solr and save them in files.
>> Then modify them. Then reload all of them. No guarantee, but it is worth a
>> try.
>>
>> Good luck.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org 
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Sep 25, 2015, at 2:59 PM, Ravi Solr > > wrote:
>> >
>> > Walter, Not in a mood for banter right now Its 6:00pm on a friday and
>> > Iam stuck here trying to figure reindexing issues :-)
>> > I dont have source of docs so I have to query the SOLR, modify and put it
>> > back and that is seeming to be quite a task in 5.3.0, I did reindex
>> several
>> > times with 4.7.2 in a master slave env without any issue. Since then we
>> > have moved to cloud and it has been a pain all day.
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood > >
>> > wrote:
>> >
>> >> Sure.
>> >>
>> >> 1. Delete all the docs (no commit).
>> >> 2. Add all the docs (no commit).
>> >> 3. Commit.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org 
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>
>> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr > > wrote:
>> >>>
>> >>> I have been trying to re-index the docs (about 1.5 million) as one of
>> the
>> >>> field needed part of string value removed (accidentally introduced). I
>> >> was
>> >>> issuing a query for 100 docs getting 4 fields and updating the doc
>> >> (atomic
>> >>> update with "set") via the CloudSolrClient in batches, However from
>> time
>> >> to
>> >>> time the query returns 0 results, which exits the re-indexing program.
>> >>>
>> >>> I cant understand as to why the cloud returns 0 results when there are
>> >> 1.4x
>> >>> million docs which have the "accidental" string in them.
>> >>>
>> >>> Is there another way to do bulk massive updates ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> Ravi Kiran Bhaskar
>> >>
>> >>
>>
>>


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
Walter, Not in a mood for banter right now Its 6:00pm on a friday and
Iam stuck here trying to figure reindexing issues :-)
I dont have source of docs so I have to query the SOLR, modify and put it
back and that is seeming to be quite a task in 5.3.0, I did reindex several
times with 4.7.2 in a master slave env without any issue. Since then we
have moved to cloud and it has been a pain all day.

Thanks

Ravi Kiran Bhaskar

On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood 
wrote:

> Sure.
>
> 1. Delete all the docs (no commit).
> 2. Add all the docs (no commit).
> 3. Commit.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Sep 25, 2015, at 2:17 PM, Ravi Solr  wrote:
> >
> > I have been trying to re-index the docs (about 1.5 million) as one of the
> > field needed part of string value removed (accidentally introduced). I
> was
> > issuing a query for 100 docs getting 4 fields and updating the doc
> (atomic
> > update with "set") via the CloudSolrClient in batches, However from time
> to
> > time the query returns 0 results, which exits the re-indexing program.
> >
> > I cant understand as to why the cloud returns 0 results when there are
> 1.4x
> > million docs which have the "accidental" string in them.
> >
> > Is there another way to do bulk massive updates ?
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
>
>


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
Thanks for responding Erick. I set the "start" to zero and "rows" always to
100. I create CloudSolrClient instance and use it to both query as well as
index. But I do sleep for 5 secs just to allow for any auto commits.

So query --> client.add(100 docs) --> wait --> query again

But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
docs the "query again" returns zero docs causing my while loop to
exist...so was trying to see if I was doing the right thing or if there is
an alternate way to do heavy indexing.

Thanks

Ravi Kiran Bhaskar



On Friday, September 25, 2015, Erick Erickson 
wrote:

> How are you querying Solr? You say you query for 100 docs,
> update then get the next set. What are you using for a marker?
> If you're using the start parameter, and somehow a commit is
> creeping in things might be weird, especially if you're using any
> of the internal Lucene doc IDs. If you're absolutely sure no commits
> are taking place even that should be OK.
>
> The "deep paging" stuff could be helpful here, see:
>
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>
> Best,
> Erick
>
> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr  > wrote:
> > No problem Walter, it's all fun. Was just wondering if there was some
> other
> > good way that I did not know of, that's all 
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> > On Friday, September 25, 2015, Walter Underwood  >
> > wrote:
> >
> >> Sorry, I did not mean to be rude. The original question did not say that
> >> you don’t have the docs outside of Solr. Some people jump to the
> advanced
> >> features and miss the simple ones.
> >>
> >> It might be faster to fetch all the docs from Solr and save them in
> files.
> >> Then modify them. Then reload all of them. No guarantee, but it is
> worth a
> >> try.
> >>
> >> Good luck.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org  
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr  
> >> > wrote:
> >> >
> >> > Walter, Not in a mood for banter right now Its 6:00pm on a friday
> and
> >> > Iam stuck here trying to figure reindexing issues :-)
> >> > I dont have source of docs so I have to query the SOLR, modify and
> put it
> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
> >> several
> >> > times with 4.7.2 in a master slave env without any issue. Since then
> we
> >> > have moved to cloud and it has been a pain all day.
> >> >
> >> > Thanks
> >> >
> >> > Ravi Kiran Bhaskar
> >> >
> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
> wun...@wunderwood.org 
> >> >
> >> > wrote:
> >> >
> >> >> Sure.
> >> >>
> >> >> 1. Delete all the docs (no commit).
> >> >> 2. Add all the docs (no commit).
> >> >> 3. Commit.
> >> >>
> >> >> wunder
> >> >> Walter Underwood
> >> >> wun...@wunderwood.org  
> >> >> http://observer.wunderwood.org/  (my blog)
> >> >>
> >> >>
> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr  
> >> > wrote:
> >> >>>
> >> >>> I have been trying to re-index the docs (about 1.5 million) as one
> of
> >> the
> >> >>> field needed part of string value removed (accidentally
> introduced). I
> >> >> was
> >> >>> issuing a query for 100 docs getting 4 fields and updating the doc
> >> >> (atomic
> >> >>> update with "set") via the CloudSolrClient in batches, However from
> >> time
> >> >> to
> >> >>> time the query returns 0 results, which exits the re-indexing
> program.
> >> >>>
> >> >>> I cant understand as to why the cloud returns 0 results when there
> are
> >> >> 1.4x
> >> >>> million docs which have the "accidental" string in them.
> >> >>>
> >> >>> Is there another way to do bulk massive updates ?
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> Ravi Kiran Bhaskar
> >> >>
> >> >>
> >>
> >>
>


Re: [Open source] SolrCloud High Availability (HAFT) Library - Bloomreach

2015-09-25 Thread Nitin Sharma
Thanks.

On Fri, Sep 25, 2015 at 2:35 PM, Shawn Heisey  wrote:

> On 9/25/2015 12:00 PM, Nitin Sharma wrote:
> >  My user name is nitin.sharma.  Does this give edit access to the
> > confluence page as well?
>
> You are added as a contributor on the Solr wiki.
>
> Only Apache committers for the Solr project have access to edit the
> confluence wiki.  This is because the wiki is used to produce the Apache
> Reference Guide, which is released as official documentation.
>
> You are welcome to comment on the confluence wiki if you find something
> missing or incorrect, and it will be given full consideration.
>
> Thanks,
> Shawn
>
>


-- 
-- Nitin


Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Erick Erickson
Wait, query again how? You've got to have something that keeps you
from getting the same 100 docs back so you have to be sorting somehow.
Or you have a high water mark. Or something. Waiting 5 seconds for any
commit also doesn't really make sense to me. I mean how do you know

1> that you're going to get a commit (did you explicitly send one from
the client?).
2> all autowarming will be complete by the time the next query hits?

Let's see the query you fire. There has to be some kind of marker that
you're using to know when you've gotten through the entire set.

And I would use much larger batches, I usually update in batches of
1,000 (excepting if these are very large docs of course). I suspect
you're spending a lot more time sleeping than you need to. I wouldn't
sleep at all in fact. This is one (rare) case I might consider
committing from the client. If you specify the wait for searcher param
(server.commit(true, true), then it doesn't return until a new
searcher is completely opened so your previous updates will be
reflected in your next search.

Actually, what I'd really do is
1> turn off all auto commits
2> go ahead and query/change/update. But the query bits would be using
the cursormark.
3> do NOT commit
4> issue a commit when you were all done.

I bet you'd get through your update a lot faster that way.

Best,
Erick

On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr  wrote:
> Thanks for responding Erick. I set the "start" to zero and "rows" always to
> 100. I create CloudSolrClient instance and use it to both query as well as
> index. But I do sleep for 5 secs just to allow for any auto commits.
>
> So query --> client.add(100 docs) --> wait --> query again
>
> But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
> docs the "query again" returns zero docs causing my while loop to
> exist...so was trying to see if I was doing the right thing or if there is
> an alternate way to do heavy indexing.
>
> Thanks
>
> Ravi Kiran Bhaskar
>
>
>
> On Friday, September 25, 2015, Erick Erickson 
> wrote:
>
>> How are you querying Solr? You say you query for 100 docs,
>> update then get the next set. What are you using for a marker?
>> If you're using the start parameter, and somehow a commit is
>> creeping in things might be weird, especially if you're using any
>> of the internal Lucene doc IDs. If you're absolutely sure no commits
>> are taking place even that should be OK.
>>
>> The "deep paging" stuff could be helpful here, see:
>>
>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>
>> Best,
>> Erick
>>
>> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr > > wrote:
>> > No problem Walter, it's all fun. Was just wondering if there was some
>> other
>> > good way that I did not know of, that's all 
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> > On Friday, September 25, 2015, Walter Underwood > >
>> > wrote:
>> >
>> >> Sorry, I did not mean to be rude. The original question did not say that
>> >> you don’t have the docs outside of Solr. Some people jump to the
>> advanced
>> >> features and miss the simple ones.
>> >>
>> >> It might be faster to fetch all the docs from Solr and save them in
>> files.
>> >> Then modify them. Then reload all of them. No guarantee, but it is
>> worth a
>> >> try.
>> >>
>> >> Good luck.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org  
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>
>> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr > 
>> >> > wrote:
>> >> >
>> >> > Walter, Not in a mood for banter right now Its 6:00pm on a friday
>> and
>> >> > Iam stuck here trying to figure reindexing issues :-)
>> >> > I dont have source of docs so I have to query the SOLR, modify and
>> put it
>> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
>> >> several
>> >> > times with 4.7.2 in a master slave env without any issue. Since then
>> we
>> >> > have moved to cloud and it has been a pain all day.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Ravi Kiran Bhaskar
>> >> >
>> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>> wun...@wunderwood.org 
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Sure.
>> >> >>
>> >> >> 1. Delete all the docs (no commit).
>> >> >> 2. Add all the docs (no commit).
>> >> >> 3. Commit.
>> >> >>
>> >> >> wunder
>> >> >> Walter Underwood
>> >> >> wun...@wunderwood.org  
>> >> >> http://observer.wunderwood.org/  (my blog)
>> >> >>
>> >> >>
>> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr > 
>> >> > wrote:
>> >> >>>
>> >> >>> I have been trying to re-index the docs (about 1.5 million) as one
>> of
>> >> the
>> >> >>> field needed part of string value 

Re: bulk reindexing 5.3.0 issue

2015-09-25 Thread Ravi Solr
thank you for taking time to help me out. Yes I was not using cursorMark, I
will try that next. This is what I was doing, its a bit shabby coding but
what can I say my brain was fried :-) FYI this is a side process just to
correct a messed up string. The actual indexing process was working all the
time as our business owners are a bit petulant about stopping indexing. My
autocommit conf and code is given below, as you can see autocommit should
fire every 100 docs anyway


   100
   12



3

  

private static void processDocs() {

try {
CloudSolrClient client = new
CloudSolrClient("zk1:,zk2:,zk3.com:");
client.setDefaultCollection("collection1");

//First initialize docs
SolrDocumentList docList = getDocs(client, 100);
Long count = 0L;

while (docList != null && docList.size() > 0) {

List inList = new
ArrayList();
for(SolrDocument doc : docList) {

SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);

//This is my SOLR's Unique id
String uniqueId = (String)
iDoc.getFieldValue("uniqueId");

/*
 * This is another system's id which is what I want to
correct. Was messed
 * because of script transformer in DIH import via
SolrEntityProcessor
 * ex-
sun.org.mozilla.javascript.internal.NativeString:9cdef726-05dd-40b7-b1b2-c9bbce96741f
 */
String uuid = (String) iDoc.getFieldValue("uuid");
String sanitizedUUID =
uuid.replace("sun.org.mozilla.javascript.internal.NativeString:", "");
Map fieldModifier = new
HashMap(1);
fieldModifier.put("set",sanitizedUUID);
iDoc.setField("uuid", fieldModifier);

inList.add(iDoc);
log.info("added " + systemid);
}

client.add(inList);

count = count + docList.size();
log.info("Indexed " + count + "/" + docList.getNumFound());

Thread.sleep(5000);

docList = getDocs(client, docList.size());
log.info("Got Docs- " + docList.getNumFound());
}

} catch (Exception e) {
log.error("Error indexing ", e);
}
}

private static SolrDocumentList getDocs(CloudSolrClient client, Integer
rows) {


SolrQuery q = new SolrQuery("*:*");
q.setSort("publishtime", ORDER.desc);
q.setStart(0);
q.setRows(rows);
q.addFilterQuery(new String[] {"uuid:[* TO *]",
"uuid:sun.org.mozilla*"});
q.setFields(new String[]{"uniqueId","uuid"});
SolrDocumentList docList = null;
QueryResponse resp;
try {
resp = client.query(q);
docList = resp.getResults();
} catch (Exception e) {
log.error("Error querying " + q.toString(), e);
}
return docList;
}


Thanks

Ravi Kiran Bhaskar

On Fri, Sep 25, 2015 at 10:58 PM, Erick Erickson 
wrote:

> Wait, query again how? You've got to have something that keeps you
> from getting the same 100 docs back so you have to be sorting somehow.
> Or you have a high water mark. Or something. Waiting 5 seconds for any
> commit also doesn't really make sense to me. I mean how do you know
>
> 1> that you're going to get a commit (did you explicitly send one from
> the client?).
> 2> all autowarming will be complete by the time the next query hits?
>
> Let's see the query you fire. There has to be some kind of marker that
> you're using to know when you've gotten through the entire set.
>
> And I would use much larger batches, I usually update in batches of
> 1,000 (excepting if these are very large docs of course). I suspect
> you're spending a lot more time sleeping than you need to. I wouldn't
> sleep at all in fact. This is one (rare) case I might consider
> committing from the client. If you specify the wait for searcher param
> (server.commit(true, true), then it doesn't return until a new
> searcher is completely opened so your previous updates will be
> reflected in your next search.
>
> Actually, what I'd really do is
> 1> turn off all auto commits
> 2> go ahead and query/change/update. But the query bits would be using
> the cursormark.
> 3> do NOT commit
> 4> issue a commit when you were all done.
>
> I bet you'd get through your update a lot faster that way.
>
> Best,
> Erick
>
> On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr  wrote:
> > Thanks for responding Erick. I set the "start" to zero and "rows" always
> to
> > 100. I create CloudSolrClient instance and use it to both query as well
> as
> > index. But I do sleep for 5 secs just to allow for any auto 

RE: How to know index file in OS Cache

2015-09-25 Thread Markus Jelsma
Hello - as far as i remember, you don't. A file itself is not the unit to 
cache, but blocks are.
Markus
 
 
-Original message-
> From:Aman Tandon 
> Sent: Friday 25th September 2015 5:56
> To: solr-user@lucene.apache.org
> Subject: How to know index file in OS Cache
> 
> Hi,
> 
> Is there any way to know that the index file/s is present in the OS cache
> or RAM. I want to check if the index is present in the RAM or in OS cache
> and which files are not in either of them.
> 
> With Regards
> Aman Tandon
> 


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Charlie Hull

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:

You may find the following articles interesting:
http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr


The latter article is great and we drew on it when helping a recent 
client with Chinese indexing. However, if you do use Paoding bear in 
mind that it has few if any tests and all the comments are in Chinese. 
We found a problem with it recently (it breaks the Lucene highlighters) 
and have submitted a patch: 
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1


Cheers

Charlie


Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo  wrote:

Hi,

Would like to check, will StandardTokenizerFactory works well for indexing
both English and Chinese (Bilingual) documents, or do we need tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Alessandro Benedetti
   There is an undocumented "method" parameter - I need to enable that to

> allow switching between the docvalues approach and the UnInvertedField
> approach.
>

Only to clarify, please correct me Yonik if my understanding is wrong or
outdated :
To calculate facets, without going into the algorithm details there are 2
approaches available :
Term Enum ( good for limited number of unique values for your field) and Fc
( FieldCache) good for a lot of unique values, but not for big fields.

For the FC approach,
 - storing the DocValues for the field would transparently use them ( with
the known benefit at the cost of disk space for the docValues data
structures)
 - without the DocValues , there algorithm will un-invert the index at
runtime using the field cache to store the results

So , from your quote, Term Enum will not be supported by Json Faceting ?
DocValues usage will not happen automatically ?

Cheers

>
> --
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: How to know index file in OS Cache

2015-09-25 Thread Aman Tandon
okay thanks Markus :)

With Regards
Aman Tandon

On Fri, Sep 25, 2015 at 12:27 PM, Markus Jelsma 
wrote:

> Hello - as far as i remember, you don't. A file itself is not the unit to
> cache, but blocks are.
> Markus
>
>
> -Original message-
> > From:Aman Tandon 
> > Sent: Friday 25th September 2015 5:56
> > To: solr-user@lucene.apache.org
> > Subject: How to know index file in OS Cache
> >
> > Hi,
> >
> > Is there any way to know that the index file/s is present in the OS cache
> > or RAM. I want to check if the index is present in the RAM or in OS cache
> > and which files are not in either of them.
> >
> > With Regards
> > Aman Tandon
> >
>


Re: recovering mode loop

2015-09-25 Thread Lorenzo Fundaró
I think the attachment was stripped off from the mail :( .
here's a public link.

https://drive.google.com/file/d/0B_z8xmsby0uxRDZEeWpLcnR2b3M/view?usp=sharing

On 25 September 2015 at 09:59, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> This is the last logs i've got, even with a higher zkClientTimeout of 30s.
>
> this is a replica:
>
> 9/24/2015, 8:14:46 PMWARNRecoveryStrategyStopping recovery for
> core=dawanda coreNodeName=core_node69/24/2015, 8:14:56 PMWARN
> RecoveryStrategyStopping recovery for core=dawanda 
> coreNodeName=core_node69/24/2015,
> 8:16:34 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node69/24/2015, 8:16:37 PMWARNPeerSyncPeerSync:
> core=dawanda url=http://solr6.dawanda.services:8983/solr too many updates
> received since start - startingUpdates no longer overlaps with our
> currentUpdates9/24/2015, 8:16:40 PMERRORReplicationHandlerSnapPull failed
> :org.apache.solr.common.SolrException: Index fetch failed :9/24/2015,
> 8:16:40 PMERRORRecoveryStrategyError while trying to
> recover:org.apache.solr.common.SolrException: Replication for recovery
> failed.9/24/2015, 8:16:40 PMERRORRecoveryStrategyRecovery failed - trying
> again... (0) core=dawanda9/24/2015, 8:16:55 
> PMERRORSolrCorenull:org.apache.lucene.store.AlreadyClosedException:
> Already closed:
> MMapIndexInput(path="/srv/loveos/solr/server/solr/dawanda/data/index.20150924174642739/_ulvt_Lucene50_0.tim")9/24/2015,
> 8:16:59 PMWARNSolrCore[dawanda] PERFORMANCE WARNING: Overlapping
> onDeckSearchers=29/24/2015, 8:17:53 PMWARNUpdateLogStarting log replay
> tlog{file=/srv/loveos/solr/server/solr/dawanda/data/tlog/tlog.0024343
> refcount=2} active=true starting pos=33935659/24/2015, 8:18:47 PMWARN
> RecoveryStrategyStopping recovery for core=dawanda coreNodeName=core_node6
> 9/24/2015, 8:18:57 PMWARNUpdateLogLog replay finished.
> recoveryInfo=RecoveryInfo{adds=2556 deletes=2455 deleteByQuery=0 errors=0
> positionOfStart=3393565}9/24/2015, 8:18:57 PMWARNRecoveryStrategyStopping
> recovery for core=dawanda coreNodeName=core_node69/24/2015, 8:19:07 PMWARN
> RecoveryStrategyStopping recovery for core=dawanda 
> coreNodeName=core_node69/24/2015,
> 8:19:17 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node69/24/2015, 8:19:27 PMWARNRecoveryStrategyStopping
> recovery for core=dawanda coreNodeName=core_node69/24/2015, 8:19:37 PMWARN
> RecoveryStrategyStopping recovery for core=dawanda 
> coreNodeName=core_node69/24/2015,
> 8:19:47 PMWARNRecoveryStrategyStopping recovery for core=dawanda
> coreNodeName=core_node6
>
> and *again an overlapping searcher*.
>
> this is the leader:
>
> 9/24/2015, 8:18:56 PMWARNDistributedUpdateProcessorError sending update
> to http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:56 PMERROR
> StreamingSolrClientserror9/24/2015, 8:18:56 PMWARN
> DistributedUpdateProcessorError sending update to
> http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:56 PMWARN
> ZkControllerLeader is publishing core=dawanda coreNodeName =core_node6
> state=down on behalf of un-reachable replica
> http://solr6.dawanda.services:8983/solr/dawanda/; forcePublishState? 
> false9/24/2015,
> 8:18:56 PMERRORDistributedUpdateProcessorSetting up to try to start
> recovery on replica http://solr6.dawanda.services:8983/solr/dawanda/ after:
> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
> connection from pool9/24/2015, 8:18:57 
> PMERRORStreamingSolrClientserror9/24/2015,
> 8:18:57 PMWARNDistributedUpdateProcessorError sending update to
> http://solr6.dawanda.services:8983/solr9/24/2015, 8:18:57 PMWARN
> ZkControllerLeader is publishing core=dawanda coreNodeName =core_node6
> state=down on behalf of un-reachable replica
> http://solr6.dawanda.services:8983/solr/dawanda/; forcePublishState? 
> false9/24/2015,
> 8:18:57 PMERRORDistributedUpdateProcessorSetting up to try to start
> recovery on replica http://solr6.dawanda.services:8983/solr/dawanda/ after:
> org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
> connection from pool
>
>
> I am doing jmx monitoring and right when two replicas went into recovering
> and down respectively, I checked the threads on the leader and there were
> at least 30 updatesExecutor running.
>
> I did a check on one of the replicas gc logs, and with gcviewer i get this
> for
>
> -* the leader*: http://snag.gy/UZyBu.jpg  , http://snag.gy/azrqH.jpg ,
> http://snag.gy/vef8D.jpg , http://snag.gy/mRVF0.jpg (on the cpu usage of
> this one, the graph going up and down from 20% to 90% cpu usage lasts for
> time that what is shown on the image, around 15 to 20 min)
> - *one replica: *http://snag.gy/PmW4t.jpg , http://snag.gy/2fpUK.jpg ,
> http://snag.gy/1Z5S2.jpg , http://snag.gy/AdlZw.jpg ,
> http://snag.gy/W4Kko.jpg ,
>
> As shown on the article about updates you sent me, when the time of the
> incident yesterday (one three replicas went down and only the leader was
> up) I saw a 

Re: More Like This on numeric fields - BF accepted by MLT handler

2015-09-25 Thread Upayavira
Alessandro,

I'd suggest you review the code of the MoreLikeThisHandler. It is a
little knotty, but it would be worth your while understanding what is
going on there.

Basically, there are three phases:

phase #1: parse the source document into a list of terms (avoided if
term vectors enabled and source doc is in index)
phase #2: calculate a score for each of these terms and select the n
highest scoring ones (default 25)
phase #3: build and execute a boolean query using these 25 terms

Phase #2 uses a TF/IDF like approach to calculate the scores for those
"interesting terms".

Once you understand what MLT is doing, you will probably not find it so
hard to create your own version which is better suited to your own
use-case.

Of course, this would probably be better constructed as a QueryParser
rather than a request handler, but that's a detail.

Upayavira

On Fri, Sep 25, 2015, at 11:08 AM, Alessandro Benedetti wrote:
> Hi guys,
> was just investigating a little bit in how to include numeric fields in
> the
> MLT calculations.
> 
> As we know, we are currently building a smart lucene query based on the
> document in input ( the one to search for similar ones) and run this
> query
> to obtain the similar docs.
> Because the MLT is currently built on TF/IDF , it is mainly thought for
> textual fields.
> What about we want to include a numeric factor  in the similarity
> calculus ?
> 
> e.g.
> Solr Document ( Hotel)
> mlt.fl=description,stars,trip_advisor_rating
> 
> To find the similarity based not only on the description, but also on the
> numeric fields ( stars and rating) .
> 
> The first thought I had , is to add a support for boosting functions.
> In this way we are more flexible and we can add how many functions we
> want.
> 
> For example adding :
> bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB))
> 
> Also other kind of functions can be applied.
> What do you think ? Do you have any alternative ideas ?
> 
> Cheers
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England


Re: Help on autocomplete / suggester

2015-09-25 Thread Andrea Gazzarini

Sorry, in the first point I meant "prefix_search"

Best,
Andrea

On 09/24/2015 11:18 AM, Andrea Gazzarini wrote:

Hi guys,
as part of a customer requirement, I need to provide an autocomplete / 
suggester feature. For that reason I started looking at the Suggester 
Component.


The target Solr version is not yet determined: I mean, there's another 
project in production, of the same customer, which is using Solr 4.7.1 
(no SolrCloud, just a master with two slaves) so I guess they will 
extend those instances with additional cores, but I'm not sure about 
that, maybe they would like to migrate towards a new version  / new 
architecture.


Anyway, after reading some info [1]  [2]  [3] about the Suggester, and 
after trying a bit with some sample data, I'm not sure if that fits my 
needs, because the proposed suggestions must follow these criteria:


  * suffix search: Vi = *Vi*terbo, *Vi*cenza, *Vi*llanova (max priority)
  * infix search: Vi = A*vi*gliano, Tar*vi*sio (medium priority)
  * fuzzy (phonetic?) search: Vitr= Viterbo, Vitorchiano (lowest
priority, this requirement could be even removed)

  * everything could be constrained by one or more filter queries
  * each suggestion could contain (depending on the use case) up to
five additional attributes (other than the suggestion itself), so
the payload provided by the Suggester couldn't be enough (or it
would require a custom encoding of such data in that field)
  * in a couple of scenarios, the search needs to be executed on
several fields, with different boosts (e.g. description, address,
code) and the corresponding suggestions come from another field
(e.g. name)
  * I don't have any incremental / delta indexing issue, the whole
dataset is not huge, a couple of millions of database records,
with a low grow rate, and I can recreate everything from scratch
using the DIH

Do you think this is something for the built-in Suggester? Or is this 
something that it's better to implement with a RequestHandler with  
something like (e)dismax and ngramming?


Many thanks in advance
Andrea

[1] https://cwiki.apache.org/confluence/display/solr/Suggester
[2] http://lucidworks.com/blog/solr-suggester/
[3] http://alexbenedetti.blogspot.it/2015/07/solr-you-complete-me.html







Re: Autowarm and filtercache invalidation

2015-09-25 Thread Shawn Heisey
On 9/24/2015 3:11 PM, Jeff Wartes wrote:
> Answering my own question: Looks like the default filterCache regenerator
> uses the old cache to re-executes queries in the context of the new
> searcher and does nothing with the old cache value.
> 
> So, the new searcher’s cache contents will be consistent with that
> searcher’s view, regardless of whether it was populated via autowarm.

That is how cache warming works in general.  The entries in the old
cache contain the query that was used to produce the cache entry.
During warming, the same query is executed on the new searcher to build
a new cache entry.

Thanks,
Shawn



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Uwe Reh

Am 25.09.2015 um 05:16 schrieb Yonik Seeley:

I did some performance benchmarks and opened an issue.  It's bad.
https://issues.apache.org/jira/browse/SOLR-8096


Hi Yonik,
thanks a lot for your investigation.
Using the JSON Facet API is fast and seems to be a usable workaround for 
new applications. But not really, as fast patch to our production 
environment.


What' your assessment about Bill's question? Is there a chance to get 
the fieldValueCache back?


I would like to have it back in 5.x, even marked as deprecated. This 
would help to migrate.


Uwe



Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Zheng Lin Edwin Yeo
Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?

Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>
>> You may find the following articles interesting:
>>
>> http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
>> ( a whole epic journey)
>> https://dzone.com/articles/indexing-chinese-solr
>>
>
> The latter article is great and we drew on it when helping a recent client
> with Chinese indexing. However, if you do use Paoding bear in mind that it
> has few if any tests and all the comments are in Chinese. We found a
> problem with it recently (it breaks the Lucene highlighters) and have
> submitted a patch:
> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>
> Cheers
>
> Charlie
>
>
>> Regards,
>> Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi,
>>>
>>> Would like to check, will StandardTokenizerFactory works well for
>>> indexing
>>> both English and Chinese (Bilingual) documents, or do we need tokenizers
>>> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Charlie Hull

On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:

Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?


Solr v4.6 I believe.

Charlie


Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:


On 23/09/2015 16:23, Alexandre Rafalovitch wrote:


You may find the following articles interesting:

http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr



The latter article is great and we drew on it when helping a recent client
with Chinese indexing. However, if you do use Paoding bear in mind that it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie



Regards,
 Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo 
wrote:


Hi,

Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Different ports for search and upload request

2015-09-25 Thread Alexandre Rafalovitch
How about you do indexing on a completely different node and then swap
the index into production using Solr aggregate aliases?
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection

The problem here is that deleting existing content is harder, so it is
more suitable for things like rolling log collections that are one-way
only.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 September 2015 at 18:05, Siddhartha Singh Sandhu
 wrote:
> Thank you so much.
>
> Safe to ignore the following(not a query):-
>
> *Never did this. *But how about this crazy idea:
>
> Take an Amazon EFS and share it between 2 EC2. Use one EC2 endpt to update
> the index on EFS while the other reads from it. This way each EC2 can use
> its own compute and not share its resources among-st solr threads.
>
> Regards,
> Sid.
>
> On Thu, Sep 24, 2015 at 5:17 PM, Shawn Heisey  wrote:
>
>> On 9/24/2015 2:01 PM, Siddhartha Singh Sandhu wrote:
>> > I wanted to know if we can configure different ports as end points for
>> > uploading and searching API. Also, if someone could point me in the right
>> > direction.
>>
>> From our perspective, no.
>>
>> I have no idea whether it is possible at all ... it might be something
>> that a servlet container expert could figure out, or it might require
>> code changes to Solr itself.
>>
>> You probably need another mailing list specifically for the container.
>> For virtually all 5.x installs, the container is Jetty.  In earlier
>> versions, it could be any container.
>>
>> Another possibility would be putting an intelligent proxy in front of
>> Solr and having it only accept certain handler paths on certain ports,
>> then forward them to the common port on the Solr server.
>>
>> If you did manage to do this, it would require custom client code.  None
>> of the Solr clients for programming languages have a facility for
>> separate ports.
>>
>> Thanks,
>> Shawn
>>
>>


Re: How to know index file in OS Cache

2015-09-25 Thread Aman Tandon
Awesome thank you Mikhail. This is what I was looking for.

This was just a random question poped up in my mind. So I just asked this
on the group.

With Regards
Aman Tandon

On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> What about Linux:
> $less /proc//maps
> $pmap 
>
> On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> wrote:
>
> > Hello - as far as i remember, you don't. A file itself is not the unit to
> > cache, but blocks are.
> > Markus
> >
> >
> > -Original message-
> > > From:Aman Tandon 
> > > Sent: Friday 25th September 2015 5:56
> > > To: solr-user@lucene.apache.org
> > > Subject: How to know index file in OS Cache
> > >
> > > Hi,
> > >
> > > Is there any way to know that the index file/s is present in the OS
> cache
> > > or RAM. I want to check if the index is present in the RAM or in OS
> cache
> > > and which files are not in either of them.
> > >
> > > With Regards
> > > Aman Tandon
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


More Like This on numeric fields - BF accepted by MLT handler

2015-09-25 Thread Alessandro Benedetti
Hi guys,
was just investigating a little bit in how to include numeric fields in the
MLT calculations.

As we know, we are currently building a smart lucene query based on the
document in input ( the one to search for similar ones) and run this query
to obtain the similar docs.
Because the MLT is currently built on TF/IDF , it is mainly thought for
textual fields.
What about we want to include a numeric factor  in the similarity calculus ?

e.g.
Solr Document ( Hotel)
mlt.fl=description,stars,trip_advisor_rating

To find the similarity based not only on the description, but also on the
numeric fields ( stars and rating) .

The first thought I had , is to add a support for boosting functions.
In this way we are more flexible and we can add how many functions we want.

For example adding :
bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB))

Also other kind of functions can be applied.
What do you think ? Do you have any alternative ideas ?

Cheers
-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: How to know index file in OS Cache

2015-09-25 Thread Mikhail Khludnev
What about Linux:
$less /proc//maps
$pmap 

On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma 
wrote:

> Hello - as far as i remember, you don't. A file itself is not the unit to
> cache, but blocks are.
> Markus
>
>
> -Original message-
> > From:Aman Tandon 
> > Sent: Friday 25th September 2015 5:56
> > To: solr-user@lucene.apache.org
> > Subject: How to know index file in OS Cache
> >
> > Hi,
> >
> > Is there any way to know that the index file/s is present in the OS cache
> > or RAM. I want to check if the index is present in the RAM or in OS cache
> > and which files are not in either of them.
> >
> > With Regards
> > Aman Tandon
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2015 at 6:33 AM, Uwe Reh  wrote:
> Am 25.09.2015 um 05:16 schrieb Yonik Seeley:
>>
>> I did some performance benchmarks and opened an issue.  It's bad.
>> https://issues.apache.org/jira/browse/SOLR-8096
>
>
> Hi Yonik,
> thanks a lot for your investigation.
> Using the JSON Facet API is fast and seems to be a usable workaround for new
> applications. But not really, as fast patch to our production environment.

Single-valued fields were likely also impacted (but probably not to
the extent that multi-valued fields were).
Are you faceting on any of those?

> What' your assessment about Bill's question? Is there a chance to get the
> fieldValueCache back?

Unclear.  If you look at
https://issues.apache.org/jira/browse/SOLR-8096
You see
"I was always in favour of removing those top-level facetting
algorithms. So they still have my strong +1."
Which means that it could be veto'd

-Yonik


Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Yonik Seeley
On Fri, Sep 25, 2015 at 5:07 AM, Alessandro Benedetti
 wrote:
>There is an undocumented "method" parameter - I need to enable that to
>
>> allow switching between the docvalues approach and the UnInvertedField
>> approach.
>>
>
> Only to clarify, please correct me Yonik if my understanding is wrong or
> outdated :
> To calculate facets, without going into the algorithm details there are 2
> approaches available :
> Term Enum ( good for limited number of unique values for your field) and Fc
> ( FieldCache) good for a lot of unique values, but not for big fields.
>
> For the FC approach,
>  - storing the DocValues for the field would transparently use them ( with
> the known benefit at the cost of disk space for the docValues data
> structures)
>  - without the DocValues , there algorithm will un-invert the index at
> runtime using the field cache to store the results

Yeah, that's right so far.
We should add a switch though for the method of uninversion...
UnInvertedField (for indexes that change less frequently) vs DocValues
(i.e. if you didn't index with DocValues, UnInvertedReader will
uninvert to an in-memory structure that looks like DocValues).

> So , from your quote, Term Enum will not be supported by Json Faceting ?

We can, it just hasn't been a priority yet.

Anyway, I'm going to step away from email and
https://issues.apache.org/jira/browse/SOLR-8096 for a couple of days.
I need to go focus on putting some slides together for
Strata/HadoopWorld next week. I'll be talking about the new facet
module / json facets there.

-Yonik


Help for Highlights

2015-09-25 Thread Leandro Henrique
Dear Colleagues of Solr-list,

I am using the Solr 5.0 on my work to index textual base of approximately 3500 
documents. The documents are stored in XML files. Almost everything is right 
and functioning normally ... unless the highlight functionality.

This feature is not working well! After a survey any, Solr presents the 
findings, but there are documents matched that do not have highlights. I do not 
understand: How a document is found that there is no highlight for him?

Here is an example:

=> Search for "rabanete" (in Portuguese):

=> URL search: 
http://localhost:8983/solr/baseprojetos/select?q=rabanete=score+desc=5=tituloprojeto%2Csubmissaoid%2Cscore=json=true=dismax=true=*=%3Cem%3E=%3C%2Fem%3E=true=true

=> Results (JSON):
**
 "responseHeader":{ "status":0, "QTime":146, "params":{ "hl":"true", 
"indent":"true", "fl":"tituloprojeto,submissaoid,score", 
"hl.usePhraseHighlighter":"true", "sort":"score desc", "rows":"5", 
"hl.simple.pre":"", "q":"rabanete", "defType":"dismax", 
"hl.simple.post":"", "hl.fl":"*", "wt":"json", 
"hl.highlightMultiTerm":"true"}}, 
"response":{"numFound":5,"start":0,"maxScore":0.4094792,"docs":[

{ "submissaoid":"22920", "tituloprojeto":"AVALIAÇÃO DE DISPONIBILIDADE DE 
METAIS PESADOS PARA PLANTAS CULTIVADAS EM UM SOLO TRATADO COM FONTES 
ALTERNATIVAS DE POTÁSSIO", "score":0.4094792},

{ "submissaoid":"34721", "tituloprojeto":"Aperfeiçoamento do processo de 
produção e definição de parâmetros ideais para produção de conservas de brotos 
de soja a partir da cultivar BRS 216", "score":0.24568753},

{ "submissaoid":"204661", "tituloprojeto":"Transferência de tecnologias de 
cobertura vegetal na cultura dos citros e sua contribuição para a agricultura 
conservacionista.", "score":0.08686366},

{ "submissaoid":"204607", "tituloprojeto":"DESENVOLVIMENTO DE INSTRUMENTAÇÃO, 
MÉTODOS E PROCESSOS PARA AVALIAÇÃO E USO SEGURO DE RESÍDUOS", 
"score":0.057909105},

{ "submissaoid":"210515", "tituloprojeto":"Projeto Xisto Agrícola - Pesquisa e 
desenvolvimento do potencial de uso do xisto e seus coprodutos na agricultura", 
"score":0.057909105}]

"highlighting":{
"22920":{"objetivogeral":[" presente Projeto tem como objetivo geral estudar a 
disponibilidade de metais pesados provenientes de quatro fontes alternativas de 
potássio, para a alface, soja e rabanete"]},
"34721":{"resumoprojeto":[" o tamanho necessário para serem consumidos, sendo 
fontes ricas em minerais, vitaminas, proteínas e com baixa caloria. O \"feijão 
moyashi\", também conhecido como feijão mungo é a espécie mais utilizada para a 
produção de brotos no Brasil. Mais de 30 espécies de plantas, principalmente de 
olerícolas (brócolis, rabanete"]},
"204661":{},
"204607":{"descricaoatividade":[" de massa seca da parte aérea e raízes e na 
produtividade de hortaliças.Os experimentos serão realizados na Estação 
Experimental da Embrapa Clima Temperado, num Argissolo Vermelho, utilizando-se 
espécies de hortaliças cujo órgão de consumo são as folhas (alface), as raízes 
(rabanete"]},
"210515":{descricaoatividade":[" plástica em março de 2012. O uso de cobertura 
plástica nos canteiros foi para evitar possíveis perdas dos tratamentos 
aplicados por lixiviação. As espécies de hortaliças avaliadas neste estudo são 
rabanete"]}}}

**

See the document with ID = 204661 does not highlight but was found with the 
third score!!!

Where am I going wrong? Which configuration is wrong? Can anyone help me?

Thanks in advance!
Leandro.