Re: docValue vs. analyzer

2018-04-26 Thread Uwe Reh

Hi Erick,

Thank your for the hint about SortableTextField. This seems to be really 
the type I was looking for.


UpdateProcesors could be a workaround, but I don't like them. For me 
they are neither fish nor fowl. (neither internal nor external)


Uwe



Am 19.04.2018 um 18:38 schrieb Erick Erickson:

I haven't poked into the details, but (recently, very recently, 7.3)
theres a SortableTextField that may be useful in this situation.
Otherwise you could use a FieldMutatingUpdateProcessorFactory or
perhaps a ScriptUpdateProcessor to manipulate the fields on the way
in. Not quite sure how you could get synonyms to work in those
situations though.

Best,
Erick


wired behavior of own tokenizer

2018-04-26 Thread Uwe Reh

Hi

I'm trying to write a own tokenizer for Solr7.

Doing this, everything seems to be fine:
- the tokenizer compiles
- the tokanizer is instanced fine by it's factory
- the tokenizer seems to do his work, when tested with the gui.
  "../solr/#/collection/analysis"

BUT
- the expected result isn't visible in the document.

Sure, I got something wrong. But I have no idea what.

Any hints are appreciated.
Uwe


###
# Snippet schema.xml
###





   



###
# minimized example:
# Just replace everything with the constant string "substitute"
###

public class MyTokenizer extends Tokenizer {
   private static final Logger LOG   = 
LoggerFactory.getLogger(MyTokenizer.class);
   protected CharTermAttribute charTermAttribute = 
addAttribute(CharTermAttribute.class);
   private boolean done  = false;

   public ClusterSynonymTokenizer() {
  super();
   }

   @Override
   public boolean incrementToken() throws IOException {
  if (done) return false;
  charTermAttribute.setEmpty();
  String toReplace = getStartOFChallange();
  LOG.info("Input: " + toReplace + " replaced.");
  charTermAttribute.append("substitute");
  done = true;
  return true;
   }

   @Override
   public void reset() throws IOException {
  super.reset();
  done = false;
   }

   /* Read some chars from 'input' */
   private String getStartOFChallange() {
  char[] buffer = new char[200];
  int inputLength = -1;
  try {
 inputLength = input.read(buffer, 0, 200);
  } catch (IOException e) {
 throw new RuntimeException(e);
  }
  if (inputLength == -1) {
 LOG.warn("No input");
 return null;
  }
  return new String(buffer, 0, inputLength);
   }
}


###
# Snippet solr.log
# The input was "ReplaceMe"
###

de.hebis.solr.analysis.MyTokenizer.incrementToken(): Input: ReplaceMe replaced.






docValue vs. analyzer

2018-04-19 Thread Uwe Reh

Hi,

I'm stuck in a dead end.

My task is to map individual ids, to group them.

So far, so simple:
* copyfield 'id' -> 'groupId'
* use a SynonymFilter on 'groupId'

Now, I had the idea to improve the performance of grouping with 'docValues'.

Unfortunately, this leads to a contradiction:
* docValues are not allowed for TextFields
* analysers are not allowed on StrFields.

Is there a way, to resolve this contradiction within Solr? (without the 
need of external preprocessing?)


Regards
Uwe

PS.
Yes, a token stream for a strfield, isn't a great idea.
But having CharFiltes would be nice.


Re: CVE-2017-12629 which versions are vulnerable?

2017-10-16 Thread Uwe Reh

Sorry,

I missed the post from Florian Gleixner:
>Re: Several critical vulnerabilities discovered in Apache Solr (XXE & RCE)


Am 16.10.2017 um 16:52 schrieb Uwe Reh:

Hi,

I'm still using V4.10. Is this version also vulnerable by 
http://openwall.com/lists/oss-security/2017/10/13/1 ?


Uwe


CVE-2017-12629 which versions are vulnerable?

2017-10-16 Thread Uwe Reh

Hi,

I'm still using V4.10. Is this version also vulnerable by 
http://openwall.com/lists/oss-security/2017/10/13/1 ?


Uwe


Re: CDCR (Solr6.x) does not start

2016-07-08 Thread Uwe Reh

Hi Renaud,

thank you for your response.

You asked for some further information:

1. Log messages at the source cluster:
As mentioned in my addendum "CDCR (Solr6.x) does not start (logfile)". I 
changed the log level for all Handlers to TRACE and I got three Messages 
for each shard caused by "Action LASTPROCESSEDVERSION sent to non-leader 
replica .."

For me this looks like the blocker.

2.

Replication should start even if no commit has been sent to the source cluster.

Thanks for the clarification. It helps me to understand.

3.

The empty queue seems to indicate there is an issue, and that cdcr was unable 
to instantiate the replicator for the target cluster.
Just to be sure, your source cluster has 4 shards, but not replica ? If it has 
replicas, can you ensure that you execute these command on the shard leader.
At the beginning I tried to replicate 4 shards with an replication 
factor of 3. Later on i simplified the environment by omitting the 
replicas. (replication factor = 1)

Do you think having no replicas could the reason for the log messages above?


Regards
Uwe





Am 05.07.2016 um 14:55 schrieb Renaud Delbru:

Hi Uwe,

At first look, your configuration seems correct,
see my comments below.

On 28/06/16 15:36, Uwe Reh wrote:

9. Start CDCR
http://SOURCE:s_port/solr/scoll/cdcr?action=start=json

{"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]}



! (not even a single query to the target's zookeeper ??)


Indeed, you should have observed a communication between the source
cluster and the target zookeeper. Do you see any errors in the log of
the source cluster ? Or a log message such as:
"Unable to instantiate the log reader for target collection ..."



10. Enter some test data into the SOURCE

11. Explicit commit in SOURCE
http://SOURCE:s_port/solr/scoll/update?commit=true=true
!! (at least now there should be some traffic, or?)


Replication should start even if no commit has been sent to the source
cluster.



12. Check errors and queues
http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues=json


{"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"}



http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors=json


{"responseHeader":{"status":0,"QTime":0},"errors":[]}

! Why is the element queues is empty


The empty queue seems to indicate there is an issue, and that cdcr was
unable to instantiate the replicator for the target cluster.
Just to be sure, your source cluster has 4 shards, but not replica ? If
it has replicas, can you ensure that you execute these command on the
shard leader.

Kind Regards


Re: CDCR (Solr6.x) does not start (logfile)

2016-06-29 Thread Uwe Reh

Hi,

trying to get more information, I restarted the SOURCE node and watched 
the log.

For each shard i got following triple:


 WARN  org.apache.solr.handler.CdcrRequestHandler - Action LASTPROCESSEDVERSION 
sent to non-leader replica @ scoll:shard1
ERROR org.apache.solr.handler.RequestHandlerBase - 
org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to 
non-leader replica
WARN  org.apache.solr.handler.CdcrUpdateLogSynchronizer - Caught unexpected 
exception
   org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://SOURCE:s_port/solr/scoll_shard1_replica1: Action 
LASTPROCESSEDVERSION sent to non-leader replica


Could this the reason, why there is no further action?
The SOURCE cloud has just the replicationfactor '1'. 
'scoll_shard1_replica1' should have to be allays the leader, or?


Regards
Uwe




CDCR (Solr6.x) does not start

2016-06-28 Thread Uwe Reh

Hi,

I'm trying to get CDCR to run, but I can't even trigger any 
communication between SOURCE and TARGET.
It seems to be a small but grave misunderstanding. I've tested a lot of 
variants but now I'm blind on this point.

If anyone could give me a hint, I would appreciate.

Uwe


Testsetting:
Two nearly identical hosts (open solaris) with:
- a minimal zookeeper ensemble (one local installation (not embedded), 
listening on port 2181)

- a minimal cloud (one node, one empty collection, 4 shards)
Initial both installations differ only in solrconfig.xml (snipplets below)
The tcp traffic was observed with 'snoop' (tcpdump). There are no packet 
filters or other firewalls between both machines.



Testprocess:

1. Start node for TARGET

2. Create TARGET collection 'tcoll'
http://TARGET:t_port/solr/admin/collections?action=CREATE=tcoll=4=1=4=cdcr

3. Get status
http://TARGET:t_port/solr/tcoll/cdcr?action=status=json

{"responseHeader":{"status":0,"QTime":0},"status":["process","stopped","buffer","enabled"]}


4. Disable buffer
http://TARGET:t_port/solr/tcoll/cdcr?action=disablebuffer=json

{"responseHeader":{"status":0,"QTime":12},"status":["process","stopped","buffer","disabled"]}


6. Start node for SOURCE
(like expected, no tcp between both hosts)

7. Create SOURCE collection 'scoll'
http://SOURCE:s_port/solr/admin/collections?action=CREATE=scoll=4=1=4=cdcr
(no tcp between both hosts)

8. Get status
http://SOURCE:s_port/solr/scoll/cdcr?action=status=json

{"responseHeader":{"status":0,"QTime":13},"status":["process","stopped","buffer","enabled"]}

(like expected, no tcp between both hosts)

9. Start CDCR
http://SOURCE:s_port/solr/scoll/cdcr?action=start=json

{"responseHeader":{"status":0,"QTime":13},"status":["process","started","buffer","enabled"]}

! (not even a single query to the target's zookeeper ??)

10. Enter some test data into the SOURCE

11. Explicit commit in SOURCE
http://SOURCE:s_port/solr/scoll/update?commit=true=true
!! (at least now there should be some traffic, or?)

12. Check errors and queues
http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=queues=json

{"responseHeader":{"status":0,"QTime":0},"queues":[],"tlogTotalSize":135,"tlogTotalCount":1,"updateLogSynchronizer":"stopped"}

http://SOURCE:s_port/solr/scoll_shard1_replica1/cdcr?action=errors=json

{"responseHeader":{"status":0,"QTime":0},"errors":[]}

! Why is the element queues is empty


 where is my stupid bug 





# solrconfig Source


   
   ${solr.ulog.dir:}
   



  
 TARGET:2181
 scoll
 tcoll
  
  
 1 
  



# solrconfig Target


   
   ${solr.ulog.dir:}
   
   
   ${solr.autoCommit.maxdocs:1000}
   ${solr.autoCommit.maxTime:300}
   true
   
   
   ${solr.autoSoftCommit.maxTime:60}
   



   
  disabled




   
  cdcr-processor-chain
   



   
   


##
# EOF
#


Re: Solr 6 CDCR does not work

2016-06-07 Thread Uwe Reh

Hi Adam,

maybe it's my poor English, but I'm confused.
I've taken Renault's quote as a hint to activate autocommit on the 
target cluster. Or at least doing manually frequent commits, to see the 
replicated documents.

Now you wrote disabling autocommit helps.

Could you please clarify this point?

Regards
Uwe


Am 01.06.2016 um 12:28 schrieb Adam Majid Sanjaya:

disable autocommit on the target

It worked!
thanks

2016-05-30 15:40 GMT+07:00 Renaud Delbru :


Hi Adam,
...
Also, do you have an autocommit configured on the target ? CDCR does not
replicate commit, and therefore you have to send a commit command on the
target to ensure that the latest replicated documents are visible.
...
--
Renaud Delbru



relaxed vs. improved validation in solr.TrieDateField

2016-04-29 Thread Uwe Reh

Hi,

doing some migration tests (4.10 to 6.0) I recognized a improved 
validation of TrieDateField.
Syntactical correct but impossible days are rejected now. (stack trace 
at the end of the mail)


Examples:
- '1997-02-29T00:00:00Z'
- '2006-06-31T00:00:00Z'
- '2000-00-00T00:00:00Z'
The first two dates are formal ok, but the Date does not exist. The 
third date is more suspicions, but was also accepted by Solr 4.10.


I appreciate this improvement in principle, but I have to respect the 
original data. The dates might be intentionally wrong.


Is there an easy way to get the weaker validation back?

Regards
Uwe



Invalid Date in Date Math String:'1997-02-29T00:00:00Z'
at 
org.apache.solr.util.DateMathParser.parseMath(DateMathParser.java:254)
at org.apache.solr.schema.TrieField.createField(TrieField.java:726)
at org.apache.solr.schema.TrieField.createFields(TrieField.java:763)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:47)




Re: faceting is unusable slow since upgrade to 5.3.0

2015-10-08 Thread Uwe Reh

Sorry for the delay. I had an ugly flu.

SOLR-7730 seems to work fine. Using docValues with Solr 
5.4.0-2015-09-29_08-29-55 1705813 makes my faceted queries fast again. 
(90ms vs. 2ms) :-)


Thanks
Uwe



Am 27.09.2015 um 20:32 schrieb Mikhail Khludnev:

On Sun, Sep 27, 2015 at 2:00 PM, Uwe Reh <r...@hebis.uni-frankfurt.de> wrote:


When 5.4 with SOLR-7730 will be released, I will start to use docValues.
Going this way, seems more straight forward to me.



Sure. Giving your answers docValues facets has a really good chance to
perform in your index after SOLR-7730. It's really interesting to see
performance numbers on early 5.4 builds:
https://builds.apache.org/view/All/job/Solr-Artifacts-5.x/lastSuccessfulBuild/artifact/solr/package/





Re: Scramble data

2015-10-08 Thread Uwe Reh

Hi,

my suggestions are probably to simple, because they are not a real 
protection of privacy. But maybe one fits to your needs.


Most simple:
Declare your 'hidden' fields just as "indexed=true stored=false", the 
data will be used for searching, but the fields are not listed in the 
query response.
Cons: The Terms of the fields can be still examined by advanced users. 
As example they could use the field as facet.


Very simple
Use a PhoneticFilter for indexing and searching. The encoding 
"ColognePhonetic" generates a numeric hash for each term. The name 
"Breschnew" will be saved as "17863".
Cons: Phonetic similaritys will lead to false hits. This hashing is 
really only scrambling and not appropriate as security feature.


Simple
Declare a special SearchHandlers in your solrconfig.xml and define an 
invariant fieldList parameter. This should contain just the public 
subset of your fields.

Cons: I'm not really sure, about this.

Still quite simple
Write a own Filter, which generates real cryptographic hashes
Cons: If the entropy of your data is poor, you may need additional 
tricks like padding the data. This filter may slow down your system.



Last but not least be aware, that the searching could be a way to 
restore hidden informations. If a query for "billionaire" just get one 
hit, it's obvious that "billionaire" is an attribute of the document 
even if it is not listed in the result.


Uwe


Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-27 Thread Uwe Reh

Hi Mikhail,

is this, what you've requested?

lookups: 34084
hits: 34067
hitratio: 1
inserts: 34
evictions: 0
...
item_author_facet: 
{field=author_facet,memSize=104189615,tindexSize=789195,time=16901,phase1=16534,nTerms=3989851,bigTerms=0,termInstances=16214154,uses=4065}
item_topic_facet: 
{field=topic_facet,memSize=103817915,tindexSize=112199,time=8912,phase1=8496,nTerms=525261,bigTerms=0,termInstances=11050466,uses=1510}
item_material_access: 
{field=material_access,memSize=4532,tindexSize=46,time=1820,phase1=1820,nTerms=2,bigTerms=2,termInstances=0,uses=3406}
(The fields 'author_facet' and 'topic_facet' do have a lot of unique 
entries. 'material_access' has only two values ('online' vs. 'print')


Beside of "*:*", querys with more than maxdoc/2 hits happen very very 
rawly. Typical requests results in less than 1% of maxdoc.


Here a typical example, searching for "Goethe" in the portfolio of the 
University Library Frankfurt/Main

> https://hds.hebis.de/ubffm/Search/Results?lookfor=goethe=new
The request yields over 31,000 results (~.2%. of maxdocs). The majority 
are books about Goethe, 'just' 5700 books are from him. The facet helps 
to detect professionals.


Like Walter Underwood wrote, in technical sense faceting on authors 
isn't a good idea. In the worst case, the relation book to author is 
n:n. Never the less, thanks to authority files (which are intensively 
used in Germany) the facet 'author' is often helpful.


Uwe


Am 26.09.2015 um 14:08 schrieb Mikhail Khludnev:

Uwe,
Would you mind to provide a few details about your case?
I wonder about number of bigterms and other stats as well at 'author' field
(ant other most expensive facets). It looks like log rows:

Sep 13, 2011 2:51:53 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field
{*field=nomejornal*,memSize=827108,tindexSize=40,time=16,phase1=4,*nTerms=15,bigTerms=0*,termInstances=750,uses=0}

Those heavy requests, do they find more than half of docs, eg hits>maxdoc/2 ?


Thanks for your input!


On Thu, Sep 24, 2015 at 11:38 AM, Uwe Reh <r...@hebis.uni-frankfurt.de>
wrote:


Am 22.09.2015 um 18:10 schrieb Walter Underwood:


Faceting on an author field is almost always a bad idea. Or at least a
slow, expensive idea.



Hi Wunder,
n a technical context, the 'author'-facet may be suboptimal. In our
businesses (library services) it's a core feature.
Yes the facet is expensive, but thanks to the fieldValueCache (4.10)
sufficiently fast.

uwe









Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-27 Thread Uwe Reh

Hi Mikhail,

thanks for the hint, and "no" it wasn't obvious for me. :-)
But I think, for us it's better to remain at 4.10.3 and observe the 
evolution of SOLR-8096. When 5.4 with SOLR-7730 will be released, I will 
start to use docValues. Going this way, seems more straight forward to me.


Uwe

Am 27.09.2015 um 00:20 schrieb Mikhail Khludnev:

Uwe,

As a workaround, can you add facet.threads=Ncores to count fields in
parallel?
Also, setting fcs method for single value fields runs per segment faceting
in parallel.
Of course, fields which has small number of terms are beneficial from enum
method.
Excuse me if it's obvious.
https://cwiki.apache.org/confluence/display/solr/Faceting





Re: Different ports for search and upload request

2015-09-25 Thread Uwe Reh

Am 25.09.2015 um 00:05 schrieb Siddhartha Singh Sandhu:

*Never did this. *But how about this crazy idea:
Take an Amazon EFS and share it between 2 EC2. 


I think, you are on the right way. Imho this requirement should be 
solved external.


Option 1:
Hide your Solr node behind a http-proxy which publishes the APIs/handler 
on different Ports. Or publish only the requestHandler  like 'select' 
and 'get' and let use your updateprocess the full API.


Option 2: Use replication. Update the Master and send your Querys to the 
Slave


Uwe



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-25 Thread Uwe Reh

Am 25.09.2015 um 05:16 schrieb Yonik Seeley:

I did some performance benchmarks and opened an issue.  It's bad.
https://issues.apache.org/jira/browse/SOLR-8096


Hi Yonik,
thanks a lot for your investigation.
Using the JSON Facet API is fast and seems to be a usable workaround for 
new applications. But not really, as fast patch to our production 
environment.


What' your assessment about Bill's question? Is there a chance to get 
the fieldValueCache back?


I would like to have it back in 5.x, even marked as deprecated. This 
would help to migrate.


Uwe



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-24 Thread Uwe Reh

Am 23.09.2015 um 10:02 schrieb Mikhail Khludnev:

...
Accelerating non-DV facets is not so clear so far. Please show profiler
snapshot for non-DV facets if you wish to go this way.


Hi,

attached is a visualvm profile to several times a simplified query (just 
one facet):

http://xyz/solr/hebis/select/?q=*:*=true=1=30=author_facet=true


The avarage "QTime" for the query is ~5 Seconds:


  5254.0
  0.0
  5253.0
  0.0
  0.0
  0.0
  0.0



The profile was made with Solr 5.3 running an 4.10 index with no 
'docValue' at all in the schema. (A native 5.3 index with docValues is 
still building)


For me it's surprising, that a lot of "docValue" could be found in the 
profile.


Uwe

PS.
Meanwhile I tried a 5.1 and I got in the same behavior.
"Hot Spots - Method";"Self Time [%]";"Self Time";"Self Time (CPU)";"Total 
Time";"Total Time (CPU)";"Samples"
"sun.nio.ch.ServerSocketChannelImpl.accept()";"29.757507";"911411.696 
ms";"227852.922 ms";"911411.696 ms";"227852.922 ms";"4"
"sun.nio.ch.SelectorImpl.select()";"29.751842";"911238.171 ms";"911238.171 
ms";"911238.171 ms";"911238.171 ms";"6"
"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos()";"12.056136";"369254.84
 ms";"0.0 ms";"369254.84 ms";"0.0 ms";"73"
"java.lang.Object.wait()";"7.439377";"227852.924 ms";"0.0 ms";"227852.924 
ms";"0.0 ms";"1"
"java.net.ServerSocket.accept()";"7.439377";"227852.924 ms";"0.0 
ms";"227852.924 ms";"0.0 ms";"1"
"java.util.HashMap.put()";"3.8150647";"116847.636 ms";"116847.636 
ms";"116847.636 ms";"116847.636 ms";"873"
"java.util.TreeMap.put()";"2.946289";"90238.818 ms";"90238.818 ms";"90238.818 
ms";"90238.818 ms";"180"
"org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal()";"2.1034875";"64425.528
 ms";"64425.528 ms";"183450.033 ms";"183450.033 ms";"113"
"java.util.Collections$UnmodifiableCollection$1.next()";"0.8864094";"27148.909 
ms";"27148.909 ms";"27148.909 ms";"27148.909 ms";"41"
"java.util.TreeMap$EntryIterator.next()";"0.81940365";"25096.661 ms";"25096.661 
ms";"25096.661 ms";"25096.661 ms";"26"
"java.util.HashMap.get()";"0.66768044";"20449.689 ms";"20449.689 ms";"20449.689 
ms";"20449.689 ms";"159"
"org.apache.solr.request.DocValuesFacets.accumMultiSeg()";"0.42119572";"12900.365
 ms";"12900.365 ms";"32423.444 ms";"32423.444 ms";"23"
"org.apache.lucene.util.packed.MonotonicLongValues.get()";"0.37381834";"11449.292
 ms";"11449.292 ms";"11449.292 ms";"11449.292 ms";"73"
"java.util.AbstractCollection.toArray()";"0.3550354";"10874.009 ms";"10874.009 
ms";"10874.009 ms";"10874.009 ms";"63"
"org.apache.lucene.index.FieldInfos.()";"0.319384";"9782.08 ms";"9782.08 
ms";"150232.207 ms";"150232.207 ms";"69"
"org.apache.lucene.uninverting.DocTermOrds$Iterator.read()";"0.26374063";"8077.837
 ms";"8077.837 ms";"8077.837 ms";"8077.837 ms";"64"
"java.util.Collections.max()";"0.21143816";"6475.919 ms";"6475.919 
ms";"6475.919 ms";"6475.919 ms";"46"
"org.apache.solr.request.DocValuesFacets.getCounts()";"0.090463296";"2770.706 
ms";"2770.706 ms";"410211.805 ms";"410211.805 ms";"60"
"org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll()";"0.05521328";"1691.07
 ms";"1691.07 ms";"1691.07 ms";"1691.07 ms";"2"
"java.lang.System.identityHashCode[native]()";"0.031022375";"950.152 
ms";"950.152 ms";"950.152 ms";"950.152 ms";"3"
"org.apache.solr.util.LongPriorityQueue.downHeap()";"0.026107715";"799.626 
ms";"799.626 ms";"799.626 ms";"799.626 ms";"9"
"java.util.Collections$UnmodifiableCollection$1.hasNext()";"0.020632554";"631.933
 ms";"631.933 ms";"631.933 ms";"631.933 ms";"6"
"org.apache.lucene.index.FieldInfo.()";"0.011944577";"365.838 
ms";"365.838 ms";"365.838 ms";"365.838 ms";"4"
"java.util.WeakHashMap.put()";"0.011552288";"353.823 ms";"353.823 ms";"353.823 
ms";"353.823 ms";"2"
"org.apache.lucene.index.FieldInfos$Builder.add()";"0.010934878";"334.913 
ms";"334.913 ms";"211565.8 ms";"211565.8 ms";"181"
"org.eclipse.jetty.server.HttpOutput.write()";"0.010440102";"319.759 
ms";"319.759 ms";"482.602 ms";"482.602 ms";"9"
"java.util.WeakHashMap.get()";"0.010077655";"308.658 ms";"308.658 ms";"308.658 
ms";"308.658 ms";"2"
"org.apache.lucene.util.LongValues.get()";"0.010070211";"308.43 ms";"308.43 
ms";"11757.722 ms";"11757.722 ms";"74"
"org.apache.lucene.util.fst.FST.findTargetArc()";"0.00995512";"304.905 
ms";"304.905 ms";"304.905 ms";"304.905 ms";"2"
"org.apache.lucene.uninverting.DocTermOrds.uninvert()";"0.008576673";"262.686 
ms";"262.686 ms";"262.686 ms";"262.686 ms";"1"
"org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter()";"0.0037003446";"113.334
 ms";"113.334 ms";"341247.51 ms";"341247.51 ms";"58"
"java.lang.String.getBytes()";"0.0032916984";"100.818 ms";"100.818 ms";"100.818 
ms";"100.818 ms";"1"
"java.nio.DirectByteBuffer.get()";"0.0032914046";"100.809 ms";"100.809 
ms";"100.809 ms";"100.809 ms";"1"
"org.eclipse.jetty.http.DateGenerator.doFormatDate()";"0.003270182";"100.159 
ms";"100.159 ms";"100.159 ms";"100.159 ms";"1"

Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-24 Thread Uwe Reh

Am 22.09.2015 um 18:10 schrieb Walter Underwood:

Faceting on an author field is almost always a bad idea. Or at least a slow, 
expensive idea.


Hi Wunder,
n a technical context, the 'author'-facet may be suboptimal. In our 
businesses (library services) it's a core feature.
Yes the facet is expensive, but thanks to the fieldValueCache (4.10) 
sufficiently fast.


uwe



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-22 Thread Uwe Reh

Am 22.09.2015 um 02:12 schrieb Joel Bernstein:

Have you looked at your Solr instance with a cpu profiler like YourKit? It
would be useful to see the hotspots which should be really obvious with 20
second response times.


No, until now I have done no profiling. I thought the unused 
fieldValueCache was clear indicator of my faulty operation.
Because we are a public service, I can not YourKit use (not the license 
itself, the local expenses for licensing is the blocker) I will try to 
detect the hotspot with VisualVM.



Also are you running in distributed mode or on a single Solr instance?

Just as single instance.

Thanks for the attention
Uwe



Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-22 Thread Uwe Reh

The exact version as shown by the UI is:
- solr-impl   5.3.0 1696229 - noble - 2015-08-17 17:10:43
- lucene-impl 5.3.0 1696229 - noble - 2015-08-17 16:59:03

Unfortunately my skills in debugging are limited. So I'm not sure about 
a 'deeper caller stack'.
Did you mean the attached snapshot from VirtualVM, a stack trace like 
below or something else? Please give me a hint.


uwe


"qtp1734853116-68" #68 prio=5 os_prio=64 tid=0x117fd800 nid=0x77 
runnable [0xfd7f991fc000]
   java.lang.Thread.State: RUNNABLE
at java.util.HashMap.resize(HashMap.java:734)
at java.util.HashMap.putVal(HashMap.java:662)
at java.util.HashMap.put(HashMap.java:611)
at 
org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal(FieldInfos.java:344)
at org.apache.lucene.index.FieldInfos$Builder.add(FieldInfos.java:366)
at org.apache.lucene.index.FieldInfos$Builder.add(FieldInfos.java:304)
at 
org.apache.lucene.index.MultiFields.getMergedFieldInfos(MultiFields.java:245)
at 
org.apache.lucene.index.SlowCompositeReaderWrapper.getFieldInfos(SlowCompositeReaderWrapper.java:237)
at 
org.apache.lucene.index.SlowCompositeReaderWrapper.getSortedSetDocValues(SlowCompositeReaderWrapper.java:174)
at 
org.apache.solr.request.DocValuesFacets.getCounts(DocValuesFacets.java:72)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:492)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:385)
at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:628)
at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:619)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:573)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:644)
at 
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:294)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:256)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:285)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)




Am 22.09.2015 um 12:56 schrieb Mikhail Khludnev:

It's quite strange
https://issues.apache.org/jira/browse/SOLR-7730 significantly optimized DV
facets at 5.3.0 exactly by avoiding FileInfo merge.
Would you mind to provide deeper caller stack for
org.apache.lucene.index.FileInfos.MultibleFields.getMergedFieldInfos()?
Or a time spend in SlowCompositeReaderWrapper, DocValuesFacets,
MultiDocValues and their hot methods.
Which version you exactly on? and how do you know that?
Thanks





Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-22 Thread Uwe Reh

here is my try to detect with VirtualVM some hot spots with VirtualVM.

Enviroment:
A newly started node with ~15 times the query:

http://yxz/solr/hebis/select/?q=darwin=true=1=30=material_access=department_3=rvk_facet=author_facet=material_brief=language==count=all=true


Ordered by self time the top methods are:

org.eclipseutil.BlockingArrayQueue.poll():   
260s(self), 260s(total)
org.apache.lucene.index.FileInfos.init()  
90s(self),  90s(total)
org.apache.lucene.index.FileInfos.FieldNumbers.addOrGet() 
60s(self),  60s(total)
org.apache.lucene.index.FileInfos.Builder.addOrGetUpdateInternal()
51s(self), 121s(total)
org.apache.lucene.index.FileInfos.Builder.finish()
13s(self), 102s(total)
org.apache.lucene.index.FileInfos.Builder.fieldInfo()  
9s(self),   9s(total)
org.apache.lucene.index.FileInfos.Builder.add()
4s(self), 126s(total)
org.apache.lucene.index.FileInfos.MultibleFields.getMergedFieldInfos() 
1s(self), 229s(total)
...less 
than 1000ms


Ordered by total time the top (non http/jetty) methods are:

jetty ...   231s(total)
org.apache.solr.handler.component.SearchHandler.handleRequestBody() 231s(total)
org.apache.solr.request.SimpleFacets.*  230s(total)
org.apache.solr.handler.component.FacetComponent.*  230s(total)
org.apache.lucene.index.*   125s(total)
org.apache.lucene.search.*   .3s(total)
... less than 
300ms




Re: faceting is unusable slow since upgrade to 5.3.0 (missing attachment)

2015-09-22 Thread Uwe Reh
 

virtualvm_snapshot_solr5.3_facetting.csv
Description: MS-Excel spreadsheet


faceting is unusable slow since upgrade to 5.3.0

2015-09-21 Thread Uwe Reh

Hi,

our bibliographic index (~20M entries) runs fine with Solr 4.10.3
With Solr 5.3 faceted searching is constantly incredibly slow (~ 20 
seconds)

Output of 'debugQuery':
17705.0
2.0
17590.0 !!
111.0


The 'fieldValueCache' seems to be unused (no inserts nor lookups) in 
Solr 5.3. In Solr 4.10 the 'fieldValueCache' is in heavy use with a 
cumulative_hitratio of 1.


- the behavior is the same, running Solr5.3 on a copy of the old index 
(luceneMatch=4.6) or a newly build index

- using 'facet.method=enum' makes no remarkable difference
- declaring 'docValues' (with reindexing) makes no remarkable difference
- 'softCommit' isn't used

My enviroment is
  OS: Solaris 5.11 on AMD64
  JDK: 1.8.0_25 and 1.8.0_60 (same behavior)
  JavaOpts: -Xmx 10g -XX:+UseG1GC -XX:+AggressiveOpts 
-XX:+UseLargePages -XX:LargePageSizeInBytes=2m


Any help/advice is welcome
Uwe


Re: faceting is unusable slow since upgrade to 5.3.0

2015-09-21 Thread Uwe Reh

Am 21.09.2015 um 15:16 schrieb Shalin Shekhar Mangar:

Can you post your complete facet request as well as the schema
definition of the field on which you are faceting?



Query:

http://yxz/solr/hebis/select/?q=darwin=true=1=30=material_access=department_3=rvk_facet=author_facet=material_brief=language==count=all=true




Schema (with docValue):

...


...

...




Schema (w/o docValue):

...


...

...




solrconfig:

...

...

  
 10
 allfields
 none
  
  
 query
 facet
 stats
 debug
 elevator
  
   





Re: Understanding the Debug explanations for Query Result Scoring/Ranking

2014-07-24 Thread Uwe Reh

Hi,

to get an idea of the meaning of all this numbers, have a look on 
http://explain.solr.pl. I like this tool, it's great.


Uwe

Am 25.07.2014 00:45, schrieb O. Olson:

Hi,

If you add /*debug=true*/ to the Solr request /(and wt=xml if your
current output is not XML)/, you would get a node in the resulting XML that
is named debug. There is a child node to this called explain to this
which has a list showing why the results are ranked in a particular order.
I'm curious if there is some documentation on understanding these
numbers/results.

I am new to Solr, so I apologize that I may be using the wrong terms to
describe my problem. I also aware of
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
though I have not completely understood it.

My problem is trying to understand something like this:

1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in
44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0
= termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of:
7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 =
termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109)
[DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 =
termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 =
idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 =
fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of:
6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 =
fieldNorm(doc=44109)

*Note:* I have searched for televisions. My search field is a single
catch-all field. The Edismax parser seems to break up my search term into
televis and tv

Is there some documentation on how to understand these numbers. They do not
seem to be properly delimited. At the minimum, I can understand something
like:
1.5797625 =  0.4717142 + 1.1080483
and
0.71447384  = 7.0424104 * 0.10145303

But, I cannot understand if something like 0.10145303 = queryNorm 0.660226
= fieldWeight in 44109 is used in the calculation anywhere. Also since
there were only two terms /(televis and tv)/ I could use subtraction to
find out 1.1080483 was the start of a new result.

I'd also appreciate if someone can tell me which class dumps out the above
data. If I know it, I can edit that class to make the output a bit more
understandable for me.

Thank you,
O. O.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr Fields Multilingue

2014-06-30 Thread Uwe Reh

Am 30.06.2014 16:57, schrieb benjelloun:

AllChamp that don't do analyzer and filter. any idea?

Exemple:
I search for : AllChamp:presenton  -- num result=0
AllChamp:présenton  -- num result=1


Hi Anass,

any analyzer means any modification (no ICU-Normalisation).
copyField copys just the raw input not the processed tokens from the 
source field(s). Maybe that's your misconception.


Uwe



Re: Two solr instances access common index

2014-06-26 Thread Uwe Reh

Hi,

with the lock type 'simple' I have tree instances (different JREs, GC-Problem) 
running on the same files.
You should use this option only for a readonly system. Otherwise it's easy to 
corrupt the index.

Maybe you should have a look on replication or SolrCloud.

Uwe


Am 26.06.2014 11:25, schrieb Prasi S:

Hi,
Is it possible to point two solr instances to point to a common index
directory. Will this work wit changing the lock type?



Thanks,
Prasi



Re: Distributed search with Terms Component and Solr Cloud.

2014-01-24 Thread Uwe Reh

Hi Ryan,

just take a look on the thread TermsComponent/SolrCloud.
Setting your parameters as default in solrconfig.xml should help.

Uwe


Am 13.01.2014 20:24, schrieb Ryan Fox:

Hello,

I am running Solr 4.6.0.  I am experiencing some difficulties using the
terms component across multiple shards.  I see according to the
documentation, it should work, but I am unable to do so with solr cloud.

When I have one shard, queries using the terms component respond as I would
expect.  However, when I split my index across two shards, I get empty
results for the same query.

I am querying solr with a CloudSolrServer object.  When I manually add the
query params shards and shards.qt to my SolrQuery, I get the expected
response.  It's not ideal, but if there's a way to get a list of all shards
programmatically, I could set that parameter.


From the documentation, it appears to me the terms component should be

supported by solr cloud, but I can't find anything that explicitly says one
way or the other.  If there is a better way to do it, or perhaps something
I have misconfigured, any advice would be much appreciated.  If it's just
not possible, I will manage.  I can provide more configuration or
specifically how I am running the query if that would help.

Ryan Fox





Re: Error when creating collection in Solr 4.6

2014-01-20 Thread Uwe Reh

Hi,

I had the same problem.
In my case the error was, I had a copy/paste typo in my solr.xml.

str name=genericCoreNodeNames${genericCoreNodeNames:true}/str
!^! Ouch!

With the type 'bool' instead of 'str' it works definitely better. ;-)

Uwe



Am 28.11.2013 08:53, schrieb lansing:

Thank you for your replies,
I am using the new-style discovery
It worked after adding this setting :
bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: How to index X™ as #8482; (HTML decimal entity)

2013-11-20 Thread Uwe Reh
What's about having a simple charfilter in the analyzer queue for 
indexing *and* searching. e.g
charFilter class=solr.PatternReplaceFilterFactory pattern=™ 
replacement=#8482; /

or
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping-specials.txt /


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

field name=displayNameX™ - Black/field

I need to store the HTML Entity (decimal) equivalent value (i.e. #8482;)
in SOLR rather than storing the original value.

Is there a way to do this?





Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-19 Thread Uwe Reh

Am 18.11.2013 14:39, schrieb Furkan KAMACI:

Atlassian Jira has two options at default: exporting to PDF and exporting
to Word.


I see, 'Word' isn't optimal for a reference guide. But OO can handle 
'doc' and has epub plugins.

Could it be possible, to offer the doku also as 'doc(x)'

barefaced
Uwe



Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-19 Thread Uwe Reh

Thank you for opening the issue.

I'm not sure that my case is representative. I'm spending every day 
three hours in the train (commuting to work). I like to use this time to 
have a closer look into manuals. Printouts and laptops are horrible in 
this situation. So there is only the alternative between my 10 tablet 
and my 6 e-reader. I prefer the more handy reader.

No, I can't afford a nice Nexus7. Not now ;-)

Uwe


Am 19.11.2013 17:08, schrieb Cassandra Targett:

I've often thought of possibly providing the reference guide in .epub
format, but wasn't sure of general interest. I also once tried to
convert the PDF version with calibre and it was a total mess. - but
PDF is probably the least-flexible starting point for conversion.

Unfortunately, the Word export is only available on a per-page basis,
which would make it really tedious to try to make a .doc version of
the entire guide (there are ~150 pages). There are, however, options
for HTML export, which I believe could be converted to .epub - but
might take some fiddling.

I created an issue for this - for now just to track that it's
something that might be of interest - but not sure if/when I'd
personally be able to work on it:
https://issues.apache.org/jira/browse/SOLR-5467.

On Tue, Nov 19, 2013 at 6:34 AM, Uwe Reh r...@hebis.uni-frankfurt.de wrote:

Am 18.11.2013 14:39, schrieb Furkan KAMACI:


Atlassian Jira has two options at default: exporting to PDF and exporting
to Word.



I see, 'Word' isn't optimal for a reference guide. But OO can handle 'doc'
and has epub plugins.
Could it be possible, to offer the doku also as 'doc(x)'

barefaced
Uwe





Re: [ANNOUNCE] Apache Solr Reference Guide 4.5 Available

2013-11-18 Thread Uwe Reh
I'd like to read the guide as e-paper. Is there a way to obtain the 
document in the Format epub or odt.

Trying to convert the PDF with Calibre,  wasn't very satisfyingly. :-(

Uwe


Am 05.10.2013 14:19, schrieb Steve Rowe:

The Lucene PMC is pleased to announce the release of the Apache Solr Reference 
Guide for Solr 4.5.

This 338 page PDF serves as the definitive users manual for Solr 4.5.

The Solr Reference Guide is available for download from the Apache mirror 
network:

https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/


Steve





SolrCloud: read only node

2013-11-04 Thread Uwe Reh

Hi,

as service provider for libraries we run a small cloud (1 collection, 1 
shard, 3 replicas).  To improve the local reliability we want to offer 
the possibility to set up own local replicas.
As fas as I know, this can be easily done just by adding a new node to 
the cloud. But the external node shouldn't be able to do any changes on 
the index.


Is there a cheap way to restrict a node of a SolrCloud into a read only 
modus?
Is it a better idea, to do legacy replication from one node (master) to 
an external slave?



Uwe


Re: SolrCloud: read only node

2013-11-04 Thread Uwe Reh

F***, this is the answer, I was afraid of. ;-)
I hoped, there could be anything, similar to 
http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html.


Nevertheless, thank you.
Uwe

Am 04.11.2013 14:14, schrieb Erick Erickson:

In this situation, I'd consider going with the older master/slave
setup. The problem is that in SolrCloud, you have a lot of chatter
back and forth. Presumably the connection to your local instances
is rather slow, so if you're adding data to your index, each and
every add has to be communicated individually to the remote node.

But no, there's no good way in SolrCloud to make a node read only.
Actually, that doesn't really make sense in the solr cloud world since
each node maintains its own index, does its own indexing, etc. So
each node _must_ be able to change the Solr index it uses.

FWIW,
Erick





SOLR-3076 for beginners?

2013-03-08 Thread Uwe Reh

Hi,

blockjoin seems to be a real cool feature. Unfortunately I'm to dumb, to 
get the patch running. I even don't know what to do :-(


Is there anywhere an example, a howto or a cookbook, other than using 
elasticsearch or bare lucene?


Uwe


Re: Nested function query must use ....

2013-02-02 Thread Uwe Reh

Hi Jack

thanks a lot for the hint.

Am 02.02.2013 00:46, schrieb Jack Krupansky:

I've updated the example on the Function Query wiki that you may have copied:
http://wiki.apache.org/solr/FunctionQuery#exists

Thanks again, because the wiki page was really my start point.

Uwe




Nested function query must use ....

2013-02-01 Thread Uwe Reh

Hi,

should be easy, but I'm to blind to find the correct syntax (Solr4.1)

Problem:
I' have some documents in the index, because of their structure they 
tend to get too high scores. This documents are easy to identify and I 
want to boost the others to get a fair ranking.


Could anyone give my the correct syntax to accomplish this simplified 
query?  ...q=*:*fl=foo:exists(query(id:3))


Uwe

Example:

response
lst name=responseHeader
  int name=status400/int
  int name=QTime2/int
  lst name=params
str name=flfoo:exists(query(id:3))/str
str name=q*:*/str
  /lst
/lst
lst name=error
  str name=msgError parsing fieldname: Nested function query must use $param or 
{!v=value} forms. got 'exists(query(id:3))'/str
  int name=code400/int
/lst
/response


But ...q=*:*fl=foo:exists(id) works

response
lst name=responseHeader
  int name=status0/int
  int name=QTime1/int
  lst name=params
str name=flfoo:exists(id)/str
str name=q*:*/str
str name=rows1/str
  /lst
/lst
result name=response numFound=1735 start=0
  doc
bool name=footrue/bool/doc
/result
/response




Re: Tokenized keywords

2013-01-21 Thread Uwe Reh

Hi

probably my note is nonsense. But sometimes one is blind and not able to 
see simple things anymore.


Is this query, what you are looking for?
 q=modified:(search+for+Laptops)fl=original,modified

Sorry, if my suggest is too trivial.

Uwe


Am 21.01.2013 09:17, schrieb Romita Saha:

Hi,

I have a field defined in scheme.xml named as 'original'. I first copy
this field to modified and apply filters on this field modified.

field name=original type=string indexed=true stored=true/
field name=modified type=text_general indexed=true stored=true/

  copyField source=original dest=modified/

I want to display in my responseas follows:

original: Search for all the Laptops
modified: search laptop

Thanks and regards,
Romita Saha




Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?

2013-01-17 Thread Uwe Reh

Hi Mark,

one entry in my long list of self made problems is:
Done the commit before the ConcurrentUpdateSolrServer was finished.

Since the ConcurrentUpdateSolrServer is asynchronous, it's very easy to 
create a race conditions. Make sure that your program is waiting () 
before it's doing the commit.

if (solrserver instanceof ConcurrentUpdateSolrServer) {
   ((ConcurrentUpdateSolrServer) solrserver).blockUntilFinished();
}


Uwe



Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?

2013-01-17 Thread Uwe Reh

Hi Shawn,

don't panic
Due 'historical' reasons, like comparing the different subclasses of 
SolrServer, I have an HttpSolrServer for querys and commits. I've never 
tried to to use the CUSS for anything else than adding documents.


As I wrote, it was a home made problem and not a bug. Sometimes I hope, 
not to be the only dumbass and others may caught in the same trap.


Uwe


Am 17.01.2013 15:52, schrieb Shawn Heisey:

If you are using the same ConcurrentUpdateSolrServer object for all
update interaction with Solr (including commits) and you still have to
do the blockUntilFinished() in your own code before you issue an
explicit commit, that sounds like a bug, and you should put all the
details in a Jira issue.




Re: Results in same or different fields

2013-01-15 Thread Uwe Reh

Hi,

maybe it helps to have a closer look on the other params of edismax.

http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29


'mm=2' will be to strong, but th usage of pf, pf2, and pf is likely your 
solution.


uwe


Am 15.01.2013 10:15, schrieb Gastone Penzo:

Hi,
i'm using solr 4.0 with edismax search handler.
i'm searching inside 3 fields with same boost.
i'd like to have high score for results in the same fields,
instead of results in different fields

es.
qf=title,description


if white house is found in title, it must have higher score
than white in title field and house in description field

how is  it possible?

ps. i set

omitTermFreqAndPositions=true


for all fields

thanx


*Gastone Penzo*
*
*





Re: theory of sets

2013-01-14 Thread Uwe Reh

Am 08.01.2013 10:26, schrieb Uwe Reh:

OK, OK,
I will try it again with dynamic fields.


NO!
dynamic fields are nice, but not for my problem. :-(

I got more than *52* new fields.
I was wrong, the impact on searching is really reasonable. But have you 
ever used the Admin's Schema Browser with that much fields? I suppose 
never, my Installation (4.1) freezes the FireFox while the JS-job runs 
into a timeout.

At most I don't like it, because having that much fields 'smells' for me.

Before anyone asks the XY question.
The index is intended for a library's catalog and the quest is Find all 
members of a series (eg. penguin books, paperbacks) and order them on 
their sortkey. Unfortunately titles may belong to several (sub)series 
with different sortkeys.


Still seeking for better approaches.
Uwe



Re: POST query with non-ASCII to solr using httpclient wont work

2013-01-14 Thread Uwe Reh

Hi Jie,

maybe there is a simple solution. When we used tomcat as servlet 
container for solr I notices similar problems. Even with the hints from 
the solr wiki about unicode and Tomcat, i wasn't able to fix this.
So we switched back to Jetty, querys like q=allfields2%3A能力 are 
reliable now.


Uwe

BTW: I have no idea for at all what these Japanese signs mean. So just 
let me append two of 31 hits in our bibliographic catalog



doc
  str name=idHEB052032124/str
  str name=raw_fullrecordalg: 5203212
001@ $0205
001A $4:13-05-97
001B $t13:12:07.000$01999:10-06-10
001D $0:99-99-99
001U $0utf8
001X $00
002@ $0Aau
003@ $0052032124
007I $0NacsisBN09679884
010@ $ajpn
011@ $a1993
013H $0z
019@ $ajp
021A $ULatn$T01$aNōryoku kaihatsu no shisutemu$hYaguchi Hajime
021A $UJpan$T01$a@能力開発のシステム$h矢口新著
028A $ULatn$T01$9165745363$8Yaguchi, Hajime
028A $UJpan$T01$d新$a矢口
033A $ULatn$T01$pTokyo$nNōryoku Kaihatsu Kōgaku Sentaa
033A $UJpan$T01$p東久留米$n能力開発工学センター
034D $a274 S.
034M $aIll.
036E $aYaguchi Hajime senshū$l2
036F $l2$9052031527$8Yaguchi Hajime senshū$x12
037B $aSysteme zur Entwicklung der Fähigkeiten
046L $aIn japan. Schr.
...
247C/01 $9102595631$8351457-2 4/457Marburg, Universität Marburg, Bibliothek 
des Japan-Zentrums (BJZ)
  /str
/doc
doc
  str name=idHEB286840723/str
  str name=raw_fullrecordalg: 28684072
001@ $03
001A $00030:04-01-12
001B $t22:29:11.000$01999:04-01-12
001C $t10:48:47.000$00030:04-01-12
001D $00030:04-01-12
001U $0utf8
001X $00
002@ $0Aau
003@ $0286840723
004A $A978-4-88319-546-6
007A $0286840723$aHEB
010@ $ajpn
011@ $a2010
021A $ULatn$T01$aShin kanzen masutā kanji nihongo nōryoku shiken ; N1$hIshii 
Reiko ...
021A $UJpan$T01$a新完全マスター漢字日本語能力試験 ; N1$h石井怜子 [ほか] 著
027A $ULatn$T01$aShin kanzen masutā kanji : nihongo nōryoku shiken ; enu ichi / 
Ishii Reiko ...
027A $UJpan$T01$a新完全マスター漢字 : 日本語能力試験 ; N1 / 石井怜子 [ほか] 著
028C $9230917593$8Ishii, Reiko
033A $ULatn$T01$pTōkyō$nSurīē nettowāku
033A $UJpan$T01$p東京$nスリーエーネットワーク
034D $aviii, 197, 21S.
034I $a26cm
044A $S4$aNihongokyōiku(Taigaikokujin)
045Z $aEI 4650
...
247C/01 $9102599157$8601220-6 30/220Frankfurt, Universität Frankfurt, 
Institut für Orientalische und Ostasiatische Philologien, Japanologie
  /str
/doc





Re: retrieving latest document **only**

2013-01-11 Thread Uwe Reh

Am 10.01.2013 11:54, schrieb jmozah:

I need a query that matches only the most recent ones...
Because my stats depend on it..

But I have a requirement to show **only** the latest documents and the
stats along with it..


What do you want?
'the most recent ones' or '**only** the latest' ?

Perhaps a range query q=timestamp:[refdate TO NOW] will match your needs.

Uwe



Re: Hotel Searches

2013-01-09 Thread Uwe Reh

Hi,

maybe I'm thinking too simple again. Nevertheless, here an idea to solve 
the question. The basic thought is to get rid of the range query.


Have:
- a textfield 'vacant_days'. Instead of ISO-Dates just simple dates in 
the form mmdd
- a dynamic field 'price_*', You can add the tariff for Jan. 31th into 
'price_0131'


To get the total,  eg. Feb. 1st to Feb. 3th you could query for the days 
0201, 0202 and 0203. You can calculate the sum of the corresponding 
price fields

q=vacant_days:0201 AND vacant_days:0202 AND 
vacant_days:0203fl?_val_:sum(price_0201, price_0202, price_0203)

(not tested)

Uwe


Am 09.01.2013 07:08, schrieb Harshvardhan Ojha:

Hi Alex,

Thanks for your reply.
I saw prices based on daterange using multipoints . But this is not my 
problem. Instead the problem statement for me is pretty simple.

Say I have 100 documents each having tariff as field.
Doc1
doc
double name=tariff2400.0/double
/doc

Doc2
doc
double name=tariff2500.0/double
/doc

Now a user's search should give me a total tariff.

Desired result
doc
double name=tariff4900.0/double
/doc

And this could be any combination for 100 docs it is (100+101)/2. (N*N+1)/2.

How can I get these combination of documents already indexed ?
Or is there any way to do calculations at runtime?

How can I place this constraint that if there is any 1 doc missing in a range 
don’t give me any result.(if a user asked for hotel tariff from 11th to 13th, 
and I don’t have tariff for 12th, I shouldn't add 11th and 13th only).

Hope I made my problem very simple.

Regards
Harshvardhan Ojha

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Tuesday, January 08, 2013 6:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Hotel Searches

Did you look at a conversation thread from 12 Dec 2012 on this list? Just go to 
the archives and search for 'hotel'. Hopefully that will give you something to 
work with.

If you have any questions after that, come back with more specifics.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at once. 
Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Jan 8, 2013 at 7:18 AM, Harshvardhan Ojha  
harshvardhan.o...@makemytrip.com wrote:


Sorry for that, we just spoiled that thread so posted my question in a
fresh thread.

Problem is indeed very simple.
I have solr documents, which has all the required fields(from db).
Say DOC1,DOC2,DOC3.DOCn.

Every document has 1 night tariff and I have 180 nights tariff.
So a person can search for any combination in these 180 nights.

Say a request came to me to give total tariff for 10th to 15th of jan 2013.
Now I need to get a sum of tariff field of 6 docs.

So how can I keep this data indexed, to avoid search time calculation,
and there are other dimensions of this data also beside tariff.
Hope this makes sense.

Regards
Harshvardhan Ojha

-Original Message-
From: Gora Mohanty [mailto:g...@mimirtech.com]
Sent: Tuesday, January 08, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: Re: Hotel Searches

On 8 January 2013 17:10, Harshvardhan Ojha 
harshvardhan.o...@makemytrip.com wrote:

Hi All,

Looking into a finding solution for Hotel searches based on the
below criteria's

[...]

Didn't you just post this on a separate thread, complete with some
nonsensical follow-up from a colleague of yours? Please do not repost
the same message over and over again.

It is not clear what you are trying to achieve.
What is the difference between a city and a hotel in your data? How is
a person represented in your documents? Is it by the ID field?

Are you looking to cache all possible combinations of ID, city, and
startdate? If so, to what end?  This smells like a XY problem:
http://people.apache.org/~hossman/#xyproblem

Regards,
Gora





Re: theory of sets

2013-01-08 Thread Uwe Reh

OK, OK,

I will try it again with dynamic fields. May be the Problem has been 
something else. All statements sound reasonable.
Even Lisheng's thoughts about the impact of to many fields on memory 
consumption should not be the problem for a JVM with 32G Ram an almost 
no gc.


Please give me some time.
Thanks
Uwe


Am 08.01.2013 00:27, schrieb Zhang, Lisheng:

Hi,

Just thought this possibility: I think dynamic field is solr concept, on lcene
level all fields are the same, but in initial startup, lucene should load all
field information into memory (not field data, but schema).

If we have too many fields (like *_my_fields, * = a1, a2, ...), does this take
too much memory and slow down performance (even if very few fields are really
used)?

Best regards, Lisheng

-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk]
Sent: Monday, January 07, 2013 2:57 PM
To: solr-user@lucene.apache.org
Subject: Re: theory of sets


Dynamic fields resulted in poor response times? How many fields did each
document have? I can't see how a dynamic field should have any
difference from any other field in terms of response time.

Or are you querying across a large number of dynamic fields
concurrently? I can imagine that slowing things down.

Upayavira



Am 07.01.2013 17:40, schrieb Petersen, Robert:

Hi Uwe,

We have hundreds of dynamic fields but since most of our docs only use some of 
them it doesn't seem to be a performance drag.  They can be viewed as a sparse 
matrix of fields in your indexed docs.  Then if you make the 
sortinfo_for_groupx an int then that could be used in a function query to 
perform your sorting.  See  http://wiki.apache.org/solr/FunctionQuery






Re: fieldtype for name

2013-01-08 Thread Uwe Reh

Hi Michael,

in our index ob bibliographic metadata, we see the need for at least 
tree fields:
- name_facet: String as type, because the facet should should represent 
the original inverted format from our data.
- name: TextField for searching. This field is heavily analyzed to match 
different orders, to match synonyms, phonetic similarity, German umlauts 
and other European stuff.
- name_lc: TextField. This field is just mapped to lower case. It's used 
to boost docs with the same style of writing like the users input.


Uwe

Am 08.01.2013 15:30, schrieb Michael Jones:

Hi,

What would be the best fieldtype for a persons name? at the moment I'm
using text_general but, if I search for bob smith, some results I get back
might be rob thomas. In that it's matched 'ob'.

But I only really want results that are either

'bob smith'
'bob, smith'
'smith, bob'
'smith bob'

Thanks





Re: theory of sets (first solution)

2013-01-07 Thread Uwe Reh

Hi,

I found a own hack. It's based on free interpretation of the function 
strdist().


Have:
- one multivalued field 'part_of'
- one unique field 'groupsort'

Index each item:
   For each group membership:
  add groupid to 'part_of'
  concat groupid and sortstring  to new string
  add this string to a csv list
   End
   add the csv list to 'groupsort'
End

Have also, a own class that implements 
org.apache.lucene.search.spell.StringDistance, to generate a custom 
distance value. This class should:

- split the csv list
- find the element/string that starts with the given group id
- translate the rest (sortstring) to a float value

.../select?q=part_of:Xsort=strdist(X, groupsort, FQN) asc
FQN is the fully qualified name of the own class. (remember to place the 
the jar in a 'lib' defined in solrconfig.xml or add a own 'lib' entry)


Uwe
(still looking for a smarter solution)



Re: custom solr sort

2013-01-07 Thread Uwe Reh

Am 06.01.2013 02:32, schrieb andy:

I want to custom solr sort and  pass solr param from client to solr server,


Hi Andy,

not a answer of your question, but maybe an other approach to solve your 
initial question. Instead of writing a new SearchComponent I decided to 
(miss)use the function http://wiki.apache.org/solr/FunctionQuery#strdist

'strdist' seems to have everything, you need:
- a parameter 's1'
- a fieldname 's2'
- a slot to plugin your own algo

How to use this to sort on multivalued attributes, I've described in 
this list as thread theory of sets


Uwe


Re: Sorting on mutivalued fields still impossible?

2013-01-07 Thread Uwe Reh

Hi Jack,

thank you for the hint.
Since I have already a solrj client to do the preprocessing, mapping to 
sort fields isn't my problem. I will try to explain better in my reply 
to Erick.


Uwe
(Sorry late reaction)


Am 30.08.2012 16:04, schrieb Jack Krupansky:

You can also use a Field Mutating Update Processor to do a smart
copy of a multi-valued field to a sortable single-valued field.

See:
http://wiki.apache.org/solr/UpdateRequestProcessor#Field_Mutating_Update_Processors


Such as using the maximum value via MaxFieldValueUpdateProcessorFactory.

See:
http://lucene.apache.org/solr/api-4_0_0-BETA/org/apache/solr/update/processor/MaxFieldValueUpdateProcessorFactory.html


Which value of a multi-valued field do you wish to sort by?

-- Jack Krupansky




Re: Sorting on mutivalued fields still impossible?

2013-01-07 Thread Uwe Reh

Am 31.08.2012 13:35, schrieb Erick Erickson:

... what would the correct behavior
be for sorting on a multivalued field


Hi Erick,

in generally you are right, the question of multivalued fields is which 
value the reference is. But there are thousands of cases where this 
question is implicit answered. See my example ...sort=max(datefield) 
desc It is obvious, that the newest date should win. I see no 
reason why simple filters like max can't handle multivalued fields.


Now four month's later i still wounder, why there is no pluginable 
function to map multivalued fields into a single value.

eg. ...sort=sqrt(mapMultipleToOne(FQN, fieldname)) asc...

Uwe
(Sorry late reaction)




Re: Sorting on mutivalued fields still impossible?

2013-01-07 Thread Uwe Reh

Hi,

like I just wrote in my reply to the similar suggestion form Jack.
I'm not looking for a way to preprocess my data.

My question is, why do i need two redundant fields to sort a multivalued 
field ('date_max' and 'date_min' for 'date')

For me it's just a waste of space, poisoning the fieldcache.

There is also an other class of problems, where a filterfunction like 
'mapMultipleToOne' may helpful. In the thread 'theory of sets' (this 
list) I described a hack with the function strdist, an own class and the 
mapping of a multiple values as a cvs list in a single value field.


Uwe




Am 07.01.2013 14:54, schrieb Alexandre Rafalovitch:

If the Multiple-to-one mapping would be stable (e.g. independent of a
query), why not implement it as a custom update.chain processor with a copy
to a separate field? There is already a couple of implementations
under FieldValueMutatingUpdateProcessor (first, last, max, min).

Regards,
Alex.





Re: theory of sets

2013-01-07 Thread Uwe Reh

Hi Robi,

thank you for the contribution. It's exiting to read, that your index 
isn't contaminated by the number of fields. I can't exclude other 
mistakes, but my first experience with extensive use of dynamic fields 
have been very poor response times.


Even though I found an other solution, I should give the straight 
forward solution a second chance.


Uwe

Am 07.01.2013 17:40, schrieb Petersen, Robert:

Hi Uwe,

We have hundreds of dynamic fields but since most of our docs only use some of 
them it doesn't seem to be a performance drag.  They can be viewed as a sparse 
matrix of fields in your indexed docs.  Then if you make the 
sortinfo_for_groupx an int then that could be used in a function query to 
perform your sorting.  See  http://wiki.apache.org/solr/FunctionQuery




Re: indexing cpu utilization

2013-01-04 Thread Uwe Reh

Hi Mark,

SOLR-3929 rocks!
A nigthly build of 4.1 with maxIndexingThreads configured to 24, takes 
80% to 100% of the cpu resources :-)


Thank you, Otis and Gora


mpstat 10

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   0   13   607  241  234   78  10021   258   87   2   0  11
  10   0   24   240   23  293   94  10931   286   86   1   0  13
  20   0   12   367  181  268   83  10241   338   89   1   0  10
  30   0   18   188   20  226   67   8651   243   87   1   0  13
  40   05   205   22  255   74  10041   310   87   1   0  12
  50   05   192   22  228   68   8840   260   89   1   0  10
  60   0   15   223   23  278   86  10451   319   87   1   0  12
  70   0   18   215   23  267   75  10451   321   85   1   0  14
  80   04   253   21  272   64  11240   284   77   1   0  22
  90   04   243   20  281   61  10830   300   79   1   0  20
 100   02   234   22  272   56  11140   376   78   1   0  21
 110   02   205   18  237   57   9640   297   82   1   0  17
 120   03   251   24  273   59  11340   323   72   1   0  27
 130   04   203   19  236   54   9120   294   82   1   0  17
 140   04   245   21  288   54  11130   309   77   1   0  22
 150   04   233   21  258   58  10630   280   80   1   0  19
 160   05   286   19  346   60  13340   425   73   1   0  26
 170   06   340   23  414   67  15140   500   67   1   0  31
 180   07   343   23  435   67  15050   482   66   2   0  32
 190   08   294   19  348   53  12850   444   70   1   0  29
 200   06   309   21  385   64  13940   514   68   1   0  31
 210   07   279   20  378   58  13330   471   69   1   0  30
 220   06   249   18  329   50  12040   469   72   1   0  27
 230   06   258   20  338   54  12730   388   70   1   0  28
 240   06   400   20  608  146  18740  1071   75   3   0  22
 250   04   375   20  550  134  17350   891   73   2   0  25
 260   08   329   19  490  103  15250   856   75   2   0  23
 270   07   341   22  489  107  16140   793   72   2   0  26
 280   05   321   18  478   98  16230   793   75   2   0  23
 290   04   283   18  399   84  13640   744   76   2   0  22
 300   05   252   16  378   86  12730   620   79   2   0  20
 310   05   277   16  447   96  14440   715   76   2   0  22




Re: indexing cpu utilization

2013-01-03 Thread Uwe Reh

Hi,
thank you for the hints.


On 3 January 2013 05:55, Mark Miller markrmil...@gmail.com wrote:

32 cores eh? You probably have to raise some limits to take advantage of
that.
32 cores isn't that much anymore. You can buy amd servers from 
Supermicro with two sockets and 32G of ram for less than 2500$. Systems 
with four sockets (64Cores) aren't unaffordable too.
Having some more money, one can think about the four socket Oracle T4-4 
System (4 * 8cores * 8vcores = 256)



You might always want to experiment with using more merge threads? I
think the default may be 3.
I will try this. But i think otis is right. It's rather SOLR-3929 than 
SOLR-4078.

Mark wanted to point this other issue:
https://issues.apache.org/jira/browse/SOLR-3929 though, so try that...


Am 03.01.2013 05:20, schrieb Otis Gospodnetic:

I, too, was going to point out to the number of threads, but was going to
suggest using fewer of them because the server has 32 cores and there was a
mention of 100 threads being used from the client.  Thus, my guess was that
the machine is busy juggling threads and context switching (how's vmstat 2
output, Uwe?) instead of doing the real work.

use more threads vs. use less threads
It is a bit confusing. I made some test with 50 to 200 threads. within 
this range I noticed no real difference. 50 threads on the client seems 
to trigger enough threads on the server to saturate the bottleneck. 200 
client threads seems not to be destructive.


vmstat 5 on the Server with 100 threads on the client

 kthr  memorypagedisk  faults  cpu
 r b w   swap  free  re  mf pi po fr de sr cd cd s0 s4   in   sy   cs us sy id
 0 0 0 13605380 17791928 0 7 0  0  0  0  0  0  0  0  0 3791 1638 1666 26  0 73
 1 0 0 13641072 17826368 0 8 0  0  0  0  0  0  0  0  0 3540 1305 1527 25  0 74
 0 0 0 13691908 17876364 0 8 0  0  0  0  0  0  0  0 48 3935 1453 1919 26  0 73
 0 0 0 13720208 17904652 0 4 0  0  0  0  0  0  0  0  0 3964 1342 1645 25  0 74
 0 0 0 13792440 17976868 0 9 0  0  0  0  0  0  0  0  0 3891 1551 1757 26  0 74
 1 0 0 13867128 18051532 0 4 0  0  0  0  0  0  0  0  0 3871 1430 1584 26  0 74
 1 0 0 13948796 18133184 0 6 0  0  0  0  0  0  0  0  0 3079 1218 1435 25  0 74


To see whats going on I prefer mpstat 10 (100 client threads)

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   02  1389 1149   556   1580   173   62   4   0  34
  10   05   2302  3415   34   15024   16   1   0  83
  20   00   653  612   4865   180   222   68   5   0  27
  30   01311   38177021   13   1   0  87
  40   08392   45266038   17   1   0  82
  50   06763   844   18   14051   35   2   0  64
  60   04300   40467059   32   1   0  67
  70   01360   5336   12065   34   1   0  65
  80   04   1074   913   314025   19   0   0  80
  90   04704   663   17   10038   27   1   0  72
 100   01582   564   138047   34   1   0  65
 110   01363   4625   13020   14   0   0  86
 120   00322   37358030   20   0   0  80
 130   01402   48379037   25   1   0  74
 140   02773   854   18   16042   35   1   0  64
 150   02294   27263023   15   0   0  85
 160   03   1102  1002   338024   14   1   0  85
 170   03662   693   179040   27   1   0  72
 180   01421   5449   11058   32   1   0  67
 190   01541   601   13   110167   0   0  93
 200   00260   3913   110229   0   0  91
 210   00330   4636   11050   30   1   0  69
 220   03381   4448   12050   33   1   0  66
 230   03601   612   158029   18   0   0  82
 240   04   1022   923   31   10095   31   1   0  68
 250   02751   764   21   10047   36   1   0  63
 260   01681   815   19   18069   47   1   0  52
 270   01401   523   10   14025   22   1   0  77
 280   00350   38396034   24   0   0  76
 290   01310   4647   13044   31   1   0  68
 300   00320   4848   13047   37   1   0  62
 310   00260   3637   10050   32   1   0  67
No minor fails, no major fails, low crosscalls, reasonable interrupts, 
only some migrations... This seems for me quite good. Do you see a pitfall?


The 

theory of sets

2013-01-03 Thread Uwe Reh

Hi,

I'm looking for a tricky solution of a common problem. I have to handle 
a lot of items and each could be member of several groups.

- OK, just add a field called 'member_of'

No that's not enough, because each group is sorted and each member has a 
sortstring for this group.
- OK, still easy add a dynamic field 'sortinfo_for_*' and fill this for 
each group membership.


Yes, this works, but there are thousands of different groups, that much 
dynamic fields are probably a serious performance issue.

- Well ...

I'm looking for a smart way to answer to the question Find the members 
of group X and sort them by the the sortstring for this group.


One idea I had was to fill the 'member_of' field with composed entrys 
(groupname + _ + sortstring). Finding the members is easy with 
wildcards but there seems to be no way to use the sortstring as a 
boostfactor


Has anybody solved this problem?
Any hints are welcome.

Uwe


Re: indexing cpu utilization

2013-01-02 Thread Uwe Reh

Hi,

while trying to optimize our indexing workflow I reached the same 
endpoint like gabriel shen described in his mail. My Solr server won't 
utilize more than 40% of the computing power.
I made some tests, but i'm not able to find the bottleneck. Could 
anybody help to solve this quest?


At first let me describe the environment:

Server:
- Two socket Opteron (interlagos) = 32 cores
- 64Gb Ram (1600Mhz)
- SATA Disks: spindle and ssd
- Solaris 5.11
- JRE 1.7.0
- Solr 4.0
- ApplicationServer Jetty
- 1Gb network interface

Client:
- same hardware as client
- either multi threaded solrj client using multiple instances of 
HttpSolrServer
- or multi threaded solrj client using a ConcurrentUpdateSolrServer with 
100 threads


Problem:
- 10,000,000 docs of bibliographic data (~4k each)
- with a simplified schema definition it takes 10 hours to index = 
~250docs/second

- with the real schema.xml it takes 50 hours to index  = ~50docs/second
In both cases the client takes just 2% of the cpu resources and the 
server 35%. It's obvious that there is some optimization potential in 
the schema definition, but why uses the Server never more than 40% of 
the cpu power?



Discarded possible bottlenecks:
- Ram for the JVM
Solr takes only up to 12G of heap and there is just a negligible gc 
activity. So the increase from 16G to 32G of possible heap made no 
difference.

- Bandwidth of the net
The transmitted data is identical in both cases. The size of the 
transmitted data is somewhat below 50G. Since both machines have a 
dedicated 1G line to the switch, the raw transmission should not take 
much more than 10 minutes

- Performance of the client
Like above, the client ist fast enough for the simplified case (10h). A 
dry run (just preprocessing not indexing) may finish after 75 minutes.

- Servers disk IO
The size of the simpler index is ~100G the size of the other is ~150G. 
This makes factor of 1.5 not 5. The difference between a ssd and a real 
disk is not noticeable. The output of 'iostat' and 'zpool iostat' is 
unsuspicious.

- Bad thread distribution
'mpstat' shows a well distributed load over all cpus and a sensible 
amount of crosscalls (less than ten/cpu)

- Solr update parameter (solrconfig.xml)
Inspired from 
http://www.hathitrust.org/blogs/large-scale-search/forty-days-and-forty-nights-re-indexing-7-million-books-part-1 I'm using:

ramBufferSizeMB256/ramBufferSizeMB
mergeFactor40/mergeFactor
termIndexInterval1024/termIndexInterval
lockTypenative/lockType
unlockOnStartuptrue/unlockOnStartup

Any changes on this Parameters made it worse.

To get an idea whats going on, I've done some statistics with visualvm. 
(see attachement)
The distribution of real and cpu time looks significant, but Im not 
smart enough to interpret the results.
The method 
org.apache.lucene.index.treadAffinityDocumentsWriterThreadPool.getAndLock() 
is active at 80% of the time but takes only 1% of the cpu time. On the 
other hand the second method 
org.apache.commons.codec.language.bm.PhoneticEngine$PhonemeBuilder.append() 
is active at 12% of the time and is always running on a cpu


So again the question When there are free resources in all dimensions, 
why utilizes Solr not more than 40% of the computing Power?

Bandwidth of the RAM?? I can't believe this. How to verify?
???

Any hints are welcome.
Uwe








Re: indexing cpu utilization (attachement)

2013-01-02 Thread Uwe Reh

Am 02.01.2013 22:39, schrieb Uwe Reh:

To get an idea whats going on, I've done some statistics with visualvm.
(see attachement)


merde the listserver stripes attachments.
You'll find the screen shot at 
http://fantasio.rz.uni-frankfurt.de/solrtest/HotSpot.gif


uwe



Re: Where is ISOLatin1AccentFilterFactory (Solr4)?

2013-01-02 Thread Uwe Reh

Hi,

I like the best of both worlds:

 charFilter class=solr.MappingCharFilterFactory mapping=mapping-specials.txt 
/

 Mask some specials like C++ to cplusplus or C# to csharp ...

 tokenizer class=solr.ICUTokenizerFactory /

 Tokenize an identify on unicode whitespaces and charsets

 filter class=solr.WordDelimiterFilterFactory /

 Well known splitter for composed words

 filter class=solr.ICUFoldingFilterFactory /

 Perfect superset of charFilter ... ISOLatin1Accent.txt/
 or the ISOLatin1AccentFilterFactory because it can handle composed and 
decomposed accents and umlauts

 filter class=solr.CJKBigramFilterFactory /
 Nice workaround for missing whitespace as word separator in this 
languages.



Am 01.01.2013 17:48, schrieb Jack Krupansky:

Hmmm... quite some time ago I switched from ASCIIFoldingFilterFactory
to MappingCharFilterFactory, because I was told (by who I can't recall)
that the latter was better/preferred. Is there any particular reason
to favor one over the other?

-Original Message- From: Erick Erickson
ASCIIFoldingFilterFactory is preferred, does that suit your needs?




Sorting on mutivalued fields still impossible?

2012-08-29 Thread Uwe Reh

Hi,
just to be sure.

There is still no way to sort by multivalued fields?
...sort=max(datefield) desc

There is no smarter option, than creating additional singelevalued 
fields just for sorting?

eg. datafield_max and datefield_min

Uwe


Re: Paoding analyzer with solr for chinese

2012-08-09 Thread Uwe Reh

Hi Rajani,

I'm not really familiar with this paoding tokenizer, but it seems a bit 
old. We are using the CJKBigramFilter (like in the example of Solr 4.0 
alpha), which should be equivalent or even better and it works.


analyzer
   tokenizer class=solr.ICUTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory /
   filter class=solr.ICUFoldingFilterFactory /
   filter class=solr.CJKBigramFilterFactory /
/analyzer

Uwe



Am 09.08.2012 06:47, schrieb Rajani Maski:

Hi All,

   Any reply on this?



On Wed, Aug 8, 2012 at 3:23 PM, Rajani Maski rajinima...@gmail.com
mailto:rajinima...@gmail.com wrote:

Hi All,

   As said in this blog site
http://java.dzone.com/articles/indexing-chinese-solr that paoding
analyzer is much better for chinese text, I was trying to implement
it to get accurate results for chinese text.

I followed the instruction specified in the below sites
Site1

http://androidyou.blogspot.hk/2010/05/chinese-tokenizerlibrary-paoding-with.html
 Site2
http://www.opensourceconnections.com/2011/12/23/indexing-chinese-in-solr/


After Indexing, when I search on same field with same text, no
search results(numFound=0)

And luke tool is not showing up any terms for the field that is
indexed with below field type. Can anyone comment on what is going
wrong?



*_Schema field types for  paoding :_*

*1) fieldType name=paoding class=solr.TextField
positionIncrementGap=100 *
*analyzer*
*tokenizer
class=test.solr.PaodingTokerFactory.PaoDingTokenizerFactory/*
*/analyzer*
*/fieldType*


And analaysis page results is :
Inline image 2

*2)fieldType name=paoding_chinese class=solr.TextField*
*  analyzer class=net.paoding.analysis.analyzer.PaodingAnalyzer*
* /analyzer*
*/fieldType*

Analysis on the  field paoding_chinese throws this error
Inline image 3



Thanks  Regards
Rajani







Two questions on spellchecking

2012-08-06 Thread Uwe Reh

Hi,

even though I read a lot, none of my spellchecker configurations works 
really well. I reached a dead end. Maybe someone could help, to solve my 
challenges.


- How can I get case sensitive suggestions, independent of the given 
case in the query?


- How to configure a 'did you mean' spellchecking, as discussed in 
https://issues.apache.org/jira/browse/SOLR-2585 (Context-Sensitive 
Spelling Suggestions  Collations)



I'm using following environment:
- Solr 4.0-alpha (downloaded 25. June)
- Java 7
- schema.xml

 fieldType name=textSuggest class=solr.TextField 
positionIncrementGap=100
 analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
 /analyzer
  /fieldType

 ...

  field name=suggest type=textSuggest indexed=true  stored=true 
required=false multiValued=true  /

- solrconfig.xml (suggester)

   requestHandler name=/hint 
class=org.apache.solr.handler.component.SearchHandler
  lst name=defaults
 str name=echoParamsall/str
 str name=spellchecktrue/str
 str name=spellcheck.dictionarysuggester/str
 str name=spellcheck.extendedResultstrue/str
 str name=spellcheck.onlyMorePopularfalse/str
 str name=spellcheck.count20/str
  /lst
  arr name=components
 strsuggester/str
  /arr
   /requestHandler
   searchComponent name=suggester class=solr.SpellCheckComponent
  lst name=spellchecker
 str name=namesuggester/str
 str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
 str 
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
 str name=fieldsuggest/str
  /lst
   /searchComponent

- solrconfig.xml (spellcheck)

  requestHandler name=standard class=solr.StandardRequestHandler 
default=true
  lst name=defaults
 str name=echoParamsall/str
 int name=rows10/int
 str name=dfallfields/str
 str name=spellcheck.extendedResultstrue/str
 str name=spellcheck.onlyMorePopularfalse/str
 str name=spellcheck.count20/str
  /lst
  arr name=last-components
 strspellcheck/str
  /arr
   /requestHandler

searchComponent name=spellcheck class=solr.SpellCheckComponent

  str name=queryAnalyzerFieldTypetextSpell/str
  lst name=spellchecker
 str name=namedefault/str
 str name=fieldsuggest/str
 str name=classnamesolr.DirectSolrSpellChecker/str
 str name=distanceMeasureinternal/str
 float name=accuracy0.1/float
 int name=maxEdits2/int
 int name=minPrefix1/int
 int name=maxInspections5/int
 int name=minQueryLength1/int
 float name=maxQueryFrequency0.1/float
 float name=thresholdTokenFrequency0.001/float
  /lst
   /searchComponent


*Suggester problem*
With this configuration the suggester works not case sensitive, but the 
hints are all lower case.

Example: .../hint?q=dawt=xmlspellcheck=truespellcheck.build=true

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint name=QTime173/intlst name=paramsstr name=spellchecktrue/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.dictionarysuggester/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr name=spellchecktrue/strstr name=qda/strstr 
name=wtxml/strstr name=spellcheck.buildtrue/str/lst/lststr name=commandbuild/strlst name=spellchecklst name=suggestionslst name=daint name=numFound20/intint name=startOffset0/intint name=endOffset2/intarr name=suggestionstrdat-marktspiegel spezial/strstrdata structures with c++ using stl/strstrdata warehouse/strstrdatan, 
ingeborg/strstrdatenbanken mit delphi/strstrdatenverschlüsselung/strstrdauner, gabriele/strstrdautermann, margit/strstrdavid copperfield/strstrdavid, horst/strstrdav

id, leo/strstrdavid, nicholas/strstrdavis, charles t./strstrdavis, edward l/strstrdavis, leslie dorfman/strstrdavis, stanley m./strstrdavor 
kommt noch/strstrdavydova, irina n./strstrdawidowski, bernd/strstrdayan, daniel/str/arr/lstbool 
name=correctlySpelledfalse/bool/lst/lst

/response
Using just solr.StrField as field type, the suggestion are true to 
original capitalization, but I get no suggestions, if the query starts 
with a lower case character.


*Spelling problem*
One of the indexed entries in the field 'suggest' is David Copperfield 
and I want this string as alternative suggestion to the query David 
opperfield.

Example .../select?q=david+opperfieldrows=0wt=xmlspellcheck=true

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint name=QTime15/intlst name=paramsstr name=dfallfields/strstr name=echoParamsall/strstr name=spellcheck.extendedResultstrue/strstr name=spellcheck.count20/strstr name=spellcheck.onlyMorePopularfalse/strstr 
name=rows0/strstr name=spellchecktrue/strstr name=qdavid opperfield/strstr 

Re: Can't find org.apache.solr.client.solrj.embedded

2010-07-30 Thread Uwe Reh

Sorry,

I had inspected the ...core.jar three times, without recognizing the 
package. I was realy blind. =8-)


thanks
Uwe

Am 26.07.2010 20:48, schrieb Chris Hostetter:

: where is a Jar, containing org.apache.solr.client.solrj.embedded?

Classes in the embedded package are useless w/o the rest of the Solr
internal core classes, so they are included directly in the
apache-solr-core-1.4.1.jar.

-Hoss



Can't find org.apache.solr.client.solrj.embedded

2010-07-26 Thread Uwe Reh

Hello experts,

where is a Jar, containing org.apache.solr.client.solrj.embedded?

I miss this package in 'apache-solr-solrj-1.4.[01].jar'.
Also I can't find any other sources than 
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/webapp/src/org/apache/solr/client/solrj/embedded/ , which does not fit to Solr 1.4.


Any tips for a blind newbie?

Uwe