Re: gzip compression solr 8.4.1

2020-05-05 Thread Johannes Siegert
Hi,

We did further tests to see where the problem exactly is. These are our
outcomes:

The content-length is calculated correctly, a quick test with curl showed
this.
The problem is that the stream with the gzip data is not fully consumed and
afterwards not closed.

Using the debugger with a breakpoint at
org/apache/solr/common/util/Utils.java:575 shows that it won't enter the
function readFully((entity.getContent()) most likely due to how the gzip
stream content is wrapped and extracted beforehand.

On line org/apache/solr/common/util/Utils.java:582 the
consumeQuietly(entity) should close the stream but does not because of a
silent exception.

This seems to be the same as it is described in
https://issues.apache.org/jira/browse/SOLR-14457

We saw that the problem happened also with correct GZIP responses from
jetty. Not only with non-GZIP as described within the jira issue.

Best,

Johannes

Am Do., 23. Apr. 2020 um 09:55 Uhr schrieb Johannes Siegert <
johannes.sieg...@offerista.com>:

> Hi,
>
> we want to use gzip-compression between our application and the solr
> server.
>
> We use a standalone solr server version 8.4.1 and the prepackaged jetty as
> application server.
>
> We have enabled the jetty gzip module by adding these two files:
>
> {path_to_solr}/server/modules/gzip.mod (see below the question)
> {path_to_solr}/server/etc/jetty-gzip.xml (see below the question)
>
> Within the application we use a HttpSolrServer that is configured with
> allowCompression=true.
>
> After we had released our application we saw that the number of
> connections within the TCP-state CLOSE_WAIT rising up until the application
> was not able to open new connections.
>
>
> After a long debugging session we think the problem is that the header
> "Content-Length" that is returned by the jetty is sometimes wrong when
> gzip-compression is enabled.
>
> The solrj client uses a ContentLengthInputStream, that uses the header
> "Content-Lenght" to detect if all data was received. But the InputStream
> can not be fully consumed because the value of the header "Content-Lenght"
> is higher than the actual content-length.
>
> Usually the method PoolingHttpClientConnectionManager.releaseConnection is
> called after the InputStream was fully consumed. This give the connection
> free to be reused or to be closed by the application.
>
> Due to the incorrect header "Content-Length" the
> PoolingHttpClientConnectionManager.releaseConnection method is never called
> and the connection stays active. After the connection-timeout of the jetty
> is reached, it closes the connection from the server-side and the TCP-state
> switches into CLOSE_WAIT. The client never closes the connection and so the
> number of connections in use rises up.
>
>
> Currently we try to configure the jetty gzip module to return no
> "Content-Length" if gzip-compression was used. We hope that in this case
> another InputStream implementation is used that uses the NULL-terminator to
> see when the InputStream was fully consumed.
>
> Do you have any experiences with this problem or any suggestions for us?
>
> Thanks,
>
> Johannes
>
>
> gzip.mod
>
> -
>
> DO NOT EDIT - See:
> https://www.eclipse.org/jetty/documentation/current/startup-modules.html
>
> [description]
> Enable GzipHandler for dynamic gzip compression
> for the entire server.
>
> [tags]
> handler
>
> [depend]
> server
>
> [xml]
> etc/jetty-gzip.xml
>
> [ini-template]
> ## Minimum content length after which gzip is enabled
> jetty.gzip.minGzipSize=2048
>
> ## Check whether a file with *.gz extension exists
> jetty.gzip.checkGzExists=false
>
> ## Gzip compression level (-1 for default)
> jetty.gzip.compressionLevel=-1
>
> ## User agents for which gzip is disabled
> jetty.gzip.excludedUserAgent=.*MSIE.6\.0.*
>
> -
>
> jetty-gzip.xml
>
> -
>
> 
>  http://www.eclipse.org/jetty/configure_9_3.dtd;>
>
> 
> 
> 
> 
> 
> 
>
> 
> 
> 
>  class="org.eclipse.jetty.server.handler.gzip.GzipHandler">
> 
>  deprecated="gzip.minGzipSize" default="2048" />
> 
> 
>  deprecated="gzip.checkGzExists" default="false" />
> 
> 
>  deprecated="gzip.compressionLevel" default="-1" />
> 
> 
&

gzip compression solr 8.4.1

2020-04-23 Thread Johannes Siegert
Hi,

we want to use gzip-compression between our application and the solr server.

We use a standalone solr server version 8.4.1 and the prepackaged jetty as
application server.

We have enabled the jetty gzip module by adding these two files:

{path_to_solr}/server/modules/gzip.mod (see below the question)
{path_to_solr}/server/etc/jetty-gzip.xml (see below the question)

Within the application we use a HttpSolrServer that is configured with
allowCompression=true.

After we had released our application we saw that the number of connections
within the TCP-state CLOSE_WAIT rising up until the application was not
able to open new connections.


After a long debugging session we think the problem is that the header
"Content-Length" that is returned by the jetty is sometimes wrong when
gzip-compression is enabled.

The solrj client uses a ContentLengthInputStream, that uses the header
"Content-Lenght" to detect if all data was received. But the InputStream
can not be fully consumed because the value of the header "Content-Lenght"
is higher than the actual content-length.

Usually the method PoolingHttpClientConnectionManager.releaseConnection is
called after the InputStream was fully consumed. This give the connection
free to be reused or to be closed by the application.

Due to the incorrect header "Content-Length" the
PoolingHttpClientConnectionManager.releaseConnection method is never called
and the connection stays active. After the connection-timeout of the jetty
is reached, it closes the connection from the server-side and the TCP-state
switches into CLOSE_WAIT. The client never closes the connection and so the
number of connections in use rises up.


Currently we try to configure the jetty gzip module to return no
"Content-Length" if gzip-compression was used. We hope that in this case
another InputStream implementation is used that uses the NULL-terminator to
see when the InputStream was fully consumed.

Do you have any experiences with this problem or any suggestions for us?

Thanks,

Johannes


gzip.mod

-

DO NOT EDIT - See:
https://www.eclipse.org/jetty/documentation/current/startup-modules.html

[description]
Enable GzipHandler for dynamic gzip compression
for the entire server.

[tags]
handler

[depend]
server

[xml]
etc/jetty-gzip.xml

[ini-template]
## Minimum content length after which gzip is enabled
jetty.gzip.minGzipSize=2048

## Check whether a file with *.gz extension exists
jetty.gzip.checkGzExists=false

## Gzip compression level (-1 for default)
jetty.gzip.compressionLevel=-1

## User agents for which gzip is disabled
jetty.gzip.excludedUserAgent=.*MSIE.6\.0.*

-

jetty-gzip.xml

-


http://www.eclipse.org/jetty/configure_9_3.dtd;>


















































-


ManagedFilter for stemming

2019-07-09 Thread Johannes Siegert
Hi,

we are using the SnowballPorterFilter to stem our tokens for serveral
languages.

Now we want to update the list of protected words over the Solr-API.

As I can see, there are only solutions for SynonymFilter and the
StopwordFilter with ManagedSynonymFilter and ManagedStopFilter.

Do you know any solution for my problem?

Thanks,

Johannes


Re: optimize cache-hit-ratio of filter- and query-result-cache

2015-12-01 Thread Johannes Siegert
Thanks. The statements on 
http://wiki.apache.org/solr/SolrCaching#showItems are not explicitly 
enough for my question.




optimize cache-hit-ratio of filter- and query-result-cache

2015-11-30 Thread Johannes Siegert

Hi,

some of my solr indices have a low cache-hit-ratio.

1 Does sorting the parts of a single filter-query have impact on 
filter-cache- and query-result-cache-hit-ratio?
1.1 Example: fq=field1:(2 or 3 or 1) to fq=field1:(1 or 2 or 3) -> if 
1,2,3 are randomly sorted
2 Does sorting the parts of the query have impact on 
query-result-cache-hit-ratio?
2.1 Example: "q=abc=field1:abc=field1 
desc=field2:xyz=field2 asc" to 
"q=abc=field1:abc=field2:xyz=field1 desc=field2 asc" -> 
if the query parts are randomly sorted


Thanks!

Johannes



sort by given order

2015-03-12 Thread Johannes Siegert

Hi,

i want to sort my documents by a given order. The order is defined by a 
list of ids.


My current solution is:

list of ids: 15, 5, 1, 10, 3

query: q=*:*fq=(id:((15) OR (5) OR (1) OR (10) OR 
(3)))sort=query($idqsort) desc,id ascidqsort=id:((15^5) OR (5^4) OR 
(1^3) OR (10^2) OR (3^1))start=0rows=5


Do you know an other solution to sort by a list of ids?

Thanks!

Johannes


NGramTokenizer influence to length normalization?

2014-08-08 Thread Johannes Siegert

Hi,

does the NGramTokenizer have an influence to the length normalization?

Thanks.

Johannes


wrong docFreq while executing query based on uniqueKey-field

2014-07-22 Thread Johannes Siegert

Hi.

My solr-index (version=4.7.2.) has an id-field:

field  name=id  type=string  indexed=true  stored=true/
...
uniqueKeyid/uniqueKey

The index will be updated once per hour.

I use the following query to retrieve some documents:

q=id:2^2 id:1^1

I would expect that the document(2) should be always before the 
document(1). But after many index updates document(1) is before document(2).


With debug=true I could see the problem. The document(1) has a 
docFreq=2, while the document(2) has a docFreq=1.


How could the docFreq of the uniqueKey-field be hight than 1? Could 
anyone explain this behavior to me?


Thanks!

Johannes



default query operator ignored by edismax query parser

2014-06-25 Thread Johannes Siegert

Hi,

I have defined the following edismax query parser:

requestHandler name=/search class=solr.SearchHandler lst 
name=defaultsstr name=mm100%/strstr 
name=defTypeedismax/strfloat name=tie0.01/floatint 
name=ps100/intstr name=q.alt*:*/strstr 
name=q.opAND/strstr name=qffield1^2.0 field2/strstr 
name=rows10/strstr name=fl*/str/lst

/requestHandler


My search query looks like:

q=(word1 word2) OR (word3 word4)

Since I specified AND as default query operator, the query should match 
documents by ((word1 AND word2) OR (word3 AND word4)) but the query 
matches documents by ((word1 OR word2) OR (word3 OR word4)).


Could anyone explain the behaviour?

Thanks!

Johannes

P.S. The query q=(word1 word2) match all documents by (word1 AND word2)


Re: default query operator ignored by edismax query parser

2014-06-25 Thread Johannes Siegert

Thanks Shawn!

In this case I will use operators everywhere.

Johannes


Am 25.06.2014 15:09, schrieb Shawn Heisey:

On 6/25/2014 1:05 AM, Johannes Siegert wrote:

I have defined the following edismax query parser:

requestHandler name=/search class=solr.SearchHandler lst
name=defaultsstr name=mm100%/strstr
name=defTypeedismax/strfloat name=tie0.01/floatint
name=ps100/intstr name=q.alt*:*/strstr
name=q.opAND/strstr name=qffield1^2.0 field2/strstr
name=rows10/strstr name=fl*/str/lst
/requestHandler


My search query looks like:

q=(word1 word2) OR (word3 word4)

Since I specified AND as default query operator, the query should match
documents by ((word1 AND word2) OR (word3 AND word4)) but the query
matches documents by ((word1 OR word2) OR (word3 OR word4)).

Could anyone explain the behaviour?

I believe that you are running into this bug:

https://issues.apache.org/jira/browse/SOLR-2649

It's a very old bug, coming up on three years.  The workaround is to not
use boolean operators at all, or to use operators EVERYWHERE so that
your intent is explicitly described.  It is not much of a workaround,
but it does work.

Thanks,
Shawn



Bug within the solr query parser (version 4.7.1)

2014-04-15 Thread Johannes Siegert

Hi,

I have updated my solr instance from 4.5.1 to 4.7.1. Now the parsed 
query seems to be not correct.


Query: /*q=*:*fq=title:TEdebug=true */

Before the update the parsed filter query is */+title:te +title:t 
+title:e/*. After the update the parsed filter query is */+((title:te 
title:t)/no_coord) +title:e/*. It seems like a bug within the query parser.


I also have validated the parsed filter query with the analysis 
component. The result was */+title:te +title:t +title:e/*.


The behavior is equal on all special characters that split words into 2 
parts.


I use the following WordDelimiterFilter on query side:

filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 
preserveOriginal=1/


Thanks.

Johannes


Additional informations:

Debug before the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
*arrname=parsed_filter_queries **
**str+title:te +title:t +title:e/str **
**/arr *
...

Debug after the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
*arrname=parsed_filter_queries **
**str+((title:te title:t)/no_coord) +title:e/str **
**/arr*
...

title-field definition:

fieldType name=text_title class=solr.TextField 
positionIncrementGap=100 omitNorms=true

  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 
splitOnNumerics=1 preserveOriginal=1 stemEnglishPossessive=0/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=false/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 
splitOnNumerics=0 preserveOriginal=1/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


changed query behavior

2014-04-14 Thread Johannes Siegert

Hi,

I have updated my solr instance from 4.5.1 to 4.7.1.

Now my solr query failing some tests.

Query: q=*:*fq=(title:((TE)))?debug=true

Before the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
arrname=parsed_filter_queries
str+title:te +title:t +title:e/str
/arr
...

After the update:

lstname=debug
strname=rawquerystring*:*/str
strname=querystring*:*/str
strname=parsedqueryMatchAllDocsQuery(*:*)/str
strname=parsedquery_toString*:*/str
lstname=explain/
strname=QParserLuceneQParser/str
arrname=filter_queries
str(title:((TE)))/str
/arr
arrname=parsed_filter_queries
str+((title:te title:t)/no_coord) +title:e/str
/arr
...

Before update the query deliver only one result. Now the query deliver 
three results.


Do you have any idea why the parsed_filter_queries is +((title:te 
title:t)/no_coord) +title:e instead of +title:te +title:t +title:e?


title-field definition:

fieldType name=text_title class=solr.TextField 
positionIncrementGap=100 omitNorms=true

  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 
splitOnNumerics=1 preserveOriginal=1 stemEnglishPossessive=0/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory 
mapping=mapping.txt/

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=false/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 
splitOnNumerics=0 preserveOriginal=1/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

The default query operator is AND.

Thanks!

Johannes




solr-query with NOT and OR operator

2014-02-11 Thread Johannes Siegert

Hi,

my solr-request contains the following filter-query:

fq=((-(field1:value1)))+OR+(field2:value2).

I expect solr deliver documents matching to ((-(field1:value1))) and 
documents matching to (field2:value2).


But solr deliver only documents, that are the result of (field2:value2). 
I receive several documents, if I request only for ((-(field1:value1))).


Thanks!

Johannes


Re: solr-query with NOT and OR operator

2014-02-11 Thread Johannes Siegert

Hi Jack,

thanks!

fq=((*:* -(field1:value1)))+OR+(field2:value2).

This is the solution.

Johannes

Am 11.02.2014 17:22, schrieb Jack Krupansky:
With so many parentheses in there, I wonder what you are really trying 
to do Try expressing your query in simple English first so that we 
can understand your goal.


But generally, a purely negative nested query must have a *:* term to 
apply the exclusion against:


fq=((*:* -(field1:value1)))+OR+(field2:value2).

-- Jack Krupansky

-Original Message- From: Johannes Siegert
Sent: Tuesday, February 11, 2014 10:57 AM
To: solr-user@lucene.apache.org
Subject: solr-query with NOT and OR operator

Hi,

my solr-request contains the following filter-query:

fq=((-(field1:value1)))+OR+(field2:value2).

I expect solr deliver documents matching to ((-(field1:value1))) and
documents matching to (field2:value2).

But solr deliver only documents, that are the result of (field2:value2).
I receive several documents, if I request only for ((-(field1:value1))).

Thanks!

Johannes


--
Johannes Siegert
Softwareentwickler

Telefon:  0351 - 418 894 -73
Fax:  0351 - 418 894 -99
E-Mail:   johannes.sieg...@marktjagd.de
Xing: https://www.xing.com/profile/Johannes_Siegert2

Webseite: http://www.marktjagd.de
Blog: http://blog.marktjagd.de
Facebook: http://www.facebook.com/marktjagd
Twitter:  http://twitter.com/Marktjagd
__

Marktjagd GmbH | Schützenplatz 14 | D - 01067 Dresden

Geschäftsführung: Jan Großmann
Sitz Dresden | Amtsgericht Dresden | HRB 28678



Re: high memory usage with small data set

2014-02-05 Thread Johannes Siegert

Hi Erick,

thanks for your reply.

What do you exactly mean with Do your used entries in your caches 
increase in parallel??


I update the indices every hour and commit the changes. So a new 
searcher with empty or autowarmed caches should be created and the old 
one should be removed.


Johannes

Am 30.01.2014 15:08, schrieb Erick Erickson:

Do your used entries in your caches increase in parallel? This would be the case
if you aren't updating your index and would explain it. BTW, take a look at your
cache statistics (from the admin page) and look at the cache hit ratios. If they
are very small (and my guess is that with 1,500 boolean operations, you aren't
getting significant re-use) then you're just wasting space, try the cache=false
option.

Also, how are you measuring memory? It's sometimes confusing that virtual
memory can be include, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

On Wed, Jan 29, 2014 at 7:49 AM, Johannes Siegert
johannes.sieg...@marktjagd.de wrote:

Hi,

we are using Apache Solr Cloud within a production environment. If the
maximum heap-space is reached the Solr access time slows down, because of
the working garbage collector for a small amount of time.

We use the following configuration:

- Apache Tomcat as webserver to run the Solr web application
- 13 indices with about 150 entries (300 MB)
- 5 server with one replication per index (5 GB max heap-space)
- All indices have the following caches
- maximum document-cache-size is 4096 entries, all other indices have
between 64 and 1536 entries
- maximum query-cache-size is 1024 entries, all other indices have
between 64 and 768
- maximum filter-cache-size is 1536 entries, all other i ndices have
between 64 and 1024
- the directory-factory-implementation is NRTCachingDirectoryFactory
- the index is updated once per hour (no auto commit)
- ca. 5000 requests per hour per server
- large filter-queries (up to 15000 bytes and 1500 boolean operations)
- many facet-queries (30%)

Behaviour:

Started with 512 MB heap space. Over several days the heap-space grow up,
until the 5 GB was reached. At this moment the described problem occurs.
 From this time on the heap-space-useage is between 50 and 90 percent. No
OutOfMemoryException occurs.

Questions:


1. Why does Solr use 5 GB ram, with this small amount of data?
2. Which impact does the large filter-queries have in relation to ram usage?

Thanks!

Johannes Siegert



high memory usage with small data set

2014-01-29 Thread Johannes Siegert

Hi,

we are using Apache Solr Cloud within a production environment. If the 
maximum heap-space is reached the Solr access time slows down, because 
of the working garbage collector for a small amount of time.


We use the following configuration:

- Apache Tomcat as webserver to run the Solr web application
- 13 indices with about 150 entries (300 MB)
- 5 server with one replication per index (5 GB max heap-space)
- All indices have the following caches
   - maximum document-cache-size is 4096 entries, all other indices 
have between 64 and 1536 entries
   - maximum query-cache-size is 1024 entries, all other indices have 
between 64 and 768
   - maximum filter-cache-size is 1536 entries, all other i ndices have 
between 64 and 1024

- the directory-factory-implementation is NRTCachingDirectoryFactory
- the index is updated once per hour (no auto commit)
- ca. 5000 requests per hour per server
- large filter-queries (up to 15000 bytes and 1500 boolean operations)
- many facet-queries (30%)

Behaviour:

Started with 512 MB heap space. Over several days the heap-space grow 
up, until the 5 GB was reached. At this moment the described problem 
occurs. From this time on the heap-space-useage is between 50 and 90 
percent. No OutOfMemoryException occurs.


Questions:


1. Why does Solr use 5 GB ram, with this small amount of data?
2. Which impact does the large filter-queries have in relation to ram usage?

Thanks!

Johannes Siegert