sing something else. It should not make a difference (as your
non-truncated queries are fast), but could you try to reduce the slow
request to the simplest possible? No grouping, faceting or other special
processing, just q=network se*
- Toke Eskildsen, State and University Library, Denmark
ected is going on while you test?
> How can I disable replication(as it is implicitly enabled) permanently as
> in our case we are not using it but can see warnings related to leader
> election?
If you are using spinning drives and only have 32GB of RAM in total in
each machine, you are probably st
tabilizes.
- Toke Eskildsen, State and University Library, Denmark
done #segments benchmarking for your huge datasets?
Only informally. However, the guys at UKWA run a similar scale index and
have done multiple segment-count-oriented tests. They have not published
a report, but there are measurements & graphs at
https://github.com/ukwa/shine/tree/master/pytho
tternReplaceCharFilter, matching on something
like
([^.,:!?]\p{Space}*\p{Upper})|(^\p{Upper})
and replacing with 'capital' (the regexp above probably fails - it was just
from memory).
- Toke Eskildsen
VM works the same as the Oracle one in this aspect,
but for the Oracle one, it is important to set Xmx _below_ 32GB instead of at
exactly 32GB:
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
You might want to try the program at that page to check where the IBM l
th
sufficiently long pauses between index updates. Nightly index updates
with few active users at that time could be an example.
- Toke Eskildsen, State and University Library, Denmark
tarted AFTER 29 seconds. Any logic behind
> what I am seeing here?
It shows that the shard-searches themselves is not what is slowing you down.
Are the returned documents very large? Try setting fl=id,score and see if it
brings response times below 1 second.
- Toke Eskildsen
s. A manual process that requires clicking next 1000
times is a severe indicator that something can be done differently.
- Toke Eskildsen
rt parameter, the difference
is small as long as you stay below a start of 1000. 10K might also work for
you. Do your users page beyond that?
- Toke Eskildsen
ve the amount
of memory available for new processes. If you start a new and
memory-hungry process, it will take the memory from the free pool first,
then from the disk cache.
- Toke Eskildsen, State and University Library, Denmark
on the machine, only 1 CPU
is running at full tilt. There is always a bottleneck.
What might help is that the SSD (probably) does not get bogged down by
the process, so it should be much better at handling other requests
while the optimization is running.
- Toke Eskildsen, State and University L
way, operating under the assumption that the single-core facet
request for some reason acts as a distributed call, the key to avoid the
fine-counting is to ensure that _all_ possibly relevant term counts has
been returned in the first facet phase.
Try setting both facet.mincount=0 and facet.limit=-1.
- T
the virtualized instances, we only use local SSDs to hold our index
data. That might affect the trade-off as even slight delays in IO becomes
visible, when storage access times are < 0.1ms instead of > 1ms. I suspect the
relative impact of virtualization is less with spinning drives or networ
On Wed, 2015-09-30 at 06:58 -0700, marotosg wrote:
> b) Based on full data. I would like to run queries and see if the results
> are good enough. That's the part I am not sure if makes sense or how to do
> it.
Seems like an exact match for http://quepid.com/
(I am not affil
hers. It was a bad idea for
us.
- Toke Eskildsen, State and University Library, Denmark
reaming faceting work well if one wants to export the full
result set. For top-X requests it seems that there is a lot of overhead
resolving terms that will not be used in the final result. But my
understanding of Solr streams is very shaky.
- Toke Eskildsen, State and University Library, Denmark
ng. It is
basically 'original_query AND facet_field:fine_count_term'. Quite fast for a
few terms, but if there is a need for resolving tens or hundreds of terms for a
non-trivial index, the fine-counting phase can take longer than the initial
faceting phase.
- Toke Eskildsen
(sorry for the
Thank you for the verification,
Toke Eskildsen, State and University Library, Denmark
it is substantially less than 100%, then feed Solr from more than one
thread at a time.
- Toke Eskildsen, State and University Library, Denmark
hat the CPU-cores are nicely utilized with our low queries/second
usage pattern.
- Toke Eskildsen, State and University Library, Denmark
e number of unique Terms might mean that the disk cache is
not large enough.
Blatant plug: I have spend a fair amount of time trying to make some of this
faster http://tokee.github.io/lucene-solr/
- Toke Eskildsen
ld be to use
lenient=false as default, and to allow overriding it in solrconfig.xml
for backwards compatibility.
- Toke Eskildsen, State and University Library, Denmark
ifference. Changing time zone on the machine might have triggered that,
but then we're entering random-guessing.
- Toke Eskildsen, State and University Library, Denmark
Linux, I
would suspect it to be very easy. If you use the in-build graphical file
explorer, I suspect the only way to do so is by adjusting timezone settings for
the whole system. Etc.
- Toke Eskildsen
dated. So if I check the index
> folder, it will not be accurately reflexing the last time the index files
> are updated.
Just watch index/segments.gen. That is precise as it tracks when the logical
index was last updated, whereas segment files currently being written are with
later tim
Renee Sun wrote:
> But I did a test with heavy indexing on going, and observed the index file
> in [core]/index with a latest updated timestamp keep growing for about 7
> minutes...
That is not a file, but the folder that holds the immutable segment files. What
you observe is segments being writ
it does not support multiple index threads.
- Toke Eskildsen
e that does not help if you are
already doing that.
Also sanity check that you are not doing commits all the time.
- Toke Eskildsen
).
I guess your local timezone is UTC+2 and that your country is using
daylight saving? Solr uses UTC only for timestamps, which is fairly
unambiguous. If you want the filesystem dates to match, you can
normalise them to UTC in your viewer - how to do that depends on your
system.
- Toke Eskildsen
nal hiccup, so we'll be switching
to SolrCloud at some point.
- Toke Eskildsen, State and University Library, Denmark
blems with 10M documents calls for locating the
bottlenecks, before trying to scale the problem away.
- Toke Eskildsen, State and University Library, Denmark
With a
large field this map cannot be in the fast caches. Combine this with a
gazillion references and it makes sense that JSON Facets is slower in this
scenario. A factor 20 sounds like way too much though. I would have expected
maybe 2.
- Toke Eskildsen
limit you requested or if it is higher
(default formula is limit * 1.5 + 10).
The rest of your questions are too far outside of my knowledge for me to
try and answer.
- Toke Eskildsen, State and University Library, Denmark
en 4 & 5 that you are observing. Are you doing faceting
as part of your test?
- Toke Eskildsen, State and University Library, Denmark
as little to do with Solr and a lot to do with carrot (assuming here
that carrot is the bottleneck). You might have more success asking in a
carrot forum?
- Toke Eskildsen, State and University Library, Denmark
s not working as
> it says 'Page Not Found'.
That is because it is too long for a single line. Try copy-pasting it:
https://cwiki.apache.org/confluence/display/solr/Result
+Clustering#ResultClustering-Configuration
- Toke Eskildsen, State and University Library, Denmark
ed something else?
Plain faceting perhaps? Or maybe enrichment of the documents with some
sort of entity extraction?
- Toke Eskildsen, State and University Library, Denmark
t.
It would be great if someone with a bit of time did some experiments
with Solr on this issue. Locally we side step it a bit as we are able to
get by with a 30GB heap for our largest installation and do not need
more than 10GB for the rest.
- Toke Eskildsen, State and University Library, Denmark
u are running a lot of requests in parallel. Have you considered using
a queue instead? If you currently use hundreds of parallel requests to a
single machine, chances are you will get higher throughput by limiting
that. As a bonus, it will require less heap.
- Toke Eskildsen, State and Uni
he Solr part) or the
clustering itself (the Carrot part) that is the bottleneck.
- Toke Eskildsen
ues upon first call.
> I assume my fallback is to not index with doc values, and use an uninverting
> reader to get the field data. Is there a better approach?
You could index your integers as DocValued Strings, prefixed with zeroes to
ensure same length and proper integer sort.
- Toke Eskildsen
Toke Eskildsen wrote
> Use more than one cloud. Make them fully independent.
> As I suggested when you asked 4 days ago. That would
> also make it easy to scale: Just measure how much a
> single setup can take and do the math.
The goal is 250K documents/second.
I tried modifying t
Use more than one cloud. Make them fully independent. As I suggested when you
asked 4 days ago. That would also make it easy to scale: Just measure how much
a single setup can take and do the math.
- Toke Eskildsen
and the
two bitsets are merged.
Next time you use the same fq, it should be cached (if you have caching
enabled) and be a lot faster.
Also, if you ran your two tests right after each other, the second one
benefits from disk caching. If you had executed them in reverse order,
the q+fq might have
ory.
If you can paste a problematic query, it is easier to see what is
happening.
- Toke Eskildsen, State and University Library, Denmark
rement.
- Toke Eskildsen, State and University Library, Denmark
Scott Derrick wrote:
> Is there a way to get the list of terms that matched in a query response?
Add debug=query to your request:
https://wiki.apache.org/solr/CommonQueryParameters#debug
You might also want to try
http://splainer.io/
- Toke Eskildsen
nderstand
correctly), one of our 256GB machines holds 6 billion documents in 20TB of
index data. You might want to investigate that option. Some details at
https://sbdevel.wordpress.com/net-archive-search/
- Toke Eskildsen
u do with your data. Most of the time, IO is the bottleneck
for Solr and for those cases it is probably more bang-for-the-buck to buy
machines with 256GB of RAM (or maybe the 148GB you have currently) as it
minimizes the overhead per box.
- Toke Eskildsen
.
> 3) How many shards / replicas per collection should I use?
> 4) Do I need multiple Solr servers?
Not enough data about index usage to say. Between 1 and 50, not kidding.
- Toke Eskildsen
ler collections have better performance than fewer larger
collections?
> (I also have cross customers queries)
If you make independent setups, that could be solved by querying them
independently and do the merging yourself.
- Toke Eskildsen
ging from a single-shard setup to a
multi-shard one. As always, measure.
- Toke Eskildsen
ink an all_parameters -> complete_response cache is possible?
> It could be initialized right before or during warmup and would not take to
> much memory.
Sorry, I don't know much of the mechanics of handlers in Solr and cannot
say how the in-theory-simple caching would fit.
- Toke Eskildsen, State and University Library, Denmark
If that is not the case, your best bet would probably
be to cache the match-all outside of Solr.
> My assumption is that the queryResultCache is catching such a
> MatchAllDocsQuery(*:*).
It only stores the docIDs.
I don't know why there is is no all_parameters -> complete_response
ow how can I force to fetch 50 Indian & 50 Iran records using a
> single SOLR query?
q=*.*&fq=(country:india) OR (country:iran)
&group=true&group.field=country&group.limit=50
https://cwiki.apache.org/confluence/display/solr/Result+Grouping
- Toke Eskildsen, State and University Library, Denmark
, which cannot be done on String fields.
> Would you please help me to solve this problem?
With the information we have, it does not seem to be easy to solve: It seems
like you want to facet on all terms in your index. As they need to be String
(to use docValues), you would have to do all the
due to 2 analyzed-but-single-token text fields
with 10-20M values that we use for faceting.
I am not a committer and on vacation anyway, so this is just a thumbs up to the
initiative.
- Toke Eskildsen
e UnInverted structure has a speed edge due to being directly accessible as
standard on-heap memory structures.
The difference is likely to vary a great deal depending on concrete corpus &
hardware.
- Toke Eskildsen
Paden wrote:
> How would I perform a http request that would say return the documents of
> previous query but ONLY the documents where author = (author with 31
> documents)
Simplest thing is to add it as a filter query:
q=fairy+tales&fq=author:"H. C. Andersen"
- Toke Eskildsen
t people
are seeing. There might be a perfectly fine reason for those response
times, but I suggest we sanity check them: Could you show us a typical
query and tell us how many concurrent queries you normally serve?
- Toke Eskildsen, State and University Library, Denmark
cy?
I have zero experience with that: We build the shards one at a time and don't
touch them after that. 90% of our building power goes to Tika analysis, so
there hasn't been a apparent need for tuning Solr's indexing.
- Toke Eskildsen
be better.
Turning it around: To minimize the risk of occasional performance-degrading
large merges, one might want an index where all the shards are below a certain
size. Splitting larger shards into smaller ones would in that case also be an
optimization, just towards a different goal.
- Toke Eskildsen
illion documents, divided across 1000 shards.
- Toke Eskildsen
ng, would work for us.
Switching to a new controlling layer is not trivial, so the win by
better utilization during the optimization phase is not enough in itself
to pay the cost.
- Toke Eskildsen, State and University Library, Denmark
t I do not know how hard it would be to do so.
- Toke Eskildsen
On Thu, 2015-06-04 at 16:45 +0530, Midas A wrote:
> I have some indexing issue . While indexing IOwait is high in solr server
> and load also.
Might be because you commit too frequently. How often do you do that?
- Toke Eskildsen, State and University Library, Denmark
p will slowly fill up as more and more
users perform faceted queries on their content.
- Toke Eskildsen
nd even if your 1 million facet fields all had just 1 value, represented by 1
bit, it would still require 10M * 1M * 1 bits in memory, which is 10 terabyte
of RAM.
- Toke Eskildsen
o with the data, how much the machine(s) will be used while
indexing and your requirements to speed.
See
https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
- Toke Eskildsen
point: Require Solr to be run as an application
instead of in a generic container) might be an idea?
- Toke Eskildsen
hits.
> Also subsequent calls are not fast:
> First call time: 297572
> Second call time (made with in 2 sec): 249287
Are you indexing while searching? Each time the index is changed, the
UnInversion will have to be re-done. facet.method=fcs seems a better
choice with an often-changing
just calculate facets of 137 records?
6½ minute is a long time, even for first call. Do you have tens to
hundreds of millions of documents in your index? Or do you have a
similiar amount of unique values in your facet?
Either way, subsequent faceting calls should be much faster and a switch
to D
judging from your previous post
"problem with facets - out of memory exception", you are doing
non-trivial faceting. Are you using DocValues, as Marc suggested?
- Toke Eskildsen, State and University Library, Denmark
s, but it seems like a lot of work
for a special case.
- Toke Eskildsen, State and University Library, Denmark
wasn't using any
> RAM... wasn't getting any requests.
No problem at all. On the contrary, thank you for closing the issue.
- Toke Eskildsen
wbacks?
Support for the Disk-format for DocValues was removed after 4.8, so you should
check if you use that: DocValuesFormat="Disk" for the field in the schema, if I
remember correctly.
- Toke Eskildsen
Do you have a large and active filter cache? Each entry is 30MB, so it
does not take many entries to fill a 8GB heap. That would match the
description of ever-running GC.
- Toke Eskildsen, State and University Library, Denmark
g) would probably be a lot higher with just a single shard.
- Toke Eskildsen
n each field, it just means that multiple
fields are processed in parallel.
- Toke Eskildsen
y guess is that \u0001
matches it.
So something like
regexp="^([^\u0001]*)\u0001([^\u0001]*)\u0001([^\u0001]*)\u0001...$"?
Untested and all.
But why not use the CSV import handler? That seems like the best fit.
- Toke Eskildsen
oks like DocValues now and it
seems (guessing quite a bit here) that the old 16M-limitation is gone.
- Toke Eskildsen
like this:
regex="^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),...([^,]*)$"
The match speed for 28 groups with that regexp was about 0.002ms (average over
1000 matches).
- Toke Eskildsen
unique values per
shard for docValues. I would like to see that go away, but that's just part of
an ongoing mission to get Solr to break free from the old "2 billion should be
enough for everyone"-design.
- Toke Eskildsen
ide, there seems to be renewed interest for it.
- Toke Eskildsen
100GB index from a same-size machine.
The one hardware advice I will give is to start with SSDs and scale from there.
With present day price/performance, using spinning drives for anything
IO-intensive makes little sense.
- Toke Eskildsen
t can be accomplished by having differently
analyzed versions of the same logical field: Having a single catch-all is just
easy to do.
Another reason can be performance: fq-matching against all fields is heavier
than matching against a few fields and the catch-all.
- Toke Eskildsen
ution to work, that would be the preferable
solution.
- Toke Eskildsen, State and University Library, Denmark
ost makes a lot more sense. I will not argue against that.
- Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
> Don't confuse customers and tenants.
Perhaps you could explain what you mean by multi-tenant in the context of Ian's
setup? It is not clear to me what the distinction is in this case.
- Toke Eskildsen
that update processor ...but this capability is not available out of the
> box.
I have not tried it at all, but I thought
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of
+Documents
was doing exactly what you describe?
- Toke Eskildsen, State and University Library, Denmark
practically no difference in speed
between page 1 and page 5.000. I say practically because on paper,
requesting page 5.000 will be a smidgen faster (there are less inserts
into the priority queue), but I doubt it can be measured in real world
setups.
- Toke Eskildsen
segments being immutable, the bird's eye view is that Lucene creates and
deletes large files, which makes it possible for the SSD's wear-leveler to
select the least-used flash sectors for new writes: The write pattern over time
is not too far from the one that The Tech Report tested wit
cet.mincount=1&sort=score+desc
How large is your index in bytes, how many documents does it contain and
is it single-shard or cloud? Could you paste the loglines containing
"UnInverted field", which describes the number of unique values and size
of your facet fields?
- Toke Eskildsen, State and University Library, Denmark
search wiki
> (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does
> the search and result merging (all I have to do is issue a search), is
> this correct?
Yes. From a user-perspective, searches are no different.
- Toke Eskildsen, State and University Library, Denmark
ields?
My next step would be to disable parts of the query (highlight, faceting and
collapsing one at a time) to check which part is the heaviest.
- Toke Eskildsen
From: Tang, Rebecca [rebecca.t...@ucsf.edu]
Sent: 25 February 2015 20:44
To: solr
mend running 10.000 concurrent searches as it leads to
congestion. You will probably get a higher throughput by queueing your requests
and process then with 100 concurrent searches or so. Do test.
- Toke Eskildsen
ich should tell you where the
time is spend resolving the queries. It it is IOWait then ensure a lot of free
memory for disk cache and/or improve your storage speed (SSDs instead of
spinning drives, local storage instead of remote).
- Toke Eskildsen, State and University Library, Denmark.
a small index (in bytes) and a high query
rate, that probably won't help your throughput.
- Toke Eskildsen, State and University Library, Denmark
Solr or JVM? Can it
> only be explained by the mass indexing? What is worrisome is that the
> 4.10.2 shard reserves 8x times it uses.
If you set your Xmx to a lot less, the JVM will probably favour more
frequent garbage collections over extra heap allocation.
- Toke Eskildsen, State and University Library, Denmark
201 - 300 of 594 matches
Mail list logo