On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:
Hi -- has there been any effort to create a numerical representation of
Lucene
indices. That is, to use the Lucene Directory backend as a large
term-document
matrix at index level. As this would require bijective mapping between
On Sat, 2011-07-09 at 05:44 +0200, Shai Erera wrote:
The taxonomy is global to the index, but I think it will be
interesting to explore per-segment taxonomy, and how it can be used to
improve indexing or search perf (hopefully both).
I have struggled with this for some time and still haven't
if I had a
place to make a public repository (which admittedly is easy enough with
GitHub et al).
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional
On Tue, 2012-11-13 at 19:50 +0100, Yonik Seeley wrote:
The original version of Solr (SOLAR when it was still inside CNET) did
this - a multiValued field with a single value was output as a singe
value, not an array containing a single value. Some people wanted
more predictability (always an
On Wed, 2012-11-14 at 14:46 +0100, Robert Muir wrote:
On Tue, Nov 13, 2012 at 11:41 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:
Dynamically changing response formats sounds horrible.
I don't understand how this is related with my proposal to
automatically use a different data
On Sat, 2012-12-01 at 17:18 +0100, Per Steffensen wrote:
With change/merge-tracking in both system, the important thing must be
that you do not have to throw the tracked information away before in
you attempt to get your changes into the main repository.
People write commit messages in many
Mark Harwood [markharw...@yahoo.co.uk]:
Given a large range of IDs (eg your 300 million) you could constrain
the number of unique terms using a double-hashing technique e.g.
Pick a number n for the max number of unique terms you'll tolerate
e.g. 1 million and store 2 terms for every primary
From: Mark Harwood [markharw...@yahoo.co.uk]
Good point, Toke. Forgot about that. Of course doubling the number
of hash algos used to 4 increases the space massively.
Maybe your hashing-idea could work even with collisions?
Using your original two-hash suggestion, we're just about sure to get
On Fri, 2010-10-22 at 11:23 +0200, eks dev wrote:
Both of these solutions are just better way to do it wrong :) The real
solution
is definitely somewhere around ParallelReader usage.
The problem with parallel is with updates of documents. The IndexWriter
takes terms and queries for
development. My gut feeling says the latter, but then again, I'm biased
by being firmly in the low-level group.
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail
of projects can benefit, but I would very much like to hear some
thoughts on this.
Thank you for listening,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h
Claudio Ranieri and I briefly discussed collator based sorting for
facets in the thread Problem with accented words sorting on the
solr-user mailing list. Here's the idea:
Solr faceting supports sorting by either count or index order. Claudio
and I both need the order to be collator-based. My
On Tue, 2012-09-11 at 17:23 +0200, Robert Muir wrote:
Just a concern where things could act a little funky:
today for example, If I set strength=primary, then its going to fold
Test and test to the same unique term,
but under this scheme you would have bytesTest and bytestest as two terms.
On Mon, 2012-09-24 at 06:11 +0200, Robert Muir wrote:
Artifacts are here: http://s.apache.org/lusolr40rc0
Sorry to interrupt as a non-voter, but I am afraid that
https://issues.apache.org/jira/browse/SOLR-3875 might be a blocker for
4.0. Maybe a veteran could take a quick look?
- Toke Eskildsen
My low-memory sorting/faceting-hacking requires terms to be accessed by
ordinals. With Lucene 4.0 I cannot depend on TermsEnums supporting ord()
and seek(long), so the code switches to a cache that keeps track of
every X terms if they are not implemented. When the terms for an ordinal
is
,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
the TermState could hold a reference
to the BytesRef itself, if it is needed by the implementation?
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h
2000 is strange.
Worse performance than 5000
When I ran your test (results attached), the index at 20M had 76 files
and the index at 50M had 46 files and I got the same slowdown at 20M as
you did. More segments = more merge overhead.
Thank you for sharing your test measurements,
Toke
/reviews/ssd-reliability-failure-rate,2923-3.html
It is a bit old and does not speak well for the Vertex 2 series.
So just to conclude: Lucene kills SSDs :-)
I am an accomplice to murder!? Oh Noes!
- Toke Eskildsen, happily using an old 160GB Intel X25 SSD with 11TB
written and 3 reallocated
?
- Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
as the most important component
and that makes us somewhat blind to the situations where Solr is just
another cog in a complex machinery. As the choice of how Solr is
deployed is highly relevant for users and maintenance guys, hearing
their point of view is important.
- Toke Eskildsen, State
for large scale projects but
can also be used for small scale
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
-a' and
check that max user processes is sufficiently large.
If the limit is fairly low, your reboot might explain why switching to
1.7.0_10 seemed to be the solution, as you probably had less running
applications after reboot.
- Toke Eskildsen
line width
to be consistent.
With that in mind, I suggest that the code style recommendation is
expanded with the notion that a maximum of x characters/line should be
used, where x is something more than 80. Judging by a quick search, 120
chars seems to be a common choice.
Regards,
Toke
.
What is gained by logging queries outside of the standard logging
framework? Wouldn't it be better to create a logger with an agreed-upon
name, such as queries or interaction?
- Toke Eskildsen
-
To unsubscribe, e-mail: dev
Steve Rowe [sar...@gmail.com]:
From now on, only people who appear on
http://wiki.apache.org/solr/ContributorsGroup will be able to
create/modify/delete wiki pages.
TokeEskildsen would like to be added to the list and would like spammers to
suffer greatly.
at least know that I can stop searching.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
give the same throughput with
lower memory requirements.
By the logic above, maxThreads of 100 or maybe 200 would be an
appropriate default for Jetty with Solr. So why the 10,000?
- Toke Eskildsen, State and Univeristy Library, Denmark
if a limitation on threads is on the radar for Solr 5?
Thank you,
Toke Eskildsen, State and Univeristy Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
if there is no
real limit on burst rate.
- Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
by this. Tomcat and Jetty default to allowing 200 threads.
Solr will not scale with container defaults, which is why the example
sets maxThreads to 1.
Are you talking about performance or deadlocks?
- Toke Eskildsen
Shawn Heisey [s...@elyograg.org] wrote:
On 4/27/2014 12:29 AM, Toke Eskildsen wrote:
Are you talking about performance or deadlocks?
Deadlocks. It's not a performance thing -- with only 200 threads
allowed, Jetty will refuse to start the additional threads that a large
Solr install wants
to have it as par of the Solr server
instead of outside.
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
values or something third like I/O.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
resources from the system while performing the search.
Can you outline what you are doing?
Related to that, why are you running 50+ shards on each machine, when
you're doing search across all shards? Why not fewer shards/machine and
less distribution overhead?
- Toke Eskildsen, State and University
a sounds
insane, but it's probably correct-mindset.
Anyway, setup accepted, problem acknowledged, your possibly re-usable solution
not understood.
- Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
interesting as ID-resolving
would not take up as much of the overall processing time. But it would
make it possible to scaling that number up (top-1 or above).
- Toke Eskildsen, State and University Library, Denmark
will still be able to benefit from doing the other
one.
I noticed that. Multiplying solutions are awesome.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
the logic here: When my request is for mincount 0,
when does it ever make sense to have terms with count=0 returned from
any shard?
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr
of SOLR-5894, having mincount = 1 is essential there,
but it seems like it would provide a speed-up to all distributed
faceting with a sparse result set.
Regards,
Toke Eskildsen, State and University Library, Denmark
ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley
[yo...@heliosearch.com] wrote:
On Mon, Jun 16, 2014 at 8:39 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:
I do not understand the logic here: When my request is for mincount 0,
when does it ever make sense to have terms
to determine with certainty, I could use a way of
performing a best-guess.
On a similar note, does Lucene have a concept of single and multi-value
stored fields or do I have to infer that by iterating all the documents
and check each one?
- Toke Eskildsen, State and University Library, Denmark
manner.
Thanks for the pointer. As far as I can see, the demo is very explicit
about the type of DocValues being long, so no auto-guessing there. It's
a very interesting idea though, with seamless DV-enabling.
Thank you,
Toke Eskildsen, State and University Library, Denmark
On Mon, 2014-12-15 at 11:33 +0100, Michael McCandless wrote:
On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:
[Toke: Limit on faceting with many references]
Hmm that's probably the DocTermOrds 16 MB internal addressing limit?
Yes, we've hit that one before
for Disk. But
thanks for the suggested fix.
You could copy the code too to use newer Lucene versions…
We looked at that sometime back and the code tentacles reached too far
for us to dare grapple with.
Regards,
Toke Eskildsen, State and University Library, Denmark
FunctionValues that, unfortunately for us, are
limited to single-value. We'll have to extract the multi-values
explicitly with faceting or export, as Joel suggests, for the time
being.
- Toke Eskildsen, State and University Library, Denmark
or in the reference guide (my Google-fu
is weak).
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
=*. If a field is referenced
explicitly with fl=myfield and is DocValued but not stored, return
the DocValued value.
* State that DocValued fields, that are not stored, should be returned
with a flag: resolvedv=true
- Toke Eskildsen, State and University Library, Denmark
- much difference between 2 or 4
(or 10) segments.
- Toke Eskildsen
From: Tom Burton-West [tburt...@umich.edu]
Sent: 25 February 2015 18:11
To: dev@lucene.apache.org
Subject: Fwd: Optimize maxSegments=2 not working right with Solr 4.10.2
No replies
On Sat, 2015-04-18 at 10:07 +0300, Shai Erera wrote:
Our dev-tools/eclipse configure the project to break lines on 80
characters. Are there objections to change it to 120?
Line length was discussed back in 2013 (search for Line length in
Lucene/Solr code) and AFAIR the conclusion was not to
contributions?
- Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
this sound reasonable? Should I open a JIRA? Attempt a patch?
- Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
-is-it-worth-reusing-arrays-in-java
In the case of an update-tracked structure, the cost of zeroing is
linear to the amount of changed values. This makes it even harder to
determine the best strategy as it will be tied to concrete index size
and query pattern.
- Toke Eskildsen, State and University
that it looks very promising.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
of
resolving its ordinal, then doing a lookup in the counter structure.
Unfortunately that does not work for Numerics.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr
med filter would (guessing here) be a
matter of writing a small alba-annotated class that takes the filter-ID
as input and returns the corresponding custom-made Filter, which really
is just a list of docIDs underneath (probably represented as a bitmap).
- Toke Eskildsen, State and University
nk you for bringing it to my attention,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
ny feedback is very welcome.
I know very little writing plugins, so I am in no position to qualify
how much alba helps with that: From what I can see in your GitHub
repository, it seems very accessible though.
Thank
in(rows, maxDoc)
# ScoreDoc Objects temporarily , which can trigger excessive garbage
# collection.
# Alternative: Use pagination
(https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results)
- Toke Eskildsen, State and University L
d. I'll take a closer look on how the debug mechanism ties into
Solr. If sanity checking fits well, I'll try and make a proof of concept
and a JIRA.
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail:
Thank you for the invitation and the warm welcome.
I am a 43 year old Danish man, with a family and a job at the Royal Danish
Library, where I have been working mostly with search-related technology for 10
years.
I have done a fair bit of Lucene/Solr hacking during the years, with focus on
page got published? I never got past the "Publish
lucene site"-page and my current sort-correction is still in staging.
Maybe someone else OK'ed the change?
Thank you,
Toke Eskildsen, Danish Royal Library
-
To unsub
That did not take long...
The initiation rite of adding my name to the committers list went well
until it was time to publish. The Publish lucene site at
https://cms.apache.org/lucene/publish
shows nothing under "Authors:" and when I press "View Diff", the
browser waits until I close the tab. I
On Wed, 2017-02-15 at 22:37 +, Toke Eskildsen wrote:
> Jan Høydahl <jan@cominvent.com> wrote:
> > https://ci.apache.org/builders/lucene-site-production
> [...]
> have been in contact with INFRA (Gavin McDonald on the HipChat-
> channel) and he kicked something loo
Jan Høydahl wrote:
> https://ci.apache.org/builders/lucene-site-production
[...]
> Toke, could you report this to INFRA perhaps? Looks like it has been failing
> for several days...
I have been in contact with INFRA (Gavin McDonald on the HipChat-channel) and
he kicked
osed win from
pre-allocating the sentinels gets shadowed by overall processing. It only works
well when hitcount is near top-N, where "near" is one of those things that are
really hard to measure properly.
- Toke Eskildsen
-
icles from source X, remove if
that source is deprecated",
"type", "ImportantText",
"stored", "true",
...
},
...
}
It would be great to have the content from such a documentation field
pop up in the schema browser in the GUI.
- Toke Eskildsen, State and University Library, Denmark
scenario be supported if uninversion is removed?
- Toke Eskildsen, State and University Library, Denmark
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
r a way
that a random end-user can easily do faceting on analyzed terms,
leveraging all the nice build-in filters in Solr.
- Toke Eskildsen, State and University Library, Denmark
n
for 5 & 6 + master. Was that correct?
Thank you,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
nd that I to stay clear of any 'x'-versions, should they be
created by others.
Thank you,
Toke Eskildsen, Royal Danish Library
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
be closed? Will an accept be
reflected at the Apache repo or should one close the pull-request
without accept, and commit the code directly to the Apache-repo (by
whatever method is easiest for transferring code between git repos)?
Thank you,
Toke Eskildsen, Royal Dani
in isolation.
Ah yes. That's me being overly cautious of (non-existing) unrelated
changes. Cherry-pick with hash is the clean way.
Thank you,
Toke Eskildsen, Royal Danish Library
-
To unsubscribe, e-mail: dev-unsubscr...@lucen
branch_7x and cherry-pick the changed files, check that
everything works and commit.
My I-think-I-am-doing-the-right-thing confidence level is rising, but
I'll keep asking for sanity-checks for some time.
- Toke Eskildsen, Royal Danis
grade to new major versions.
Thanks,
Toke Eskildsen
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
2000+ line patch that has not been reviewed.
It seems a bit forced to add it to 7.6, but on the other hand it will
be tested thoroughly as part of the release process. What is the best
action here?
- Toke Eskildsen, Roya
Doc Values, maybe I could be explained
what the problem is or directed towards more information?
- Toke Eskildsen, Royal Danish Library
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: de
disagreement about improving docValues in the ways
> you suggest.
You are right about that. I apologize if I was being unclear: It is not the
concrete patch I am asking about, that's just how this started. I am asking for
background on why it is considered misuse to use Doc Values for docume
t; So I think as usual, "it depends".
I would like to think so, as that implies that it does make sense to consider
if changes to Doc Values codec representation causes a performance regression,
when using them to populate documents.
- Toke Eskildsen
--
it did not solve your problem.
Cc: to Kranthi as he might have mailinglist-related delivery problems.
- Toke Eskildsen, royal Danish Library
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands,
Toke Eskildsen wrote:
> Gus Heck wrote:
>> Precommit appears to be failing related to this series of commits
> I apologize and will correct it right away.
Fixed. ant precommit now passes for me on master.
Thanks for the note Gus,
To
From: Gus Heck wrote:
> Precommit appears to be failing related to this series of commits
Thank you. I clearly did not perform this step, even if I thought I did.
I apologize and will correct it right away.
Toke Eskild
as the thread is a month old.
- Toke Eskildsen, Royal Danish Library
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
On Tue, 2008-04-08 at 18:48 -0500, robert engels wrote:
That is opposite of my testing:...
The 'foreach' is consistently faster. The time difference is
independent of the size of the array. What I know about JVM
implementations, the foreach version SHOULD always be faster -
because
an easy alternative to buying more RAM would be nice. I would
like to hear if Exposed sounds like a feasible idea to the more seasoned
Lucene developers.
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: java-dev-unsubscr
methods both for simple Locale and for
custom sorting, so I guess it would be the same for Exposed.
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h
- to my knowledge - loads the Strings into memory.
For my quick test, this means a tripling of memory usage for the sort field
when indexing collatorKeys?
Regards,
Toke Eskildsen
-
To unsubscribe, e-mail: java-dev-unsubscr
vs. the
10M*log2(10M)/8 = 27MB for a compressed order array.
Still, depending on how little space a byte-array will take in flex, using the
indexed collator key approach might turn out to be the best choice in a lot of
cases as it works really well for incremental updates.
Regards,
Toke
The current subject and description of
https://issues.apache.org/jira/browse/LUCENE-2335
is obsolete due to new knowledge.
Is it possible to change it? If not, what is the policy here? To open a
new issue and close the old one?
Cc: To Michael McCandless as he is the reporter of the issue.
If
: 4.0
Environment: Fast IO when huge hierarchies are used
Reporter: Toke Eskildsen
Hierarchical faceting with slow startup, low memory overhead and fast response.
Distinguishing features as compared to SOLR-64 and SOLR-792 are
* Multiple paths per document
* Query-time
[
https://issues.apache.org/jira/browse/SOLR-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Toke Eskildsen updated SOLR-2412:
-
Attachment: SOLR-2412.patch
Alpha-level patch (aka Proof Of Concept). Works with trunk@1066767
[
https://issues.apache.org/jira/browse/SOLR-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007411#comment-13007411
]
Toke Eskildsen commented on SOLR-2412:
--
The syntax for calling is kept close to SOLR
[
https://issues.apache.org/jira/browse/SOLR-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007497#comment-13007497
]
Toke Eskildsen commented on SOLR-2403:
--
Dividing by shard count is fairly risky
[
https://issues.apache.org/jira/browse/SOLR-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007513#comment-13007513
]
Toke Eskildsen commented on SOLR-2403:
--
My first example was hills, while the second
[
https://issues.apache.org/jira/browse/SOLR-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009627#comment-13009627
]
Toke Eskildsen commented on SOLR-2396:
--
A rough idea: It seems that ICU Collator Keys
[
https://issues.apache.org/jira/browse/SOLR-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009665#comment-13009665
]
Toke Eskildsen commented on SOLR-2396:
--
The JavaDoc for CollationKey is very explicit
[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055402#comment-13055402
]
Toke Eskildsen commented on LUCENE-3079:
This is quite another design than
[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055480#comment-13055480
]
Toke Eskildsen commented on LUCENE-3079:
SOLR-2412/LUCENE-2369 were created
[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056377#comment-13056377
]
Toke Eskildsen commented on LUCENE-3079:
The patch compiles neatly against a 3x
[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13056517#comment-13056517
]
Toke Eskildsen commented on LUCENE-3079:
Some preliminary performance testing: I
1 - 100 of 358 matches
Mail list logo