Re: Using synonyms API

2015-04-15 Thread Yonik Seeley
I just tried this quickly on trunk and it still works.

/opt/code/lusolr_trunk$ curl
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english

{
  "responseHeader":{
"status":0,
"QTime":234},
  "synonymMappings":{
"initArgs":{
  "ignoreCase":true,
  "format":"solr"},
"initializedOn":"2015-04-14T19:39:55.157Z",
"managedMap":{
  "GB":["GiB",
"Gigabyte"],
  "TV":["Television"],
  "happy":["glad",
"joyful"]}}}


Verify that your URL has the correct port number (your example below
doesn't), and that "default-collection" is actually the name of your
default collection (and not "collection1" which is the default for the
4x series).

-Yonik


On Wed, Apr 15, 2015 at 11:11 AM, Mike Thomsen  wrote:
> We recently upgraded from 4.5.0 to 4.10.4. I tried getting a list of our
> synonyms like this:
>
> http://localhost/solr/default-collection/schema/analysis/synonyms/english
>
> I got a not found error. I found this page on new features in 4.8
>
> http://yonik.com/solr-4-8-features/
>
> Do we have to do something like this with our schema to even get the
> synonyms API working?
>
> 
>  positionIncrementGap="100">
>   
> 
> 
> 
>   
> 
>
> I wanted to ask before changing our schema.
>
> Thanks,
>
> Mike


JSON Facet & Analytics API in Solr 5.1

2015-04-14 Thread Yonik Seeley
Folks, there's a new JSON Facet API in the just released Solr 5.1
(actually, a new facet module under the covers too).

It's marked as experimental so we have time to change the API based on
your feedback.  So let us know what you like, what you would change,
what's missing, or any other ideas you may have!

I've just started the documentation for the reference guide (on our
confluence wiki), so for now the best doc is on my blog:

http://yonik.com/json-facet-api/
http://yonik.com/solr-facet-functions/
http://yonik.com/solr-subfacets/

I'll also be hanging out more on the #solr-dev IRC channel on freenode
if you want to hit me up there about any development ideas.

-Yonik


Re: sort on facet.index?

2015-04-02 Thread Yonik Seeley
On Thu, Apr 2, 2015 at 10:25 AM, Ryan Josal  wrote:
> Sorting the result set or the facets?  For the facets there is
> facet.sort=index (lexicographically) and facet.sort=count.  So maybe you
> are asking if you can sort by index, but reversed?  I don't think this is
> possible, and it's a good question.

The new facet module that will be in Solr 5.1 supports sorting both
directions on both count and index order (as well as by statistics /
bucket aggregations).
http://yonik.com/json-facet-api/

-Yonik


Re: Facet sorting algorithm for index

2015-04-02 Thread Yonik Seeley
On Thu, Apr 2, 2015 at 9:44 AM, Yago Riveiro  wrote:
> Where can I found the source code used in index sorting? I need to ensure 
> that the external data has the same sorting that the facet result.

If you step over the indexed terms of a field you get them in sorted
order (hence for a single node, the sorting is done at indexing time).
Lucene index order for text will essentially be unicode code point order.

-Yonik


Re: Facet sorting algorithm for index

2015-04-02 Thread Yonik Seeley
On Thu, Apr 2, 2015 at 6:36 AM, yriveiro  wrote:
> Hi,
>
> I have an external application that use the output of a facet to join other
> dataset using the keys of the facet result.
>
> The facet query use index sort but in some point, my application crash
> because the order of the keys "is not correct". If I do an unix sort over
> the keys of the result with LC_ALL=C doesn't output the same result.
>
> I identified a case like this:
>
> 760d1f833b764591161\"84b20f28242a0
> 760d1f833b76459116184b20f2
>
> Why the line whit the '\"' is before? This chain of chars is the character "
> or is raw and are 2 chars?
>
> In ASCII the " has lower ord than character 8, if \" is " then this sort
> makes sense ...

How are you viewing the results?  If it's JSON, then yes the backslash
double quote would mean that there is just a literal double quote in
the string.

-Yonik


Re: How to create a core by API?

2015-03-26 Thread Yonik Seeley
On Thu, Mar 26, 2015 at 1:45 PM, Mark E. Haase  wrote:
> I'm not saying you're wrong. The configSet parameter doesn't work at all in
> my set up, so you might be right... I'm just wondering where that's
> documented.

Trying on current trunk, I got it to work:

/opt/code/lusolr_trunk/solr$ curl -XPOST
"http://localhost:8983/solr/admin/cores?action=CREATE&name=demo3&instanceDir=demo3&configSet=basic_configs";



  0769demo3


Although I'm not thrilled with a different parameter name  for cloud
vs non-cloud.  I come from the camp that believes that overloading is
both natural and easily understood (e.g.  I don't find "foo" + "bar"
and 1.5 + 2.5 both using "+" confusing).

-Yonik


Re: schemaless slow indexing

2015-03-23 Thread Yonik Seeley
On Mon, Mar 23, 2015 at 1:54 PM, Alexandre Rafalovitch
 wrote:
> I looked at SOLR-7290, but I think the discussion should stay on the
> mailing list for at least one more iteration.
>
> My understanding that the reason copyField exists is so that a search
> actually worked out of the box. Without knowing the field names, one
> cannot say what to search.

Some points:
- Schemaless is often just to make it easier to get started.
- If one assumes a lack of knowledge of field names, that's an issue
for non-schemaless too.
- Full-text search is only one use-case that people use Solr for...
there's lots of sorting/faceting/analytics use cases.
- Bad performance by default is bad.  People tend to do benchmarks
and make sweeping conclusions based on those.


-Yonik


Re: schemaless slow indexing

2015-03-22 Thread Yonik Seeley
I took a quick look at the stock schemaless configs... unfortunately
they contain a performance trap.
There's a copyField by default that copies *all* fields to a catch-all
field called "_text".

IMO, that's not a great default.  Double the index size (well, the
"index" portion of it at least... not stored fields), and slower
indexing performance.

The other unfortunate thing is the name.  No where else in solr (that
I know of) do we have a single underscore field name.  _text looks
more like a dynamicField pattern.  Our other fields with underscores
look like _version_ and _root_.  If we're going to start a new naming
convention (or expand the naming conventions) we need to have some
consistency and logic behind it.

-Yonik

On Sun, Mar 22, 2015 at 12:32 PM, Mike Murphy  wrote:
> I start up solr schemaless and index a bunch of data, and it takes a
> lot longer to finish indexing.
> No configuration changes, just straight schemaless.
>
> --Mike
>
> On Sun, Mar 22, 2015 at 12:27 PM, Erick Erickson
>  wrote:
>> Please review: http://wiki.apache.org/solr/UsingMailingLists
>>
>> You haven't quantified the slowdown. Or given any details on how
>> you're measuring the "slowdown". Or how you've configured your setups
>> in 4.10 and 5.0. Or... Ad Hossman would say "details matter".
>>
>> Best,
>> Erick
>>
>> On Sun, Mar 22, 2015 at 8:35 AM, Mike Murphy  wrote:
>>> I'm trying out schemaless in solr 5.0, but the indexing seems quite a
>>> bit slower than it did in the past on 4.10.  Any pointers?
>>>
>>> --Mike


Re: Solr hangs / LRU operations are heavy on cpu

2015-03-20 Thread Yonik Seeley
The document cache is not really going to be taking up time here.
How many concurrent requests (threads) are you testing with here?

One thing I've seen over the years is a false sense of what is taking
up time when benchmarks with a lot of threads are used.  The reason is
that when there are a lot more threads than CPUs, it's natural for
context switches to happen where synchronizations happen.  You look at
a profiler or thread dumps, and you see a bunch of threads piled up on
synchronization.  This does not mean that removing that
synchronization will really help anything... the threads can't all run
at once.

-Yonik


On Thu, Mar 19, 2015 at 6:35 PM, Sergey Shvets  wrote:
> Hi,
>
> we have quite a problem with Solr. We are running it in a config 6x3, and
> suddenly solr started to hang, taking all the available cpu on the nodes.
>
> In the threads dump noticed things like this can eat lot of CPU time
>
>
>- org.apache.solr.search.LRUCache.put(LRUCache.java:116)
>-
>org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:705)
>-
>
> org.apache.solr.response.BinaryResponseWriter$Resolver.writeResultsBody(BinaryResponseWriter.java:155)
>-
>
> org.apache.solr.response.BinaryResponseWriter$Resolver.writeResults(BinaryResponseWriter.java:183)
>-
>
> org.apache.solr.response.BinaryResponseWriter$Resolver.resolve(BinaryResponseWriter.java:88)
>-
>org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:158)
>-
>
> org.apache.solr.common.util.JavaBinCodec.writeNamedList(JavaBinCodec.java:148)
>-
>
> org.apache.solr.common.util.JavaBinCodec.writeKnownType(JavaBinCodec.java:242)
>-
>org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:153)
>- org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:96)
>-
>
> org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:52)
>-
>
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:758)
>-
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
>-
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>-
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>-
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>-
>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
>-
>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
>-
>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
>-
>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
>-
>org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
>-
>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
>
>
> The cache itself is very minimalistic
>
>
>autowarmCount="0"/>
>  initialSize="512" autowarmCount="0"/>
>  autowarmCount="0"/>
>  autowarmCount="256" showItems="10" />
>  initialSize="0" autowarmCount="10"
> regenerator="solr.NoOpRegenerator"/>
> true
> 20
> 200
>
> Solr version is 4.10.3
>
> Any of help is appreciated!
>
> sergey


Re: Facet pivot sorting while combining Stats Component With Pivots in Solr 5

2015-03-19 Thread Yonik Seeley
On Fri, Mar 13, 2015 at 1:43 PM, Dominique Bejean
 wrote:
> Thank you for the response
>
> This is something Heliosearch can do. Ionic Seeley, created a JIRA ticket
> to back port this feature to Solr 5.

Oh, I'm charged now, am I?  ;-)

I'ts been committed, and will be in Solr 5.1

Here's an example of sorting the buckets by something other than count:

$ curl http://localhost:8983/solr/query -d 'q=*:*&
 json.facet={
   categories:{
 terms:{
   field : cat,
   sort : "x desc",   // can also use sort:{x:desc}
   facet:{
 x : "avg(price)",
 y : "sum(price)"
   }
 }
   }
 }
'

-Yonik


Re: Solr tlog and soft commit

2015-03-15 Thread Yonik Seeley
On Sun, Mar 15, 2015 at 12:09 PM, Erick Erickson
 wrote:
> 1> Well, probably not. Hate to be confusing here, but if your ramBufferSizeMB
> setting is exceeded, then internal buffers will be flushed to the
> currently open segment in the
> index directory.

It's even more confusing though...
if you do a few adds and then do a soft commit, a new small segment
will be created and flushed to the Directory, but not fsync'd.  But by
default, the directory we use is NRTCachingDirectory which caches
small segments in memory, so those small segments won't even get
written to disk until a hard commit forces them out of the cache.

-Yonik


> You still won't be able to search it since no commits
> happened. You
> really have little control over when this happens.
>
> And, to make it more confusing still, if your process abnormally terminates,
> these docs _still_ won't be searchable when the node comes back up until 
> they're
> replayed from the transaction log. Since the segment was never closed, the 
> docs
> are invisible. But since they were in the tlog, they'll be recovered. Unclosed
> segment files will be cleaned up though.
>
> So usually you're right, an update won't change anything in the index 
> directory.
> Except sometimes ;).
>
> The net-net here is that if you're NOT issuing any commits for a long
> time, you'll
> see the tlog grow pretty steadily, _and_ upon occasion you'll see step-wise
> jumps in the size of the index directory.
>
> 2> Nothing. This is just making stuff in the not-yet-committed state available
> for searching, all in memory.
>
> 3> Not quite sure what you're asking here. The doc will be in memory
> and the tlog,
> optionally it may have been flushed to the current index segment
> (although still not
> searchable).
>
>
> Best,
> Erick
>
> On Sun, Mar 15, 2015 at 7:11 AM, vidit.asthana  
> wrote:
>> Thanks for reply Yonik. I am very new to solrcloud and trying to understand
>> how the update requests are handled and what exactly happens at file system
>> level.
>>
>> 1. So lets say I send an update request, and I don't issue any type of
>> commit(neither hard nor soft), so will the document ever touch index
>> directory? From the blog, I understand that it gets written to tlog
>> directory.
>>
>> 2. Now if I issue a soft commit, then what will happen inside the index
>> directory?
>>
>> 3. By the time I don't issue a soft commit, where will that document
>> reside(completely in memory)?
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-tlog-and-soft-commit-tp4193105p4193109.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr tlog and soft commit

2015-03-15 Thread Yonik Seeley
Your basic assumptions about the underlying mechanisms are incorrect.
The size of the index has nothing to do with the transaction logs...
and transaction logs are never "written to index" except in recovery.
You would see the same index size behavior w/o transaction logs, and
it has to do with some data being cached in memory on soft commits but
always being flushed to disk on hard commits.

-Yonik


On Sun, Mar 15, 2015 at 9:05 AM, vidit.asthana  wrote:
> I want to know what all thing gets written to index from tlog directory
> whenever a soft commit is issued.
>
> I have a test SolrCloud setup and I can see that even if I disable the
> hardcommit, and if I only issue soft commits, then also index directory
> keeps increasing little by little, so I am presuming that something gets
> written to it.
>
> When I issue a hard commit then index directory size grows drastically - as
> expected.
>
> I have read this awesome post -
> http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> but this doesn't explain the above mentioned behavior.
>
> Thanks in advance.
>
> Vidit
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-tlog-and-soft-commit-tp4193105.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: backport Heliosearch features to Solr

2015-03-09 Thread Yonik Seeley
Thanks everyone for voting!

Result charts (note that these auto-generated charts don't show blanks
as equivalent to "0")
https://docs.google.com/forms/d/1gaMpNpHVdquA3q75yiFhqZhAWdWB-K6N8Jh3dBbWAU8/viewanalytics

Raw results spreadsheet (correlations can be interesting), and
percentages at the bottom.
https://docs.google.com/spreadsheets/d/1uZ2qgOaKx1ZxJ_NKwj2zIAYFQ9fp8OrEPI5hqadcPeY/

-Yonik


On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley  wrote:
> As many of you know, I've been doing some work in the experimental
> "heliosearch" fork of Solr over the past year.  I think it's time to
> bring some more of those changes back.
>
> So here's a poll: Which Heliosearch features do you think should be
> brought back to Apache Solr?
>
> http://bit.ly/1E7wi1Q
> (link to google form)
>
> -Yonik


Re: backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
On Sun, Mar 1, 2015 at 7:18 PM, Otis Gospodnetic
 wrote:
> Hi Yonik,
>
> Now that you joined Cloudera, why not everything?

Everything is on the table, but from a practical point of view I
wanted to verify areas of user interest/support before doing the work
to get things back.

Even when there is user support, some things may be blocked anyway
(part of the reason why I did things under a fork in the first place).
I'll do what I can though.

-Yonik


> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley  wrote:
>
>> As many of you know, I've been doing some work in the experimental
>> "heliosearch" fork of Solr over the past year.  I think it's time to
>> bring some more of those changes back.
>>
>> So here's a poll: Which Heliosearch features do you think should be
>> brought back to Apache Solr?
>>
>> http://bit.ly/1E7wi1Q
>> (link to google form)
>>
>> -Yonik
>>


backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
As many of you know, I've been doing some work in the experimental
"heliosearch" fork of Solr over the past year.  I think it's time to
bring some more of those changes back.

So here's a poll: Which Heliosearch features do you think should be
brought back to Apache Solr?

http://bit.ly/1E7wi1Q
(link to google form)

-Yonik


Re: AND query not working on stopwords as expected

2015-02-16 Thread Yonik Seeley
On Mon, Feb 16, 2015 at 4:32 PM, Arun Rangarajan
 wrote:
[...]
> This query
> q=name:of&rows=0
> gives no results as expected.
>
> However, this query:
> q=name:of AND all_class_ids:(371)&rows=0
> gives results and is equal to the same number of results as
> q=all_class_ids:(371)&rows=0
>
> This is happening only for stopwords. Why?

This is more of a full-text search thing.
Removal of stopwords is more like a "don't care, it's not important".
Hence a query for "a plane" should return all documents containing
"plane", ignoring the question of if the document contained an "a"
(which we can't tell since stopwords were removed during indexing).

Now I understand your point about consistency too.  Using the example
above, something like q=name:of should arguably match all documents
(or at least all documents with a "name" field).  It is very odd to
add an additional restriction and end up with more docs.

-Yonik


Re: Query always fail if row value is too high

2015-02-09 Thread Yonik Seeley
Hmmm, that's interesting...
It looks like a container (jetty/tomcat or whatever) configuration
limit somewhere.  I'd only expect this error from Solr when trying to
send something really large though - notice "upload" in the error.  Is
this error message really from Solr or another piece of your system?

If this error message is from Solr, please open a JIRA issue so this
doesn't get lost.

-Yonik


On Mon, Feb 9, 2015 at 10:29 AM, yriveiro  wrote:
> I'm trying to retrieve from Solr a query in CSV format with around 500K
> registers and I always get this error:
>
> "Expected mime type application/octet-stream but got application/xml.  version=\"1.0\" encoding=\"UTF-8\"?>\n\n name=\"msg\">application/x-www-form-urlencoded content length (6040427
> bytes) exceeds upload limit of 2048 KB name=\"code\">400\n\n"
>
> If the rows value is lower, like 5 the query doesn't fail.
>
> What I'm doing wrong?
>
>
>
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Query-always-fail-if-row-value-is-too-high-tp4185047.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Deep paging in solr using cursorMark

2015-01-27 Thread Yonik Seeley
On Tue, Jan 27, 2015 at 3:29 AM, CKReddy Bhimavarapu
 wrote:
> Hi,
>  Using CursorMark we over come the Deep paging so far so good. As far
> as I understand cursormark unique for each and every query depending on
> sort values other than unique id and also depends up on number of rows.
>  But my concern is if solr internally creates a different set for each
> and every different queries upon sort values and they lasts for ever I
> think.
> 1. if it lasts for ever does they consume server ram or not.
> 2. if it is occupying server ram does there is any way to destroy or clean
> it.

No, there is no server-side state cached.  Think of it as a cookie.

-Yonik


Re: leader split-brain at least once a day - need help

2015-01-08 Thread Yonik Seeley
It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called "split brain").
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy  wrote:
> Hi there,
>
> we are running a 3 server cloud serving a dozen
> single-shard/replicate-everywhere collections. The 2 biggest collections are
> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
> 7.0.56, Oracle Java 1.7.0_72-b14
>
> 10 of the 12 collections (the small ones) get filled by DIH full-import once
> a day starting at 1am. The second biggest collection is updated usind DIH
> delta-import every 10 minutes, the biggest one gets bulk json updates with
> commits once in 5 minutes.
>
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
> is coming from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
>
> One of these pop up once a day at around 8am, making either some cores going
> to "recovery failed" state, or all cores of at least one cloud node into
> state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither
> software, data, or client behaviour.
>
> Most of the time, we get things going again by restarting solr on the
> current leader node, forcing a new election - can this be triggered while
> keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our
> admins didn't restart in time, creating millions of entries in
> /solr/oversser/queue, making zk close the connection, and leader re-elect
> fails. I had to flush zk, and re-upload collection config to get solr up
> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
> requests/s) up and running, which does not have these problems since
> upgrading to 4.10.2.
>
>
> Any hints on where to look for a solution?
>
> Kind regards
> Thomas
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.: +49 (40) 23 706-747
> Fax: +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>


Re: Loading data to FieldValueCache

2014-12-29 Thread Yonik Seeley
On Fri, Dec 26, 2014 at 12:26 PM, Erick Erickson
 wrote:
> I don't know the complete algorithm, but if the number of docs that
> satisfy the fq is "small enough",
> then just the internal Lucene doc IDs are stored rather than a bitset.

If smaller than maxDoc/64 ids are collected, a sorted int set is used
instead of a bitset.
Also, the enum method can skip caching for the "smaller" terms:

facet.enum.cache.minDf=100
might be good for general purpose.
Or set the value really high to not use the filter cache at all.

-Yonik


Re: 'Illegal character in query' on Solr cloud 4.10.1

2014-12-24 Thread Yonik Seeley
On Wed, Dec 24, 2014 at 4:32 PM, Erick Erickson  wrote:
> OK, then I don't think it's a Solr problem. I think 5 of your Tomcats are
> configured in such a way that they consider ^ to be an illegal character.

Hmmm, the stack trace in SOLR-5971 shows a different user (who gets
the same error message) running in Jetty.

Without looking into it further, I thought it most likely an issue
with the proxying code.

I don't think distrib=false won't prevent a node from proxying a query
to another node that can actually handle that query.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Yonik Seeley
On Thu, Dec 11, 2014 at 11:52 AM, Alexandre Rafalovitch
 wrote:
> On 11 December 2014 at 11:40, Yonik Seeley  wrote:
>> So to Solr (server side), it looks like a single update request
>> (assuming 1 thread) with a batch of multiple documents... but it was
>> never actually "batched" on the client side.
>
> Does Solr also indexes them one-by-one as it parses them off the -
> chunked -  stream?

Yes, indexing is streaming (a document at a time is read off the
stream and then immediately indexed).

-Yonik


Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Yonik Seeley
On Wed, Dec 10, 2014 at 6:09 PM, Erick Erickson  wrote:
> So CUSS will do something like this:
> 1> assemble a packet for Solr
> 2> pass off the actual transmission
>  to Solr to a thread and immediately
>  go back to <1>.
>
> Basically, CUSS is doing async processing.

The more important part about what it's doing is the *streaming*.
CUSS is like batching documents without waiting for all of the
documents in the batch.
When you add a document, it immediately writes it to a stream where
solr can read it off and index it.  When you add a second document,
it's immediately written to the same stream (or at least one of the
open streams), as part of the same udpate request.  No separate HTTP
request, No separate update request.

The number of threads parameter for CUSS actually maps to the number
of open connections to Solr (and hence the number of concurrently
streaming update requests).

So to Solr (server side), it looks like a single update request
(assuming 1 thread) with a batch of multiple documents... but it was
never actually "batched" on the client side.

-Yonik


Re: How to stop Solr tokenising search terms with spaces

2014-12-09 Thread Yonik Seeley
On Tue, Dec 9, 2014 at 12:49 PM, Dinesh Babu  wrote:
>
> But my requirement is A* B*  to be A* B* . A* OR B*won't meet my requirement.

The syntax is what it is...  With the complexphrase parser, if you
want at phrase, you need to surround the clauses with double quotes:
"A* B*"

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


We have chosen the NGram solution and it is working for our rquirement
at the moment. Thanks for your input and help Yonik
>
> Regards,
> Dinesh Babu.
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: 08 December 2014 17:58
> To: solr-user@lucene.apache.org
> Subject: Re: How to stop Solr tokenising search terms with spaces
>
> On Mon, Dec 8, 2014 at 12:01 PM, Erik Hatcher  wrote:
>> debug output tells a lot.  Looks like in the last two examples that the 
>> second part (Viewpoint*) is NOT parsed with the complex phrase parser - the 
>> whitespace thwarts it.
>
> Actually, it looks like it is, but you're not telling the complex phrase 
> parser to put the two clauses in a phrase.  You need the quotes.
>
> Even for complexphrase parser
> A* B*  is the same as A* OR B*
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions, sub-facets, 
> off-heap data


Re: How to stop Solr tokenising search terms with spaces

2014-12-08 Thread Yonik Seeley
On Mon, Dec 8, 2014 at 12:01 PM, Erik Hatcher  wrote:
> debug output tells a lot.  Looks like in the last two examples that the 
> second part (Viewpoint*) is NOT parsed with the complex phrase parser - the 
> whitespace thwarts it.

Actually, it looks like it is, but you're not telling the complex
phrase parser to put the two clauses in a phrase.  You need the
quotes.

Even for complexphrase parser
A* B*  is the same as A* OR B*

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: How to stop Solr tokenising search terms with spaces

2014-12-08 Thread Yonik Seeley
On Mon, Dec 8, 2014 at 2:50 AM, Dinesh Babu  wrote:
> I just tried  your suggestion
>
> {!complexphrase}displayName:"RVN Viewpoint users"
>
> Even the above did not work. Am I missing any configuration changes for this 
> parser to work?

What is the fieldType of displayName?
The complexphrase query parser is only for "text" fields (those that
that index each word as a separate term.)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


> Regards,
> Dinesh Babu.
>
>
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: 07 December 2014 20:49
> To: solr-user@lucene.apache.org
> Subject: Re: How to stop Solr tokenising search terms with spaces
>
> On Sun, Dec 7, 2014 at 3:18 PM, Dinesh Babu  wrote:
>> Thanks Yonik. This does not seem to work for me. This is wgat I did
>>
>> 1) q=displayName:rvn* brings me two records (a) "RVN Viewpoint Users" and 
>> (b) "RVN Project Admins"
>>
>> 2) {!complexphrase}"RVN*" --> Unknown query type 
>> \"org.apache.lucene.search.PrefixQuery\" found in phrase query string 
>> \"RVN*\""
>
> Looks like you found a bug in this part... a prefix query being quoted when 
> it doesn't need to be.
>
>> 3) {!complexphrase}"RVN V*" -- Does not bring any result back.
>
> This type of query should work (and does for me).  Is it because the default 
> search field does not have these terms, and you didn't specify a different 
> field to search?
> Try this:
> {!complexphrase}displayName:"RVN V*"
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions, sub-facets, 
> off-heap data
>
> 
>


Re: How to stop Solr tokenising search terms with spaces

2014-12-07 Thread Yonik Seeley
On Sun, Dec 7, 2014 at 3:18 PM, Dinesh Babu  wrote:
> Thanks Yonik. This does not seem to work for me. This is wgat I did
>
> 1) q=displayName:rvn* brings me two records (a) "RVN Viewpoint Users" and (b) 
> "RVN Project Admins"
>
> 2) {!complexphrase}"RVN*" --> Unknown query type 
> \"org.apache.lucene.search.PrefixQuery\" found in phrase query string 
> \"RVN*\""

Looks like you found a bug in this part... a prefix query being quoted
when it doesn't need to be.

> 3) {!complexphrase}"RVN V*" -- Does not bring any result back.

This type of query should work (and does for me).  Is it because the
default search field does not have these terms, and you didn't specify
a different field to search?
Try this:
{!complexphrase}displayName:"RVN V*"

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: How to stop Solr tokenising search terms with spaces

2014-12-06 Thread Yonik Seeley
On Sat, Dec 6, 2014 at 7:17 PM, Dinesh Babu  wrote:
> Just curious, why solr does not provide a simple mechanism to do a phrase 
> search ?

Simple phrase queries:
q= field1:"Hanks Major"

Phrase queries with wildcards / partial matches are a different
story... they are "complex":

q={!complexphrase}"hanks ma*"

See more examples here:
http://heliosearch.org/solr-4-8-features/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


[ANN] Heliosearch 0.09 (JSON Request API + Distrib for Facet API)

2014-12-05 Thread Yonik Seeley
http://heliosearch.org/download

Heliosearch v0.09 Features:

o Heliosearch v0.09 is based on (and contains all features of)
Lucene/Solr 4.10.2 + most of 4.10.3

o Distributed search support for the new faceted search module / JSON
Facet API: http://heliosearch.org/json-facet-api/

o Automatic conversion of legacy field/range/query facets when
facet.version=2 is passed. This includes support for the deprecated
heliosearch syntax of facet.stat=facet_function and
subfacet.parentfacet.type=facet_param.

o New JSON Request API:
http://heliosearch.org/heliosearch-solr-json-request-api/

Example:
$ curl -XGET http://localhost:8983/solr/query -d '
{
  query : "*:*",
  filter : [
"author:brandon",
"genre_s:fantasy"
  ],
  offset : 0,
  limit : 5,
  fields : ["title","author"],  // we could also use the string form
"title,author"
  sort : "sequence_i desc",

  facet : {  // the JSON Facet API is nicely integrated as well
avg_price : "avg(price)",
top_authors : {terms : author}
  }
}'

This includes "smart JSON merging" including support for a mixed
environment of normal request params and JSON objects / snippets.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: TrieLongField not store large longs correctly

2014-11-27 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 10:38 PM, Alexandre Rafalovitch
 wrote:
> Looks like one of these:
> http://stackoverflow.com/questions/1379934/large-numbers-erroneously-rounded-in-javascript

Yeah, that's what Brendan pointed to earlier in this thread.

> In the UI code, we just seem to be using JSON object's native functions.

OH, the irony that one can't use "JavaScript Object Notation" in
"JavaScript" w/o losing information!

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: TrieLongField not store large longs correctly

2014-11-26 Thread Yonik Seeley
Yeah, XML was fine, JSON outside admin was fine... it's definitely
just the client (admin).
Oh, you meant the JSON formatting code in the client - yeah.
Hopefully there is a way to fix it w/o sacrificing our nice syntax
highlighting.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Wed, Nov 26, 2014 at 9:41 PM, Alexandre Rafalovitch
 wrote:
> Sounds like a JSON formatting code then? What happens when the return
> format is XML?
>
> Also, what happens if the request is made with browser debug panel
> open and we can compare what is on the wire with what is in the
> browser?
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 26 November 2014 at 20:02, Yonik Seeley  wrote:
>> On Wed, Nov 26, 2014 at 7:57 PM, Brendan Humphreys  wrote:
>>> I'd wager this is a loss of precision caused by Javascript rounding in the
>>> admin client. More details here:
>>>
>>> http://stackoverflow.com/questions/1379934/large-numbers-erroneously-rounded-in-javascript
>>
>> Ah, indeed - I was testing directly through the address bar, and not
>> via the admin interface.
>> I just tried the admin interface at
>> http://localhost:8983/solr/#/collection1/query
>> and I do see the rounding now.
>>
>>
>> -Yonik
>> http://heliosearch.org - native code faceting, facet functions,
>> sub-facets, off-heap data


Re: TrieLongField not store large longs correctly

2014-11-26 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 7:57 PM, Brendan Humphreys  wrote:
> I'd wager this is a loss of precision caused by Javascript rounding in the
> admin client. More details here:
>
> http://stackoverflow.com/questions/1379934/large-numbers-erroneously-rounded-in-javascript

Ah, indeed - I was testing directly through the address bar, and not
via the admin interface.
I just tried the admin interface at
http://localhost:8983/solr/#/collection1/query
and I do see the rounding now.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: TrieLongField not store large longs correctly

2014-11-26 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 7:30 PM, Erick Erickson  wrote:
> Hmmm, this seems to be browser related
> because if I use curl or Safari, the return and display
> are fine.
>
> i.e.
> curl http://localhost:8983/solr/collection1/query?q=*:*
>
> displays:
>
> "eoe_tl":20140716126615472,
> "eoe_s":"20140716126615472",
>
> "eoe_tl":20140716126615474,
> "eoe_s":"20140716126615474",
>
> "eoe_tl":20140716126615476,
> "eoe_s":"20140716126615476",
>
> and Safari displays it correctly too, but
> Chrome (39.0.2171.71 (64-bit)) displays it as I posted
> last post.

I just updated chrome to the same version: "Version 39.0.2171.71
(64-bit)" on OS-X
I would have expected this request to tickle the bug, but it didn't:
http://localhost:8983/solr/collection1/query?q=*:*&fl=id,val:20140716126615474

> So not a Solr/Lucene problem but odd to say the least.

Hopefully. still a remote possibility that something diff in the
browser request tickles a bug in us, but unlikely.
Do you see anything different if you do "view page source"?

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: TrieLongField not store large longs correctly

2014-11-26 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 7:10 PM, Erick Erickson  wrote:
> This is very weird, someone want to check this out to insure that I'm
> not hallucinating?

I just tried the following in Heliosearch, since I had it open (based
on 4.10.x):

  @Test
  public void testWeird() throws Exception {
Client client = Client.localClient;
long val = 20140716126615474L;
client.add(sdoc("id", "1", "foo_tl",val), null);
client.commit();
// straight query facets
client.testJQ(params("q", "id:1")
, "response/docs/[0]/foo_tl==" + val
);
  }

Seemed to work fine - no bug.
How did you index the docs? Maybe it's a client issue or something...

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


> Because it looks like a JIRA to me.
>
> I tried this with 4.8.0 (because I had it handy) and 5x, same
> results
>
> Indexed three docs with eoe_tl and eoe_s pairs:
> eoe_tl is a tlong
> eoe_s is a string
>
> doc1 has
> eoe_tl=20140716126615472
> eoe_s=20140716126615472
>
> doc2 has
> eoe_tl=20140716126615474
> eoe_s=20140716126615474
>
> doc3 has
> eoe_tl=20140716126615476
> eoe_s=20140716126615476
>
>
> Now, I can search on these perfectly fine, I get
> 0 hits for eoe_tl: 20140716126615470
>
> and 1 hit for
> eoe_tl: 20140716126615472
>
> one hit for:
> eoe_tl:20140716126615474
>
> and one hit for
> eoe_tl:20140716126615476
>
> BUT, the display when q=*:* is:
>
> eoe_tl: 20140716126615470,
> eoe_s: "20140716126615472",
>
> eoe_tl: 20140716126615470,
> eoe_s: "20140716126615474",
>
> eoe_tl: 20140716126615476,
> eoe_s: "20140716126615476",
>
> No, that's not a typo, the number ending in 6 is displayed correctly
> but the first two tlongs end in 0.
>
> We're nowhere near overflow with this number.
>
> On Wed, Nov 26, 2014 at 3:27 PM, Jack Krupansky  
> wrote:
>> Your query has a space in it after the colon, which is not valid. Could you
>> post the actual, full query request, as well as the full query response?
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Thomas L. Redman
>> Sent: Wednesday, November 26, 2014 2:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: TrieLongField not store large longs correctly
>>
>>
>> I believe I have encountered a bug in SOLR. I have a data type defined as
>> follows:
>>
>> > positionIncrementGap="0”/>
>>
>> And I have a field defined like so:
>>
>> > multiValued="false" required="true" omitNorms="true" />
>>
>> I have not been able to reproduce this problem for smaller numbers, but for
>> some of the very large numbers, the value that gets stored for this “aid”
>> field is not the same as the number that gets indexed. For example,
>> 20140716126615474 is stored as 20140716126615470, or in any even, that is
>> the way it is getting reported back. When I issue a query, “aid:
>> 20140716126615474”, the value reported back for aid is 20140716126615470!
>>
>> Any suggestions?=


Re: cross site scripting

2014-11-26 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 11:41 AM, Lee Carroll
 wrote:
> Just out of interest, what is the use-case for a pseudo-field whose value
> is a repeat of the field name?

Not having to specify a field name for the function query:
  fl=add(x,y)
somes back as (for example)
  "add(x,y)" : 14.2

And constants can be function queries too (hence the oddity you see
using fl= w/o an alias)

But since we don't really restrict document field names, or indexed
field values, clients should be prepared to handle anything (i.e.
"bad" values for some clients could still exist even w/o
pseudo-fields).

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: cross site scripting

2014-11-26 Thread Yonik Seeley
On Wed, Nov 26, 2014 at 10:47 AM, Lee Carroll
 wrote:
> The applications using the data may write solr data to the dom. (I doubt
> they do but they could now or in the future. They have an expectation of
> trusting the data back from solr).
>
> As a straight forward attack you are right though. But it is incorrect
> behavior? It should not produce bogus fields and values for each record
> returned ?

That's actually by design (pseudo-fields).  You can also set arbitrary
output keys for other stuff like faceting.
In general, it's not possible to escape dangerous values for the
client since the number of clients is practically unlimited (i.e. we
don't know if values will be used in a SQL query, a PHP front-end, or
whatever).  All we can do is ensure that we correctly encapsulate
values and then leave the rest up to the client who knows how they
will use the values.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: cross site scripting

2014-11-26 Thread Yonik Seeley
It would have been helpful if you would have pointed out exactly what
you think the problem is.
I still don't see an issue, since it doesn't look like any
encapsulation has been broken.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Nov 26, 2014 at 9:56 AM, Lee Carroll
 wrote:
> Hi All,
> In solr 4.7 this query
> /solr/coreName/select/?q=*:*&fl=%27nasty%20value%27&rows=1&wt=json
>
>  returns
>
> {"responseHeader":{"status":0,"QTime":2},"response":{"numFound":189796,"start":0,"docs":[{"'nasty
> value'":"nasty value"}]}}
>
> This is naughty. Has this been seen before / fixed ?


Re: IndexSearcher not being closed

2014-11-20 Thread Yonik Seeley
On Wed, Nov 19, 2014 at 8:37 AM, Priya Rodrigues  wrote:
> public void setContext( TransformContext context ) {
> try {
>   IndexReader reader = qparser.getReq().getSearcher().getIndexReader();
> ->Refcount incremented

You can get a searcher from the request as many times as you like...
it's cached and the ref count is only incremented the first time (and
the corresponding decref is when the request object is closed).

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Solr JOIN: keeping permission data out of primary documents

2014-11-19 Thread Yonik Seeley
On Wed, Nov 19, 2014 at 9:22 AM, Philip Durbin
 wrote:
> On Wed, Nov 19, 2014 at 5:45 AM, Yonik Seeley  wrote:
>> On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
>>  wrote:
>>> Solr JOINs are a way to enforce simple document security, as explained
>>> by Yonik Seeley at
>>> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>>>
>>> I'm trying to tweak this pattern so that I don't have to keep the
>>> security information in each of my primary Solr documents.
>>>
>>> I just posted the gist at
>>> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
>>> my working Solr JOIN based on data in `before.json` . Permissions per
>>> user are embedded in the primary documents like this:
>>>
>>> {
>>> "id": "dataset_3",
>>> "perms_ss": [
>>> "alice",
>>> "bob"
>>> ]
>>> },
>>> {
>>> "id": "dataset_4",
>>> "perms_ss": [
>>> "alice",
>>> "bob",
>>> "public"
>>> ]
>>> },
>>>
>>> User document have been created to do the JOIN on:
>>>
>>> {
>>> "id": "alice",
>>> "groups_s": "alice"
>>> },
>>>
>>> The JOIN looks like this:
>>>
>>> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice
>>
>> It would probably be faster written as a single join:
>> fq={!join+from=groups_s+to=perms_ss}id:(public alice)
>
> Hmm, I can't get the single JOIN to work on the "before" example
> (perms embedded in each primary doc) in the gist I posted so I guess
> I'll live with the slower version with "OR".
>
>> Or, if you're using Heliosearch you could cache the filters separately
>> for better hit rates on commonly used perms via the "filter" keyword:
>> fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
>> filter({!join+from=groups_s+to=perms_ss}id:alice)
>
> Getting back to my original question about keeping permission
> information out of my primary documents, I noticed that
> http://heliosearch.org describes the Pseudo-Join feature as "selects a
> set of documents based on their relationship to a **second** set of
> documents" (emphasis mine) so I assume I can't take the perms out of
> my primary Solr documents and put them in a **third** set of
> "permission assignments" documents with definition points and role
> assignees: 
> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9#file-after-json
> . That is, the three sets of documents would be:
>
> 1. primary (datasets, with no permission info)
> 2. users
> 3. permission assignments

You should be able to chain joins to follow any number of links.
I don't quite understand how you mean to use your schema... but something like

fq={!join from=definition_point_s to=id}role_assignee_ss:alice

That's only following a single link and ignoring the group_s field, so
I'm probably missing something.

-Yonik


Re: Solr JOIN: keeping permission data out of primary documents

2014-11-19 Thread Yonik Seeley
On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
 wrote:
> Solr JOINs are a way to enforce simple document security, as explained
> by Yonik Seeley at
> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>
> I'm trying to tweak this pattern so that I don't have to keep the
> security information in each of my primary Solr documents.
>
> I just posted the gist at
> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
> my working Solr JOIN based on data in `before.json` . Permissions per
> user are embedded in the primary documents like this:
>
> {
> "id": "dataset_3",
> "perms_ss": [
> "alice",
> "bob"
> ]
> },
> {
> "id": "dataset_4",
> "perms_ss": [
> "alice",
> "bob",
> "public"
> ]
> },
>
> User document have been created to do the JOIN on:
>
> {
> "id": "alice",
> "groups_s": "alice"
> },
>
> The JOIN looks like this:
>
> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice

It would probably be faster written as a single join:
fq={!join+from=groups_s+to=perms_ss}id:(public alice)

Or, if you're using Heliosearch you could cache the filters separately
for better hit rates on commonly used perms via the "filter" keyword:
fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
filter({!join+from=groups_s+to=perms_ss}id:alice)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: DocSet getting cached in filterCache for facet request with {!cache=false}

2014-11-11 Thread Yonik Seeley
On Tue, Nov 11, 2014 at 1:25 PM, Mohsin Beg Beg  wrote:
> Wiki says fq={!cache=false}*:* is ok, no?

That's for the filtering... not for the faceting.

> then how to skip filterCache for facet.method=enum ?>

Specify a high minDF (the min "docfreq" or number of documents that
need to match a term before the filter cache will be used).

facet.enum.cache.minDf=1000

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Solr 4.10 very slow on build()

2014-11-08 Thread Yonik Seeley
Try commenting out the suggester component & handler in solrconfig.xml:
https://issues.apache.org/jira/browse/SOLR-6679

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Sat, Nov 8, 2014 at 2:03 PM, Mohsen Saboorian  wrote:
> I have a ~4GB index which takes a minute (or over) to /build()/ when starting
> server. I noticed that this happens when I upgrade from solr 4.0 to 4.10.
> The index was fully rebuilt with solr 4.10 (using DIH). How can I speed up
> startup time?Here is the slow part of the starting log:INFO
> 141101-23:48:18.239  Loading spell index for spellchecker: wordbreakINFO
> 141101-23:48:18.239  Loading suggester index for: mySuggesterINFO
> 141101-23:48:18.239  reload()INFO  *141101-23:48:18.239*  build()INFO
> *141101-23:49:15.270*  [admin] webapp=null path=/admin/cores
> params={_=1414873135659&wt=json} status=0 QTime=11INFO  141101-23:49:22.503
> [news] Registered new searcher Searcher@28195344[news]
> main{StandardDirectoryReader(segments_1b6:65731:nrt _fgm(4.10.1):C244111
> _1pw(4.10.1):C191483/156:delGen=140 _1wg(4.10.1):C174054/11:delGen=11
> _236(4.10.1):C1920/1:delGen=1 _23h(4.10.1):C1756
> _67x(4.10.1):C2120/144:delGen=126 _23l(4.10.1):C2185/2:delGen=2
> _4ch(4.10.1):C784/145:delGen=126 _3b5(4.10.1):C758/80:delGen=79
> _23q(4.10.1):C3391 _97s(4.10.1):C1218/136:delGen=127
> _buo(4.10.1):C1096/86:delGen=84 _eh8(4.10.1):C819/73:delGen=69
> _fg8(4.10.1):C413/94:delGen=81 _geb(4.10.1):C229/5:delGen=5
> _g4b(4.10.1):C130/24:delGen=23 _g6c(4.10.1):C144/15:delGen=14
> _ghj(4.10.1):C21/2:delGen=2 _gj6(4.10.1):C25/3:delGen=3
> _gfz(4.10.1):C10/1:delGen=1 _ghe(4.10.1):C1 _gir(4.10.1):C3/2:delGen=1
> _gis(4.10.1):C2/1:delGen=1 _gja(4.10.1):C1 _gjb(4.10.1):C2/1:delGen=1
> _gjd(4.10.1):C1 _gjj(4.10.1):C1 _gjo(4.10.1):C1 _gjp(4.10.1):C1
> _gjq(4.10.1):C1 _gjs(4.10.1):C1)}INFO  141101-23:49:22.505  Creating new
> IndexWriter...INFO  141101-23:49:22.506  Waiting until IndexWriter is
> unused... core=newsINFO  141101-23:49:22.506  Closing old IndexWriter...
> core=newsINFO  141101-23:49:22.650  SolrDeletionPolicy.onInit: commits:
> num=1
> commit{dir=NRTCachingDirectory(MMapDirectory@/app/solr/solrhome/news/data/index
> lockFactory=NativeFSLockFactory@/app/solr/solrhome/news/data/index;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_bpm,generation=15178}INFO
> 141101-23:49:22.650  newest commit generation = 15178
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-4-10-very-slow-on-build-tp4168368.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr slow startup

2014-11-03 Thread Yonik Seeley
One possible cause of a slow startup with the default configs:
https://issues.apache.org/jira/browse/SOLR-6679

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Mon, Nov 3, 2014 at 11:05 AM, Michal Krajňanský
 wrote:
> Dear All,
>
>
> Sorry for the possibly newbie question as I have only recently started
> experimenting with Solr and Solrcloud.
>
>
> I am trying to import an index originally created with Lucene 2.x so Solr
> 4.10. What I did was:
>
> 1. upgrade index to version 3.x with IndexUpgrader
> 2. upgrade index to version 4.x with IndexUpgrader
> 3. created schema for Solr and used the default solrconfig (with some paths
> changes)
> 4. succesfully started Solr
>
> The sizes I am speaking about are in tens of gigabytes and the startup
> times are 5~10 minutes.
>
>
> I have read here:
> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCMQFjAB&url=https%3A%2F%2Fwiki.apache.org%2Fsolr%2FSolrPerformanceProblems&ei=AKNXVL7ULbGR7Abp7IDYCA&usg=AFQjCNEtw2Zma8ST3JLGL3xw6nG2G_0YuA&sig2=HmM8R1VYuVtXv8lQHsHPJQ&bvm=bv.78597519,bs.1,d.dGY&cad=rja
> that it has possibly something to do with the updateHandler and enabled the
> autoCommit as suggested, however with no improvement.
>
> Such a long startup time feels odd when Lucene itself seems to load the
> same indexes in no time.
>
> I would very much appreciate any help with this issue.
>
>
> Best,
>
>
> Michal Krajnansky


Re: Solr slow start up (tlog is small)

2014-11-03 Thread Yonik Seeley
Can you tell from the logs what Solr is doing during that time?
Do you have any warming queries configured?
Also see this: https://issues.apache.org/jira/browse/SOLR-6679
  (comment out suggester related stuff if you aren't using it)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Mon, Nov 3, 2014 at 11:03 AM, Po-Yu Chuang  wrote:
> Hi,
>
> I am using Solr 4.9 with Tomcat and it works fine except that the
> deployment of solr.war is too long. While deploying Solr, all webapps on
> Tomcat stop responding which is unacceptable. Most articles I found say
> that it might result from big transaction log because of uncommitted
> documents, but this is not my case.
>
> At first, the Solr data is 280G and the start up time is 30 minutes. Then I
> set a field to stored="false" and re-index whole data. The data size became
> 185G and the start up time reduced to 17 minutes, but it is still too long.
>
> Here are some numbers I measured:
>
> 1)
> Solr home: 280G
> tlog: 500K
> 30 min to start up
> While starting up, disk read is constantly about 50MB/s (according to
> dstat). So it seems that Solr reads 30m * 60s * 50MB/s = 90GB of data while
> starting up, which is 30% of index data size.
>
> 2)
> Solr home: 185G
> tlog: 5M
> 17 minutes to start up
> While starting up, disk read is constantly about 5MB/s (according to
> dstat). So it seems that Solr reads 17m * 60s *5MB/s = 5GB of data while
> starting up, which is about 3% of index data size.
>
> p.s. I did commit each time 1000 documents being added and did optimization
> after all documents are added.
>
> Any ideas or suggestions would be appreciated.
>
> Thanks,
> Po-Yu


Re: order of updates

2014-11-03 Thread Yonik Seeley
On Mon, Nov 3, 2014 at 8:53 AM, Matteo Grolla  wrote:
> HI,
> can anybody give me a confirm?
> If I add multiple document with the same id but differing on other fields and 
> then issue a commit (no commits before this) the last added document gets 
> indexed, right?

Correct.

> using solr 4 and default settings for optimistic locking.

If you haven't seen it, I did an example of that a while back:

http://heliosearch.org/solr/optimistic-concurrency/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: [ANN] Heliosearch 0.08 released

2014-10-28 Thread Yonik Seeley
On Tue, Oct 28, 2014 at 10:10 AM, Bernd Fehling
 wrote:
> Is the new faceted search module the cause why I don't have
> any lucene-facet-hs_0.08.jar in the binary distribution?

Solr has never used that (and Heliosearch doesn't either).   ES never
has either AFAIK.

> And what is with lucene-classification and lucene-replicator?

Ditto for these.

> How can I build from source, with solr/hs.xml?

The only thing hs.xml is used for is building the final package.
Other stuff uses the straight build.xml...
"ant test",  "ant example", etc...

There is a shell script in the solr/native directory to build the
native code libraries,
but if you aren't changing them it's easiest to just take the
solr/example/native directory from the heliosearch download.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


>
> Regards
> Bernd
>
>
> Am 27.10.2014 um 17:25 schrieb Yonik Seeley:
>> http://heliosearch.org/download
>>
>> Heliosearch v0.08 Features:
>>
>> o  Heliosearch v0.08 is based on (and contains all features of)
>> Lucene/Solr 4.10.2
>>
>> o  Streaming Aggregations over search results API:
>> http://heliosearch.org/streaming-aggregation-for-solrcloud/
>>
>> o  Optimized request logging, and added a logLimit request parameter
>> that limits the size of logged request parameters
>>
>> o  A new faceted search module to more easily support future search features
>>
>> o  A JSON Facet API to more naturally express Facet Statistics and
>> Nested Sub-Facets
>> http://heliosearch.org/json-facet-api/
>>
>> Example:
>> curl http://localhost:8983/solr/query -d 'q=*:*&
>>  json.facet={
>>categories:{
>>  terms:{// terms facet creates a bucket for each indexed term
>> in the field
>>field : cat,
>>facet:{
>>  avg_price : "avg(price)",  // average price per bucket
>>  num_manufacturers : "unique(manu)",  // number of unique
>> manufacturers per bucket
>>  my_subfacet: {terms: {...}}  // do a sub-facet for every bucket
>>}
>>  }
>>}
>>  }
>> '
>>
>> -Yonik
>> http://heliosearch.org - native code faceting, facet functions,
>> sub-facets, off-heap data
>>


[ANN] Heliosearch 0.08 released

2014-10-27 Thread Yonik Seeley
http://heliosearch.org/download

Heliosearch v0.08 Features:

o  Heliosearch v0.08 is based on (and contains all features of)
Lucene/Solr 4.10.2

o  Streaming Aggregations over search results API:
http://heliosearch.org/streaming-aggregation-for-solrcloud/

o  Optimized request logging, and added a logLimit request parameter
that limits the size of logged request parameters

o  A new faceted search module to more easily support future search features

o  A JSON Facet API to more naturally express Facet Statistics and
Nested Sub-Facets
http://heliosearch.org/json-facet-api/

Example:
curl http://localhost:8983/solr/query -d 'q=*:*&
 json.facet={
   categories:{
 terms:{// terms facet creates a bucket for each indexed term
in the field
   field : cat,
   facet:{
 avg_price : "avg(price)",  // average price per bucket
 num_manufacturers : "unique(manu)",  // number of unique
manufacturers per bucket
 my_subfacet: {terms: {...}}  // do a sub-facet for every bucket
   }
 }
   }
 }
'

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: recip function error

2014-10-23 Thread Yonik Seeley
On Thu, Oct 23, 2014 at 7:47 PM, Michael Sokolov
 wrote:
> 3.16e-11.0 looks fishy to me

Indeed... looks like it should be "3.16e-11"
Standard scientific notation shouldn't have decimal points in the
exponent.  Not sure if that causes Java problems or not though...

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: SOLR Boolean clause impact on memory/Performance

2014-10-14 Thread Yonik Seeley
A terms query will be better than a boolean query here (assuming you
don't care about scoring those terms):
http://heliosearch.org/solr-terms-query/

But you need a recent version of Solr or Heliosearch.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Mon, Oct 13, 2014 at 12:10 PM, ankit gupta  wrote:
> hi,
>
> Can we quantify the impact on SOLR memory usage/performance if we increase
> the boolean clause. I am currently using lot of OR clauses in the query
> (close to 10K) and can see heap size growing.
>
> Thanks,
> Ankit


Re: Payload with Local Params?

2014-10-11 Thread Yonik Seeley
On Sat, Oct 11, 2014 at 12:22 AM, William Bell  wrote:
> I want to call:
>
> http://localhost:8983/solr/collection1/query?defType=myqp&yy=electronics&q=payloads:$yy
>
> How do I pass $yy to the parser and have it use "electronics" instead
> of the literal $yy?

Solr only does parameter substitution for local param values... so for
a term query above of payload:$yy
you want q={!term f=payload v=$yy}

Heliosearch can do full request substitution, so you can do things like
q=payload:${yy}
or
q=price:[ ${low} TO ${high} ]
http://heliosearch.org/solr-query-parameter-substitution/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Solr Index to Helio Search

2014-10-09 Thread Yonik Seeley
Hmmm, I imagine this is due to the lucene back compat bugs that were
in 4.10, and the fact that the last release of heliosearch was
branched off of the 4x branch.

I just tried moving an index back and forth between my local
heliosearch copy and solr 4.10.1 and things worked fine.

Here's the snapshot I just tested that you can use until the next
release comes out:
https://www.dropbox.com/s/x9rs5yfousvkrnj/solr-hs_0.08snapshot.tgz?dl=0

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Thu, Oct 9, 2014 at 1:51 AM, Norgorn  wrote:
> When I try to simple copy index from native SOLR to Heliosearch, i get
> exception:
>
> Caused by: java.lang.IllegalArgumentException: A SPI class of type
> org.apache.lu
> cene.codecs.Codec with name 'Lucene410' does not exist. You need to add the
> corr
> esponding JAR file supporting this SPI to your classpath.The current
> classpath s
> upports the following names: [Lucene40, Lucene3x, Lucene41, Lucene42,
> Lucene45,
> Lucene46, Lucene49]
>
> Is there any proper way to add index from native SOLR to Heliosearch?
>
> The problem with native SOLR is that there are lot of OOM Exceptions (cause
> of large index).
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Index-to-Helio-Search-tp4163446.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: queryResultCache's size is not increasing

2014-10-07 Thread Yonik Seeley
It's your "full-import" every 5 minutes.
A queryResultCache will be invalidated by changes to the index (i.e. a
commit) and the size will drop back to 0.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Tue, Oct 7, 2014 at 4:53 AM, Lee Chunki  wrote:
> Hi,
>
> I am running Solr 4.1.0 and trying to use queryResultCache
> but “size” value at admin page is extremely smaller than incoming queries.
>
> I want to know why.
>
> settings and status are as fallow :
>
> * setting - solrconfig.xml
>  size="163840"
> initialSize="163840"
> autowarmCount="10240"/>
>
> * status
> - for 13 hours
>   - # of requests : 6,711,920
>   - # of unique queries : 72,414
>   - size value at admin page : 9
>
> * extra informations
> - I want to fetch all result set so we set rows as 1,000,000
> - but most result number is less 1k
> - run ‘full-import’ every five minutes
> - use  group, facet and sort when querying
>
> Thanks,
> Chunki.


Re: Question about filter cache size

2014-10-03 Thread Yonik Seeley
On Fri, Oct 3, 2014 at 6:38 PM, Peter Keegan  wrote:
>> it will be cached as hidden:true and then inverted
> Inverted at query time, so for best query performance use fq=hidden:false,
> right?

Yep.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Question about filter cache size

2014-10-03 Thread Yonik Seeley
On Fri, Oct 3, 2014 at 4:35 PM, Shawn Heisey  wrote:
> On 10/3/2014 1:57 PM, Yonik Seeley wrote:
>> On Fri, Oct 3, 2014 at 3:42 PM, Peter Keegan  wrote:
>>> Say I have a boolean field named 'hidden', and less than 1% of the
>>> documents in the index have hidden=true.
>>> Do both these filter queries use the same docset cache size? :
>>> fq=hidden:false
>>> fq=!hidden:true
>>
>> Nope... !hidden:true will be smaller in the cache (it will be cached
>> as hidden:true and then inverted)
>> The downside is that you'll pay the cost of that inversion.
>
> I would think that unless it's using hashDocSet, the cached data for
> every filter would always be the same size.  The wiki says that
> hashDocSet is no longer used for filter caching as of 1.4.0.  Is that
> actually true?

Yes, SortedIntDocSet is used instead.  It stores an int per match
(i.e. 4 bytes per match).  This change was made so in-order traversal
could be done efficiently.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Question about filter cache size

2014-10-03 Thread Yonik Seeley
On Fri, Oct 3, 2014 at 3:42 PM, Peter Keegan  wrote:
> Say I have a boolean field named 'hidden', and less than 1% of the
> documents in the index have hidden=true.
> Do both these filter queries use the same docset cache size? :
> fq=hidden:false
> fq=!hidden:true

Nope... !hidden:true will be smaller in the cache (it will be cached
as hidden:true and then inverted)
The downside is that you'll pay the cost of that inversion.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: fq syntax for requiring all multiValued field values to be within a list?

2014-09-27 Thread Yonik Seeley
Heh... very clever, Mikhail!

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Sat, Sep 27, 2014 at 4:43 PM, Mikhail Khludnev
 wrote:
> indeed!
> the exclusive range {green TO red} matches to the "lemon yellow"
> hence, the negation suppresses it from appearing
> fq=-color:{green TO red}
> then you need to suppress eg black and white also
> fq=-color:({* TO green} {green TO red} {red TO *})
>
> I have no control over the
>> possible values of 'color',
>
> You don't need to control possible values, you just suppressing any values
> beside of the given green and red.
> Mind that either green or red passes that negation of exclusive ranges
> disjunction.
>
>
> On Sun, Sep 28, 2014 at 12:15 AM, White, Bill  wrote:
>
>> OK, let me try phrasing it better.
>>
>> How do I exclude from search, any result which contains any value for
>> multivalued field 'color' which is not within a given "constraint set"
>> (e.g., "red", "green", "yellow", "burnt sienna"), given that I do not what
>> any of the other possible values of 'color' are?
>>
>> In pseudocode:
>>
>> for all x in result.color
>> if x not in ("red","green","yellow", "burnt sienna")
>> filter out result
>>
>> I don't see how range queries would work since I have no control over the
>> possible values of 'color', e.g., there could be a valid color "lemon
>> yellow" between "green" and "red", and I don't want a result which has
>> (color: red, color: "lemon yellow")
>>
>> On Sat, Sep 27, 2014 at 4:02 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>> > On Sat, Sep 27, 2014 at 11:36 PM, White, Bill  wrote:
>> >
>> > > but do NOT match ANY other color.
>> >
>> >
>> > Bill, I miss the whole picture, it's worth to rephrase the problem in one
>> > sentence.
>> > But regarding the quote above, you can try to use exclusive ranges
>> >
>> >
>> https://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Range_Searches
>> > fq=-color:({* TO green} {green TO red} {red TO *})
>> > just don't forget to build ranges alphabetically
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > 
>> > 
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 


Re: fq syntax for requiring all multiValued field values to be within a list?

2014-09-27 Thread Yonik Seeley
On Sat, Sep 27, 2014 at 3:46 PM, White, Bill  wrote:
> Hmm, that won't work since color is free-form.
>
> Is there a way to invoke (via fq) a user-defined function (hopefully
> defined as part of the fq syntax, but alternatively, written in Java) and
> have it applied to the resultset?

https://wiki.apache.org/solr/SolrPlugins#QParserPlugin

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: fq syntax for requiring all multiValued field values to be within a list?

2014-09-27 Thread Yonik Seeley
On Sat, Sep 27, 2014 at 3:36 PM, White, Bill  wrote:
> Sorry, color is multivalued, so a given record might be both blue and red.
> I don't want those to show up in the results.

I think the only way currently (out of the box) is to enumerate the
other possible colors to exclude them.

color:(red yellow green)  -color:(blue cyan xxx)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data



> On Sat, Sep 27, 2014 at 3:36 PM, White, Bill  wrote:
>
>> Not just that.  I'm looking for things which match either red or yellow or
>> green, but do NOT match ANY other color.  I can probably drop the
>> requirement related to having no color.
>>
>> On Sat, Sep 27, 2014 at 3:28 PM, Yonik Seeley 
>> wrote:
>>
>>> On Sat, Sep 27, 2014 at 2:52 PM, White, Bill  wrote:
>>> > Hello,
>>> >
>>> > I've attempted to figure this out from reading the documentation but
>>> > without much luck.  I looked for a comprehensive query syntax
>>> specification
>>> > (e.g., with BNF and a list of operator semantics) but I'm unable to find
>>> > such a document (does such a thing exist? or is the syntax too much of a
>>> > moving target?)
>>> >
>>> > I'm using 4.6.1, if that makes a difference, though upgrading is an
>>> option
>>> > if it necessary to make this work.
>>> >
>>> > I've got a multiValued field "color", which describes the colors of
>>> item in
>>> > the database.  Items can have zero or more colors.  What I want is to be
>>> > able to filter out all hits that contain colors not within a
>>> constraining
>>> > list, i.e., something like
>>> >
>>> > NOT (color NOT IN ("red","yellow","green")).
>>> >
>>> > So the following would be passed by the filter:
>>> > (no value for 'color')
>>> > color: red
>>> > color: red, color: green
>>> >
>>> > whereas these would be excluded:
>>> > color: red, color: blue
>>> > color: magenta
>>>
>>> You're looking for things that either match red, yellow, or green, or
>>> have no color:
>>>
>>> color:(red yellow green) OR (*:* -color:*)
>>>
>>> -Yonik
>>> http://heliosearch.org - native code faceting, facet functions,
>>> sub-facets, off-heap data
>>>
>>
>>


Re: fq syntax for requiring all multiValued field values to be within a list?

2014-09-27 Thread Yonik Seeley
On Sat, Sep 27, 2014 at 2:52 PM, White, Bill  wrote:
> Hello,
>
> I've attempted to figure this out from reading the documentation but
> without much luck.  I looked for a comprehensive query syntax specification
> (e.g., with BNF and a list of operator semantics) but I'm unable to find
> such a document (does such a thing exist? or is the syntax too much of a
> moving target?)
>
> I'm using 4.6.1, if that makes a difference, though upgrading is an option
> if it necessary to make this work.
>
> I've got a multiValued field "color", which describes the colors of item in
> the database.  Items can have zero or more colors.  What I want is to be
> able to filter out all hits that contain colors not within a constraining
> list, i.e., something like
>
> NOT (color NOT IN ("red","yellow","green")).
>
> So the following would be passed by the filter:
> (no value for 'color')
> color: red
> color: red, color: green
>
> whereas these would be excluded:
> color: red, color: blue
> color: magenta

You're looking for things that either match red, yellow, or green, or
have no color:

color:(red yellow green) OR (*:* -color:*)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Does soft commit block on autowarming?

2014-09-24 Thread Yonik Seeley
On Wed, Sep 24, 2014 at 6:56 PM, Bruce Johnson  wrote:
> Is it reliably true that once a soft commit request returns,
> any subsequent queries will hit a new (and autowarmed) searcher?

Yes.
The default for commit and softCommit commands is waitSearcher=true,
which will not return until a new searcher is "registered".  After
that point, you're guaranteed to get the new searcher for any
requests.  Autowarming happens before searcher registration and hence
isn't an issue.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Performance of Boolean query with hundreds of OR clauses.

2014-09-07 Thread Yonik Seeley
Solr 4.10 has added a {!terms} query that should speed up these cases.

Benchmarks here:
http://heliosearch.org/solr-terms-query/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Tue, Aug 19, 2014 at 2:57 PM, SolrUser1543  wrote:
> I am using Solr to perform search for finding similar pictures.
>
> For this purpose, every image indexed as a set of descriptors ( descriptor
> is a string of 6 chars ) .
> Number of descriptors for every image may vary ( from few to many thousands)
>
> When I want to search  for a similar image , I am extracting the descriptors
> from it and create a query like :
> MyImage:( desc1 desc2 ...  desc n )
>
> Number of descriptors in query may also vary. Usual it is about 1000.
>
> Of course performance of this query very bad and may take few minutes to
> return .
>
> Any ideas for performance improvement ?
>
> P.s I also tried to use lire , but it is not fits my use case.


[ANN] Heliosearch 0.07 released

2014-09-07 Thread Yonik Seeley
http://heliosearch.org/download

Heliosearch v0.07 Features
  o  Heliosearch v0.07 is based on (and contains all features of)
Lucene/Solr 4.10.0
  o  An optimized Terms Query with native code performance
enhancements for efficiently matching multiple terms in a field.
  http://heliosearch.org/solr-terms-query/
  o  Native code to accelerate creation of off-heap filters.
  o  Added a off-heap buffer pool to speed allocation of temporary
memory buffers.
  o  Added ConstantScoreQuery support to lucene query syntax.
  Example: +color:blue^=1 text:shoes
  http://heliosearch.org/solr/query-syntax/#ConstantScoreQuery
  o  Added filter support to lucene query syntax. This retrieves an
off-heap filter from the filter cache, essentially like embedding “fq”
(filter queries) in a lucene query at any level. This also effectively
provides a way to “OR” various cached filters together.
  Example: description:HDTV OR filter(+promotion:tv
+promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])
  http://heliosearch.org/solr/query-syntax/#FilterQuery
  o  Added C-style comments to lucene query syntax.
  Example: description:HDTV /* this is a comment */
  http://heliosearch.org/solr/query-syntax/#comments

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Custom stat aggregation and sorting function in solr

2014-08-17 Thread Yonik Seeley
On Sun, Aug 17, 2014 at 7:27 AM, dhimant  wrote:
> Hi Yonik,
> Thanks for the reply.
> But i want a unique function on my binary column. This column contains
> binary representation of java hashset.

Ah, got it...  hopefully it's your own binary format and not Java
serialization (the latter would be too slow).
I don't think any analytics options currently have a way to easily
plug in custom functions.

There is something for custom analytics that doesn't integrate with
any of the other analytics options:
http://heliosearch.org/solrs-new-analyticsquery-api/

I am actually working on a way to easily plug in custom analytics into
heliosearch facet functions, but it's not done yet.

Another option you can consider is using a multi-valued string field
instead instead of a custom binary format... it will allow a lot more
functionality out of the box.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Custom stat aggregation and sorting function in solr

2014-08-17 Thread Yonik Seeley
On Sun, Aug 17, 2014 at 2:35 AM, dhimant  wrote:
> I want to add a new stat
> function (UniqueUsers(fieldName) like add/avg function already available in
> Solr) to find the unique across searched Solr records.

Heliosearch (a solr fork) has this:
http://heliosearch.org/solr-facet-functions/

facet.stat=unique(fieldName)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Syntax unavailable for parameter substitution Solr 3.5

2014-08-16 Thread Yonik Seeley
You can't do this with stock solr, but a generic templating ability is
now in heliosearch (a fork of solr):
http://heliosearch.org/solr-query-parameter-substitution/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Fri, Aug 15, 2014 at 5:46 AM, deepaksshettigar
 wrote:
>
> Environment :-
> --
> Solr version 3.5
> Apache Web Server on Jboss AS 5.1.x
>
> ===
> Problem statement :-
> --
>
> I am using a singe request handler to handle dynamic scenarios.
> So my UI decides at runtime which facet field (using a Dynamic field type
> String) to apply.
> E.g. Depending on current' logged in user's usergroup ( employee,admin, etc)
> I apply the facet field as
> &facet.field=platform_emp OR &facet.field=platform_admin (it needed to
> designed this way due to functionality)
>
> How using this technique, I have many such dynamic facet fields & the Solr
> Query string became too long resulting in HTTP 413(request entity too
> large).
>
> Now, I am looking move these facet field declarations from the URL to the
> Search Request Handler.
>
> Is there way to have local params do this for me.
>
> =
> Workable Solution:-
> -
> I have tried local params, which works if the whole term is passed through a
> Query String,
> but am stuck with syntax with does not allow any concatenation of params to
> a prefix.
>
> My Request handler looks like this -
>
>   
>  
>explicit
>edismax
>.
>.
> {!v=$role}
>
>
> If I pass &role=plaform_emp OR &role=plaform_emp, it works for me, but i
> would like to move the prefix inside the handler, as I have more such facet
> fields to be declared dynamically, e.g
> facet.field=share_class_emp , facet.field=share_class_admin, etc
> However I would like to avoid these multiple facet.field declarations
> through the URL to avoid running into HTTP 413 at runtime.
>
> ===
>
> Required Possible Solution:-
> ---
> Is there a way to have a configuration which might look like this -
>
>  
>  
>explicit
>edismax
>.
>.
>platform_
>share_class_
>{!v=$prefixPlatform$role}
>{!v=$prefixShareClass$role}
>
>
> & pass &role=emp from the URL at runtime.
>
> 
>
> Another Query, is it possible to handle HTTP 413 by increasing Allowed HTTP
> Request Size on Apache/Jboss
>
> -
>
> Any help will be highly appreciated.
>
> Regards
> Deepak
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Syntax-unavailable-for-parameter-substitution-Solr-3-5-tp4153197.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: content-type for json is not right

2014-08-09 Thread Yonik Seeley
It's configurable:
https://issues.apache.org/jira/browse/SOLR-1123

It has been text/plain since v1.0 by default (so it will render in
browsers) - perhaps you just never noticed?

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Sat, Aug 9, 2014 at 6:53 PM, William Bell  wrote:
> http://hgsolr2sl1:8983/solr/autosuggest/select?q=*%3A*&wt=json
>
> We are getting text/plan as the content-type.
>
> We want it to be application/json ?
>
> DId this change in 4.8.1?
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


Re: Stand alone Solr - no zookeeper?

2014-08-04 Thread Yonik Seeley
On Fri, Aug 1, 2014 at 10:48 AM, Joel Cohen  wrote:
> The only thing so far that I see as a hurdle here is the data set size vs.
> heap size. If the index grows too large, then we have to increase the heap
> size, which could lead to longer GC times. Servers could pop in and out of
> the load balancer if they are unavailable for too long when a major GC
> happens.

We took this on specifically in the Heliosearch fork of Solr via
off-heap data:
http://heliosearch.org/off-heap-filters/
http://heliosearch.org/solr-off-heap-fieldcache/

If you try it out, let us know how it works!

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Solr vs ElasticSearch

2014-08-04 Thread Yonik Seeley
On Mon, Aug 4, 2014 at 2:43 AM, Alexandre Rafalovitch
 wrote:
> That resource is rather superficial. I wouldn't make big decision based on it.

Agree.  It's also somewhat biased given the environment in which it
grew.  ES advocates were all over stuff like that, but Solr advocates
were less vocal.

Qualitatively:
Solr facets and function queries were faster for ages (no idea if they
still are or not...).
Solr's faceting took up far less memory (that's probably changed
too)... but no mention.
Solr had efficient deep paging first, but most assume it was the other
way around: https://github.com/elasticsearch/elasticsearch/issues/4940
Solr's "function queries" were far faster - I evaluated the mvel
scripting language used by ES for this stuff... it was dog slow.

Some something more concrete:
Solr's faceting gives exact counts for the constraints returned, while
ES still does not (it still does a naive "sum top N from each shard".)

Some things in the table are just wrong:
- Under "joins" for Solr, it says "It's not supported in distributed
search.", yet ES has the exact same limitations... joined docs must be
on the same shard (and provided that is true, joins are both supported
in Solr and ES).
- The comment for "Negative Boosting" is just wrong.  It is supported.
- "Online schema changes" is incorrect for Solr - it is supported.
- "Structured Query DSL"... yes, we've had it forever.  No it's not JSON.
- "Advanced Faceting" is simply a "no" under solr and a "yes" under
ES... this is incorrect.  The tooltip says "metrics and bucketing",
which solr has had forever (facet stats) that tons of people have used
to build BI tools.  Heliosearch adds even more of course.

There are probably things wrong on the ES side too of course.

But then at the bottom some of the things in "Thoughts..." are unfair
and biasing...
"""As Matt Weber points out below, ElasticSearch was built to be
distributed from the ground up, not tacked on as an 'afterthought'
like it was with Solr. This is totally evident when examining the
design and architecture of the 2 products, and also when browsing the
source code."""

That's from a well known ES advocate of course.  But software, just
like arguments, should be evaluated in it's merits.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Search results inconsistency when using joins

2014-07-29 Thread Yonik Seeley
The join qparser has no "fq" parameter, so that is ignored.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

On Tue, Jul 29, 2014 at 12:12 PM, heaven  wrote:
> _query_:"{!join from=profile_ids_im to=id_i v=$qTweet107001860
> fq=$fqTweet107001860}"


Re: faceting within facets

2014-07-21 Thread Yonik Seeley
On Mon, Jul 21, 2014 at 8:08 AM, David Flower  wrote:
> Is it possible to create a facet within another facet in a single query

For simple field facets, there's pivot faceting.
For more complex nested facets, there are sub-facets in heliosearch (a
solr fork):
http://heliosearch.org/solr-subfacets/

-Yonik


Re: stats.facet with multi-valued field in Solr 4.9

2014-07-21 Thread Yonik Seeley
On Mon, Jul 21, 2014 at 7:32 AM, Nico Kaiser  wrote:
> Yonik, thanks for your reply! I also found 
> https://issues.apache.org/jira/browse/SOLR-1782 which also sees to deal with 
> this, but I did not find out wether there is a workaround.
>
> For our use case the previous behaviour was ok and seemed (!) to be 
> consistent.
> However I understand that this feature had to be disabled if it was broken.
>
> Do you have an idea how to achieve the behaviour I mentioned before?

I don't think there's anything currently committed/released.

There has been work on an Analytics component that could do it.  This
hasn't been committed to Solr yet, but has been committed in
Heliosearch.  Also, Heliosearch has facet functions:
http://heliosearch.org/solr-facet-functions/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: stats.facet with multi-valued field in Solr 4.9

2014-07-21 Thread Yonik Seeley
On Mon, Jul 21, 2014 at 7:09 AM, Nico Kaiser  wrote:
> After the upgrade to Solr 4.9 (from 3.6) this seems not to be possible 
> anymore:
>
> "Stats can only facet on single-valued fields, not: instrumentIds"

https://issues.apache.org/jira/browse/SOLR-3642

It looks like perhaps it never did work correctly.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: Performance issues with facets and filter query exclusions

2014-07-18 Thread Yonik Seeley
On Fri, Jul 18, 2014 at 2:10 PM, Hayden Muhl  wrote:
> I was doing some performance testing on facet queries and I noticed
> something odd. Most queries tended to be under 500 ms, but every so often
> the query time jumped to something like 5000 ms.
>
> q=*:*&fq={!tag=productBrandId}productBrandId:(156
> 1227)&facet.field={!ex=productBrandId}productBrandId&facet=true
>
> I noticed that the drop in performance happened any time I had a filter
> query tag match up with a facet exclusion.

Is this an actual query that took a long time, or just an example?
My guess is that "q" is actually much more expensive.

If a filter is excluded, the base DocSet for faceting must be re-computed.
This involves intersecting all the DocSets for the other filters not
excluded (which should all be cached) with the DocSet of the query
(which won't be cached and will need to be generated).  That last step
can be expensive, depending on the query.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: problem with replication/solrcloud - getting 'missing required field' during update intermittently (SOLR-6251)

2014-07-17 Thread Yonik Seeley
On Wed, Jul 16, 2014 at 10:20 PM, Nathan Neulinger  wrote:
> [{"id":"4b2c4d09-31e2-4fe2-b767-3868efbdcda1","channel": {"add":
> "preet"},"channel": {"add": "adam"}}]
>
> Look at the JSON... It's trying to add two "channel" array elements...
> Should have been:
[...]
> From what I'm reading on JSON - this isn't valid syntax at all.

It is valid... repeated keys are actually allowed by the JSON spec.
How we want to handle this particular situation at the Solr level is
another question.  It's also not clear how this causes intermittent
failures.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: TrieDateField, precisionStep impact on sorting performance

2014-07-16 Thread Yonik Seeley
On Wed, Jul 16, 2014 at 5:51 AM, Kuehn, Dennis
 wrote:
> I'd like to sort on a TrieDateField which currently has a precisionStep value 
> of 6.
> From what I got so far, the precisionStep value only affects range query 
> performance and index size.
>
> However, the documentation for TrieDateField says:
> 'precisionStep="0" enables efficient date sorting and minimizes index size; 
> precisionStep="8" (the default) enables efficient range queries.'
>
> Does this mean sorting performance will suffer for precisionStep values other 
> than 0?

No, sorting speed is unaffected by precisionStep.  That comment looks
slightly misleading.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Yonik Seeley
On Fri, Jun 20, 2014 at 12:36 PM, Floyd Wu  wrote:
> Hi Yonik, i dont' understand the relationship between solr and heliosearch
> since you were committer of solr?

Heliosearch is a Solr fork that will hopefully find it's way back to
the ASF in the future.

Here's the original project announcement:
http://heliosearch.org/heliosearch-solr-evolved/

And the project FAQ:
http://heliosearch.org/heliosearch-faq/

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Yonik Seeley
On Fri, Jun 20, 2014 at 11:16 AM, Floyd Wu  wrote:
> Will these awesome features being implemented in Solr soon
>  2014/6/20 下午10:43 於 "Yonik Seeley"  寫道:

Given the current makeup of the joint Lucene/Solr PMC, it's unclear.
I'm not worrying about that for now, and just pushing Heliosearch as
far and as fast as I can.
Come join us if you'd like to help!

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Yonik Seeley
On Fri, Jun 20, 2014 at 10:15 AM, Yago Riveiro  wrote:
> Yonik,
>
> This native code uses in any way the docValues?

Nope... not yet.  It is something I think we should look into in the
future though.

> In the past I was forced to indexed a big portion of my data with docValues 
> enable. OOP problems with large terms dictionaries and GC was my main problem.
>
> Other good optimization can be do facet aggregations offsite the heap to 
> minimize the GC,

Yeah, the single-valued string faceting in Heliosearch currently does
this (the "counts" array is also off-heap).

> To ensure that facet aggregations has enough ram we need a large heap, in 
> machines with a lot of ram maybe if this aggregation was made offsite this 
> allow us reduce the heap size.

Yeah, it's nice not having to worry so much about the correct heap size too.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Yonik Seeley
On Fri, Jun 20, 2014 at 12:36 AM, Andy  wrote:
> Congrats! Any idea when will native faceting & off-heap fieldcache be 
> available for multivalued fields? Most of my fields are multivalued so that's 
> the big one for me.

Hopefully within the next month or so
If anyone wants to help out, the github issue is here:
https://github.com/Heliosearch/heliosearch/issues/13

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data



> On Thursday, June 19, 2014 3:46 PM, Yonik Seeley  
> wrote:
>
>
>
> FYI, for those who want to try out the new native code faceting, this
> is the first release containing it (for single valued string fields
> only as of yet).
>
> http://heliosearch.org/download/
>
> Heliosearch v0.06
>
> Features:
> o  Heliosearch v0.06 is based on (and contains all features of)
> Lucene/Solr 4.9.0
> o  Native code faceting for single valued string fields.
> - Written in C++, statically compiled with gcc for Windows, Mac OS-X, 
> Linux
> - static compilation avoids JVM hotspot warmup period,
> mis-compilation bugs, and variations between runs
> - Improves performance over 2x
> o  Top level Off-heap fieldcache for single valued string fields in nCache.
> - Improves sorting and faceting speed
> - Reduces garbage collection overhead
> - Eliminates FieldCache “insanity” that exists in Apache Solr from
> faceting and sorting on the same field
> o  Full request Parameter substitution / macro expansion, including
> default value support.
> o  frange query now only returns documents with a value.
>  For example, in Apache Solr, {!frange l=-1 u=1 v=myfield} will
> also return documents without a value since the numeric default value
> of 0 lies within the range requested.
> o  New JSON features via Noggit upgrade, allowing optional comments
> (C/C++ and shell style), unquoted keys, and relaxed escaping that
> allows one to backslash escape any character.
>
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data


[ANN] Heliosearch 0.06 released, native code faceting

2014-06-19 Thread Yonik Seeley
FYI, for those who want to try out the new native code faceting, this
is the first release containing it (for single valued string fields
only as of yet).

http://heliosearch.org/download/

Heliosearch v0.06

Features:
o  Heliosearch v0.06 is based on (and contains all features of)
Lucene/Solr 4.9.0
o  Native code faceting for single valued string fields.
- Written in C++, statically compiled with gcc for Windows, Mac OS-X, Linux
- static compilation avoids JVM hotspot warmup period,
mis-compilation bugs, and variations between runs
- Improves performance over 2x
o  Top level Off-heap fieldcache for single valued string fields in nCache.
- Improves sorting and faceting speed
- Reduces garbage collection overhead
- Eliminates FieldCache “insanity” that exists in Apache Solr from
faceting and sorting on the same field
o  Full request Parameter substitution / macro expansion, including
default value support.
o  frange query now only returns documents with a value.
 For example, in Apache Solr, {!frange l=-1 u=1 v=myfield} will
also return documents without a value since the numeric default value
of 0 lies within the range requested.
o  New JSON features via Noggit upgrade, allowing optional comments
(C/C++ and shell style), unquoted keys, and relaxed escaping that
allows one to backslash escape any character.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


Re: ANN: Solr Next

2014-06-09 Thread Yonik Seeley
On Tue, Jan 7, 2014 at 1:53 PM, Yonik Seeley  wrote:
[...]
> Next major feature: Native Code Optimizations.
> In addition to moving more large data structures off-heap(like
> UnInvertedField?), I am planning to implement native code
> optimizations for certain hotspots.  Native code faceting would be an
> obvious first choice since it can often be a CPU bottleneck.

It's in!  Abbreviated report: 2x performance increase over stock solr
faceting (which is already fast!)
http://heliosearch.org/native-code-faceting/

-Yonik
http://heliosearch.org -- making solr shine

> Project resources:
>
> https://github.com/Heliosearch/heliosearch
>
> https://groups.google.com/forum/#!forum/heliosearch
> https://groups.google.com/forum/#!forum/heliosearch-dev
>
> Freenode IRC: #heliosearch #heliosearch-dev
>
> -Yonik


Re: Is the act of *caching* an fq very expensive? (seems to cost 4 seconds in my example)

2014-06-03 Thread Yonik Seeley
On Tue, Jun 3, 2014 at 9:48 PM, Brett Hoerner  wrote:
> Yonik, I'm familiar with your blog posts -- and thanks very much for them.
> :) Though I'm not sure what you're trying to show me with the q=*:* part? I
> was of course using q=*:* in my queries, but I assume you mean to leave off
> the text:lol bit?
>
> I've done some Cluster changes, so these are my baselines:
>
> q=*:*
> fq=created_at_tdid:[1392768004 TO 1393944400] (uncached at this point)
> ~7.5 seconds
>
> q=*:*
> fq={!cache=false}created_at_tdid:[1392768005 TO 1393944400]
> ~7.5 seconds (I guess this is what you were trying to show me?)

Correct.

> The thing is, my queries always more "specific" than that, so given a
> string:
>
> q=*:*
> fq=text:basketball
> fq={!cache=false}created_at_tdid:[1392768007 TO 1393944400]
> ~5.2 seconds
>
> q=*:*
> fq=text:basketball
> fq={!cache=false}created_at_tdid:[1392768005 TO 1393944400]
> ~1.6 seconds
>
> Is there no hope for my first time fq searches being as fast as non-cached
> fqs?

Not really...
Think of it as an optimization in the cache=false case... we can
consider the rest of the request context (and use it to skip over a
lot of useless work).  If a filter will be cached, it must be valid
for all contexts, not just the current one (hence we can't skip any
work generating it).

An analogy... say that my job is determining if a given number is
prime.  Call it a prime-number filter.
Cached case: I figure out all prime numbers less than 1000 and save
the list (and have a very fast lookup).  Every person that comes up to
me after that and gives me a single number (or a small handful) I can
quickly answer if it's prime or not.
Uncached case: A single person walks up to me and asks if a single
number is prime.  I check that number only.

The uncached case is obviously faster the first time because much less
work is done.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Is the act of *caching* an fq very expensive? (seems to cost 4 seconds in my example)

2014-06-03 Thread Yonik Seeley
On Tue, Jun 3, 2014 at 5:19 PM, Yonik Seeley  wrote:
> So try:
>   q=*:*
>   fq=created_at_tdid:[1400544000 TO 1400630400]

vs

So try:
  q=*:*
  fq={!cache=false}created_at_tdid:[1400544000 TO 1400630400]


-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Is the act of *caching* an fq very expensive? (seems to cost 4 seconds in my example)

2014-06-03 Thread Yonik Seeley
On Tue, Jun 3, 2014 at 4:44 PM, Brett Hoerner  wrote:
> If I run a query like this,
>
> fq=text:lol
> fq=created_at_tdid:[1400544000 TO 1400630400]
>
> It takes about 6 seconds. Following queries take only 50ms or less, as
> expected because my fqs are cached.
>
> However, if I change the query to not cache my big range query:
>
> fq=text:lol
> fq={!cache=false}created_at_tdid:[1400544000 TO 1400630400]
>
> It takes 2 seconds every time, which is a much better experience for my
> "first query for that range."
>
> What's odd to me is that I would expect both of these (first) queries to
> have to do the same amount of work, expect the first one stuffs the
> resulting bitset into a map at the end... which seems to have a 4 second
> overhead?

They are not equivalent.  Caching the filter separately (so it can be
reused in any combination with other queries and filters) means that
*all* docs that match the filter are collected and cached (it's the
collection that is taking the time).

For the {!cache=false} case, Solr executes different code (it doesn't
just skip the caching step).

http://heliosearch.org/advanced-filter-caching-in-solr/

"""When a filter isn’t generated up front and cached, it’s executed in
parallel with the main query. First, the filter is asked about the
first document id that it matches. The query is then asked about the
first document that is equal to or greater than that document. The
filter is then asked about the first document that is equal to or
greater than that. The filter and the query play this game of leapfrog
until they land on the same document and it’s declared a match, after
which the document is collected and scored."""

So try:
  q=*:*
  fq=created_at_tdid:[1400544000 TO 1400630400]

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: slow performance on simple filter

2014-05-31 Thread Yonik Seeley
On Sat, May 31, 2014 at 8:47 AM, mizayah  wrote:
> i show you my full query
>
> it's rly simple one
> q=*:* and fq=class_name:CdnFile
>
> debug q shows that process of q takes so long.
> single filter is critical here.

400ms is too long... something is strange.
One possibility is that the part of the index used to generate the
filter was not in OS cache and thus disk IO needed to be performed to
generate the filter.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Regex with local params is not working

2014-05-28 Thread Yonik Seeley
On Wed, May 28, 2014 at 1:41 AM, Lokn  wrote:
> Thanks for the reply.
> I am using edismax for the query parsing. Still it's not working.
> Instead of using local params, if I use the field directly then regex is
> working fine.

It's not for me...

This does not work:
http://localhost:8983/solr/query?defType=edismax&q=/[A-Z]olr/&debugQuery=true

But this does work:
http://localhost:8983/solr/query?defType=lucene&q=/[A-Z]olr/&debugQuery=true

edismax was developed before the lucene query parser syntax was
changed to include regex, so maybe that's the issue.
Not that "/" was a great character to use for regex... it's too widely
used in URLs, paths, etc.  I'd almost argue against following lucene
syntax in this case and enabling regex a different way.


-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: 答复: Internals about "Too many values for UnInvertedField faceting on field xxx"

2014-05-27 Thread Yonik Seeley
On Mon, May 26, 2014 at 9:21 PM, 张月祥  wrote:
> Thanks a lot.
>
>> There are only 256 byte arrays to hold all of the ord data, and the
> pointers into those arrays are only 24 bits long.  That gets you back
> to 32 bits, or 4GB of ord data max.  It's practically less since you
> only have to overflow one array before the exception is thrown.
>
> What does the ord data mean? Term Id or Term-Document Relation or 
> Document-Term Relation ?

Every document has a list of term numbers (term ords) associated with it.
The deltas between sorted term numbers are vInt encoded.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Regex with local params is not working

2014-05-27 Thread Yonik Seeley
On Tue, May 27, 2014 at 4:38 AM, Lokn  wrote:
> With solr local params, the regex is not working.
> My sample query: q ={!qf=$myfield_qf}/[a-d]ad/, where I have myfield_qf
> defined in the solrconfig.xml.

add debugQuery=true to the request to see what query is actually produced.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: 答复: Internals about "Too many values for UnInvertedField faceting on field xxx"

2014-05-25 Thread Yonik Seeley
On Sat, May 24, 2014 at 9:50 PM, 张月祥  wrote:
> Thanks for your reply. I'll try it.
>
> We're  still interested in the real limitation about  "Too many values for
> UnInvertedField faceting on field xxx" .
>
> Could anybody tell us some internals about "Too many values for
> UnInvertedField faceting on field xxx" ?

There are only 256 byte arrays to hold all of the ord data, and the
pointers into those arrays are only 24 bits long.  That gets you back
to 32 bits, or 4GB of ord data max.  It's practically less since you
only have to overflow one array before the exception is thrown.

This faceting method is best for high numbers of unique values, but a
relatively low number of unique values per document.
I've been considering making an off-heap version for Heliosearch, and
maybe bump the limits a little at the same time...

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Query translation of User Fields

2014-05-25 Thread Yonik Seeley
On Thu, May 22, 2014 at 10:56 AM, Jack Krupansky
 wrote:
> Hmmm... that doesn't sound like what I would have expected - I would have
> thought that Solr would throw an exception on the "user" field, rather than
> simply treat it as a text keyword.

No, I believe that's working as designed.  edismax should never throw
exceptions due to the structure of the user query.
Just because something looks like a field query (has a : in it)
doesn't mean it was intended to be.

Examples:
Terminator 2: Judgment Day
Mission: Impossible

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: How does query on AND work

2014-05-23 Thread Yonik Seeley
On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen  
wrote:
> Per Steffensen [st...@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
>
> Thank you for the follow up. It sounds rather special-case though, with 
> requirement of DocValues for the range-field. Do you think this can be 
> generalized?

Maybe it already is?
http://heliosearch.org/advanced-filter-caching-in-solr/

Something like this:
 &fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}


-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Extensibility and code reuse: SOLR vs Lucene

2014-05-20 Thread Yonik Seeley
On Tue, May 20, 2014 at 6:01 PM, Achim Domma  wrote:
> - I found several times code snippets like " if (collector instanceof 
> DelegatingCollector) { ((DelegatingCollector)collector).finish() } ". Such 
> code is considered bad practice in every OO language I know. Do I miss 
> something here? Is there a reason why it's solved like this?

In a single code base you would be correct (we would just add a finish
method to the base Collector class).  When you are adding additional
functionality to an existing API/code base however, this is often the
only way to do it.

What type of aggregation are you looking for?  The Heliosearch project
(a Solr fork), also has this:
http://heliosearch.org/solr-facet-functions/

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: deep paging without sorting / keep IRs open

2014-05-17 Thread Yonik Seeley
On Sat, May 17, 2014 at 10:30 AM, Yonik Seeley  wrote:
> I think searcher leases would fit the bill here?
> https://issues.apache.org/jira/browse/SOLR-2809
>
> Not yet implemented though...

FYI, I just put up a simple LeaseManager implementation on that issue.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: deep paging without sorting / keep IRs open

2014-05-17 Thread Yonik Seeley
On Wed, May 14, 2014 at 8:34 AM, Tommaso Teofili
 wrote:
> Basically I need the ability to keep running searches against a specified
> commit point / index reader / state of the Lucene / Solr index.

I think searcher leases would fit the bill here?
https://issues.apache.org/jira/browse/SOLR-2809

Not yet implemented though...

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Solr performance: multiValued filed vs separate fields

2014-05-17 Thread Yonik Seeley
On Thu, May 15, 2014 at 10:29 AM, danny teichthal  wrote:
> I wonder about performance difference of 2 indexing options: 1- multivalued
> field 2- separate fields
>
> The case is as follows: Each document has 100 “properties”: prop1..prop100.
> The values are strings and there is no relation between different
> properties. I would like to search by exact match on several properties by
> known values (like ids). For example: search for all docs having
> prop1=”blue” and prop6=”high”
>
> I can choose to build the indexes in 1 of 2 ways: 1- the trivial way – 100
> separate fields, 1 for each property, multiValued=false. the values are
> just property values. 2- 1 field (named “properties”) multiValued=true. The
> field will have 100 values: value1=”prop1:blue”.. value6=”high” etc
>
> Is it correct to say that option1 will have much better performance in
> searching?  How about indexing performance?

For straight exact-match searching (matching properties) there should
be no difference.  A single field should be slightly faster at
indexing.

If you need fast numeric range queries, faceting, or sortong on any
properties, you would want those as separate fields.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Question regarding the lastest version of HeliosSearch

2014-05-16 Thread Yonik Seeley
On Thu, May 15, 2014 at 3:44 PM, Jean-Sebastien Vachon
 wrote:
> I spent some time today playing around with subfacets and facets functions 
> now available in helios search 0.05 and I have some concerns... They look 
> very promising .

Thanks, glad for the feedback!

[...]
> the response looks good except for one little thing... the mincount is not 
> respected whenever I specify the facet.stat parameter. Removing it will cause 
> the mincount to be respected but then I need this parameter.

Right, the mincount parameter is not yet implemented.   Hopefully soon!

> {
>
>   "val":1133,
>
>   "unique(job_id)":0, <== what is this?
>
>   "count":0},
>  Many zero entries following...
>
> I was wondering where the extra entries were coming from... the position_id = 
> 1133 above is not even a match for my query (its title is "Audit Consultant")
> I`ve also noticed a similar behaviour when using subfacets. It looks like the 
> number of items returned always match the "facet.limit" parameter.
> If not enough values are present for a given entry then the bucket is filled 
> with documents not matching the original query.

Right... straight Solr faceting will do this too (unless you have a
mincount>0).  We're just looking at terms in the field and we don't
have enough context to know if some 0's make more sense than others to
return.

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Histogram facet?

2014-05-06 Thread Yonik Seeley
On Tue, May 6, 2014 at 5:30 PM, Romain Rigaux  wrote:
> This looks nice!
>
> The only missing piece for more interactivity would be to be able to map
> multiple field values into the same bucket.
>
> e.g.
>
> http://localhost:8983/solr/query?
>q=*:*
>&facet=true
>&facet.field=*round(date, '15MINUTES')*
>&facet.stat=sum(retweetCount)
>
> This is a bit similar to
> SOLR-4772for the
> rounding.
>
> Then we could zoom out just by changing the size of the bucket, without any
> index change, e.g.:
> http://localhost:8983/solr/query?
>q=*:*
>&facet=true
>&facet.field=*round(date, '1HOURS')*
>&facet.stat=sum(retweetCount)

For this specific example, I think "map multiple field values into the
same bucket" equates to a range facet?

facet.range=mydatefield
facet.range.start=...
facet.range.end=...
facet.range.gap=+1HOURS
facet.stat=sum(retweetCount)

And then if you need additional breakouts by time range, you can use subfacets:

subfacet.mydatefield.field=mycategoryfield

That will provide retweet counts broken out by "mycategoryfield" for
every bucket produced by the range query.

See http://heliosearch.org/solr-subfacets/

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: query(subquery, default) filters results

2014-05-06 Thread Yonik Seeley
On Tue, May 6, 2014 at 5:08 AM, Matteo Grolla  wrote:
> Hi everybody,
> I'm having troubles with the function query
>
> "query(subquery, default)"  
> http://wiki.apache.org/solr/FunctionQuery#query
>
> running this
>
> http://localhost:8983/solr/select?q=query($qq,1)&qq={!dismax qf=text}hard 
> drive

The default query syntax is lucene, so "query(..." will just be parsed as text.
Try q={!func}query($qq,1)
OR
defType=func&q=query($qq,1)

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters
+ fieldcache


Re: Histogram facet?

2014-05-06 Thread Yonik Seeley
On Mon, May 5, 2014 at 6:18 PM, Romain  wrote:
> Hi,
>
> I am trying to plot a non date field by time in order to draw an histogram
> showing its evolution during the week.
>
> For example, if I have a tweet index:
>
> Tweet:
>   date
>   retweetCount
>
> 3 tweets indexed:
> Tweet | Date | Retweet
> A01/01   100
> B01/01   100
> C01/02   100
>
> If I want to plot the number of tweets by day: easy with a date range facet:
> Day 1: 2
> Day 2: 1
>
> But now counting the number of retweet by day is not possible natively:
> Day 1: 200
> Day 2: 100

Check out "facet functions" in Heliosearch (an experimental fork of Solr):
http://heliosearch.org/solr-facet-functions/

All you would need to do is add:
facet.stat=sum(retweetCount)

-Yonik
http://heliosearch.org - solve Solr GC pauses with off-heap filters
and fieldcache


HDS 4.8.0_01 released - solr tomcat distro

2014-05-01 Thread Yonik Seeley
For those Tomcat fans out there, we've released HDS 4.8.0_01,
based on Solr 4.8.0 of course.  HDS is pretty much just Apache Solr,
with the addition of a Tomcat based server.

Download: http://heliosearch.com/heliosearch-distribution-for-solr/

HDS details:
- includes a pre-configured (threads, logging, connection settings,
message sizes, etc) and tested Tomcat based Solr server  in the
"server" directory
- start scripts can be run from anywhere, and allow passing JVM args
on command line (just like jetty, so it makes it easier to use)
- start scripts work around known JVM bugs
- start scripts allow setting port from command line, and default stop
port based off of http port to make it easy to run multiple servers on
a single box)
- the "server" directory has been kept clean but stuffing all of
tomcat under the "server/tc" directory


Getting started:
$ cd server
$ bin/startup.sh

To start on a different port (e.g. 7574):
$ cd server
$ bin/startup.sh -Dhttp.port=7574

To shut down:
$ cd server
$ bin/shutdown.sh -Dhttp.port=7574

The scripts even accept -Djetty.port=7574 to make it easier to
cut-n-paste from start examples using jetty.  The "example" directory
is still there too, so you can still run the jetty based server if you
want.


-Yonik
http://heliosearch.org - solve Solr GC pauses with off-heap filters
and fieldcache


<    1   2   3   4   5   6   7   8   9   10   >