from:"Jack Krupansky"

Re: solr.StrField or solr.StringField?

2016-05-03 Thread Jack Krupansky

Yeah, that's a typo. The same typo is in the official Solr Reference Guide:
https://cwiki.apache.org/confluence/display/solr/Putting+the+Pieces+Together

[ATTN: Solr Ref Guide Team!]


-- Jack Krupansky

On Tue, May 3, 2016 at 4:14 PM, John Bickerstaff <j...@johnbickerstaff.com>
wrote:

> I'm assuming it's another "class" or data type that someone built - but I'm
> afraid I don't know any more than that.
>
> An alternative possibility (supported by at least one of the links on that
> page you linked) is that it's just a typo -- people typing quickly and
> forgetting the exact (truncated) spelling of the field.
>
> In that case, it's talking about using it for faceting and IIRC, you want a
> non-analyzed field for that - preserve it exactly as it is for facet
> queries -- that suggests to me that the author actually meant StrField
>
> 
>
> I might want to index the same data differently in three different fields
> (perhaps using the Solr copyField
> <http://wiki.apache.org/solr/SchemaXml#Copy_Fields> directive):
>
>- For searching: Tokenized, case-folded, punctuation-stripped:
>   - schildt / herbert / wolpert / lewis / davies / p
>- For sorting: Untokenized, case-folded, punctuation-stripped:
>   - schildt herbert wolpert lewis davies p
>-
>
>For faceting: Primary author only, using a solr.StringField:
>- Schildt, Herbert
>
> Then when the user drills down on the "Schildt, Herbert" string I would
> reissue the query with an added fq=author:"Schild, Herbert" parameter.
>
> On Tue, May 3, 2016 at 2:01 PM, Steven White <swhite4...@gmail.com> wrote:
>
> > Thanks John.
> >
> > Yes, the out-of-the-box schema.xml does not have solr.StringField.
> > However, a number of Solr pages on the web mention solr.StringField [1]
> and
> > thus I'm not sure if that's a typo, a real thing and such it is missing
> > from the official Solr wiki's.
> >
> > Steve
> >
> > [1] https://wiki.apache.org/solr/SolrFacetingOverview,
> >
> >
> http://grokbase.com/t/lucene/solr-commits/06cw5038rk/solr-wiki-update-of-solrfacetingoverview-by-jjlarrea
> > ,
> >
> > On Tue, May 3, 2016 at 3:35 PM, John Bickerstaff <
> j...@johnbickerstaff.com
> > >
> > wrote:
> >
> > > My default schema.xml does not have an entry for solr.StringField so I
> > > can't tell you what that one does.
> > >
> > > If you look for solr.StrField in the schema.xml file, you'll get some
> > idea
> > > of how it's defined.  The default setting is for it not to be analyzed.
> > >
> > > On Tue, May 3, 2016 at 10:16 AM, Steven White <swhite4...@gmail.com>
> > > wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > Is solr.StrField and solr.StringField the same thing?
> > > >
> > > > Thanks in advanced!
> > > >
> > > > Steve
> > > >
> > >
> >
>

Re: concat 2 fields

2016-04-26 Thread Jack Krupansky

As I myself had commented on that grokbase thread so many months ago, there
are examples of how to do this is my old Solr 4.x Deep Dive book.

If you read the grokbase thread carefully, you will see that you left out
the prefix "Custom" in front of "Concat" - this is not a standard Solr
feature.

Concat simply combines multiple values for a single field into a single
value. It does that for each specified field independently. It will not
concatenate two separate fields.

What you can do is Clone your second field to the name of the first field,
which will result in two values for the first field. Then you can use
Concat to combine the two values.

-- Jack Krupansky

On Thu, Apr 21, 2016 at 5:29 AM, vrajesh <vrajes...@gmail.com> wrote:

> to concatenating two fields to use it as one field from
>
> http://grokbase.com/t/lucene/solr-user/138vr75hvj/concat-2-fields-in-another-field
> ,
> but the solution whichever is given i tried but its not working. please
> help
> me on it.
>  i am trying to concat latitude and longitude fields to make it as single
> unit using following:
>  
>
> 
>
>  
>  i added it to solrconfig.xml.
>
>  some of my doubts are :
>  - should we define destination field (geo_location) in schema.xml?
>
>  - i want to make this combined field  (geo_location) as field facet so i
> have to add   in
>
>  - any specific tag in which i should add above process script to make it
> working.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/concat-2-fields-tp4271760.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Jack Krupansky

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, <jimi.hulleg...@svensktnaringsliv.se>
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> 
> >> From: Jack Krupansky <jack.krupan...@gmail.com>
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
&

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Jack Krupansky

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.



-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> 
> From: Jack Krupansky <jack.krupan...@gmail.com>
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
>
> > I am talking about the title field. And for the title field, a sweetspot
> > interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> > value that differentiates between for example 2, 3, 4 and 5 terms in the
> > title, but only very little.
> >
> > The 20% number I got by simply calculating the difference in the title
> > fieldNorm of two documents, where one title was one word longer than the
> > other title. And one fieldNorm value was 20% larger then the other as a
> > result of that. And since we use multiplicative scoring calculation, a
> 20%
> > increase in the fieldNorm results in a 20% increase in the final score.
> >
> > I'm not talking about "scores as percentages". I'm simply noting that
> this
> > minor change in the text data (adding or removing one single word) causes
> > the score to change by a almost 20%. I noted this when I renamed a
> > document, removing a word from the title, and that single change caused
> the
> > document to move up several positions in the result list. We don't want
> > such minor modifications to have such big impact of the resulting score.
> >
> > I'm not sure I can agree with you that "the effect of document length
> > normalization factor is minimal". Then why does it inpact our result in
> > such a big way? And as I said, we don't want to disable it completely, we
> > just want it to have a much lesser effect, even on

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Jack Krupansky

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> 
> From: Ahmet Arslan <iori...@yahoo.com.INVALID>
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.
>
> Lucene/Solr is highly pluggable, very easy to customize.
>
> Ahmet
>
>
> On Wednesday, April 20, 2016 11:05 PM, "
> jimi.hulleg...@svensktnaringsliv.se" <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
> Hi Ahmet,
>
> SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
> different values at the class gives quite good results. Setting ln_min=1,
> ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or
> less what I want. At least for the title field. I'm not sure what the
> actual effect of those settings would be on longer text fields, so maybe I
> will use the SweetSpotSimilarity only for the title field to start with.
>
> Of course I understand that there are many things that can be considered
> domain specific requirements, like if to favor/punish short/medium/long
> texts, and how. I was just wondering how many actual use cases there are
> where one want's a ~20% difference in score between two documents, where
> the only difference is that one of the documents has one extra word in one
> field. (And now I'm talking about an extra word that doesn't affect
> anything else except the fieldNorm value). I for one find it hard to find
> such a use case, and would consider it a very speci

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread Jack Krupansky

FWIW, length for normalization is measured in terms (tokens), not
characters.

With TDIFS similarity (the default before 6.0), the normalization is based
on the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation is
> quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a few
> characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has fieldNorm
> 0.4375, and in document 2 the text is 37 characters long and has fieldNorm
> 0.375. That means that the first document gets almost a 20% higher score
> simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a lower
> character limit, meaning that all fields with a length below this limit
> gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for that
> field, but I would prefer to still have it, just limit its effect on short
> texts.
>
> Regards
> /Jimi
>
>
>

Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Jack Krupansky

SolrJ does indeed provide load balancing via CloudSolrClient which
uses LBHttpSolrClient:
https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html
https://lucene.apache.org/solr/5_5_0/solr-solrj/org/apache/solr/client/solrj/impl/LBHttpSolrClient.html

There is the separate issue of how many application clients you may have
and whether a load balancer would be in front of them.

-- Jack Krupansky

On Mon, Apr 18, 2016 at 4:34 AM, Jaroslaw Rozanski <s...@jarekrozanski.com>
wrote:

> Hi,
>
> How are you executing searches?
>
> I am asking because if you search using Solr client, for example SolrJ -
> ie. create instance of CloudSolrClient, and not directly via HTTP
> endpoint, it will provided load-balancing (last time I checked it picks
> random non-stale node).
>
>
> Thanks,
> Jarek
>
> On Mon, 18 Apr 2016, at 05:58, John Bickerstaff wrote:
> > Thanks, so on the matter of indexing -- while I could isolate a cloud
> > replica from queries by not including it in the load balancer's list...
> >
> > ... I cannot isolate any of the replicas from an indexing perspective by
> > a
> > similar strategy because the SOLR leader decides who does indexing?  Or
> > do
> > all "nodes" index the same incoming document independently?
> >
> > Now that I know I still need a load balancer, I guess I'm trying to find
> > a
> > way to keep indexing load off servers that are busy serving search
> > results...  Possibly by having one or two servers just handle indexing...
> >
> > Perhaps I'm looking in the wrong direction though -- and should just spin
> > up more replicas to handle more indexing load?
> > On Apr 17, 2016 10:46 PM, "Walter Underwood" <wun...@wunderwood.org>
> > wrote:
> >
> > No, Zookeeper is used for managing the locations of replicas and the
> > leader
> > for indexing. Queries should still be distributed with a load balancer.
> >
> > Queries do NOT go through Zookeeper.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Apr 17, 2016, at 9:35 PM, John Bickerstaff <
> j...@johnbickerstaff.com>
> > wrote:
> > >
> > > My prior use of SOLR in production was pre SOLR cloud.  We put a
> > > round-robin  load balancer in front of replicas for searching.
> > >
> > > Do I understand correctly that a load balancer is unnecessary with SOLR
> > > Cloud?  I. E. -- SOLR and Zookeeper will balance the load, regardless
> of
> > > which replica's URL is getting hit?
> > >
> > > Are there any caveats?
> > >
> > > Thanks,
>

Re: UUID processor handling of empty string

2016-04-16 Thread Jack Krupansky

Remove that line of code from your client, or... add the remove blank field
update processor as Hoss suggested. Your code is violating the contract for
the UUID update processor. An empty string is still a value, and the
presence of a value is an explicit trigger to suppress the UUID update
processor.

-- Jack Krupansky

On Sat, Apr 16, 2016 at 12:41 PM, Susmit Shukla <shukla.sus...@gmail.com>
wrote:

> I am seeing the UUID getting generated when I set the field as empty string
> like this - solrDoc.addField("id", ""); with solr 5.3.1 and based on the
> above schema.
> The resulting documents in the index are searchable but not sortable.
> Someone could verify if this bug exists and file a jira.
>
> Thanks,
> Susmit
>
>
>
> On Sat, Apr 16, 2016 at 8:56 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > "UUID processor factory is generating uuid even if it is empty."
> >
> > The processor will generate the UUID only if the id field is not
> specified
> > in the input document. Empty value and value not present are not the same
> > thing.
> >
> > So, please clarify your specific situation.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Apr 14, 2016 at 7:20 PM, Susmit Shukla <shukla.sus...@gmail.com>
> > wrote:
> >
> > > Hi Chris/Erick,
> > >
> > > Does not work in the sense the order of documents does not change on
> > > changing sort from asc to desc.
> > > This could be just a trivial bug where UUID processor factory is
> > generating
> > > uuid even if it is empty.
> > > This is on solr 5.3.0
> > >
> > > Thanks,
> > > Susmit
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 14, 2016 at 2:30 PM, Chris Hostetter <
> > hossman_luc...@fucit.org
> > > >
> > > wrote:
> > >
> > > >
> > > > I'm also confused by what exactly you mean by "doesn't work" but a
> > > general
> > > > suggestion you can try is putting the
> > > > RemoveBlankFieldUpdateProcessorFactory before your UUID Processor...
> > > >
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html
> > > >
> > > > If you are also worried about strings that aren't exactly empty, but
> > > > consist only of whitespace, you can put
> TrimFieldUpdateProcessorFactory
> > > > before RemoveBlankFieldUpdateProcessorFactory ...
> > > >
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html
> > > >
> > > >
> > > > : Date: Thu, 14 Apr 2016 12:30:24 -0700
> > > > : From: Erick Erickson <erickerick...@gmail.com>
> > > > : Reply-To: solr-user@lucene.apache.org
> > > > : To: solr-user <solr-user@lucene.apache.org>
> > > > : Subject: Re: UUID processor handling of empty string
> > > > :
> > > > : What do you mean "doesn't work"? An empty string is
> > > > : different than not being present. Thee UUID update
> > > > : processor (I'm pretty sure) only adds a field if it
> > > > : is _absent_. Specifying it as an empty string
> > > > : fails that test so no value is added.
> > > > :
> > > > : At that point, if this uuid field is also the ,
> > > > : then each doc that comes in with an empty field will replace
> > > > : the others.
> > > > :
> > > > : If it's _not_ the , the sorting will be confusing.
> > > > : All the empty string fields are equal, so the tiebreaker is
> > > > : the internal Lucene doc ID, which may change as merges
> > > > : happen. You can specify secondary sort fields to make the
> > > > : sort predictable (the  field is popular for this).
> > > > :
> > > > : Best,
> > > > : Erick
> > > > :
> > > > : On Thu, Apr 14, 2016 at 12:18 PM, Susmit Shukla <
> > > shukla.sus...@gmail.com>
> > > > wrote:
> > > > : > Hi,
> > > > : >
> > > > : > I have configured solr schema to generate unique id for a
> > collection
> > > > using
> > > > : > UUIDUpdateProcessorFactory
> > >

Re: UUID processor handling of empty string

2016-04-16 Thread Jack Krupansky

"UUID processor factory is generating uuid even if it is empty."

The processor will generate the UUID only if the id field is not specified
in the input document. Empty value and value not present are not the same
thing.

So, please clarify your specific situation.


-- Jack Krupansky

On Thu, Apr 14, 2016 at 7:20 PM, Susmit Shukla <shukla.sus...@gmail.com>
wrote:

> Hi Chris/Erick,
>
> Does not work in the sense the order of documents does not change on
> changing sort from asc to desc.
> This could be just a trivial bug where UUID processor factory is generating
> uuid even if it is empty.
> This is on solr 5.3.0
>
> Thanks,
> Susmit
>
>
>
>
>
> On Thu, Apr 14, 2016 at 2:30 PM, Chris Hostetter <hossman_luc...@fucit.org
> >
> wrote:
>
> >
> > I'm also confused by what exactly you mean by "doesn't work" but a
> general
> > suggestion you can try is putting the
> > RemoveBlankFieldUpdateProcessorFactory before your UUID Processor...
> >
> >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html
> >
> > If you are also worried about strings that aren't exactly empty, but
> > consist only of whitespace, you can put TrimFieldUpdateProcessorFactory
> > before RemoveBlankFieldUpdateProcessorFactory ...
> >
> >
> >
> https://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html
> >
> >
> > : Date: Thu, 14 Apr 2016 12:30:24 -0700
> > : From: Erick Erickson <erickerick...@gmail.com>
> > : Reply-To: solr-user@lucene.apache.org
> > : To: solr-user <solr-user@lucene.apache.org>
> > : Subject: Re: UUID processor handling of empty string
> > :
> > : What do you mean "doesn't work"? An empty string is
> > : different than not being present. Thee UUID update
> > : processor (I'm pretty sure) only adds a field if it
> > : is _absent_. Specifying it as an empty string
> > : fails that test so no value is added.
> > :
> > : At that point, if this uuid field is also the ,
> > : then each doc that comes in with an empty field will replace
> > : the others.
> > :
> > : If it's _not_ the , the sorting will be confusing.
> > : All the empty string fields are equal, so the tiebreaker is
> > : the internal Lucene doc ID, which may change as merges
> > : happen. You can specify secondary sort fields to make the
> > : sort predictable (the  field is popular for this).
> > :
> > : Best,
> > : Erick
> > :
> > : On Thu, Apr 14, 2016 at 12:18 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > wrote:
> > : > Hi,
> > : >
> > : > I have configured solr schema to generate unique id for a collection
> > using
> > : > UUIDUpdateProcessorFactory
> > : >
> > : > I am seeing a peculiar behavior - if the unique 'id' field is
> > explicitly
> > : > set as empty string in the SolrInputDocument, the document gets
> indexed
> > : > with UUID update processor generating the id.
> > : > However, sorting does not work if uuid was generated in this way.
> Also
> > : > cursor functionality that depends on unique id sort also does not
> work.
> > : > I guess the correct behavior would be to fail the indexing if user
> > provides
> > : > an empty string for a uuid field.
> > : >
> > : > The issues do not happen if I omit the id field from the
> > SolrInputDocument .
> > : >
> > : > SolrInputDocument
> > : >
> > : > solrDoc.addField("id", "");
> > : >
> > : > ...
> > : >
> > : > I am using schema similar to below-
> > : >
> > : > 
> > : >
> > : > 
> > : >
> > : >  > required="true" />
> > : >
> > : > id
> > : >
> > : > 
> > : > 
> > : > 
> > : >   id
> > : > 
> > : > 
> > : > 
> > : >
> > : >
> > : >  
> > : >
> > : >  uuid
> > : >
> > : > 
> > : >
> > : >
> > : > Thanks,
> > : > Susmit
> > :
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
>

Re: Solr best practices for many to many relations...

2016-04-15 Thread Jack Krupansky

And it may also be that there are whole classes of user for whom
denormalization is just too heavy a cross to bear and for who a little
extra money spent on more hardware is a great tradeoff.

And... Lucene's indexing may be superior to your average SQL database, so
that a Solr JOIN could be so much better than your average RDBMS SQL JOIN.
That would be an interesting benchmark.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 11:06 AM, Joel Bernstein <joels...@gmail.com> wrote:

> I think people are going to be surprised though by the speed of the joins.
> The joins also get faster as the number of shards, replicas and worker
> nodes grow in the cluster. So we may see people building out large clusters
> and and using the joins in OLTP scenarios.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 10:58 AM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > And of course it depends on the specific queries, both in terms of what
> > fields will be searched and which fields need to be returned.
> >
> > Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
> > seconds for a complex query may be just fine vs. OLTP/search where under
> > 150 ms is the target. But, again, it will depend on the nature of the
> > query, the cardinality of each search field, the cross product of
> > cardinality of search fields, etc.
> >
> > -- Jack Krupansky
> >
> > On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > In general the Streaming Expression joins are designed for interactive
> > OLAP
> > > type work loads. So BI and data warehousing scenarios are the sweet
> spot.
> > > There may be scenarios where high QPS search applications will work
> with
> > > the distributed joins, particularly if the joins themselves are not
> huge.
> > > But the specific use cases need to be tested.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > > >
> > > wrote:
> > >
> > > > It will be interesting to see which use cases work best with the new
> > > > streaming JOIN vs. which will remain best with full denormalization,
> or
> > > > whether you simply have to try both and benchmark them.
> > > >
> > > > My impression had been that streaming JOIN would be ideal for bulk
> > > > operations rather than traditional-style search queries. Maybe there
> > are
> > > > three use cases: bulk read based on broad criteria, top-n relevance
> > > search
> > > > query, and specific document (or small number of documents) based on
> > > > multiple fields.
> > > >
> > > > My suspicion is that doing JOIN on five tables will likely be slower
> > than
> > > > accessing a single document of a denormalized table/index.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com>
> > > > wrote:
> > > >
> > > > > Solr now has full distributed join capabilities as part of the
> > > Streaming
> > > > > Expression library. Keep in mind that these are distributed joins
> so
> > > they
> > > > > shuffle records to worker nodes to perform the joins. These are
> > > > comparable
> > > > > to joins done by SQL over MapReduce systems, but they are very
> > > responsive
> > > > > and can respond with sub-second response time for fairly large
> joins
> > in
> > > > > parallel mode. But these joins do lend themselves to large
> > distributed
> > > > > architectures (lot's of shards an replicas). Target QPS also needs
> to
> > > be
> > > > > taken into account and tested in deciding whether these joins will
> > meet
> > > > the
> > > > > specific use case.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com>
> > wrote:
> > > > >
> > > > > > The Streaming API with Streaming Expressions (or Parallel SQL if
> > you
> > > > want
> > > > > > to use SQL) can give you the functionality you're looking for.
> See
> > &g

Re: Can a field be an array of fields?

2016-04-15 Thread Jack Krupansky

It all depends on what your queries look like - what input data does your
application have and what data does it need to retrieve.

My recommendation is that you store first name and last name as separate,
multivalued fields if you indeed need to query by precisely a first or last
name, but also store the full name as a separate multivalued text field. If
you want to search by only first or last name, fine. If you want to search
by full name or wildcards, etc., you can use the full name field, using
phrase query. You can use an update request processor to combine first and
last name into that third field. You could also store the full name in a
fourth field as raw JSON if you really need structure in the result. The
third field might have first and last name with a special separator such as
"|", although a simple comma is typically sufficient.


-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:58 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> Short answer - JOINs, external query outside Solr, Elastic Search ;)
> Alternatives:
>   * You get back an id for each document when you query on "Nino".   You
> look up the last names in some other system that has the full list.
>   * You index the authors in another collection and use JOINs
>   * You store the author_array as formatted, escaped JSON, stored, but not
> indexed (or analyzed).   When you get the data back, you navigate the JSON
> to the author_array, get the value, and parse that value as JSON.   Now you
> have the full list.
>   * This is a sweet spot for Elastic Search, to be perfectly honest.
>
> -Original Message-
> From: Bastien Latard - MDPI AG [mailto:lat...@mdpi.com.INVALID]
> Sent: Friday, April 15, 2016 7:52 AM
> To: solr-user@lucene.apache.org
> Subject: Can a field be an array of fields?
>
> Hi everybody!
>
> /I described a bit what I found in another thread, but I prefer to create
> a new thread for this specific question.../ *It's **possible to create an
> array of string by doing (incomplete example):
> - in the data-conf.xml:*
> 
>
> 
>   
>   
>   
>   
> 
>
> 
>
> *- in schema.xml:
> * required="false" multiValued="true" />
>  required="false" multiValued="true" />
>  required="false" multiValued="true" />
>  required="false" multiValued="true" />
>
> And this provides something like:
>
> "docs":[
>{
> [...]
> "given_name":["Bastien",  "Matthieu",  "Nino"],
> "last_name":["lastname1", "lastname2",
>  "lastname3",   "lastname4"],
>
> [...]
>
>
> *Note: there can be one author with only a last_name, and then we are
> unable to tell which one it is...*
>
> My goal would be to get this as a result:
>
> "docs":[
>{
> [...]
> "authors_array":
>  [
> [
> "given_name":["Bastien"],
> "last_name":["lastname1"]
>  ],
> [
> "last_name":["lastname2"]
>  ],
> [
> "given_name":["Matthieu"],
> "last_name":["lastname2"]
>  ],
> [
> "given_name":["Nino"],
> "last_name":["lastname4"]
>  ],
>  ]
> [...]
>
>
> Is there any way to do this?
> /PS: I know that I could do '//select if(a.given_name is not null,
> a.given_name ,'') as given_name, [...]//' but I would like to get an
> array.../
>
> I tried to add something like that to the schema.xml, but this doesn't
> work (well, it might be of type 'array'):
>  required="false" multiValued="true"/>
>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail:
> lat...@mdpi.com
> http://www.mdpi.com/
>
>

Re: Solr best practices for many to many relations...

2016-04-15 Thread Jack Krupansky

And of course it depends on the specific queries, both in terms of what
fields will be searched and which fields need to be returned.

Yes, OLAP is the clear sweet spot, where taking 500 ms to 2 or even 20
seconds for a complex query may be just fine vs. OLTP/search where under
150 ms is the target. But, again, it will depend on the nature of the
query, the cardinality of each search field, the cross product of
cardinality of search fields, etc.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 10:44 AM, Joel Bernstein <joels...@gmail.com> wrote:

> In general the Streaming Expression joins are designed for interactive OLAP
> type work loads. So BI and data warehousing scenarios are the sweet spot.
> There may be scenarios where high QPS search applications will work with
> the distributed joins, particularly if the joins themselves are not huge.
> But the specific use cases need to be tested.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 10:24 AM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > It will be interesting to see which use cases work best with the new
> > streaming JOIN vs. which will remain best with full denormalization, or
> > whether you simply have to try both and benchmark them.
> >
> > My impression had been that streaming JOIN would be ideal for bulk
> > operations rather than traditional-style search queries. Maybe there are
> > three use cases: bulk read based on broad criteria, top-n relevance
> search
> > query, and specific document (or small number of documents) based on
> > multiple fields.
> >
> > My suspicion is that doing JOIN on five tables will likely be slower than
> > accessing a single document of a denormalized table/index.
> >
> > -- Jack Krupansky
> >
> > On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com>
> > wrote:
> >
> > > Solr now has full distributed join capabilities as part of the
> Streaming
> > > Expression library. Keep in mind that these are distributed joins so
> they
> > > shuffle records to worker nodes to perform the joins. These are
> > comparable
> > > to joins done by SQL over MapReduce systems, but they are very
> responsive
> > > and can respond with sub-second response time for fairly large joins in
> > > parallel mode. But these joins do lend themselves to large distributed
> > > architectures (lot's of shards an replicas). Target QPS also needs to
> be
> > > taken into account and tested in deciding whether these joins will meet
> > the
> > > specific use case.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com> wrote:
> > >
> > > > The Streaming API with Streaming Expressions (or Parallel SQL if you
> > want
> > > > to use SQL) can give you the functionality you're looking for. See
> > > >
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > > > and
> > > >
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > > > SQL queries coming in through the Parallel SQL Interface are
> translated
> > > > down into Streaming Expressions - if you need to do something that
> SQL
> > > > doesn't yet support you should check out the Streaming Expressions to
> > see
> > > > if it can support it.
> > > >
> > > > With these you could store your data in separate collections (or the
> > same
> > > > collection with different docType field values) and then during
> search
> > > > perform a join (inner, outer, hash) across the collections. You
> could,
> > if
> > > > you wanted, even join with data NOT in solr using the jdbc streaming
> > > > function.
> > > >
> > > > - Dennis Gove
> > > >
> > > >
> > > > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > > > lat...@mdpi.com.invalid> wrote:
> > > >
> > > >> '*would I then be able to query a specific field of articles or
> other
> > > >> "table" (with the same OR BETTER performances)?*'
> > > >> -> And especially, would I be able to get only 1 article in the
> > > result...
> > > >>
> > > >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> > > >>
> > > >> Thanks Jack.
> >

Re: Solr best practices for many to many relations...

2016-04-15 Thread Jack Krupansky

It will be interesting to see which use cases work best with the new
streaming JOIN vs. which will remain best with full denormalization, or
whether you simply have to try both and benchmark them.

My impression had been that streaming JOIN would be ideal for bulk
operations rather than traditional-style search queries. Maybe there are
three use cases: bulk read based on broad criteria, top-n relevance search
query, and specific document (or small number of documents) based on
multiple fields.

My suspicion is that doing JOIN on five tables will likely be slower than
accessing a single document of a denormalized table/index.

-- Jack Krupansky

On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Solr now has full distributed join capabilities as part of the Streaming
> Expression library. Keep in mind that these are distributed joins so they
> shuffle records to worker nodes to perform the joins. These are comparable
> to joins done by SQL over MapReduce systems, but they are very responsive
> and can respond with sub-second response time for fairly large joins in
> parallel mode. But these joins do lend themselves to large distributed
> architectures (lot's of shards an replicas). Target QPS also needs to be
> taken into account and tested in deciding whether these joins will meet the
> specific use case.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpg...@gmail.com> wrote:
>
> > The Streaming API with Streaming Expressions (or Parallel SQL if you want
> > to use SQL) can give you the functionality you're looking for. See
> > https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > and
> > https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
> > SQL queries coming in through the Parallel SQL Interface are translated
> > down into Streaming Expressions - if you need to do something that SQL
> > doesn't yet support you should check out the Streaming Expressions to see
> > if it can support it.
> >
> > With these you could store your data in separate collections (or the same
> > collection with different docType field values) and then during search
> > perform a join (inner, outer, hash) across the collections. You could, if
> > you wanted, even join with data NOT in solr using the jdbc streaming
> > function.
> >
> > - Dennis Gove
> >
> >
> > On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
> > lat...@mdpi.com.invalid> wrote:
> >
> >> '*would I then be able to query a specific field of articles or other
> >> "table" (with the same OR BETTER performances)?*'
> >> -> And especially, would I be able to get only 1 article in the
> result...
> >>
> >> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
> >>
> >> Thanks Jack.
> >>
> >> I know that Solr is a search engine, but this replace a search in my
> >> mysql DB with this model:
> >>
> >>
> >> *My goal is to improve my environment (and my performances at the same
> >> time).*
> >>
> >> *Yes, I have a Solr data model... but atm I created 4 different indexes
> >> for "similar service usage".*
> >> *So atm, for 70 millions of documents, I am duplicating journal data and
> >> publisher data all the time in 1 index (for all articles from the same
> >> journal/pub) in order to be able to retrieve all data in 1 query...*
> >>
> >> *I found yesterday that there is the possibility to create like an array
> >> of  in the data-conf.xml.*
> >> e.g. (pseudo code - incomplete):
> >> 
> >> 
> >> 
> >> 
> >>
> >>
> >> * Would this be a good option? Is this the denormalization you were
> >> proposing? *
> >>
> >> *If yes, would I then be able to query a specific field of articles or
> >> other "table" (with the same OR BETTER performances)? If yes, I might
> >> probably merge all the different indexes together. *
> >> *I'm currently joining everything in mysql, so duplicating the fields in
> >> the solr (pseudo code):*
> >> 
> >> *So I have an index for authors query, a general one for articles (only
> >> needed info of other tables) ...*
> >>
> >> Thanks in advance for the tips. :)
> >>
> >> Kind regards,
> >> Bastien
> >>
> >> On 14/04/2016 16:23, Jack Krupansky wrote:
> >>
> >> Solr is a search engine, not a database.
> >>
> >> JOINs? Although Solr does

Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-14 Thread Jack Krupansky

BTW, I did check and that stemmer code is the same today as it was in 3.x,
so there should be no change in stemmer behavior there.

-- Jack Krupansky

On Thu, Apr 14, 2016 at 3:47 PM, Sara Woodmansee <swood...@gmail.com> wrote:

> Hi Shawn,
>
> Thanks so much the feedback. And for the heads-up regarding (the bad form
> of) starting a new discussion from an existing one. Thought removing all
> content wouldn’t track to original. (Sigh). This is what you get when you
> have photographers posting to high-end forums.
>
> Thanks Erick, regarding upgrading to v5.  We actually just removed all
> test data from the site, so we can now upload all the true, final files and
> metadata. In some ways this could be a perfect time to upgrade to v5 (if I
> can talk the developer into it) since all metadata has to be re-ingested
> anyway..
>
> All best,
> Sara
>
>
> > On Apr 14, 2016, at 3:31 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >
> > re: upgrading to 5.x... 5X Solr's are NOT guaranteed to
> > read 3x indexes, you'd have to go through 4x to do that.
> >
> > If you can re-index from scratch that would be best.
> >
> > Best,
> > Erick
> >
> >
> >> On Apr 14, 2016, at 3:29 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >>
> >> On 4/14/2016 11:17 AM, Sara Woodmansee wrote:
> >>> I posted yesterday, however I never received my own post, so worried
> it did not go through (?)
> >>
> >> I *did* see your previous message, but couldn't immediately think of
> >> anything constructive to say.  I've had a little bit of time on my lunch
> >> break today to look deeper.
> >>
> >> EnglishMinimalStemFilter is designed to *not* aggressively stem
> >> everything it sees.  It appears that the behavior you are seeing is
> >> probably intentional with that filter.
> >>
> >> In 5.5.0 and 6.0.0, PorterStemFilter will handle words of the form you
> >> mentioned correctly.  In the screenshot below, PSF means
> >> "PorterStemFilter".  I did not check any earlier versions.  I already
> >> had these versions on my system.
> >>
> >> https://www.dropbox.com/s/ss48vinrtbgifce/stemmer-ee-es-6.0.0.png?dl=0
> >>
> >> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> >>
> >> That version of Solr is over four years old.  Bugs in 3.x will *not* be
> >> fixed.  Bugs in 4.x will also not be fixed.  On 5.x, only extremely
> >> major bugs are likely to get any attention, and this does not qualify as
> >> a major bug.
> >>
> >> 
> >>
> >> On another matter:
> >>
> >> http://people.apache.org/~hossman/#threadhijack
> >>
> >> You replied to a message with the subject "Solr Support for BM25F" ...
> >> so your message is showing up within that thread.
> >>
> >>
> https://www.dropbox.com/s/xi0o8z6smhd2n5d/woodmansee-thread-hijack.png?dl=0
> >>
> >> Thanks,
> >> Shawn
> >>
> >
>

Re: Singular Plural Results Inconsistent - SOLR v3.6 and EnglishMinimalStemFilterFactor

2016-04-14 Thread Jack Krupansky

Yes, this is the intended behavior. All of the Solr stemmers are based on
heuristics that are not perfect, and are not based on the real dictionary.
You can solve one problem by switching to another stemmer, but then you run
into a different problem, rinse and repeat.

The code has a specific rule that refrains from stemming a pattern that
also happens to match your specified cases:

if (s[len-3] == 'i' || s[len-3] == 'a' || s[len-3] == 'o' ||
s[len-3] == 'e')
  return len;

See:
https://github.com/apache/lucene-solr/blob/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/en/EnglishMinimalStemmer.java

So, xxxies, xxxaes, xxxoes, and xxxees will all remain unstemmed. Exactly
what the rationale for that rule was is unspecified in the code - no
comments, other than to point to this research document:
https://www.researchgate.net/publication/220433848_How_effective_is_suffixing



-- Jack Krupansky

On Thu, Apr 14, 2016 at 1:17 PM, Sara Woodmansee <swood...@gmail.com> wrote:

> Hello all,
>
> I posted yesterday, however I never received my own post, so worried it
> did not go through (?) Also, I am not a coder, so apologies if not
> appropriate to post here. I honestly don't know where else to turn, and am
> determined to find a solution, as search is essential to our site.
>
> We are having a website built with a search engine based on SOLR v3.6. For
> stemming, the developer uses EnglishMinimalStemFilterFactory. They were
> previously using PorterStemFilterFactory which worked better with plural
> forms, however PorterStemFilterFactory was not working correctly with –ing
> endings. “icing” becoming "ic", for example.
>
> Most search terms work fine, but we have inconsistent results (singular vs
> plural) with terms that end in -ee, -oe, -ie, -ae,  and words that end in
> -s.  In comparison, the following work fine: words that end with -oo, -ue,
> -e, -a.
>
> The developers have been unable to find a solution ("Unfortunately we
> tried to apply all the filters for stemming but this problem is not
> resolved"), but this has to be a common issue (?) Someone surely has found
> a solution to this problem??
>
> Any suggestions greatly appreciated.
>
> Many thanks!
> Sara
> _
>
> DO NOT WORK:  Plural terms that end in -ee, -oe, -ie, -ae,  and words that
> end in -s.
>
> Examples:
>
> tree = 0 results
> trees = 21 results
>
> dungaree = 0 results
> dungarees = 1 result
>
> shoe = 0 results
> shoes = 1 result
>
> toe = 1 result
> toes = 0 results
>
> tie = 1 result
> ties = 0 results
>
> Cree = 0 results
> Crees = 1 result
>
> dais = 1 result
> daises = 0 results
>
> bias = 1 result
> biases = 0 results
>
> dress = 1 result
> dresses = 0 results
> _
>
> WORKS:  Words that end with -oo, -ue, -e, -a
>
> Examples:
>
> tide = 1 result
> tides = 1 results
>
> hue = 2 results
> hues = 2 results
>
> dakota = 1 result
> dakotas = 1 result
>
> loo = 1 result
> loos = 1 result
> _
>
>

Re: Solr best practices for many to many relations...

2016-04-14 Thread Jack Krupansky

Solr is a search engine, not a database.

JOINs? Although Solr does have some limited JOIN capabilities, they are
more for special situations, not the front-line go-to technique for data
modeling for search.

Rather, denormalization is the front-line go-to technique for data modeling
in Solr.

In any case, the first step in data modeling is always to focus on your
queries - what information will be coming into your apps and what
information will the apps want to access based on those inputs.

But wait... you say you are upgrading, which suggests that you have an
existing Solr data model, and probably queries as well. So...

1. Share at least a summary of your existing Solr data model as well as at
least a summary of the kinds of queries you perform today.
2. Tell us what exacting is driving your inquiry - are queries too slow,
too cumbersome, not sufficiently powerful, or... what exactly is the
problem you need to solve.

-- Jack Krupansky

On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Hi Guys,
>
> *I am upgrading from solr 4.2 to 6.0.*
> *I successfully (after some time) migrated the config files and other
> parameters...*
>
> Now I'm just wondering if my indexes are following the best
> practices...(and they are probably not :-) )
>
> What would be the best if we have this kind of sql data to write in Solr:
>
>
> I have several different services which need (more or less), different
> data based on these JOINs...
>
> e.g.:
> Service A needs lots of data (but bot all),
> Service B needs a few data (some fields already included in A),
> Service C needs a bit more data than B(some fields already included in
> A/B)...
>
> *1. Would it be better to create one single index?*
> *-> i.e.: this will duplicate journal info for every single article*
>
> *2. Would it be better to create several specific indexes for each similar
> services?*
>
>
>
>
>
> *-> i.e.: this will use more space on the disks (and there are ~70millions
> of documents to join) 3. Would it be better to create an index per table
> and make a join? -> if yes, how?? *
>
> Kind regards,
> Bastien
>
>

Re: Get number of results in filtered query

2016-04-13 Thread Jack Krupansky

If you just do a faceted query without the filter, each facet will give you
the number of results for that country and numResults will give you the
total number of results across all countries. But once you apply one or
more filters, numResults reflects onl the post-filtering documents.

-- Jack Krupansky

On Wed, Apr 13, 2016 at 4:43 PM, Fundera Developer <
funderadevelo...@outlook.com> wrote:

> Hi all,
>
> we are developing a search engine in which all the possible results have
> one or more countries associated. If, apart from writing the query, the
> user selects a country, we use a filterquery to restrict the results to
> those that match the query and are associated to that country. Nothing
> spectacular so far  :-D
>
> However, we would like to show the number of results that are returned by
> the unfiltered query, since we already have the number of results
> associated to each country as we are also faceting on that field. Is it
> possible to have that number without executing the query twice?
>
> Thanks in advance!
>
>

Re: How to search for a First, Last of contact which are stored in differnet multivalued fields

2016-04-13 Thread Jack Krupansky

I was also going to point out the field masking span query, but... also
that it is at the Lucene level and not surfaced in Solr:
http://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html

Also see:
https://issues.apache.org/jira/browse/LUCENE-1494

But no hint anywhere that I know of for how to surface this Lucene feature
in Solr.

I would suggest the workaround of using an update processor to combine the
first and last names into a single multivalues field.

-- Jack Krupansky

On Wed, Apr 13, 2016 at 4:20 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Thrinadh,
>
> I think you can pull something together with FieldMaskingSpanQuery
>
> http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
>
> Ahmet
>
>
>
> On Wednesday, April 13, 2016 8:24 PM, Thrinadh Kuppili <
> thrinadh...@gmail.com> wrote:
> Hi,
>
> I have created 2 multivalued fields FirstName, Lastname
>
> In solr the values available are :
> FirstName": [ "Kim", "Jake","NATALIE", "Tammey"]
> LastName": [ "Lara", "Sharan","Taylor", "Taylor"]
>
> I am trying to search where firstName is Tammey and LastName is Taylor.
>
> I should be able to search firstname [4] and lastname [4] and get the
> record
> but currently it is searching with the firstname [4] and lastname [3] which
> shouldn't not happen.
>
> Do let me know if more details are needed.
>
> Thnx
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-for-a-First-Last-of-contact-which-are-stored-in-differnet-multivalued-fields-tp4269901.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Release date for Solr 6.0

2016-04-07 Thread Jack Krupansky

Do you need to preserve your index data or are you able to fully re-index
all data from scratch? If the former, you will need to upgrade from 4 to 5
first, then force a full optimize to fully upgrade all index segments to 5
format, and then upgrade from 5 to 6. In fact, if you had originally
upgraded from 3 to 4, you may need to force a full optimize on 4 to assure
that any lingering old 3 format index segments are in 4 format.

-- Jack Krupansky

On Thu, Apr 7, 2016 at 12:10 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> The release vote just passed, 6.0 should be released in a very few days.
>
> Best,
> Erick
>
> On Thu, Apr 7, 2016 at 8:48 AM, Ben Earley <bdearle...@gmail.com> wrote:
> > Hi there,
> >
> > My team has been using Solr 4 on a large distributed system and we are
> > interested in upgrading to Solr 6 when the new version is released to
> > leverage some of the new features, such as graph queries.  Is anyone able
> > to provide any insight as to the release schedule for this new version?
> >
> > Thanks,
> >
> > Ben Earley
>

Re: maxBooleanClauses in solrconfig.xml is ignored

2016-04-07 Thread Jack Krupansky

Edismax phrase-boost terms?

-- Jack Krupansky

On Thu, Apr 7, 2016 at 10:28 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/7/2016 8:05 AM, Zaccheo Bagnati wrote:
> > I'm trying to set the maxBooleanClauses parameter in solrconfig.xml to
> 1024
> > but I still have "Too many boolean clauses" error even with 513 terms
> (with
> > 512 terms it works).
> > I've read in the documentation (
> >
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig
> )
> > the warning that it is a global setting but I have only 1 core so there
> are
> > not conflicting definitions. I don't know how to deal with this
> > I'm using SOLR 5.5.
>
> The default value for maxBooleanClauses is 1024, so if you're getting an
> error with 513 terms, then either your query is getting parsed so there
> are more terms, or you have a config somewhere that is setting the value
> to 512.
>
> Can you add "debugQuery=true" to your query and see what you are getting
> for the parsedquery?
>
> Are you running SolrCloud?  If you are, then editing a config file is
> not enough.  You also have to upload the changes to zookeeper.
>
> Thanks,
> Shawn
>
>

Re: Can't get phrase field boosting to work using edismax

2016-04-06 Thread Jack Krupansky

I haven't traced through all the code recently, so I can't dispute Jan if
he knows a place that checks the output of the pf phrase analysis to see if
it is a single term, but... the INPUT to pf is definitely multiple clauses.
Regardless of the use of the keyword tokenizer, the query parser sees two
tokens, "some" and "words", and passes them as separate clauses to the code
I referenced above, which constructs quoted phrases and passes them through
the query parser again for the pf fields. What happens after that I cannot
say for sure.

But if the pf post-analysis processing does have this limitation that the
analysis of a multi-word phrase must be at least two terms, it should be
clearly documented. That's essentially what is at stake in this particular
issue.

Granted, that was my first thought, that the use of the keyword tokenizer
would be a no-no for a pf field, but this particular use case seems valid
to me, so we should consider whether the "multiple words analyze to one
term" use case should be supported, for precisely the use case at hand.

I can see wanting to have both a multi-term pf field combined with a
single-term pf field with the latter having a higher boost. For example, if
the input query exactly matches a product name field, as opposed to simply
matching a subset of a longer product name.

-- Jack Krupansky

On Wed, Apr 6, 2016 at 5:22 AM, <jimi.hulleg...@svensktnaringsliv.se> wrote:

> OK, well I'm not sure I agree with you. First of all, you ask me to point
> my "pf" towards a tokenized field, but I already do that (the fact that all
> text is tokenized into a single token doesn't change that fact). Also, I
> don't agree with the view that a single term phrase never is
> valid/reasonable. In this specific case, with a KeywordTokenizer, I see it
> as very reasonable indeed. And I would consider a "single term keyword
> phrase" solution more logical than a workaround using special magical
> characters inserted in the text. Just my two cents... :)
>
> Oh, hang on... If a phrase is defined as multiple tokens, and pf is used
> for phrase  boosting, does that mean that even with a regular tokenizer the
> pf won't work for fields that only contain one word? For example if the
> title of one document is "John", and the user searches for 'John' (without
> any surrounding phrase-characters), will edismax not boost this document?
>
> /Jimi
>
> -Original Message-
> From: Jan Høydahl [mailto:jan@cominvent.com]
> Sent: Wednesday, April 6, 2016 10:43 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Can't get phrase field boosting to work using edismax
>
> Hi,
>
> Phrase match via “pf” requires the target field to contain a phrase. A
> phrase is defined as multiple tokens. Yours does not contain a phrase since
> you use the KeywordTokenizer, leaving only one token in the field. eDismax
> pf will thus never kick in. Please point your “pf” towards a tokenized
> field.
>
> If what you are trying to achieve is to boost only when the whole query
> exactly matches the full content of the field, then have a look at my
> solution here https://github.com/cominvent/exactmatch
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 5. apr. 2016 kl. 19.10 skrev jimi.hulleg...@svensktnaringsliv.se:
> >
> > Some more input, before I call it a day. Just for the heck of it, I
> tried changing minClauseSize to 0 using the Eclipse debugger, so that it
> didn't return null at line 1203, but instead returned the TermQuery on line
> 1205. Then everything worked exactly as it should. The matching document
> got boosted as expected. And in the explain output, this can be seen:
> >
> > [...]
> > 11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172)
> [DefaultSimilarity], result of:
> > [...]
> >
> > So. In my case, having minClauseSize=2 on line 550 (line 565 for solr
> 5.5.0) is the culprit. Is this a bug, or am I using the pf in the wrong
> way? Can someone explain why minClauseSize can't be set to 0 here? The
> comment simply states "we need at least two or there shouldn't be a boost",
> but no explaination *why* at least two is needed.
> >
> > Regards
> > /Jimi
> >
> > -Original Message-
> > From: jimi.hulleg...@svensktnaringsliv.se
> > [mailto:jimi.hulleg...@svensktnaringsliv.se]
> > Sent: Tuesday, April 5, 2016 6:51 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Can't get phrase field boosting to work using edismax
> >
> > I now used the Eclipse debugger, to try and see if I can understand what
> is happening, I it seems like the ExtendedDismaxQParser simply ignores my
> pf parameter, since it does

Re: Can't get phrase field boosting to work using edismax

2016-04-05 Thread Jack Krupansky

It looks like the code constructing the boost phrase for pf will always add
a trailing blank, which is never a problem when a normal tokenizer is used
that removes white space, but the keyword tokenizer will preserve that
extra space, which prevents an exact match.

See line 531:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

I'd say it's a bug, but more a narrow use case that wasn't considered or
tested.

-- Jack Krupansky

On Tue, Apr 5, 2016 at 7:50 AM, <jimi.hulleg...@svensktnaringsliv.se> wrote:

> Hi,
>
> I'm trying to boost documents using a phrase field boosting (ie the pf
> parameter for edismax), but I can't get it to work (ie boosting documents
> where the pf field match the query as a phrase).
>
> As far as I can tell, solr, or more specifically the edismax handler, does
> *something* when I add this parameter. I know this because the QTime
> increases from around 5-10ms to around 30-40 ms, and the score explain
> structure is *slightly* modified (though with the same final score for all
> documents). But nowhere in the explain structure can I see anything about
> the pf. And I can't understand that. Shouldn't it be included in the
> explain? If not, is there any way to force it to be included somehow?
>
> The query looks something like this:
>
>
> ?q=some+words=10=score+desc=true=objectid,exactTitle,score%2C%5Bexplain+style%3Dtext%5D=title%5E2=swedishText1%5E1=edismax=exactTitle%5E5=xml=true
>
>
> I have one document that has the title "some words", and when I do a
> simple query filter with exactTitle:"some words" I get a match for that
> document. So then I would expect that the query above would boost this
> document, and include information about this in the explain. But nothing
> like this happens, and I can't understand why.
>
> The field looks like this:
>
>  required="false" multiValued="false" />
>
> And the fieldType looks like this:
>
>  positionIncrementGap="100">
>  
>class="solr.HTMLStripCharFilterFactory" />
>class="solr.KeywordTokenizerFactory" />
>class="solr.LowerCaseFilterFactory" />
>  
> 
>
>
> I have also tried boosting this document using a boost query, ie
> bq=exactTitle:"some words", and this works as expected. The document score
> is boosted, and the explain states this very clearly, with this segment:
>
> [...]
> 9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12)
> [DefaultSimilarity], result of:
> [...]
>
> Why is this working, but q=some+words=exactTitle^5 not? Shouldn't
> edismax rewrite my "pf query" into something very similar to the "bq query"?
>
> Regards
> /Jimi
>

Re: Slor 5.5.0 : SolrException: fieldType 'booleans' not found in the schema

2016-04-01 Thread Jack Krupansky

I think it's a bug...

Ah, the key clue is here:
Caused by: org.apache.solr.common.SolrException: fieldType 'booleans' not
found in the schema
at
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessor
Factory$TypeMapping.populateValueClasses(AddSchemaFieldsUpdateProcessor
Factory.java:247)

In fact, if we look at this schema in the repo:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml

We can find this update processor chain:

Which has this processor:

 strings  java.lang.Boolean booleans 

And there is you field type "booleans" reference.

That type used to be needed for multivalued boolean fields, but now the
dynamic pattern for *_bs is itself multivalued and simply references the
"boolean" type.

This schema in the repo has a similar issue:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/example/files/conf/solrconfig.xml

 strings  java.lang.Boolean booleans 

Hmmm... or maybe the old "booleans" field type should be restored to allow
boolean fields to be multivalued?

So, somebody should file a Jira on this.

-- Jack Krupansky

On Fri, Apr 1, 2016 at 3:24 PM, Girish Tavag <send2mymail...@gmail.com>
wrote:

> Hi Shawn,
>
>  Finally i'm able to figure out the problem. The issue was in
> solrconfig.xml where the booleans was defined. I replaced booleans with
> boolean and other similar fileds and it worked correctly :)
>
> Regards,
> GNT.
>

Re: Slor 5.5.0 : SolrException: fieldType 'booleans' not found in the schema

2016-03-31 Thread Jack Krupansky

Exactly which file did you copy? Please give the specific directory.

-- Jack Krupansky

On Thu, Mar 31, 2016 at 3:24 PM, Girish Tavag <send2mymail...@gmail.com>
wrote:

> Hi Binoy,
>
>  I copied the entire file schema.xml from the working example provided by
> solr itself. Solr provided dih example i'm able to run successfully .How
> could this be a problem?
>
> On Fri, Apr 1, 2016 at 12:39 AM, Binoy Dalal <binoydala...@gmail.com>
> wrote:
>
> > Somewhere in your schema you've defined a field with type as "booleans".
> > You should check if you've made a typo somewhere by adding that extra s
> > after boolean.
> > Else if it is a separate field that you're looking to add, define a new
> > fieldtype called booleans.
> >
> > All the info to help you with this can be found here:
> >
> >
> https://cwiki.apache.org/confluence/display/solr/Documents,+Fields,+and+Schema+Design
> >
> > I higly recommend that you go through the documentation before starting.
> >
> > On Fri, 1 Apr 2016, 00:34 Girish Tavag, <send2mymail...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I am new to solr, I started using this only from today,  when I wanted
> to
> > > create dih, i'm getting the below error.
> > >
> > > SolrException: fieldType 'booleans' not found in the schema
> > >
> > > What does this mean? and How  to resolve this.
> > >
> > > Regards,
> > > GNT
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>

Re: Solr response error 403 when I try to index medium.com articles

2016-03-30 Thread Jack Krupansky

You could use the curl command to read a URL on Medium.com. That would let
you examine and control the headers to experiment.

Google is able to index Medium.

Check the URL and make sure it's not on one of the paths disallowed by
medium.com/robots.txt (the one you gave seems fine):

User-Agent: *
Disallow: /_/
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/*/edit
Sitemap: https://medium.com/sitemap/sitemap.xml



-- Jack Krupansky

On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> 403 means "forbidden"
>
> Something about the request Solr is sending -- or soemthing about the IP
> address Solr is connecting from when talking to medium.com -- is causing
> hte medium.com web server to reject the request.
>
> This is something that servers may choose to do if they detect (via
> headers, or missing headers, or reverse ip lookup, or other
> distinctive nuances of how the connection was made) that the
> client connecting to their server isn't a "human browser" (ie: firefox,
> chrome, safari) and is a Robot that they don't want to cooperate with (ie:
> they might be happy toserve their pages to the google-bot crawler, but not
> to some third-party they've never heard of.
>
> The specifics of how/why you might get a 403 for any given url are hard to
> debug -- it might literally depend on how many requests you've sent tothat
> domain in the past X hours.
>
> In general Solr's ContentStream indexing from remote hosts isn't inteded
> to be a super robust solution for crawling arbitrary websites on the web
> -- if that's your goal, then i would suggest you look into running a more
> robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more
> features and debugging options (notably: rate limiting) and use that code
> to feath the content, then push it to Solr.
>
>
> : Date: Tue, 29 Mar 2016 20:54:52 -0300
> : From: Jeferson dos Anjos <jefersonan...@packdocs.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Solr response error 403 when I try to index medium.com articles
> :
> : I'm trying to index some pages of the medium. But I get error 403. I
> : believe it is because the medium does not accept the user-agent solr. Has
> : anyone ever experienced this? You know how to change?
> :
> : I appreciate any help
> :
> : 
> : 500
> : 94
> : 
> : 
> : 
> : Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : 
> : 
> : java.io.IOException: Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> : Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at java.security.AccessController.doPrivileged(Native Method) at
> : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> : Source) at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> : Source) at
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> : at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> : at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> : at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> : at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> :
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> : at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> : at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> : at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> : at
> org.eclipse.jetty.security.Secur

Re: Solr response error 403 when I try to index medium.com articles

2016-03-29 Thread Jack Krupansky

Medium switches from http to https, so you would need the logic for dealing
with https security handshakes.

-- Jack Krupansky

On Tue, Mar 29, 2016 at 7:54 PM, Jeferson dos Anjos <
jefersonan...@packdocs.com> wrote:

> I'm trying to index some pages of the medium. But I get error 403. I
> believe it is because the medium does not accept the user-agent solr. Has
> anyone ever experienced this? You know how to change?
>
> I appreciate any help
>
> 
> 500
> 94
> 
> 
> 
> Server returned HTTP response code: 403 for URL:
>
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> 
> 
> java.io.IOException: Server returned HTTP response code: 403 for URL:
>
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method) at
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> Source) at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> Source) at
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368) at
>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source) Caused by:
> java.io.IOException: Server returned HTTP response code: 403 for URL:
>
> https://me

Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Jack Krupansky

Thanks, Erick, I had forgotten about that. I did find one short reference
to it in the doc: "Be sure to run the Lucene IndexUpgrader included with
Solr 4.10 if you might still have old 3x formatted segments in your index.
Alternatively: fully optimize your index with Solr 4.10 to make sure it
consists only of one up-to-date index segment."

See:
https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5

Note to doc guys and committers: That section needs to be replaced with
"Major Changes form Solr 5 to Solr 6".

Also, that IU reference doesn't link to any doc, even the Lucene Javadoc:
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/index/IndexUpgrader.html

Feels like there should be some Solr doc as well. For example, can Solr be
running, or does it (each node if SolrCloud) need to be shut down first.
And note that it's needed for each collection. Presumably the collections
can be upgraded in parallel since they are distinct directories. It would
be nice to have a SolrIndexUpgrader to run in one shot and discover and
upgrade all Solr collections.

-- Jack Krupansky

On Thu, Mar 24, 2016 at 12:16 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> There's always the IndexUpgrader, one could run the 5x version against
> a 4x index and have a 5x-compatible index that would then be readable
> by 6x OOB.
>
> A bit convoluted to be sure.
>
> Erick
>
> On Thu, Mar 24, 2016 at 8:49 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> > On Thu, Mar 24, 2016 at 11:45 AM, Yonik Seeley <ysee...@gmail.com>
> wrote:
> >>> I've been led to understand that 6.X (at least the Lucene part?) won't
> >>> be backwards compatible with 4.X data. 5.5 at least works fine with
> data
> >>> files from 4.7, for instance.
> >
> > It really doesn't seem like much changed at the lucene index-format
> > level from 5 to 6...
> > it makes one wonder how much work would be involved in allowing Lucene
> > 6 to directly read a newer 4.x index... maybe it's just down to
> > version strings in the index and not much else?
> >
> > -Yonik
>

Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Jack Krupansky

Does anybody know if we have doc on the recommended process for upgrading
data after upgrading Solr? Sure the upgraded version will work fine with
that old data, but unless the data is upgraded, the user can't then upgrade
to the next major release after that. This is a case in point - the user is
on 4.x and upgrades to 5.x with that 4.x data, but will want to upgrade to
6.x shortly, but that will require the 4.x data to be rewritten
(force-merged?) to 5.x first.

-- Jack Krupansky

On Thu, Mar 24, 2016 at 11:38 AM, Bram Van Dam <bram.van...@intix.eu> wrote:

> On 23/03/16 15:50, Yonik Seeley wrote:
> > Kind of a unique situation for a dot-oh release, but from the Solr
> > perspective, 6.0 should have *fewer* bugs than 5.5 (for those features
> > in 5.5 at least)... we've been squashing a bunch of docValue related
> > issues.
>
> I've been led to understand that 6.X (at least the Lucene part?) won't
> be backwards compatible with 4.X data. 5.5 at least works fine with data
> files from 4.7, for instance. With that in mind, at least from my
> selfish perspective, applying fixes to 5.X would be much appreciated ;-)
>
>  - Bram
>
>
>

Re: Delete by query using JSON?

2016-03-22 Thread Jack Krupansky

See the correct syntax example here:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-SendingJSONUpdateCommands

Your query is fine.

-- Jack Krupansky

On Tue, Mar 22, 2016 at 3:07 PM, Paul Hoffman <p...@flo.org> wrote:

> I've been struggling to find the right syntax for deleting by query
> using JSON, where the query includes an fq parameter.
>
> I know how to delete *all* documents, but how would I delete only
> documents with field doctype = "cres"?  I have tried the following along
> with a number of variations, all to no avail:
>
> $ curl -s -d @- 'http://localhost:8983/solr/blacklight-core/update?wt=json'
> < {
> "delete": { "query": "doctype:cres" }
> }
> EOS
>
> I can identify the documents like this:
>
> curl -s '
> http://localhost:8983/solr/blacklight-core/select?q==doctype%3Acres=json=id
> '
>
> It seems like such a simple thing, but I haven't found any examples that
> use an fq.  Could someone post an example?
>
> Thanks in advance,
>
> Paul.
>
> --
> Paul Hoffman <p...@flo.org>
> Systems Librarian
> Fenway Libraries Online
> c/o Wentworth Institute of Technology
> 550 Huntington Ave.
> Boston, MA 02115
> (617) 442-2384 (FLO main number)
>

Re: Save Number of words in field

2016-03-21 Thread Jack Krupansky

You can write an Update Request Processor that would count the words in the
source value for a specified field and generate that count as an integer
value for another field.

My old Solr 4.x Deep Dive book has an example that uses a sequence (chain)
of existing update processors to count words in a multi-valued text field.
That's not as efficient as a custom or script update processor, but avoids
creating a custom processor.

See:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
Look for "regex-count-words".

-- Jack Krupansky

On Mon, Mar 21, 2016 at 12:15 PM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi,
>
> When indexing sentences I want to store the number of words in the
> sentence in a fields that I can use to with other query later for word
> count match. Please let me know whether it is possible?
>
> Thanks
> Rajesh
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>

Re: Seasonal searches in SOLR 5.x

2016-03-21 Thread Jack Krupansky

You can write an Update Request Processor which takes a pair of date field
value and creates a season code value for a separate field, which could be
multivalued for date ranges spanning seasons. Similarly you could have
another generated multivalued field which listed the months when the data
was collected. You could decide whether to store this extra info as an
alphanumeric code or a sall integers (1-4 for seasons, 1-12 for months.)

-- Jack Krupansky

On Mon, Mar 21, 2016 at 1:26 PM, Ioannis Kirmitzoglou <
ioanniskirmitzog...@gmail.com> wrote:

> Hi all,
>
> I would like to implement seasonal date searches on date ranges. I’m using
> SOLR 5.4.1 and have indexed date ranges using a DateRangeField (let’s call
> this field date_ranges).
> Each document in SOLR corresponds to a biological sample and each sample
> was collected during a date range that can span from a single day to
> multiple years. For my application it makes sense to enable seasonal
> searches, ie find samples that were collected during a specific period of
> the year (e.g. summer, or February). In this type of search, the year that
> the sample was collected is not relevant, only the days of the year. I’ve
> been all over SOLR documentation and I haven’t been able to find anything
> that will enable do me that. The closest I got was a post with instructions
> on how to use a spatial field to do date searches (
> https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/).
> Using the logic in that post I was able to come up with a solution but it’s
> rather complex and needs polygon searches (which in turn means installing
> the JTS Topology suite).
> Before committing to that I would like to ask for your input and whether
> there’s an easier way to do these types of searches.
>
> Many thanks,
>
> Ioannis
>
> -
> Ioannis Kirmitzoglou, PhD
> Bioinformatician - Scientific Programmer
> Imperial College, London
> www.vectorbase.org
> www.vigilab.org
>
>

Re: Query behavior.

2016-03-19 Thread Jack Krupansky

That's what I thought you had meant before, but the Jira ticket indicates
that you are looking for some extra level of AND/MUST outside of the OR,
which is different from what you just indicated. In the ticket you say: "How
can I achieve following? "+((fl:java fl:book))"", which has an extra AND
outside of the inner sub-query, which is a little different than just "(fl:java
fl:book)". Sure, the results should be the same, but why insist on the
extra level of nested boolean query?

-- Jack Krupansky

On Thu, Mar 17, 2016 at 12:50 AM, Modassar Ather <modather1...@gmail.com>
wrote:

> What I understand by q.op is the default operator. If there is no AND/OR
> in-between the terms the default will be AND as per my setting of q.op=AND.
> But what if the query has AND/OR explicitly put in-between the query terms?
> I just think that if (A OR B) is the query then the result should be based
> on any of the term's or both of the terms and not only both of the terms.
> Please correct me if my understanding is wrong.
>
> Thanks,
> Modassar
>
> On Wed, Mar 16, 2016 at 7:34 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > Now you've confused me... Did you actually intend that q.op=AND was going
> > to perform some function in a query with only two terms and and OR
> > operator? I mean, why not just drop the q.op=AND?
> >
> > -- Jack Krupansky
> >
> > On Wed, Mar 16, 2016 at 1:31 AM, Modassar Ather <modather1...@gmail.com>
> > wrote:
> >
> > > Jack as suggested I have created following jira issue.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-8853
> > >
> > > Thanks,
> > > Modassar
> > >
> > >
> > > On Tue, Mar 15, 2016 at 8:15 PM, Jack Krupansky <
> > jack.krupan...@gmail.com>
> > > wrote:
> > >
> > > > That was precisely the point of the need for a new Jira - to answer
> > > exactly
> > > > the questions that you have posed - and that I had proposed as well.
> > > Until
> > > > some of the senior committers comment on that Jira you won't have
> > > answers.
> > > > They've painted themselves into a corner and now I am curious how
> they
> > > will
> > > > unpaint themselves out of that corner.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Tue, Mar 15, 2016 at 1:46 AM, Modassar Ather <
> > modather1...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Jack for your response.
> > > > > The following jira bug for this issue is already present so I have
> > not
> > > > > created a new one.
> > > > > https://issues.apache.org/jira/browse/SOLR-8812
> > > > >
> > > > > Kindly help me understand that whether it is possible to achieve
> > search
> > > > on
> > > > > ORed terms as it was done in earlier Solr version.
> > > > > Is this behavior intentional or is it a bug? I need to migrate to
> > > > > Solr-5.5.0 but not doing so due to this behavior.
> > > > >
> > > > > Thanks,
> > > > > Modassar
> > > > >
> > > > >
> > > > > On Fri, Mar 11, 2016 at 3:18 AM, Jack Krupansky <
> > > > jack.krupan...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We probably need a Jira to investigate whether this really is an
> > > > > explicitly
> > > > > > intentional feature change, or whether it really is a bug. And if
> > it
> > > > > truly
> > > > > > was intentional, how people can work around the change to get the
> > > > > desired,
> > > > > > pre-5.5 behavior. Personally, I always thought it was a mistake
> > that
> > > > q.op
> > > > > > and mm were so tightly linked in Solr even though they are
> > > independent
> > > > in
> > > > > > Lucene.
> > > > > >
> > > > > > In short, I think people want to be able to set the default
> > behavior
> > > > for
> > > > > > individual terms (MUST vs. SHOULD) if explicit operators are not
> > > used,
> > > > > and
> > > > > > that OR is an explicit operator. And that mm should control only
> > how
> > > > many
> > > > > > SHOULD terms are required (Lucene MinShouldMatch.)
> > > >

Re: Query behavior.

2016-03-19 Thread Jack Krupansky

I was just wanting to see the Jira clarified (without creating noise on the
Jira), but if others feel they understand the relevance of the outer AND/+
to the stated problem, fine. I don't think I have anything else to add to
the discussion at this stage. Now we sit and wait for some senior
committers to address the concern.

-- Jack Krupansky

On Fri, Mar 18, 2016 at 6:06 AM, Alessandro Benedetti <abenede...@apache.org
> wrote:

> I think what he tried to explain was :
> " Input query : *fl:(java OR book)*
>  Instead of having the query parser parsing :
>  *+((fl:java fl:book)~2) *( which seems what is happening right now)
> He want the query parser to parse :
>
> +((fl:java fl:book)) ( without the mm expressed)
>
> More than the outer level of AND , I think the concern is in the absence of
> the ~2 operator ( mm=2 set automatically) .
>
> Anyway I can't reproduce the issue :(
>
> P.S. taking a brief look into the code
> : org/apache/solr/search/ExtendedDismaxQParser.java:341
> I suggest you to debug from that point as the comment says :
>
> // For correct lucene queries, turn off mm processing if there
> // were explicit operators (except for AND).
> if (query instanceof BooleanQuery) {
> query = SolrPluginUtils.setMinShouldMatch((BooleanQuery)query,
> config.minShouldMatch, config.mmAutoRelax);
> }
>
> I have no time now,
> Cheers
>
> On Fri, Mar 18, 2016 at 4:39 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > You still haven't explained what exactly you are trying to accomplish
> with
> > that outer level AND/+/MUST. Please be specific - why you insist on
> > "+((fl:java
> > fl:book))" rather than  "fl:java fl:book".
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 18, 2016 at 12:12 AM, Modassar Ather <modather1...@gmail.com
> >
> > wrote:
> >
> > > What I understand by "+((fl:java fl:book))" is any of the terms should
> be
> > > present in the complete query. Please correct me if I am wrong.
> > > What I want to achieve is (A OR B) where any of the term or both of the
> > > term will cause a match.
> > >
> > > Thanks,
> > > Modassar
> > >
> > > On Thu, Mar 17, 2016 at 10:32 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > > >
> > > wrote:
> > >
> > > > That's what I thought you had meant before, but the Jira ticket
> > indicates
> > > > that you are looking for some extra level of AND/MUST outside of the
> > OR,
> > > > which is different from what you just indicated. In the ticket you
> say:
> > > > "How
> > > > can I achieve following? "+((fl:java fl:book))"", which has an extra
> > AND
> > > > outside of the inner sub-query, which is a little different than just
> > > > "(fl:java
> > > > fl:book)". Sure, the results should be the same, but why insist on
> the
> > > > extra level of nested boolean query?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Mar 17, 2016 at 12:50 AM, Modassar Ather <
> > modather1...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > What I understand by q.op is the default operator. If there is no
> > > AND/OR
> > > > > in-between the terms the default will be AND as per my setting of
> > > > q.op=AND.
> > > > > But what if the query has AND/OR explicitly put in-between the
> query
> > > > terms?
> > > > > I just think that if (A OR B) is the query then the result should
> be
> > > > based
> > > > > on any of the term's or both of the terms and not only both of the
> > > terms.
> > > > > Please correct me if my understanding is wrong.
> > > > >
> > > > > Thanks,
> > > > > Modassar
> > > > >
> > > > > On Wed, Mar 16, 2016 at 7:34 PM, Jack Krupansky <
> > > > jack.krupan...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Now you've confused me... Did you actually intend that q.op=AND
> was
> > > > going
> > > > > > to perform some function in a query with only two terms and and
> OR
> > > > > > operator? I mean, why not just drop the q.op=AND?
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Wed, Mar 16, 2016 at 1:31 AM, Modas

Re: Query behavior.

2016-03-19 Thread Jack Krupansky

Now you've confused me... Did you actually intend that q.op=AND was going
to perform some function in a query with only two terms and and OR
operator? I mean, why not just drop the q.op=AND?

-- Jack Krupansky

On Wed, Mar 16, 2016 at 1:31 AM, Modassar Ather <modather1...@gmail.com>
wrote:

> Jack as suggested I have created following jira issue.
>
> https://issues.apache.org/jira/browse/SOLR-8853
>
> Thanks,
> Modassar
>
>
> On Tue, Mar 15, 2016 at 8:15 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > That was precisely the point of the need for a new Jira - to answer
> exactly
> > the questions that you have posed - and that I had proposed as well.
> Until
> > some of the senior committers comment on that Jira you won't have
> answers.
> > They've painted themselves into a corner and now I am curious how they
> will
> > unpaint themselves out of that corner.
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 15, 2016 at 1:46 AM, Modassar Ather <modather1...@gmail.com>
> > wrote:
> >
> > > Thanks Jack for your response.
> > > The following jira bug for this issue is already present so I have not
> > > created a new one.
> > > https://issues.apache.org/jira/browse/SOLR-8812
> > >
> > > Kindly help me understand that whether it is possible to achieve search
> > on
> > > ORed terms as it was done in earlier Solr version.
> > > Is this behavior intentional or is it a bug? I need to migrate to
> > > Solr-5.5.0 but not doing so due to this behavior.
> > >
> > > Thanks,
> > > Modassar
> > >
> > >
> > > On Fri, Mar 11, 2016 at 3:18 AM, Jack Krupansky <
> > jack.krupan...@gmail.com>
> > > wrote:
> > >
> > > > We probably need a Jira to investigate whether this really is an
> > > explicitly
> > > > intentional feature change, or whether it really is a bug. And if it
> > > truly
> > > > was intentional, how people can work around the change to get the
> > > desired,
> > > > pre-5.5 behavior. Personally, I always thought it was a mistake that
> > q.op
> > > > and mm were so tightly linked in Solr even though they are
> independent
> > in
> > > > Lucene.
> > > >
> > > > In short, I think people want to be able to set the default behavior
> > for
> > > > individual terms (MUST vs. SHOULD) if explicit operators are not
> used,
> > > and
> > > > that OR is an explicit operator. And that mm should control only how
> > many
> > > > SHOULD terms are required (Lucene MinShouldMatch.)
> > > >
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Mar 10, 2016 at 3:41 AM, Modassar Ather <
> > modather1...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Shawn for pointing to the jira issue. I was not sure that if
> > it
> > > is
> > > > > an expected behavior or a bug or there could have been a way to get
> > the
> > > > > desired result.
> > > > >
> > > > > Best,
> > > > > Modassar
> > > > >
> > > > > On Thu, Mar 10, 2016 at 11:32 AM, Shawn Heisey <
> apa...@elyograg.org>
> > > > > wrote:
> > > > >
> > > > > > On 3/9/2016 10:55 PM, Shawn Heisey wrote:
> > > > > > > The ~2 syntax, when not attached to a phrase query (quotes) is
> > the
> > > > way
> > > > > > > you express a fuzzy query. If it's attached to a query in
> quotes,
> > > > then
> > > > > > > it is a proximity query. I'm not sure whether it means
> something
> > > > > > > different when it's attached to a query clause in parentheses,
> > > > someone
> > > > > > > with more knowledge will need to comment.
> > > > > > 
> > > > > > > https://issues.apache.org/jira/browse/SOLR-8812
> > > > > >
> > > > > > After I read SOLR-8812 more closely, it seems that the ~2 syntax
> > with
> > > > > > parentheses is the way that the effective mm value is expressed
> > for a
> > > > > > particular query clause in the parsed query.  I've learned
> > something
> > > > new
> > > > > > today.
> > > > > >
> > > > > > Thanks,
> > > > > > Shawn
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Query behavior.

2016-03-19 Thread Jack Krupansky

You still haven't explained what exactly you are trying to accomplish with
that outer level AND/+/MUST. Please be specific - why you insist on
"+((fl:java
fl:book))" rather than  "fl:java fl:book".

-- Jack Krupansky

On Fri, Mar 18, 2016 at 12:12 AM, Modassar Ather <modather1...@gmail.com>
wrote:

> What I understand by "+((fl:java fl:book))" is any of the terms should be
> present in the complete query. Please correct me if I am wrong.
> What I want to achieve is (A OR B) where any of the term or both of the
> term will cause a match.
>
> Thanks,
> Modassar
>
> On Thu, Mar 17, 2016 at 10:32 AM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > That's what I thought you had meant before, but the Jira ticket indicates
> > that you are looking for some extra level of AND/MUST outside of the OR,
> > which is different from what you just indicated. In the ticket you say:
> > "How
> > can I achieve following? "+((fl:java fl:book))"", which has an extra AND
> > outside of the inner sub-query, which is a little different than just
> > "(fl:java
> > fl:book)". Sure, the results should be the same, but why insist on the
> > extra level of nested boolean query?
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:50 AM, Modassar Ather <modather1...@gmail.com
> >
> > wrote:
> >
> > > What I understand by q.op is the default operator. If there is no
> AND/OR
> > > in-between the terms the default will be AND as per my setting of
> > q.op=AND.
> > > But what if the query has AND/OR explicitly put in-between the query
> > terms?
> > > I just think that if (A OR B) is the query then the result should be
> > based
> > > on any of the term's or both of the terms and not only both of the
> terms.
> > > Please correct me if my understanding is wrong.
> > >
> > > Thanks,
> > > Modassar
> > >
> > > On Wed, Mar 16, 2016 at 7:34 PM, Jack Krupansky <
> > jack.krupan...@gmail.com>
> > > wrote:
> > >
> > > > Now you've confused me... Did you actually intend that q.op=AND was
> > going
> > > > to perform some function in a query with only two terms and and OR
> > > > operator? I mean, why not just drop the q.op=AND?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Wed, Mar 16, 2016 at 1:31 AM, Modassar Ather <
> > modather1...@gmail.com>
> > > > wrote:
> > > >
> > > > > Jack as suggested I have created following jira issue.
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-8853
> > > > >
> > > > > Thanks,
> > > > > Modassar
> > > > >
> > > > >
> > > > > On Tue, Mar 15, 2016 at 8:15 PM, Jack Krupansky <
> > > > jack.krupan...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > That was precisely the point of the need for a new Jira - to
> answer
> > > > > exactly
> > > > > > the questions that you have posed - and that I had proposed as
> > well.
> > > > > Until
> > > > > > some of the senior committers comment on that Jira you won't have
> > > > > answers.
> > > > > > They've painted themselves into a corner and now I am curious how
> > > they
> > > > > will
> > > > > > unpaint themselves out of that corner.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Tue, Mar 15, 2016 at 1:46 AM, Modassar Ather <
> > > > modather1...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Jack for your response.
> > > > > > > The following jira bug for this issue is already present so I
> > have
> > > > not
> > > > > > > created a new one.
> > > > > > > https://issues.apache.org/jira/browse/SOLR-8812
> > > > > > >
> > > > > > > Kindly help me understand that whether it is possible to
> achieve
> > > > search
> > > > > > on
> > > > > > > ORed terms as it was done in earlier Solr version.
> > > > > > > Is this behavior intentional or is it a bug? I need to migrate
> to
> > > > > > > Solr-5.5.0

Re: how to update billions of docs

2016-03-19 Thread Jack Krupansky

It would be nice to have a wiki/doc for "Bulk Field Update" that listed all
of these techniques and tricks.

And, of course, it would be so much better to have an explicit Lucene
feature for this. It could work in the background like merge and process
one segment at a time as efficiently as possible.

Have several modes:

1. Set a field of all documents to explicit value.
2. Set a field of query documents to an explicit value.
3. Increment by n.
4. Add new field to all document, or maybe by query.
5. Delete existing field for all documents.
6. Delete field value for all documents or a specified query.

-- Jack Krupansky

On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <kkrugler_li...@transpac.com>
wrote:

> As others noted, currently updating a field means deleting and inserting
> the entire document.
>
> Depending on how you use the field, you might be able to create another
> core/container with that one field (plus the key field), and use join
> support.
>
> Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> improvement, which looks like it's in the 5.x code line, though I don't see
> a fix version.
>
> -- Ken
>
> > From: Mohsin Beg Beg
> > Sent: March 16, 2016 3:52:47pm PDT
> > To: solr-user@lucene.apache.org
> > Subject: how to update billions of docs
> >
> > Hi,
> >
> > I have a requirement to replace a value of a field in 100B's of docs in
> 100's of cores.
> > The field is multiValued=false docValues=true type=StrField stored=true
> indexed=true.
> >
> > Atomic Updates performance is on the order of 5K docs per sec per core
> in solr 5.3 (other fields are quite big).
> >
> > Any suggestions ?
> >
> > -Mohsin
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Re: how to update billions of docs

2016-03-19 Thread Jack Krupansky

That's another great example of a mode that Bulk Field Update (my mythical
feature) needs - switch a list of fields from stored to docvalues.

And maybe even the opposite since there are scenarios in which docValues is
worse than stored and you would only find that out after indexing...
billions of documents.

Being able to switch indexed mode of a field (or list of fields) is also a
mode needed for bulk update (reindex).


-- Jack Krupansky

On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> kkrugler_li...@transpac.com
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 2016 3:52:47pm PDT
> > > > To: solr-user@lucene.apache.org
> > > > Subject: how to update billions of docs
> > > >
> > > > Hi,
> > > >
> > > > I have a requirement to replace a value of a field in 100B's of docs
> in
> > > 100's of cores.
> > > > The field is multiValued=false docValues=true type=StrField
> stored=true
> > > indexed=true.
> > > >
> > > > Atomic Updates performance is on the order of 5K docs per sec per
> core
> > > in solr 5.3 (other fields are quite big).
> > > >
> > > > Any suggestions ?
> > > >
> > > > -Mohsin
> > >
> > >
> > > --
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Cassandra & Solr
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Query behavior.

2016-03-15 Thread Jack Krupansky

That was precisely the point of the need for a new Jira - to answer exactly
the questions that you have posed - and that I had proposed as well. Until
some of the senior committers comment on that Jira you won't have answers.
They've painted themselves into a corner and now I am curious how they will
unpaint themselves out of that corner.

-- Jack Krupansky

On Tue, Mar 15, 2016 at 1:46 AM, Modassar Ather <modather1...@gmail.com>
wrote:

> Thanks Jack for your response.
> The following jira bug for this issue is already present so I have not
> created a new one.
> https://issues.apache.org/jira/browse/SOLR-8812
>
> Kindly help me understand that whether it is possible to achieve search on
> ORed terms as it was done in earlier Solr version.
> Is this behavior intentional or is it a bug? I need to migrate to
> Solr-5.5.0 but not doing so due to this behavior.
>
> Thanks,
> Modassar
>
>
> On Fri, Mar 11, 2016 at 3:18 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > We probably need a Jira to investigate whether this really is an
> explicitly
> > intentional feature change, or whether it really is a bug. And if it
> truly
> > was intentional, how people can work around the change to get the
> desired,
> > pre-5.5 behavior. Personally, I always thought it was a mistake that q.op
> > and mm were so tightly linked in Solr even though they are independent in
> > Lucene.
> >
> > In short, I think people want to be able to set the default behavior for
> > individual terms (MUST vs. SHOULD) if explicit operators are not used,
> and
> > that OR is an explicit operator. And that mm should control only how many
> > SHOULD terms are required (Lucene MinShouldMatch.)
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 10, 2016 at 3:41 AM, Modassar Ather <modather1...@gmail.com>
> > wrote:
> >
> > > Thanks Shawn for pointing to the jira issue. I was not sure that if it
> is
> > > an expected behavior or a bug or there could have been a way to get the
> > > desired result.
> > >
> > > Best,
> > > Modassar
> > >
> > > On Thu, Mar 10, 2016 at 11:32 AM, Shawn Heisey <apa...@elyograg.org>
> > > wrote:
> > >
> > > > On 3/9/2016 10:55 PM, Shawn Heisey wrote:
> > > > > The ~2 syntax, when not attached to a phrase query (quotes) is the
> > way
> > > > > you express a fuzzy query. If it's attached to a query in quotes,
> > then
> > > > > it is a proximity query. I'm not sure whether it means something
> > > > > different when it's attached to a query clause in parentheses,
> > someone
> > > > > with more knowledge will need to comment.
> > > > 
> > > > > https://issues.apache.org/jira/browse/SOLR-8812
> > > >
> > > > After I read SOLR-8812 more closely, it seems that the ~2 syntax with
> > > > parentheses is the way that the effective mm value is expressed for a
> > > > particular query clause in the parsed query.  I've learned something
> > new
> > > > today.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>

Re: Re: Avoid Duplication of record in searching

2016-03-15 Thread Jack Krupansky

It's called "live indexing" and is in DSE 4.7:
http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/srch/srchConfIncrIndexThruPut.html


-- Jack Krupansky

On Tue, Mar 15, 2016 at 4:41 AM, <rajeshkuma...@maxval-ip.com> wrote:

> Hi Jack,
> I am using DSE Search in  Datastax DSE 4.7.3, Cassandra version -
> Cassandra 2.1.8.689
>
> what is the Recent version of DSE which have - DSE Search also has a
> real-time search feature that does not require commit
>
>
> ---
> Thanks and regards,
> Rajesh Kumar Sountarrajan
> Software Developer - IT Team
>
> Mobile: 91 - 9600984804
> Email - rajeshkuma...@maxval-ip.com
>
>
> - Original Message - Subject: Re: Avoid Duplication of
> record in searching
> From: "Jack Krupansky" <jack.krupan...@gmail.com>
> Date: 3/14/16 9:57 pm
> To: solr-user@lucene.apache.org
>
> Are you using DSE Search or some custom integration of Solr and Cassandra?
>
>  Generally, changes in Solr are only visible after a commit operation is
>  performed, either an explicit commit or a time-based auto-commit. Recent
>  DSE Search also has a real-time search feature that does not require
> commit
>  - are you using that?
>
>  -- Jack Krupansky
>
>  On Mon, Mar 14, 2016 at 12:18 PM, <rajeshkuma...@maxval-ip.com> wrote:
>
>  > HI,
>  > I am having SOLR Search on Cassandra Table, when I do some updation in
>  > the Cassandra Table to which the SOLR is being configured he Updated
> record
>  > gets Duplicated in SOLR Search.But when we do RE-Index of the SOLR
> there we
>  > are getting unique records.
>  >
>  > We can do re-index every time via application before the search process
> is
>  > started but it will degrade the performance of the search.
>  >
>  > So Kindly anyone point me what can be done to avoid duplication if we
> make
>  > updation to the Cassandra table configured in SOLR.
>  >
>  >
>  > ---
>  > Thanks and regards,
>  > Rajesh Kumar Sountarrajan
>  > Software Developer - IT Team
>  >
>

Re: Avoid Duplication of record in searching

2016-03-14 Thread Jack Krupansky

Are you using DSE Search or some custom integration of Solr and Cassandra?

Generally, changes in Solr are only visible after a commit operation is
performed, either an explicit commit or a time-based auto-commit. Recent
DSE Search also has a real-time search feature that does not require commit
- are you using that?

-- Jack Krupansky

On Mon, Mar 14, 2016 at 12:18 PM, <rajeshkuma...@maxval-ip.com> wrote:

> HI,
>   I am having SOLR Search on Cassandra Table, when I do some updation in
> the Cassandra Table to which the SOLR is being configured he Updated record
> gets Duplicated in SOLR Search.But when we do RE-Index of the SOLR there we
> are getting unique records.
>
> We can do re-index every time via application before the search process is
> started but it will degrade the performance of the search.
>
> So Kindly anyone point me what can be done to avoid duplication if we make
> updation to the Cassandra table configured in SOLR.
>
>
> ---
> Thanks and regards,
> Rajesh Kumar Sountarrajan
> Software Developer - IT Team
>

Re: Solr Queries are very slow - Suggestions needed

2016-03-13 Thread Jack Krupansky

Yeah, there's some good material there, but probably still too inaccessible
for the average "help, my queries are slow" inquiry we get so frequently on
this list.

Another useful page is:
https://wiki.apache.org/solr/SolrPerformanceProblems


-- Jack Krupansky

On Sun, Mar 13, 2016 at 2:58 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Jack:
> https://wiki.apache.org/solr/SolrPerformanceFactors
> and
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>
> are already there, we can add to them
>
> Best,
> Erick
>
> On Sun, Mar 13, 2016 at 9:18 AM, Anil <anilk...@gmail.com> wrote:
> > Thanks Toke and Jack.
> >
> > Jack,
> >
> > Yes. it is 480 million :)
> >
> > I will share the additional details soon. thanks.
> >
> >
> > Regards,
> > Anil
> >
> >
> >
> >
> >
> > On 13 March 2016 at 21:06, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
> >
> >> (We should have a wiki/doc page for the "usual list of suspects" when
> >> queries are/appear slow, rather than need to repeat the same mantra(s)
> for
> >> every inquiry on this topic.)
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen <
> t...@statsbiblioteket.dk>
> >> wrote:
> >>
> >> > Anil <anilk...@gmail.com> wrote:
> >> > > i have indexed a data (commands from files) with 10 fields and 3 of
> >> them
> >> > is
> >> > > text fields. collection is created with 3 shards and 2 replicas. I
> have
> >> > > used document routing as well.
> >> >
> >> > > Currently collection holds 47,80,01,405 records.
> >> >
> >> > ...480 million, right? Funny digit grouping in India.
> >> >
> >> > > text search against text field taking around 5 sec. solr is query
> just
> >> > and
> >> > > of two terms with fl as 7 fields
> >> >
> >> > > fileId:"file unique id" AND command_text:(system login)
> >> >
> >> > While not an impressive response time, it might just be that your
> >> hardware
> >> > is not enough to handle that amount of documents. The usual culprit
> is IO
> >> > speed, so chances are you have a system with spinning drives and not
> >> enough
> >> > RAM: Switch to SSD and/or add more RAM.
> >> >
> >> > To give better advice, we need more information.
> >> >
> >> > * How large are your 3 shards in bytes?
> >> > * What storage system do you use (local SSD, local spinning drives,
> >> remote
> >> > storage...)?
> >> > * How much physical memory does your system have?
> >> > * How much memory is free for disk cache?
> >> > * How many concurrent queries do you issue?
> >> > * Do you update while you search?
> >> > * What does a full query (rows, faceting, grouping, highlighting,
> >> > everything) look like?
> >> > * How many documents does a typical query match (hitcount)?
> >> >
> >> > - Toke Eskildsen
> >> >
> >>
>

Re: Solr Queries are very slow - Suggestions needed

2016-03-13 Thread Jack Krupansky

(We should have a wiki/doc page for the "usual list of suspects" when
queries are/appear slow, rather than need to repeat the same mantra(s) for
every inquiry on this topic.)


-- Jack Krupansky

On Sun, Mar 13, 2016 at 11:29 AM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

> Anil <anilk...@gmail.com> wrote:
> > i have indexed a data (commands from files) with 10 fields and 3 of them
> is
> > text fields. collection is created with 3 shards and 2 replicas. I have
> > used document routing as well.
>
> > Currently collection holds 47,80,01,405 records.
>
> ...480 million, right? Funny digit grouping in India.
>
> > text search against text field taking around 5 sec. solr is query just
> and
> > of two terms with fl as 7 fields
>
> > fileId:"file unique id" AND command_text:(system login)
>
> While not an impressive response time, it might just be that your hardware
> is not enough to handle that amount of documents. The usual culprit is IO
> speed, so chances are you have a system with spinning drives and not enough
> RAM: Switch to SSD and/or add more RAM.
>
> To give better advice, we need more information.
>
> * How large are your 3 shards in bytes?
> * What storage system do you use (local SSD, local spinning drives, remote
> storage...)?
> * How much physical memory does your system have?
> * How much memory is free for disk cache?
> * How many concurrent queries do you issue?
> * Do you update while you search?
> * What does a full query (rows, faceting, grouping, highlighting,
> everything) look like?
> * How many documents does a typical query match (hitcount)?
>
> - Toke Eskildsen
>

Re: Sending text into a number field

2016-03-11 Thread Jack Krupansky

It might be nice to have a specialized update processor for this common
case of wanting to specify two separate but related numeric fields using
one string. IOW, parse out two numbers and then send them to two separate
fields. Seems doable, either as a script or in Java. The script/processor
could take three field names and a flag: the raw input source, the two
destination fields, and whether the raw source field should be removed or
passed through (presumably into a text/string field.)

(If I was still updating my old Solr 4.x Deep Dive book I'd be adding that
script right now, but... that's not happening.)

-- Jack Krupansky

On Fri, Mar 11, 2016 at 11:03 AM, Alessandro Benedetti <
abenede...@apache.org> wrote:

> I agree with Upayavira,
> this is an information extraction task, you need to implement your logic to
> extract the proper numeric values from the textual field.
> Your update request processor could be as simple as you want in extracting
> the number and setting them in numeric fields.
> So this task is responsibility of a component that process the original
> field value.
> It is not responsibility of the tdouble field type.
>
> Cheers
>
> On 11 March 2016 at 15:29, Upayavira <u...@odoko.co.uk> wrote:
>
> >
> >
> > On Fri, 11 Mar 2016, at 03:19 PM, John Blythe wrote:
> > > hey all,
> > >
> > > i'm tossing a lot of mud against the wall and am wanting to see what
> > > sticks. part of that includes throwing item descriptions against some
> > > fields i've set up as doubles. the imported data is a double and some
> of
> > > the descriptions will have the related data within it (product sizes,
> > > e.g.
> > > "Super awesome product 10 x 20cm"). is there a way to throw text at a
> > > number field (tdouble) and it only analyze the numbers instead of
> > > throwing
> > > an error?
> > >
> > > thanks for any info!
> >
> > If you really must do that on the Solr side, I'd suggest you try doing
> > it in an UpdateProcessor. You can either code these in Java, or in a
> > scripting language with the StatelessScriptUpdateProcessor. You could
> > strip out all of the non-numeric characters before they get to the
> > index.
> >
> > Upayavira
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Query behavior.

2016-03-10 Thread Jack Krupansky

We probably need a Jira to investigate whether this really is an explicitly
intentional feature change, or whether it really is a bug. And if it truly
was intentional, how people can work around the change to get the desired,
pre-5.5 behavior. Personally, I always thought it was a mistake that q.op
and mm were so tightly linked in Solr even though they are independent in
Lucene.

In short, I think people want to be able to set the default behavior for
individual terms (MUST vs. SHOULD) if explicit operators are not used, and
that OR is an explicit operator. And that mm should control only how many
SHOULD terms are required (Lucene MinShouldMatch.)

-- Jack Krupansky

On Thu, Mar 10, 2016 at 3:41 AM, Modassar Ather <modather1...@gmail.com>
wrote:

> Thanks Shawn for pointing to the jira issue. I was not sure that if it is
> an expected behavior or a bug or there could have been a way to get the
> desired result.
>
> Best,
> Modassar
>
> On Thu, Mar 10, 2016 at 11:32 AM, Shawn Heisey <apa...@elyograg.org>
> wrote:
>
> > On 3/9/2016 10:55 PM, Shawn Heisey wrote:
> > > The ~2 syntax, when not attached to a phrase query (quotes) is the way
> > > you express a fuzzy query. If it's attached to a query in quotes, then
> > > it is a proximity query. I'm not sure whether it means something
> > > different when it's attached to a query clause in parentheses, someone
> > > with more knowledge will need to comment.
> > 
> > > https://issues.apache.org/jira/browse/SOLR-8812
> >
> > After I read SOLR-8812 more closely, it seems that the ~2 syntax with
> > parentheses is the way that the effective mm value is expressed for a
> > particular query clause in the parsed query.  I've learned something new
> > today.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: ngrams with position

2016-03-10 Thread Jack Krupansky

I suspect that what you really want is analogous to PF2/PF3, but based on
the ngram terms that come out of query token analysis rather than using
pairs/triples of source terms before analysis that are then analyzed as
phrases so that all of the ngrams for a PF2/PF3 phrase must be in order
rather potentially shuffled.

Also, phrase query is an implicit AND while you may really want more of a
SpanOr query where the terms are ORed but must be within a close proximity.

-- Jack Krupansky

On Thu, Mar 10, 2016 at 6:31 AM, Alessandro Benedetti <abenede...@apache.org
> wrote:

> The reason pf2 and pf3 seems not a good solution to me is the fact that the
> edismax query parser calculate those grams on top of words shingles.
> So it takes the query in input, and produces the shingle based on the white
> space separator.
>
> i.e. if you search :
> "white tiger jumping"
>  and pf2 configured on field1.
> You are going to end up searching in field1 :
> "white tiger", "tiger jumping" .
> This is really useful in full text search oriented to phrases and partial
> phrases match.
> But it has nothing to do with the analysis type associated at query time at
> this moment.
> First it is used the query parser tokenisation to build the grams and then
> the query time analysis is applied.
> This according to my remembering,
> I will double check in the code and let you know.
>
> Cheers
>
>
> On 10 March 2016 at 11:02, elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > That's the use cas, yes. Find Amsterdam with Asmtreadm.
> >
> > And yes, we're only doing approximative search if we get 0 result.
> >
> > I don't quite get why pf2 pf3 not a good solution.
> >
> > We're actually testing a solution close to phonetic. Some kind of word
> > reduction.
> >
> > Thanks for the suggestion (and the link), this makes me think maybe
> > phonetic is the good solution.
> >
> > Thanks for your help,
> > Elisabeth
> >
> > 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
> >
> > >  If I followed your use case is:
> > >
> > > I type Asmtreadm and I want document matching Amsterdam ( even if the
> > edit
> > > distance is greater than 2) .
> > > First of all is something I hope you do only if you get 0 results, if
> not
> > > the overhead can be great and you are going to lose a lot of precision
> > > causing confusion in the customer.
> > >
> > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial
> > > phrase query to affect the scoring.
> > > Not a good fit for your problem.
> > >
> > > More than grams, have you considered using some sort of phonetic
> > matching ?
> > > Could this help :
> > > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
> > >
> > > Cheers
> > >
> > > On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com
> >
> > > wrote:
> > >
> > > > I am trying to do approximative search with solr. We've tried fuzzy
> > > search,
> > > > and spellcheck search, it's working ok but edit distance is limited
> > (to 2
> > > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator,
> we've
> > > had
> > > > performance issues, and I don't think you can have an edit distance
> > more
> > > > than 2.
> > > >
> > > > What we used to do with a database was more efficient: storing
> trigrams
> > > > with position, and then searching arround that position (not
> precisely
> > at
> > > > that position, since it's approximative search)
> > > >
> > > > Position is to avoid  for a trigram like ams (amsterdam) to get
> answers
> > > > where the same trigram is for instance at the end of the word. I
> would
> > > like
> > > > answers with the same relative position between trigrams to score
> > higher.
> > > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see
> any
> > > > other way. Please tell me if you do.
> > > >
> > > > From you're answer, I get that position is stored, but I dont
> > understand
> > > > how I can preserve relative order between trigrams, apart from using
> > pf2
> > > > pf3.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <
> abenede...@apache.org
> &

Re: Indexing Twitter - Hypothetical

2016-03-08 Thread Jack Krupansky

You have my permission... and blessing... and... condolences!

BTW, our usual recommendation is to do a subset proof of concept to see how
all the pieces come together and then calculate the scaling from there.
IOW, go ahead and index a day, a week, a month from the firehose and see
how many nodes, RAM, and SSD that takes and scale from there, although
estimating by more than a factor of ten is problematic given nonlinear
effects.


-- Jack Krupansky

On Tue, Mar 8, 2016 at 11:50 AM, Joseph Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you for the links and explanation.  We are using GATE (General
> Architecture for Text Engineering) and parts of the Stanford NER/Parser for
> the data that we ingest, but we do not apply it to the queries - only the
> data.  We've been concentrating on the back-end, and analytics, not so much
> what comes in for queries; something that we need to address.  For this
> hypothetical, I wanted to get ideas on what questions would need to be
> asked, and how large the system would need to be.  Thank you all very much
> for the information so far!
> Jack - I want to be a guru-level Solr expert.  :)
>
> -Joe
>
> On Sun, Mar 6, 2016 at 1:29 PM, Walter Underwood <wun...@wunderwood.org>
> wrote:
>
> > This is a very good presentation on using entity extraction in query
> > understanding. As you’ll see from the preso, it is not easy.
> >
> >
> >
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> > <
> >
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> > >
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com>
> > wrote:
> > >
> > > Back to the original question... there are two answers:
> > >
> > > 1. Yes - for guru-level Solr experts.
> > > 2. No - for anybody else.
> > >
> > > For starters, (as always), you would need to do a lot more upfront work
> > on
> > > mapping out the forms of query which will be supported. For example, is
> > > your focus on precision or recall. And, are you looking to analyze all
> > > matching tweets or just a sample. And, the load, throughput, and
> latency
> > > requirements. And, any spatial search requirements. And, any entity
> > search
> > > requirements. Without a clear view of the query requirements it simply
> > > isn't possible to even begin defining a data model. And without a data
> > > model, indexing is a fool's errand. In short, no focus, no progress.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> > >
> > >> Entity Recognition means you may want to recognize different entities
> > >> name/person, email, location/city/state/country etc. in your
> > >> tweets/messages with goal of  providing better relevant results to
> > users.
> > >> NER can be used at query or indexing (data enrichment) time.
> > >>
> > >> Thanks,
> > >> Susheel
> > >>
> > >> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
> > >> joseph.obernber...@gmail.com> wrote:
> > >>
> > >>> Thank you all very much for all the responses so far.  I've enjoyed
> > >> reading
> > >>> them!  We have noticed that storing data inside of Solr results in
> > >>> significantly worse performance (particularly faceting); so we store
> > the
> > >>> values of all the fields elsewhere, but index all the data with Solr
> > >>> Cloud.  I think the suggestion about splitting the data up into
> blocks
> > of
> > >>> date/time is where we would be headed.  Having two Solr-Cloud
> clusters
> > -
> > >>> one to handle ~30 days of data, and one to handle historical.
> Another
> > >>> option is to use a single Solr Cloud cluster, but use multiple
> > >>> cores/collections.  Either way you'd need a job to come through and
> > clean
> > >>> up old data. The historical cluster would have much worse
> performance,
> > >>> particularly for clustering and faceting the data, but that may be
> > >>> acceptable.
> > >>> I don't know what you mean by 'entity recognition in the queries' -
> > could
> > >>> you elaborate?
> > >>&g

Re: Solr Json API How to escape space in search string

2016-03-07 Thread Jack Krupansky

Backslash in JSON just tells JSON to escape the next character, while what
you really want is to pass a backslash through to the Solr query parser,
which you can do with a double backslash.

Alternatively, you could use quotes around the string in Solr, which would
require you to escape the quotes in JSON, with a single backslash.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 5:49 PM, Iana Bondarska <yana2...@gmail.com> wrote:

> Hi All,
> could you please tell me if escaping special characters in search keywords
> works in json api.
> e.g. I have document
>  {
> "string_s":"new value"
> }
> And I want to query "string_s" field with keyword "new value".
> In path params api I can escape spaces in keyword as well as other special
> characters with \ .
> following query finds document:
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=string_s:new\
> value=json=true
> But if I try to run same search using json api, nothing is found:
>
> http://localhsot:8983/solr/dynamic_fields_qa/select?q=*:*=
> {"query":"string_s:new\ value"}
>
> Best Regards,
> Iana Bondarska
>

Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-07 Thread Jack Krupansky

Great. And you shouldn't need the "{1}" - the square brackets match a
single character by definition.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 12:20 PM, Jay Potharaju <jspothar...@gmail.com>
wrote:

> Thanks Jack, the problem was my regex. Following regex worked.
>  "([a-zA-Z0-9]{1})" preserve_original="false"/>
> Jay
>
> On Sun, Mar 6, 2016 at 7:43 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > The filter name, "Capture Group", says it all - only pattern groups are
> > captured and you have not specified even a single group. See the example:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html
> >
> > Groups are each enclosed within parentheses, as shown in the Javadoc
> > example above.
> >
> > Since no groups were found, the filter doc applied this rule:
> > "If none of the patterns match, or if preserveOriginal is true, the
> > original token will be preserved."
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> >
> > That should probably also say "or if no pattern groups match".
> >
> > To test regular expressions, try an interactive online tool, such as:
> > https://regex101.com/
> >
> > -- Jack Krupansky
> >
> > On Sun, Mar 6, 2016 at 7:51 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > I don't see the brackets that mark the group you actually want to
> > > capture. As per:
> > >
> > >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
> > >
> > > I am also not sure if you actually need "{0,1}" part.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 7 March 2016 at 04:25, Jay Potharaju <jspothar...@gmail.com> wrote:
> > > > Hi,
> > > > I have a custom field for getting the first letter of an firstname.
> For
> > > > this I am using PatternCaptureGroupFilterFactory.
> > > > This is not working as expected, not able to parse the data and get
> the
> > > > first character for the string. Any suggestions on how to fix this?
> > > >
> > > >  
> > > >
> > > >   
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > >  pattern=
> > > > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> > > >
> > > >
> > > >
> > > > 
> > > >
> > > > --
> > > > Thanks
> > > > Jay
> > >
> >
>
>
>
> --
> Thanks
> Jay Potharaju
>

Re: Text search NGram

2016-03-07 Thread Jack Krupansky

Absolutely, but so what? Nothing in any Solr query is going to be based on
character position.

Also, adding and removing characters in a char filter is a really bad idea
if you might want to do highlighting since the token character position
would not line up with the original source text.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 10:33 AM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi Jack,
>
>
>
> Please correct me if iam wrong I added Char filter because
>
>
>
> In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value
> (Index) now WhitespaceTokenizerFactory produces the below result Office
> starts at 10. if I leave additional space say 2 more spaces Office starts
> at 12 should it not start at 10?
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 10
>
>
> 16
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 12
>
>
> 18
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> Thanks
>
> Rajesh
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Monday, March 7, 2016 8:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
>
>
> The charFilter isn't doing anything useful - the white space tokenzier
> will ignore extra white space anyway.
>
>
>
> -- Jack Krupansky
>
>
>
> On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh <r...@cebglobal.com r...@cebglobal.com>> wrote:
>
>
>
> > Hi Team,
>
> >
>
> > We have the blow type and we have indexed the value  "title":
>
> > "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
>
> > 8.0.61205.56 (2005)"
>
> >
>
> > When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
>
> > Microsoft Visual Studio 2006 as first record. I wanted to have
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
>
> > user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> >
>
> > We are using NGram so it takes care of misspelled or jumbled words[it
>
> > works as expected] e.g.
>
> > searching Micrs Visual Studio will gets Microsoft Visual Studio
>
> > searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> >
>
> >   
> > positionIncrementGap="0" >
>
> > 
>
> > 
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> > 
> > class="solr.WhitespaceTokenizerFactory"/>
>
> > 
> > class="solr.LowerCaseFilterFactory"/>
>
> > 
> > minGramSize="2" maxGramSize="800"/>
>
> > 
>
> >  
>
> > 
&

Re: Text search NGram

2016-03-07 Thread Jack Krupansky

The charFilter isn't doing anything useful - the white space tokenzier will
ignore extra white space anyway.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56
> (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected]
> e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>

Re: Custom field using PatternCaptureGroupFilterFactory

2016-03-06 Thread Jack Krupansky

The filter name, "Capture Group", says it all - only pattern groups are
captured and you have not specified even a single group. See the example:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html

Groups are each enclosed within parentheses, as shown in the Javadoc
example above.

Since no groups were found, the filter doc applied this rule:
"If none of the patterns match, or if preserveOriginal is true, the
original token will be preserved."
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html

That should probably also say "or if no pattern groups match".

To test regular expressions, try an interactive online tool, such as:
https://regex101.com/

-- Jack Krupansky

On Sun, Mar 6, 2016 at 7:51 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I don't see the brackets that mark the group you actually want to
> capture. As per:
>
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
>
> I am also not sure if you actually need "{0,1}" part.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 7 March 2016 at 04:25, Jay Potharaju <jspothar...@gmail.com> wrote:
> > Hi,
> > I have a custom field for getting the first letter of an firstname. For
> > this I am using PatternCaptureGroupFilterFactory.
> > This is not working as expected, not able to parse the data and get the
> > first character for the string. Any suggestions on how to fix this?
> >
> >  
> >
> >   
> >
> > 
> >
> > 
> >
> >  > "^[a-zA-Z0-9]{0,1}" preserve_original="false"/>
> >
> >
> >
> > 
> >
> > --
> > Thanks
> > Jay
>

Re: Solr Deserialize/Read .fdt file

2016-03-06 Thread Jack Krupansky

Solr itself doesn't directly access index files - that is the
responsibility of Lucene. That's why you see "lucene" in the class names,
not "solr".

To be clear, no Solr user will ever have to read or deserialize a .fdt
file. Or any Lucene index file for that matter.

If you actually do want to work at the Lucene level (which no one here will
recommend), start with the Lucene doc:

https://lucene.apache.org/core/documentation.html
https://lucene.apache.org/core/5_5_0/index.html

For File Formats:
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/codecs/lucene54/package-summary.html#package_description

After that you will need to become much more familiar with the Lucene (not
Solr) source code.

If you want to trace through the code from Solr through Lucene, I suggest
you start with Solr unit tests in Eclipse.

But none of that will be an appropriate topic for users on this (Solr) list.

-- Jack Krupansky

On Sun, Mar 6, 2016 at 3:34 PM, Bin Wang <binwang...@gmail.com> wrote:

> Hi there, I am interested in understanding all the files in the index
> folder.
>
> here <http://stackoverflow.com/questions/35830426/solr-read-index-files>
> is
> a stackoverflow question that I have tried however failed.
>
> Can anyone provide some sample code to help me get started.
>
> Best regards,
> Bin
>

Re: Indexing Twitter - Hypothetical

2016-03-06 Thread Jack Krupansky

Back to the original question... there are two answers:

1. Yes - for guru-level Solr experts.
2. No - for anybody else.

For starters, (as always), you would need to do a lot more upfront work on
mapping out the forms of query which will be supported. For example, is
your focus on precision or recall. And, are you looking to analyze all
matching tweets or just a sample. And, the load, throughput, and latency
requirements. And, any spatial search requirements. And, any entity search
requirements. Without a clear view of the query requirements it simply
isn't possible to even begin defining a data model. And without a data
model, indexing is a fool's errand. In short, no focus, no progress.

-- Jack Krupansky

On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com> wrote:

> Entity Recognition means you may want to recognize different entities
> name/person, email, location/city/state/country etc. in your
> tweets/messages with goal of  providing better relevant results to users.
> NER can be used at query or indexing (data enrichment) time.
>
> Thanks,
> Susheel
>
> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
> > Thank you all very much for all the responses so far.  I've enjoyed
> reading
> > them!  We have noticed that storing data inside of Solr results in
> > significantly worse performance (particularly faceting); so we store the
> > values of all the fields elsewhere, but index all the data with Solr
> > Cloud.  I think the suggestion about splitting the data up into blocks of
> > date/time is where we would be headed.  Having two Solr-Cloud clusters -
> > one to handle ~30 days of data, and one to handle historical.  Another
> > option is to use a single Solr Cloud cluster, but use multiple
> > cores/collections.  Either way you'd need a job to come through and clean
> > up old data. The historical cluster would have much worse performance,
> > particularly for clustering and faceting the data, but that may be
> > acceptable.
> > I don't know what you mean by 'entity recognition in the queries' - could
> > you elaborate?
> >
> > We would want to index and potentially facet on any of the fields - for
> > example entities_media_url, username, even background color, but we do
> not
> > know a-priori what fields will be important to users.
> > As to why we would want to make the data searchable; well - I don't make
> > the rules!  Tweets is not the only data source, but it's certainly the
> > largest that we are currently looking at handling.
> >
> > I will read up on the Berlin Buzzwords - thank you for the info!
> >
> > -Joe
> >
> >
> >
> > On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <jack.krupan...@gmail.com
> >
> > wrote:
> >
> > > As always, the initial question is how you intend to query the data -
> > query
> > > drives data modeling. How real-time do you need queries to be? How fast
> > do
> > > you need archive queries to be? How many fields do you need to query
> on?
> > > How much entity recognition do you need in queries?
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk>
> wrote:
> > >
> > > > On 03/03/2016 19:25, Toke Eskildsen wrote:
> > > >
> > > >> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> > > >>
> > > >>> Hi All - would it be reasonable to index the Twitter 'firehose'
> > > >>> with Solr Cloud - roughly 500-600 million docs per day indexing
> > > >>> each of the fields (about 180)?
> > > >>>
> > > >>
> > > >> Possible, yes. Reasonable? It is not going to be cheap.
> > > >>
> > > >> Twitter index the tweets themselves and have been quite open about
> > > >> how they do it. I would suggest looking for their presentations;
> > > >> slides or recordings. They have presented at Berlin Buzzwords and
> > > >> Lucene/Solr Revolution and probably elsewhere too. The gist is that
> > > >> they have done a lot of work and custom coding to handle it.
> > > >>
> > > >
> > > > As I recall they're not using Solr, but rather an in-house layer
> built
> > on
> > > > a customised version of Lucene. They're indexing around half a
> trillion
> > > > tweets.
> > > >
> > > > If the idea is to provide a searchable archive of all tweets, my
> first
> >

Re: How to use geospatial search to find the locations within polygon

2016-03-05 Thread Jack Krupansky

The doc does indeed say "JTS... It's a JAR file that you need to put on
Solr's classpath (but not via the standard solrconfig.xml mechanisms)", but
that is a little vague and nonspecific. It should probably be a labeled
section in the doc, like "Configuring JTS for Polygon Search", and have the
spatialContextFactory property (called a "setting" for some reason there
although elsewhere in the Solr doc XML attributes are referred to as
properties) point to that section. The "old" wiki has some more info, but
whether that is sufficient to fully configure JTS is unknown to me.

-- Jack Krupansky

On Sat, Mar 5, 2016 at 11:12 AM, david.w.smi...@gmail.com <
david.w.smi...@gmail.com> wrote:

> A Java NoClassDefFoundError of something in com.vividsolutions.jts means
> you don't have JTS on your classpath.  You should put the JTS jar file in
> server/lib/.  You can download it from maven-central.  Here's a search for
> JTS with the 1.14 version:
>
> http://search.maven.org/#artifactdetails%7Ccom.vividsolutions%7Cjts-core%7C1.14.0%7Cjar
>
> p.s. Nabble.com seems increasingly glitchy. I attempted to reply earlier
> but Nabble returned a failure.
>
> On Sat, Mar 5, 2016 at 1:39 AM Pradeep Chandra [via Lucene] <
> ml-node+s472066n4261824...@n3.nabble.com> wrote:
>
> > Thank u for your reply sirNow, I gave the ending point as starting
> > point to close the polygon 
> >
> > It is showing this error:
> >
> > {"error":{"msg":"java.lang.NoClassDefFoundError:
> > com/vividsolutions/jts/geom/Lineal","trace":"java.lang.RuntimeException:
> > java.lang.NoClassDefFoundError: com/vividsolutions/jts/geom/Lineal\n\tat
> >
> org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:618)\n\tat
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:477)\n\tat
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)\n\tat
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
> > org.eclipse.jetty.server.Server.handle(Server.java:499)\n\tat
> > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)\n\tat
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\tat
> >
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)\n\tat
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\tat
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tat
> > java.lang.Thread.run(Thread.java:745)\nCaused by:
> > java.lang.NoClassDefFoundError: com/vividsolutions/jts/geom/Lineal\n\tat
> >
> com.spatial4j.core.shape.jts.JtsGeometry.(JtsGeometry.java:104)\n\tat
> >
> com.spatial4j.core.context.jts.JtsSpatialContext.makeShape(JtsSpatialContext.java:203)\n\tat
> >
> com.spatial4j.core.io.jts.JtsWktShapeParser.makeShapeFromGeometry(JtsWktShapeParser.java:252)\n\tat
> >
> com.spatial4j.core.io.jts.JtsWktShapeParser.parsePolygonShape(JtsWktShapeParser.java:133)\n\tat
> >
> com.spatial4j.core.io.jts.JtsWktShapeParser.parseShapeByType(JtsWktShapeParser.java:89)\n\tat
> >
> com.spatial4j.core.io.WktShapeParser.parseIfSupported(WktShapeParser.java:114)\n\tat
> > com.spatial4j.core.io.WktShapeParser.parse(WktShapeParser.java:86)\n\tat
> >
> com.spatial4j.core.context.SpatialContext.readS

Re: How to use geospatial search to find the locations within polygon

2016-03-04 Thread Jack Krupansky

It would be nice for the doc to say that - describe when IsWithin is and
isn't appropriate. And give some examples as well for people to copy/mimic.

-- Jack Krupansky

On Fri, Mar 4, 2016 at 10:20 AM, david.w.smi...@gmail.com <
david.w.smi...@gmail.com> wrote:

> First of all, assuming this is a standard point-in-polygon situation, use
> the Intersects predicate -- with point data it's semantically the same as
> IsWithin and Intersects is much faster.  I don't know why you used
> isDisjointTo in your 2nd example; maybe you want to find when they don't
> touch?  Any way, one problem right away I saw is that the first point in
> the polygon is not repeated in the last.  That's what the WKT spec demands.
>
>
> On Fri, Mar 4, 2016 at 1:37 AM Pradeepchandra Mulpuru <
> prade...@infologitech.in> wrote:
>
> > Hi Sir,
> >
> > I have a question on Apache Solr Spatial search. I have a json type data
> > of City, Latitude & Longitude. I indexed those fields with locm_place of
> > the type location_rpt. Now I want to give a polygon as a filter query in
> > order to get the City names located in that polygon. I don't have any
> idea
> > of doing that.
> >
> > I tried with this:
> >
> >
> >
> http://localhost:8983/solr/loopback/select?fl=City=json=*:*=locm_place
> :"IsWithin(POLYGON((16.762467717941604
> > 78.94775390625,16.99375545289456 78.11279296875%20,17.31917640744285
> > 77.98095703125,17.80099604766698 78.72802734375))) distErrPct=0"
> >
> > It is showing the result like:
> >
> >
> {"responseHeader":{"status":400,"QTime":4,"params":{"fl":"City","q":"*:*","wt":"json","fq":"locm_place:\"IsWithin(POLYGON((16.762467717941604
> 78.94775390625, 16.99375545289456 78.11279296875 , 17.31917640744285
> 77.98095703125 , 17.80099604766698 78.72802734375)))
> distErrPct=0\""}},"error":{"msg":"Couldn't parse shape
> 'POLYGON((16.762467717941604 78.94775390625, 16.99375545289456
> 78.11279296875 , 17.31917640744285 77.98095703125 , 17.80099604766698
> 78.72802734375))' because: Unknown Shape definition
> [POLYGON((16.762467717941604 78.94775390625, 16.99375545289456
> 78.11279296875 , 17.31917640744285 77.98095703125 ,
> 17.80099604...]","code":400}}
> >
> >
> > I tried with this:
> >
> >
> http://localhost:8983/solr/loopback/select?fl=City=json=*:*=geo:%22IsDisjointTo(POLYGON((16.762467717941604%2078.94775390625,%2016.99375545289456%2078.11279296875,17.31917640744285%2077.98095703125,17.80099604766698%2078.72802734375)))%22
> >
> > It is showing the result like:
> >
> >
> >
> {"responseHeader":{"status":400,"QTime":21,"params":{"fl":"City","q":"*:*","wt":"json","fq":"geo:\"IsDisjointTo(POLYGON((16.762467717941604
> 78.94775390625, 16.99375545289456 78.11279296875,17.31917640744285
> 77.98095703125,17.80099604766698
> 78.72802734375)))\""}},"error":{"msg":"Couldn't parse shape
> 'POLYGON((16.762467717941604 78.94775390625, 16.99375545289456
> 78.11279296875,17.31917640744285 77.98095703125,17.80099604766698
> 78.72802734375))' because: java.lang.IllegalArgumentException: points must
> form a closed linestring","code":400}}
> >
> >
> > Kindly tell me what I have to change/configure. I am attaching the json
> file,schema.xml and a screenshot of Solr admin total result query.
> >
> >
> > Thanks and regards,
> >
> > M Pradeep Chandra
> >
> > --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

Re: Indexing Twitter - Hypothetical

2016-03-04 Thread Jack Krupansky

As always, the initial question is how you intend to query the data - query
drives data modeling. How real-time do you need queries to be? How fast do
you need archive queries to be? How many fields do you need to query on?
How much entity recognition do you need in queries?


-- Jack Krupansky

On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> wrote:

> On 03/03/2016 19:25, Toke Eskildsen wrote:
>
>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
>>
>>> Hi All - would it be reasonable to index the Twitter 'firehose'
>>> with Solr Cloud - roughly 500-600 million docs per day indexing
>>> each of the fields (about 180)?
>>>
>>
>> Possible, yes. Reasonable? It is not going to be cheap.
>>
>> Twitter index the tweets themselves and have been quite open about
>> how they do it. I would suggest looking for their presentations;
>> slides or recordings. They have presented at Berlin Buzzwords and
>> Lucene/Solr Revolution and probably elsewhere too. The gist is that
>> they have done a lot of work and custom coding to handle it.
>>
>
> As I recall they're not using Solr, but rather an in-house layer built on
> a customised version of Lucene. They're indexing around half a trillion
> tweets.
>
> If the idea is to provide a searchable archive of all tweets, my first
> question would be 'why': if the idea is to monitor new tweets for
> particular patterns there are better ways to do this (Luwak for example).
>
> Charlie
>
>
>> If I were to guess at a sharded setup to handle such data, and keep
>>> 2 years worth, I would guess about 2500 shards.  Is that
>>> reasonable?
>>>
>>
>> I think you need to think well beyond standard SolrCloud setups. Even
>> if you manage to get 2500 shards running, you will want to do a lot
>> of tweaking on the way to issue queries so that each request does not
>> require all 2500 shards to be searched. Prioritizing newer material
>> and only query the older shards if there is not enough resent results
>> is an example.
>>
>> I highly doubt that a single SolrCloud is the best answer here. Maybe
>> one cloud for each month and a lot of external logic?
>>
>> - Toke Eskildsen
>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Re: Indexing Twitter - Hypothetical

2016-03-03 Thread Jack Krupansky

As always, the initial question always needs to be how you wish to query
the data - query will drive the data model. I don't  want to put words in
your mouth as to your query requirements, so... clue us in on your query
requirements.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 2:25 PM, Toke Eskildsen <t...@statsbiblioteket.dk>
wrote:

> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> > Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
> > Cloud - roughly 500-600 million docs per day indexing each of the fields
> > (about 180)?
>
> Possible, yes. Reasonable? It is not going to be cheap.
>
> Twitter index the tweets themselves and have been quite open about how
> they do it. I would suggest looking for their presentations; slides or
> recordings. They have presented at Berlin Buzzwords and Lucene/Solr
> Revolution and probably elsewhere too. The gist is that they have done a
> lot of work and custom coding to handle it.
>
> > If I were to guess at a sharded setup to handle such data, and keep 2
> years
> > worth, I would guess about 2500 shards.  Is that reasonable?
>
> I think you need to think well beyond standard SolrCloud setups. Even if
> you manage to get 2500 shards running, you will want to do a lot of
> tweaking on the way to issue queries so that each request does not require
> all 2500 shards to be searched. Prioritizing newer material and only query
> the older shards if there is not enough resent results is an example.
>
> I highly doubt that a single SolrCloud is the best answer here. Maybe one
> cloud for each month and a lot of external logic?
>
> - Toke Eskildsen
>

Re: What is the best way to index 15 million documents of total size 425 GB?

2016-03-03 Thread Jack Krupansky

What does a typical document look like - number of columns, data type,
size? How much is text vs. numeric? Are there any large blobs? I mean, 15M
docs in 425GB indicates about 28K per row/document which seems rather large.

Is the PG data VARCHAR(n) or CHAR(n). IOW, might it have lots of trailing
blanks for text columns?

As always, the very first question in Solr data modeling is always how do
you intend to query the data - queries will determine the data model.

Ultimately, the issue will not be how long it takes to index, but query
latency and query throughput.

30GB sounds way too small for a 425GB index in terms of odds of low query
latency.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 12:54 PM, Aneesh Mon N <aneeshm...@gmail.com> wrote:

> Hi,
>
> We are facing a huge performance issue while indexing the data to Solr, we
> have around 15 million records in a PostgreSql database which has to be
> indexed to Solr 5.3.1 server.
> It takes around 16 hours to complete the indexing as of now.
>
> To be noted that all the fields are stored so as to support the atomic
> updates.
>
> Current approach:
> We use a ETL tool(Pentaho) to fetch the data from database in chunks of
> 1000 records, convert them into xml format and pushes to Solr. This is run
> in 10 parallel threads.
>
> System params
> Solr Version: 5.3.1
> Size on disk: 425 GB
>
> Database, ETL machine and SOLR are of 16 core and 30 GB RAM
> Database and SOLR Disk: RAID
>
> Any pointers best approaches to index these kind of data would be helpful.
>
> --
> Regards,
> Aneesh Mon N
> Chennai
> +91-8197-188-588
>

Re: FW: Difference Between Tokenizer and filter

2016-03-03 Thread Jack Krupansky

Try re-reading the doc on "Understanding Analyzers, Tokenizers, and
Filters" and then ask specific questions on specific statements made in the
doc:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

As far as on-disk format, a Solr user has absolutely zero reason to be
concerned about what format Lucene uses to store the index on disk. You are
certainly welcome to dive down to that level if you wish, but that is not
something worth discussing on this list. To a Solr user the index is simply
a list of terms at positions, both determined by the character filters,
tokenizer, and token filters of the analyzer. The format of that
information as stored in Lucene won't impact the behavior of your Solr app
in any way.

Again, to be clear, you need to be thoroughly familiar with that doc
section. It won't help you to try to guess questions to ask if you don't
have a basic understanding of what is stated on that doc page.

It might also help you visualize what the doc says by using the analysis
page of the Solr admin UI which will give you all the intermediate and
final results of the analysis process, the specific token/term text and
position at each step. But even that won't help if you are unable to grasp
what is stated on the basic doc page.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 8:51 AM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi Shawn,
>
> One last question on analyzer. If the format of the index on disk is not
> controlled by the tokenizer, or anything else in the analysis chain, then
> what does type="index" and type="query" in analyzer mean. Can you please
> help me in understanding?
>
> 
>
>  
>  
>
>  
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
> -Original Message-
> From: G, Rajesh
> Sent: Thursday, March 3, 2016 6:12 PM
> To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
> Subject: RE: FW: Difference Between Tokenizer and filter
>
> Thanks Shawn. This helps
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Wednesday, March 2, 2016 11:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: FW: Difference Between Tokenizer and filter
>
> On 3/2/2016 9:55 AM, G, Rajesh wrote:
> > Thanks for your email Koji. Can you please explain what is the role of
> tokenizer and filter so I can understand why I should not have two
> tokenizer in index and I should have at least one tokenizer in query?
>
> You can't have two tokenizers.  It's not allowed.
>
> The only notable difference between a Tokenizer and a Filter is that a
> Tokenizer operates on an input that's a single string, turning it into a
> token stream, and a Filter uses a token stream for both input and output.
> A CharFilter uses a single string as both input and output.
>
> An analysis chain in the Solr schema (whether it's index or query) is
> composed of zero or more CharFilter entries, exactly one Tokenizer entry,
> and zero or more Filter entries.  Alternately, you can specify an Analyzer
> class, which is a lot like a Tokenizer.  An Analyzer is effectively the
> same thing as a tokenizer combined with filters.
>
> CharFilters run before the Tokenizer, and Filters run after the
> Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
> concepts.
>
> > My understanding is tokenizer is used to say how the content should be
> > indexed physically in file system. Filters are used to query result
>
> The format of the index on disk is not controlled by the tokenizer, or
> anything else in the analysis chain.  It is controlled by the Lucene
> codec.  Only a very small part of the codec is configurable in Solr, but
> normally this does not need configuring.  The codec defaults are
> appropriate for the majority of use cases.
>
> Thanks,
> Shawn
>
>

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky

The chapter seems like the optimal unit for initial searches - just combine
the page text with a line break between them or index as a multivalued
field and set the position increment gap to be 1 so that phrases work.

You could have a separate collection for pages, with each page as a Solr
document, but include the last line of text from the previous page and the
first line of text from the next page so that phrases will match across
page boundaries. Unfortunately, that may also result in false hits if the
full phrase is found on the two adopted lines. That would require some
special filtering to eliminate those false positives.

There is also the question of maximum phrase size - most phrases tend to be
reasonably short, but sometimes people may want to search for an entire
paragraph (e.g., a quote) that may span multiple lines on two adjacent
pages.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi,
> From the top of my head - probably does not solve problem completely, but
> may trigger brainstorming: Index chapters and include page break tokens.
> Use highlighting to return matches and make sure fragment size is large
> enough to get page break token. In such scenario you should use slop for
> phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
>
>> Hi all,
>> I'm searching for ideas on how to define schema and how to perform queries
>> in this use case: we have to index books, each book is split into chapters
>> and chapters are split into pages (pages represent original page cutting
>> in
>> printed version). We should show the result grouped by books and chapters
>> (for the same book) and pages (for the same chapter). As far as I know, we
>> have 2 options:
>>
>> 1. index pages as SOLR documents. In this way we could theoretically
>> retrieve chapters (and books?)  using grouping but
>>  a. we will miss matches across two contiguous pages (page cutting is
>> only due to typographical needs so concepts could be split... as in
>> printed
>> books)
>>  b. I don't know if it is possible in SOLR to group results on two
>> different levels (books and chapters)
>>
>> 2. index chapters as SOLR documents. In this case we will have the right
>> matches but how to obtain the matching pages? (we need pages because the
>> client can only display pages)
>>
>> we have been struggling on this problem for a lot of time and we're  not
>> able to find a suitable solution so I'm looking if someone has ideas or
>> has
>> already solved a similar issue.
>> Thanks
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky

Any reason not to use the simplest structure - each page is one Solr
document with a book field, a chapter field, and a page text field? You can
then use grouping to group results by book (title text) or even chapter
(title text and/or number). Maybe initially group by book and then if the
user selects a book group you can re-query with the specific book and then
group by chapter.


-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zacch...@gmail.com> wrote:

> Original data is quite well structured: it comes in XML with chapters and
> tags to mark the original page breaks on the paper version. In this way we
> have the possibility to restructure it almost as we want before creating
> SOLR index.
>
> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
>
> > To start, what is the form of your input data - is it already divided
> into
> > chapters and pages? Or... are you starting with raw PDF files?
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zacch...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > > I'm searching for ideas on how to define schema and how to perform
> > queries
> > > in this use case: we have to index books, each book is split into
> > chapters
> > > and chapters are split into pages (pages represent original page
> cutting
> > in
> > > printed version). We should show the result grouped by books and
> chapters
> > > (for the same book) and pages (for the same chapter). As far as I know,
> > we
> > > have 2 options:
> > >
> > > 1. index pages as SOLR documents. In this way we could theoretically
> > > retrieve chapters (and books?)  using grouping but
> > > a. we will miss matches across two contiguous pages (page cutting
> is
> > > only due to typographical needs so concepts could be split... as in
> > printed
> > > books)
> > > b. I don't know if it is possible in SOLR to group results on two
> > > different levels (books and chapters)
> > >
> > > 2. index chapters as SOLR documents. In this case we will have the
> right
> > > matches but how to obtain the matching pages? (we need pages because
> the
> > > client can only display pages)
> > >
> > > we have been struggling on this problem for a lot of time and we're
> not
> > > able to find a suitable solution so I'm looking if someone has ideas or
> > has
> > > already solved a similar issue.
> > > Thanks
> > >
> >
>

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky

To start, what is the form of your input data - is it already divided into
chapters and pages? Or... are you starting with raw PDF files?


-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zacch...@gmail.com> wrote:

> Hi all,
> I'm searching for ideas on how to define schema and how to perform queries
> in this use case: we have to index books, each book is split into chapters
> and chapters are split into pages (pages represent original page cutting in
> printed version). We should show the result grouped by books and chapters
> (for the same book) and pages (for the same chapter). As far as I know, we
> have 2 options:
>
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
> a. we will miss matches across two contiguous pages (page cutting is
> only due to typographical needs so concepts could be split... as in printed
> books)
> b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
>
> 2. index chapters as SOLR documents. In this case we will have the right
> matches but how to obtain the matching pages? (we need pages because the
> client can only display pages)
>
> we have been struggling on this problem for a lot of time and we're  not
> able to find a suitable solution so I'm looking if someone has ideas or has
> already solved a similar issue.
> Thanks
>

Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky

Interesting... if I google for "edismax solr" (I usually specify a product
name when searching for doc by feature name to avoid irrelevant hits) the
old wiki comes up as #1 and new doc as #2, but if I search for "edismax"
alone (which I normally wouldn't do out of a desire to limit matches to the
desired product) the new ref guide does indeed show up as #1 and the old
wiki as #2.

I'm not enough of an SEO expert to know how to de-boost the old wiki other
than outright deletion. I'm guessing it's due to a lot of inbound links,
maybe mostly from references in old emails.

In any case, a proper tombstone is probably the best step at this point.

-- Jack Krupansky

On Mon, Feb 29, 2016 at 10:39 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> It is indeed a problem that the old edismax wiki is result #1 from Google.
> I find that annoying as well since I also use Google search as my first
> step in accessing doc on everything.
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 10:03 AM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
>
>> Thanks Shawn,
>>
>> I had more or less assumed that the cwiki site was focused on the latest
>> Solr version, but never really noticed that the "reference guide" was
>> available in version-specific releases. I guess that is partly because I
>> prefer googling about a specific topic, instead of reading some reference
>> guide cover to cover. And from a google search for "edismax" (for example),
>> it's not really trivial to click one's way into a version-specific
>> reference guide on that topic. Instead, one tends to land on the wiki pages
>> (with the old wiki as the first hit, sometimes).
>>
>> Regards
>> /Jimi
>>
>> -Original Message-
>> From: Shawn Heisey [mailto:apa...@elyograg.org]
>> Sent: Monday, February 29, 2016 3:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: ExtendedDisMax configuration nowhere to be found
>>
>> On 2/29/2016 7:00 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
>> > So, should I assume that the "Confluence wiki" is the correct place for
>> all documentation, even for solr 4.6?
>>
>> If you want documentation specifically for 4.6, there are
>> version-specific releases of the guide:
>>
>> https://archive.apache.org/dist/lucene/solr/ref-guide/
>>
>> The confluence wiki is the "live" version of the reference guide,
>> applicable to whatever version of Solr is being worked on at the moment,
>> not the released versions.  Because it's such a large documentation set and
>> Solr evolves incrementally, quite a lot of the confluence wiki is
>> applicable to older versions, but the wiki as a whole is not intended for
>> those older versions.
>>
>> The project is gearing up to begin the work on releasing version 6.0, so
>> you can expect a LOT of change activity on the confluence wiki in the near
>> future.  I have no idea how long it will take to finish 6.0.  The last two
>> major releases (4.0 and 5.0) took months, but there's strong hope on the
>> team that it will only take a few weeks this time.
>>
>> If you want to keep an eye on the pulse of the project, join the dev list.
>>
>> http://lucene.apache.org/solr/resources.html#mailing-lists
>>
>> In addition to a fair number of messages from real people, the dev list
>> receives automated email from back-end systems in the project
>> infrastructure, which creates very high traffic.  The ability to create
>> filters to move mail between folders may help you keep your sanity.
>>
>> Also listed on the link above page is the commit notification list, which
>> offers a particularly verbose look into what's happening to the project.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky

It is indeed a problem that the old edismax wiki is result #1 from Google.
I find that annoying as well since I also use Google search as my first
step in accessing doc on everything.

-- Jack Krupansky

On Mon, Feb 29, 2016 at 10:03 AM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Thanks Shawn,
>
> I had more or less assumed that the cwiki site was focused on the latest
> Solr version, but never really noticed that the "reference guide" was
> available in version-specific releases. I guess that is partly because I
> prefer googling about a specific topic, instead of reading some reference
> guide cover to cover. And from a google search for "edismax" (for example),
> it's not really trivial to click one's way into a version-specific
> reference guide on that topic. Instead, one tends to land on the wiki pages
> (with the old wiki as the first hit, sometimes).
>
> Regards
> /Jimi
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Monday, February 29, 2016 3:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> On 2/29/2016 7:00 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
> > So, should I assume that the "Confluence wiki" is the correct place for
> all documentation, even for solr 4.6?
>
> If you want documentation specifically for 4.6, there are version-specific
> releases of the guide:
>
> https://archive.apache.org/dist/lucene/solr/ref-guide/
>
> The confluence wiki is the "live" version of the reference guide,
> applicable to whatever version of Solr is being worked on at the moment,
> not the released versions.  Because it's such a large documentation set and
> Solr evolves incrementally, quite a lot of the confluence wiki is
> applicable to older versions, but the wiki as a whole is not intended for
> those older versions.
>
> The project is gearing up to begin the work on releasing version 6.0, so
> you can expect a LOT of change activity on the confluence wiki in the near
> future.  I have no idea how long it will take to finish 6.0.  The last two
> major releases (4.0 and 5.0) took months, but there's strong hope on the
> team that it will only take a few weeks this time.
>
> If you want to keep an eye on the pulse of the project, join the dev list.
>
> http://lucene.apache.org/solr/resources.html#mailing-lists
>
> In addition to a fair number of messages from real people, the dev list
> receives automated email from back-end systems in the project
> infrastructure, which creates very high traffic.  The ability to create
> filters to move mail between folders may help you keep your sanity.
>
> Also listed on the link above page is the commit notification list, which
> offers a particularly verbose look into what's happening to the project.
>
> Thanks,
> Shawn
>
>

Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky

There is nothing wrong with features that appear to be automagical - that
should in fact be a goal for all modern software systems. Of course, there
is no magic, it's all real logic and any magic is purely appearance - it's
just that the underlying logic may be complex and not obvious to an
uninformed observer. Deliberately hiding information from users (e.g.,
implementation details) is indeed a goal for Solr - no mere mortal should
be exposed to the intricate detail of the underlying Lucene search library
or the apparent magic of edismax. In truth, nothing is hidden - the source
code of both Solr and Lucene are readily available. But to the user it may
(and should) appear to magical and even automagical.

OTOH, maybe some of the doc on edismax was not as clear as it could have
been, in which case it is up to you to point out which specific passage(s)
caused your difficulty. AFAICT, nothing at all was hidden - the examples in
the doc (which I pointed you to) seem very simple and direct to the point.
If you experienced them otherwise, it is up to you to point out any
problems that you had. And as I pointed out, you had started with the old
wiki when you should have started with the current Solr Reference Guide.

The old edismax wiki should in fact have a tombstone warning that indicates
that it is obsolete and redirect people to the new doc. Out of curiosity,
how did you get to that old wiki page in the first place?

-- Jack Krupansky

On Mon, Feb 29, 2016 at 3:20 AM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> There is no need to deliberately misinterpret what I wrote. What I was
> trying to say was that "automagical" things don't belong in a professional
> environment, because it is hiding important information from people. And
> this is bad as it is, but if it on top of that is the *intended* meaning
> for things in solr to be "automagical", ie *deliberately* hiding
> information from the solr users, well that attitude is just baffling in my
> eyes. I can only hope that I misunderstood you.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> So, all this hard work that people have put into Solr to make it more like
> a Disney theme park is just... wasted... on you? Sigh. Okay, I guess we
> can't please everyone.
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 5:40 PM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
>
> > I have no problem with automatic. It is "automagicall" stuff that I
> > find a bit hard to like. Ie things that are automatic, but doesn't
> > explain how and why they are automatic. But Disney Land and Disney
> > World are actually really good examples of places where the magic
> > stuff is suitable, ie in themeparks, designed mostly for kids. In the
> > grown up world of IT, most people prefer logical and documented stuff,
> not things that "just works"
> > without explaining why. No offence :)
> >
> > /Jimi
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Sunday, February 28, 2016 11:31 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: ExtendedDisMax configuration nowhere to be found
> >
> > Yes, it absolutely is automagic - just look at those examples in the
> > Confluence ref guide. No special request handler is needed - just the
> > normal default handler. Just the defType and qf parameters are needed
> > - as shown in the wiki examples.
> >
> > It really is that simple! All you have to supply is the list of fields
> > to query (qf) and your actual query text (q).
> >
> > I know, I know... some people just can't handle automatic. (Some
> > people hate DisneyLand/World!)
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 28, 2016 at 5:16 PM, <jimi.hulleg...@svensktnaringsliv.se>
> > wrote:
> >
> > > I'm sorry, but I am still confused. I'm expecting to see some
> > >  tag somewhere. Why doesn't the documentation nor
> > > the example solrconfig.xml contain such a tag?
> > >
> > > If the edismax requestHandler is defined automatically, the
> > > documentation should explain that. Also, there should still exist
> > > some xml code that corresponds exactly to that default setup, right?
> > > That is what I'm looking for.
> > >
> > > For now, this edismax thing seems to work "automagically", and I
> > > prefer to understand why and how something works.
> > >
> > > /Jimi
> > >
> &

Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky

So, all this hard work that people have put into Solr to make it more like
a Disney theme park is just... wasted... on you? Sigh. Okay, I guess we
can't please everyone.

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:40 PM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> I have no problem with automatic. It is "automagicall" stuff that I find a
> bit hard to like. Ie things that are automatic, but doesn't explain how and
> why they are automatic. But Disney Land and Disney World are actually
> really good examples of places where the magic stuff is suitable, ie in
> themeparks, designed mostly for kids. In the grown up world of IT, most
> people prefer logical and documented stuff, not things that "just works"
> without explaining why. No offence :)
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:31 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Yes, it absolutely is automagic - just look at those examples in the
> Confluence ref guide. No special request handler is needed - just the
> normal default handler. Just the defType and qf parameters are needed - as
> shown in the wiki examples.
>
> It really is that simple! All you have to supply is the list of fields to
> query (qf) and your actual query text (q).
>
> I know, I know... some people just can't handle automatic. (Some people
> hate DisneyLand/World!)
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 5:16 PM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
>
> > I'm sorry, but I am still confused. I'm expecting to see some
> >  tag somewhere. Why doesn't the documentation nor the
> > example solrconfig.xml contain such a tag?
> >
> > If the edismax requestHandler is defined automatically, the
> > documentation should explain that. Also, there should still exist some
> > xml code that corresponds exactly to that default setup, right? That
> > is what I'm looking for.
> >
> > For now, this edismax thing seems to work "automagically", and I
> > prefer to understand why and how something works.
> >
> > /Jimi
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Sunday, February 28, 2016 10:58 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: ExtendedDisMax configuration nowhere to be found
> >
> > Consult the Confluence wiki for more recent doc:
> >
> > https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Q
> > uery+Parser
> >
> > You can specify all the parameters on your query request as in the
> > examples, or by placing the parameters in the "defaults" section for
> > your request handler in solrconfig.xml.
> >
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 28, 2016 at 2:42 PM, <jimi.hulleg...@svensktnaringsliv.se>
> > wrote:
> >
> > > Hi,
> > >
> > > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't
> > > seem to find any example configuration for this. Ie the
> > > configuration needed in solrconfig.xml. In the wiki page
> > > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> > >
> > > "Extended DisMax is already configured in the example configuration,
> > > with the name edismax."
> > >
> > > But this is not true for the solrconfig.xml in our setup (it only
> > > contains an example for dismax, not edismax), and I downloaded the
> > > latest solr zip file (solr 5.5.0), and it didn't have either dismax
> > > or edismax in any of its solrconfig.xml files.
> > >
> > > Why is it so hard to find this configuration? Am I missing something
> > > obvious?
> > >
> > > Regards
> > > /Jimi
> > >
> >
>

Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky

Yes, it absolutely is automagic - just look at those examples in the
Confluence ref guide. No special request handler is needed - just the
normal default handler. Just the defType and qf parameters are needed - as
shown in the wiki examples.

It really is that simple! All you have to supply is the list of fields to
query (qf) and your actual query text (q).

I know, I know... some people just can't handle automatic. (Some people
hate DisneyLand/World!)

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:16 PM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> I'm sorry, but I am still confused. I'm expecting to see some
>  tag somewhere. Why doesn't the documentation nor the
> example solrconfig.xml contain such a tag?
>
> If the edismax requestHandler is defined automatically, the documentation
> should explain that. Also, there should still exist some xml code that
> corresponds exactly to that default setup, right? That is what I'm looking
> for.
>
> For now, this edismax thing seems to work "automagically", and I prefer to
> understand why and how something works.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 10:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Consult the Confluence wiki for more recent doc:
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
> You can specify all the parameters on your query request as in the
> examples, or by placing the parameters in the "defaults" section for your
> request handler in solrconfig.xml.
>
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 2:42 PM, <jimi.hulleg...@svensktnaringsliv.se>
> wrote:
>
> > Hi,
> >
> > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't
> > seem to find any example configuration for this. Ie the configuration
> > needed in solrconfig.xml. In the wiki page
> > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> >
> > "Extended DisMax is already configured in the example configuration,
> > with the name edismax."
> >
> > But this is not true for the solrconfig.xml in our setup (it only
> > contains an example for dismax, not edismax), and I downloaded the
> > latest solr zip file (solr 5.5.0), and it didn't have either dismax or
> > edismax in any of its solrconfig.xml files.
> >
> > Why is it so hard to find this configuration? Am I missing something
> > obvious?
> >
> > Regards
> > /Jimi
> >
>

Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky

Consult the Confluence wiki for more recent doc:
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

You can specify all the parameters on your query request as in the
examples, or by placing the parameters in the "defaults" section for your
request handler in solrconfig.xml.


-- Jack Krupansky

On Sun, Feb 28, 2016 at 2:42 PM, <jimi.hulleg...@svensktnaringsliv.se>
wrote:

> Hi,
>
> I want to setup ExtendedDisMax in our solr 4.6 server, but I can't seem to
> find any example configuration for this. Ie the configuration needed in
> solrconfig.xml. In the wiki page
> http://wiki.apache.org/solr/ExtendedDisMax it simply says:
>
> "Extended DisMax is already configured in the example configuration, with
> the name edismax."
>
> But this is not true for the solrconfig.xml in our setup (it only contains
> an example for dismax, not edismax), and I downloaded the latest solr zip
> file (solr 5.5.0), and it didn't have either dismax or edismax in any of
> its solrconfig.xml files.
>
> Why is it so hard to find this configuration? Am I missing something
> obvious?
>
> Regards
> /Jimi
>

Re: Query time de-boost

2016-02-28 Thread Jack Krupansky

Thanks for clarifying - that you are referring to the bq parameter which is
in fact additive to the underlying score within the original query, while
in the main query, or using the bf and boost and qf and pf parameters the
boosting is multiplicative rather than "additive".

IOW, only in the bq parameter do you need to use negative boost values - in
all the other contexts a fractional boost is sufficient.

It's unfortunate that the ref guide isn't more clear about this key
distinction.

Now hopefully we (and others!) are on the same page.


-- Jack Krupansky

On Sun, Feb 28, 2016 at 3:26 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Jack,
> I think we are talking about different things: I agree that boost is
> multiplicative, and boost values less than zero will reduce score, but if
> you use such boost value in bq, it will still bust documents that are
> matching it. Simplest example is with ids. If you query:
>   q=id:a OR id:b
> both doc a and b will have same score. If you boost a^2 it will be first,
> if you boost a^0.1 it will be second. But if you use dismax's bq=id:a^0.1
> it will be first. In such case you have to use negative boost to make sure
> it is last.
>
> Are we on the same page now?
>
> Regards,
> Emir
>
>
> On 26.02.2016 16:00, Jack Krupansky wrote:
>
>> Could you share your actual numbers and test case? IOW, the document score
>> without ^0.01 and with ^0.01.
>>
>> Again, to repeat, the specific boost factor may be positive, but the
>> effect
>> of a fractional boost is to reduce, not add, to the score, so that a score
>> of 0.5 boosted by 0.1 would become 0.05. IOW, it de-boosts occurrences of
>> the term.
>>
>> The point remains that you do not need a "negative boost" to de-boost a
>> term.
>>
>>
>> -- Jack Krupansky
>>
>> On Fri, Feb 26, 2016 at 4:01 AM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>> Hi Jack,
>>> I just checked on 5.5 and 0.1 is positive boost.
>>>
>>> Regards,
>>> Emir
>>>
>>>
>>> On 26.02.2016 01:11, Jack Krupansky wrote:
>>>
>>> 0.1 is a fractional boost - all intra-query boosts are multiplicative,
>>>> not
>>>> additive, so term^0.1 reduces the term by 90%.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Wed, Feb 24, 2016 at 11:29 AM, shamik <sham...@gmail.com> wrote:
>>>>
>>>> Binoy, 0.1 is still a positive boost. With title getting the highest
>>>>
>>>>> weight,
>>>>> this won't make any difference. I've tried this as well.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>
>>>>> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: Solr regex documenation

2016-02-27 Thread Jack Krupansky

See:
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/RegexpQuery.html
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/automaton/RegExp.html

I vaguely recall a Jira about regex not working at all in Solr. I don't
recall reading about a resolution.

-- Jack Krupansky

On Sat, Feb 27, 2016 at 7:05 AM, Anil <anilk...@gmail.com> wrote:

> Hi,
>
> Can some one point me to the solr regex documentation ?
>
> i read it supports all java regex features.  i tried ^ and $ , seems it is
> not working.
>
> Thanks,
> Anil
>

Re: Query time de-boost

2016-02-26 Thread Jack Krupansky

Could you share your actual numbers and test case? IOW, the document score
without ^0.01 and with ^0.01.

Again, to repeat, the specific boost factor may be positive, but the effect
of a fractional boost is to reduce, not add, to the score, so that a score
of 0.5 boosted by 0.1 would become 0.05. IOW, it de-boosts occurrences of
the term.

The point remains that you do not need a "negative boost" to de-boost a
term.


-- Jack Krupansky

On Fri, Feb 26, 2016 at 4:01 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Jack,
> I just checked on 5.5 and 0.1 is positive boost.
>
> Regards,
> Emir
>
>
> On 26.02.2016 01:11, Jack Krupansky wrote:
>
>> 0.1 is a fractional boost - all intra-query boosts are multiplicative, not
>> additive, so term^0.1 reduces the term by 90%.
>>
>> -- Jack Krupansky
>>
>> On Wed, Feb 24, 2016 at 11:29 AM, shamik <sham...@gmail.com> wrote:
>>
>> Binoy, 0.1 is still a positive boost. With title getting the highest
>>> weight,
>>> this won't make any difference. I've tried this as well.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

Re: Query time de-boost

2016-02-25 Thread Jack Krupansky

0.1 is a fractional boost - all intra-query boosts are multiplicative, not
additive, so term^0.1 reduces the term by 90%.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 11:29 AM, shamik <sham...@gmail.com> wrote:

> Binoy, 0.1 is still a positive boost. With title getting the highest
> weight,
> this won't make any difference. I've tried this as well.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-25 Thread Jack Krupansky

You still haven't stated exactly what your query requirements are. In Solr
you should always start with an analysis of how people will expect to query
the data and then work backwards to how to store and index the data to
achieve the desired queries.

Note that the standard tokenizer will tokenize all of the elements of a
path or IP as separate terms. Ditto for a query, so you can effectively do
bth keyword and phrase queries to match individual terms (e.g., path
elements) or phrases or sequences of path elements or IP address components.

-- Jack Krupansky

On Thu, Feb 25, 2016 at 12:41 AM, Anil <anilk...@gmail.com> wrote:

> Sorry Jack for confusion.
>
> I have field which holds free text. text can contain path , ip or any free
> text.
>
> I would like to tokenize the text of the field using white space. if the
> text token is of path or ip pattern , it has be tockenized like path
> hierarchy way.
>
>
> Regards,
> Anil
>
> On 24 February 2016 at 21:59, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > Your statement makes no sense. Please clarify. Express your
> requirement(s)
> > in plain English first before dragging in possible solutions.
> Technically,
> > path elements can have embedded spaces.
> >
> > -- Jack Krupansky
> >
> > On Wed, Feb 24, 2016 at 6:53 AM, Anil <anilk...@gmail.com> wrote:
> >
> > > HI,
> > >
> > > i need to use both WhitespaceTokenizerFactory and
> > > PathHierarchyTokenizerFactory for use case.
> > >
> > > Solr supports only one tokenizer. is there any way we can achieve
> > > PathHierarchyTokenizerFactory  functionality with filters ?
> > >
> > > Please advice.
> > >
> > > Regards,
> > > Anil
> > >
> >
>

Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-24 Thread Jack Krupansky

Your statement makes no sense. Please clarify. Express your requirement(s)
in plain English first before dragging in possible solutions. Technically,
path elements can have embedded spaces.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 6:53 AM, Anil <anilk...@gmail.com> wrote:

> HI,
>
> i need to use both WhitespaceTokenizerFactory and
> PathHierarchyTokenizerFactory for use case.
>
> Solr supports only one tokenizer. is there any way we can achieve
> PathHierarchyTokenizerFactory  functionality with filters ?
>
> Please advice.
>
> Regards,
> Anil
>

Re: Reverse Eningeer Query For a Given Result Set?

2016-02-18 Thread Jack Krupansky

Out of the box? No. Could you develop one? Probably, or at least a rough
approximation, at least some of the time... but probably at a cost
significantly greater than converting queries by hand.

If it is taking you 2-4 hours per query then that suggests that the query
complexity is not amenable to any simple mechanical reverse engineering.

What aspects of the conversion is taking your so many hours? A few examples
would be helpful.

A mechanical reverse engineering from results would likely reduce the
semantic content of the original query, so that the query may then return a
false positive or false negative as new documents are added to the index
that are no longer in the same pattern as the old results by still within
the pattern of the original Oracle query. The trick may be whether the
delta is meaningful for the actual application use case.

-- Jack Krupansky

On Thu, Feb 18, 2016 at 4:07 AM, Christian Effertz <seme...@gmail.com>
wrote:

> Hi,
>
> Can I somehow feed Solr with a result set or a list of primary keys and get
> the shortest query that leads to this result? In other terms, can I reverse
> engineer a query for a given result set?
>
> Some background why I ask this question:
> We are currently migrating a search application from Oracle Text to Solr.
> Our users have several (>30) complex queries that we need to migrate to our
> new Solr index. This can be done by hand, but is rather time consuming. To
> get an idea of how long the whole task would need, we started with a hand
> full of them. We spent ~2-4h per query to get everything right.
>
> Thank you for your input
>

Re: Negating multiple array fileds

2016-02-17 Thread Jack Krupansky

I actually thought seriously about whether to mention wildcard vs. range,
but... it annoys me that the Lucene and query parser folks won't fix either
PrefixQuery or the query parsers to do the right/optimal thing for
single-asterisk query. I wrote up a Jira for it years ago, but for whatever
reason the difficulty persists. At one point one of the Lucene guys told me
that there was a filter query that could do both * and -* very efficiently,
but then later that was disputed, not to mention that filter query is now
gone. In any case, with the newer AutomatonQuery the single-asterisk
PrefixQuery case should always perform at least semi-reasonably no matter
what, especially since it is now a constant-score query, which it wasn't
many years ago.

Whether [* TO *] is actually a lot more (or less) efficient than
PrefixQuery for an empty prefix these days is... unknown to me, but I won't
give anybody grief for using it as a way of compensating for the
brain-damaged way that Lucene and Solr handle single-asterisk and negated
single-asterisk queries.

-- Jack Krupansky

On Tue, Feb 16, 2016 at 8:17 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/15/2016 9:22 AM, Jack Krupansky wrote:
> > I should also have noted that your full query:
> >
> > (-persons:*)AND(-places:*)AND(-orgs:*)
> >
> > can be written as:
> >
> > -persons:* -places:* -orgs:*
> >
> > Which may work as is, or can also be written as:
> >
> > *:* -persons:* -places:* -orgs:*
>
> Salman,
>
> One fact of Lucene operation is that purely negative queries do not
> work.  A negative query clause is like a subtraction.  If you make a
> query that only says "subtract these values", then you aren't going to
> get anything, because you did not start with anything.
>
> Adding the "*:*" clause at the beginning of the query says "start with
> everything."
>
> You might ask why a query of -field:value works, when I just said that
> it *won't* work.  This is because Solr has detected the problem and
> fixed it.  When the query is very simple (a single negated clause), Solr
> is able to detect the unworkable situation and implicitly add the "*:*"
> starting point, producing the expected results.  With more complex
> queries, like the one you are trying, this detection fails, and the
> query is executed as-is.
>
> Jack is an awesome member of this community.  I do not want to disparage
> him at all when I tell you that the rewritten query he provided will
> work, but is not optimal.  It can be optimized as the following:
>
> *:* -persons:[* TO *] -places:[* TO *] -orgs:[* TO *]
>
> A query clause of the format "field:*" is a wildcard query.  Behind the
> scenes, Solr will interpret this as "all possible values for field" --
> which sounds like it would be exactly what you're looking for, except
> that if there are ten million possible values in the field you're
> searching, the constructed Lucene query will quite literally include all
> ten million values.  Wildcard queries tend to use a lot of memory and
> run slowly.
>
> The [* TO *] syntax is an all-inclusive range query, which will usually
> be much faster than a wildcard query.
>
> Thanks,
> Shawn
>
>

Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky

Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.

No ready solution.

Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 score, but avoids the
pollution problem.

I haven't heard of anybody in the Lucene space discussing a way to
categorize documents such that df is relative to a specified document
category and then the query specifies a document category. I support that
indexing and query of some hypothetical similarity schema could both
specify any number of document categories. But that's speculation on my
part.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 6:42 PM, Chris Morley <ch...@depahelix.com> wrote:

> Hey Solr people:
>
>  Suppose that we did not want to break up our document set into separate
> indexes, but had certain cases where many versions of a document were not
> relevant for certain searches.
>
>  I guess this could be thought of as a "authorization" class of problem,
> however it is not that for us.  We have a few other fields that determine
> relevancy to the current query, based on what page the query is coming
> from.  It's kind of like authorization, but not really.
>
>  Anyway, I think the answer for how you would do it for authorization would
> solve it for our case too.
>
>  So I guess suppose you had 99 users and 100 documents and Document 1
> everybody could see it the same, but for the 99 documents, there was a
> slightly different document, and it was unique for each of 99 users, but
> not "very" unique.  Suppose for instance that the only thing different in
> the text of the 99 different documents was that it was watermarked with the
> users name.  Aren't you spamming your tf/idf at that point?  Is there a way
> around this?  Is there a way to say, hey, group these 99 documents together
> and only count 1 of them for tf/idf purposes?
>
>  When doing queries, each user would only ever see 2 documents, Document 1
> , plus whichever other document they specifically owned.
>
>  If there are web pages or book chapters I can read or re-read that address
> this class of problem, those references would be great.
>
>
>  -Chris.
>
>
>
>

Re: Negating multiple array fileds

2016-02-15 Thread Jack Krupansky

I should also have noted that your full query:

(-persons:*)AND(-places:*)AND(-orgs:*)

can be written as:

-persons:* -places:* -orgs:*

Which may work as is, or can also be written as:

*:* -persons:* -places:* -orgs:*




-- Jack Krupansky

On Mon, Feb 15, 2016 at 1:57 AM, Salman Ansari <salman.rah...@gmail.com>
wrote:

> @Binoy: The query does work but for one term (-persons:[* TO *]) but it
> does not work for multiple terms such as
> http://[Myserver]/solr/[Collection]/select?q=(-persons:[* TO
> *])AND(-orgs:[*
> TO *])
> This returns zero records although I do have records that has both persons
> and orgs empty.
>
> @Jack: Replacing (-persons:*)AND(-orgs:*) with (*:* -persons:*)AND(*:*
> -orgs:*) did the trick. Thanks.
>
> Thanks you both for your comments.
>
> Salman
>
> On Sun, Feb 14, 2016 at 7:51 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > Due to a bug (or poorly designed feature), you need to explicitly
> include a
> > non-negative query term in a purely negative sub-query. Usually this
> means
> > using *:* to select all documents. Note that the use of parentheses
> > introduces a sub-query. So, (-persons:*) s.b. (*:* -persons:*).
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 14, 2016 at 8:21 AM, Salman Ansari <salman.rah...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I think what I am asking should be easy to do but for some reasons I am
> > > facing issues in making that happen. The issue is that I want
> > > include/exclude some fields from my Solr query. All the fields that I
> > need
> > > to include are multi valued int fields. When I include the fields I
> have
> > > the following query
> > >
> > > http://
> > >
> > >
> >
> [MySolrServer]/solr/[Collection]/select?q=(persons:*)AND(places:*)AND(orgs:*)
> > > This does return the desired result. However, when I negate the values
> > >
> > > http://
> > >
> > >
> >
> [MySolrServer]/solr/[Collection]/select?q=(-persons:*)AND(-places:*)AND(-orgs:*)
> > > This returns 0 documents although there are a lot of documents that
> have
> > > all those fields empty.
> > >
> > > Any ideas why this is happening?
> > >
> > > Appreciate any comments/feedback.
> > >
> > > Regards,
> > > Salman
> > >
> >
>

Re: "pf" not supported by edismax?

2016-02-14 Thread Jack Krupansky

Maybe because the tokenized phrase produces only a single term it is
ignored. In any case, it won't be a phrase. pf only does something useful
for phrases. IOW, where a PhraseQuery can be generated. A PhraseQuery for
more than a single term would never match when the field value is a single
term.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 12:11 AM, Derek Poh <d...@globalsources.com> wrote:

> It is using KeywordTokenizerFactory. It is still consider as tokenized?
>
> Here's the field definition:
>  type="gs_keyword_exact" multiValued="true"/>
>
>  positionIncrementGap="100">
>   
> 
> 
> 
>   
>   
> 
> 
> 
>   
> 
>
>
> On 2/15/2016 12:43 PM, Jack Krupansky wrote:
>
>> pf stands for phrase boosting, which implies tokenized text...
>> spp_keyword_exact sounds like it is not tokenized.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh <d...@globalsources.com>
>> wrote:
>>
>> Hi
>>>
>>> Correct me If I am wrong, edismax is an extension of dismax, so it will
>>> support "pf".
>>> But from my testing I noticed "pf" is not working with edismax.
>>>  From the debug information of a query using "pf" with edismax, there is
>>> no
>>> phrase match for the "pf" field "spp_keyword_exact".
>>> If I changed to dismax, it is doing a phrase match on the field.
>>>
>>> Is this normal?
>>>
>>> We are running Solr 4.10.4.
>>>
>>> Below is the queriesand their debug information.
>>>
>>> Query using "pf" with edismax and the debug statement:
>>>
>>>
>>> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=edismax
>>>
>>> dvd bracket
>>> dvd bracket
>>> 
>>> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
>>> DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord
>>> 
>>> 
>>> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()
>>> 
>>> ExtendedDismaxQParser
>>>
>>>
>>> Query using "pf" with dismax and the debug statement:
>>>
>>>
>>> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=dismax
>>>
>>> dvd bracket
>>> dvd bracket
>>> 
>>> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
>>> DisjunctionMaxQuery((spp_keyword_exact:bracket)))
>>> DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord
>>> 
>>> 
>>> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
>>> (spp_keyword_exact:dvd bracket)
>>> 
>>> DisMaxQParser
>>>
>>> Derek
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and
>>> you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>>
>>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>
>

Re: "pf" not supported by edismax?

2016-02-14 Thread Jack Krupansky

pf stands for phrase boosting, which implies tokenized text...
spp_keyword_exact sounds like it is not tokenized.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh <d...@globalsources.com> wrote:

> Hi
>
> Correct me If I am wrong, edismax is an extension of dismax, so it will
> support "pf".
> But from my testing I noticed "pf" is not working with edismax.
> From the debug information of a query using "pf" with edismax, there is no
> phrase match for the "pf" field "spp_keyword_exact".
> If I changed to dismax, it is doing a phrase match on the field.
>
> Is this normal?
>
> We are running Solr 4.10.4.
>
> Below is the queriesand their debug information.
>
> Query using "pf" with edismax and the debug statement:
>
> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=edismax
>
> dvd bracket
> dvd bracket
> 
> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
> DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord
> 
> 
> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()
> 
> ExtendedDismaxQParser
>
>
> Query using "pf" with dismax and the debug statement:
>
> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket=spp_keyword_exact=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription=spp_keyword_exact=query=dismax
>
> dvd bracket
> dvd bracket
> 
> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
> DisjunctionMaxQuery((spp_keyword_exact:bracket)))
> DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord
> 
> 
> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
> (spp_keyword_exact:dvd bracket)
> 
> DisMaxQParser
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.

Re: Negating multiple array fileds

2016-02-14 Thread Jack Krupansky

Due to a bug (or poorly designed feature), you need to explicitly include a
non-negative query term in a purely negative sub-query. Usually this means
using *:* to select all documents. Note that the use of parentheses
introduces a sub-query. So, (-persons:*) s.b. (*:* -persons:*).

-- Jack Krupansky

On Sun, Feb 14, 2016 at 8:21 AM, Salman Ansari <salman.rah...@gmail.com>
wrote:

> Hi,
>
> I think what I am asking should be easy to do but for some reasons I am
> facing issues in making that happen. The issue is that I want
> include/exclude some fields from my Solr query. All the fields that I need
> to include are multi valued int fields. When I include the fields I have
> the following query
>
> http://
>
> [MySolrServer]/solr/[Collection]/select?q=(persons:*)AND(places:*)AND(orgs:*)
> This does return the desired result. However, when I negate the values
>
> http://
>
> [MySolrServer]/solr/[Collection]/select?q=(-persons:*)AND(-places:*)AND(-orgs:*)
> This returns 0 documents although there are a lot of documents that have
> all those fields empty.
>
> Any ideas why this is happening?
>
> Appreciate any comments/feedback.
>
> Regards,
> Salman
>

Re: optimize requests that fetch 1000 rows

2016-02-12 Thread Jack Krupansky

Thanks for that critical clarification. Try...

1. A different response writer to see if that impacts the clock time.
2. Selectively remove fields from the fl field list to see if some
particular field has some issue.
3. If you simply return only the ID for the document, how fast/slow is that?

How many fields are in fl?
Any function queries in fl?


-- Jack Krupansky

On Fri, Feb 12, 2016 at 4:57 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:

> Hi Jack,
>  tell me if I'm wrong but qtime accounts for search time excluding the
> fetch of stored fields (I have a 90ms qtime and a ~30s time to obtain the
> results on the client on a LAN infrastructure for 300kB response). debug
> explains how much of qtime is used by each search component.
> For me 90ms are ok, I wouldn't spend time trying to make them 50ms, it's
> the ~30s to obtain the response that I'd like to tackle.
>
>
> 2016-02-12 5:42 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
> > Again, first things first... debugQuery=true and see which Solr search
> > components are consuming the bulk of qtime.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla <matteo.gro...@gmail.com
> >
> > wrote:
> >
> > > virtual hardware, 200ms is taken on the client until response is
> written
> > to
> > > disk
> > > qtime on solr is ~90ms
> > > not great but acceptable
> > >
> > > Is it possible that the method FilenameUtils.splitOnTokens is really so
> > > heavy when requesting a lot of rows on slow hardware?
> > >
> > > 2016-02-11 17:17 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
> > >
> > > > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> > > still
> > > > relatively bad. Even 50ms for 10 rows would be considered barely
> okay.
> > > > But... again it depends on query complexity - simple queries should
> be
> > > well
> > > > under 50 ms for decent modern hardware.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla <
> > matteo.gro...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Jack,
> > > > >   response time scale with rows. Relationship doens't seem
> linear
> > > but
> > > > > Below 400 rows times are much faster,
> > > > > I view query times from solr logs and they are fast
> > > > > the same query with rows = 1000 takes 8s
> > > > > with rows = 10 takes 0.2s
> > > > >
> > > > >
> > > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky <
> jack.krupan...@gmail.com
> > >:
> > > > >
> > > > > > Are queries scaling linearly - does a query for 100 rows take
> > 1/10th
> > > > the
> > > > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > > > >
> > > > > > Does the app need/expect exactly 1,000 documents for the query or
> > is
> > > > that
> > > > > > just what this particular query happened to return?
> > > > > >
> > > > > > What does they query look like? Is it complex or use wildcards or
> > > > > function
> > > > > > queries, or is it very simple keywords? How many operators?
> > > > > >
> > > > > > Have you used the debugQuery=true parameter to see which search
> > > > > components
> > > > > > are taking the time?
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Yonic,
> > > > > > >  after the first query I find 1000 docs in the document
> > cache.
> > > > > > > I'm using curl to send the request and requesting javabin
> format
> > to
> > > > > mimic
> > > > > > > the application.
> > > > > > > gc activity is low
> > > > > > > I managed to load the entire 50GB index in the filesystem
> cache,
> > > > after
> > > > > > that
> > > > > > > queries don't cause disk activity anymore.
> > > > > > > Time improves now queries that

Re: query knowledge graph

2016-02-12 Thread Jack Krupansky

"knowledge graph" is kind of vague - what did you have in mind? An example
would help.

-- Jack Krupansky

On Fri, Feb 12, 2016 at 7:27 AM, Midas A <test.mi...@gmail.com> wrote:

>  Please suggest how to create query knowledge graph for e-commerce
> application .
>
>
> please describe in detail . our mote is to improve relevancy . we are from
> LAMP back ground .
>

Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky

Is this a scenario that was working fine and suddenly deteriorated, or has
it always been slow?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:

> Hi,
>  I'm trying to optimize a solr application.
> The bottleneck are queries that request 1000 rows to solr.
> Unfortunately the application can't be modified at the moment, can you
> suggest me what could be done on the solr side to increase the performance?
> The bottleneck is just on fetching the results, the query executes very
> fast.
> I suggested caching .fdx and .fdt files on the file system cache.
> Anything else?
>
> Thanks
>

Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky

Are queries scaling linearly - does a query for 100 rows take 1/10th the
time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?

Does the app need/expect exactly 1,000 documents for the query or is that
just what this particular query happened to return?

What does they query look like? Is it complex or use wildcards or function
queries, or is it very simple keywords? How many operators?

Have you used the debugQuery=true parameter to see which search components
are taking the time?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:

> Hi Yonic,
>  after the first query I find 1000 docs in the document cache.
> I'm using curl to send the request and requesting javabin format to mimic
> the application.
> gc activity is low
> I managed to load the entire 50GB index in the filesystem cache, after that
> queries don't cause disk activity anymore.
> Time improves now queries that took ~30s take <10s. But I hoped better
> I'm going to use jvisualvm's sampler to analyze where time is spent
>
>
> 2016-02-11 15:25 GMT+01:00 Yonik Seeley <ysee...@gmail.com>:
>
> > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <matteo.gro...@gmail.com>
> > wrote:
> > > Thanks Toke, yes, they are long times, and solr qtime (to execute the
> > > query) is a fraction of a second.
> > > The response in javabin format is around 300k.
> >
> > OK, That tells us a lot.
> > And if you actually tested so that all the docs would be in the cache
> > (can you verify this by looking at the cache stats after you
> > re-execute?) then it seems like the slowness is down to any of:
> > a) serializing the response (it doesn't seem like a 300K response
> > should take *that* long to serialize)
> > b) reading/processing the response (how fast the client can do
> > something with each doc is also a factor...)
> > c) other (GC, network, etc)
> >
> > You can try taking client processing out of the equation by trying a
> > curl request.
> >
> > -Yonik
> >
>

Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky

Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still
relatively bad. Even 50ms for 10 rows would be considered barely okay.
But... again it depends on query complexity - simple queries should be well
under 50 ms for decent modern hardware.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:

> Hi Jack,
>   response time scale with rows. Relationship doens't seem linear but
> Below 400 rows times are much faster,
> I view query times from solr logs and they are fast
> the same query with rows = 1000 takes 8s
> with rows = 10 takes 0.2s
>
>
> 2016-02-11 16:22 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
> > Are queries scaling linearly - does a query for 100 rows take 1/10th the
> > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> >
> > Does the app need/expect exactly 1,000 documents for the query or is that
> > just what this particular query happened to return?
> >
> > What does they query look like? Is it complex or use wildcards or
> function
> > queries, or is it very simple keywords? How many operators?
> >
> > Have you used the debugQuery=true parameter to see which search
> components
> > are taking the time?
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <matteo.gro...@gmail.com>
> > wrote:
> >
> > > Hi Yonic,
> > >  after the first query I find 1000 docs in the document cache.
> > > I'm using curl to send the request and requesting javabin format to
> mimic
> > > the application.
> > > gc activity is low
> > > I managed to load the entire 50GB index in the filesystem cache, after
> > that
> > > queries don't cause disk activity anymore.
> > > Time improves now queries that took ~30s take <10s. But I hoped better
> > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > >
> > >
> > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley <ysee...@gmail.com>:
> > >
> > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute
> the
> > > > > query) is a fraction of a second.
> > > > > The response in javabin format is around 300k.
> > > >
> > > > OK, That tells us a lot.
> > > > And if you actually tested so that all the docs would be in the cache
> > > > (can you verify this by looking at the cache stats after you
> > > > re-execute?) then it seems like the slowness is down to any of:
> > > > a) serializing the response (it doesn't seem like a 300K response
> > > > should take *that* long to serialize)
> > > > b) reading/processing the response (how fast the client can do
> > > > something with each doc is also a factor...)
> > > > c) other (GC, network, etc)
> > > >
> > > > You can try taking client processing out of the equation by trying a
> > > > curl request.
> > > >
> > > > -Yonik
> > > >
> > >
> >
>

Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky

Again, first things first... debugQuery=true and see which Solr search
components are consuming the bulk of qtime.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla <matteo.gro...@gmail.com>
wrote:

> virtual hardware, 200ms is taken on the client until response is written to
> disk
> qtime on solr is ~90ms
> not great but acceptable
>
> Is it possible that the method FilenameUtils.splitOnTokens is really so
> heavy when requesting a lot of rows on slow hardware?
>
> 2016-02-11 17:17 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
> > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> still
> > relatively bad. Even 50ms for 10 rows would be considered barely okay.
> > But... again it depends on query complexity - simple queries should be
> well
> > under 50 ms for decent modern hardware.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla <matteo.gro...@gmail.com
> >
> > wrote:
> >
> > > Hi Jack,
> > >   response time scale with rows. Relationship doens't seem linear
> but
> > > Below 400 rows times are much faster,
> > > I view query times from solr logs and they are fast
> > > the same query with rows = 1000 takes 8s
> > > with rows = 10 takes 0.2s
> > >
> > >
> > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
> > >
> > > > Are queries scaling linearly - does a query for 100 rows take 1/10th
> > the
> > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > >
> > > > Does the app need/expect exactly 1,000 documents for the query or is
> > that
> > > > just what this particular query happened to return?
> > > >
> > > > What does they query look like? Is it complex or use wildcards or
> > > function
> > > > queries, or is it very simple keywords? How many operators?
> > > >
> > > > Have you used the debugQuery=true parameter to see which search
> > > components
> > > > are taking the time?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Yonic,
> > > > >  after the first query I find 1000 docs in the document cache.
> > > > > I'm using curl to send the request and requesting javabin format to
> > > mimic
> > > > > the application.
> > > > > gc activity is low
> > > > > I managed to load the entire 50GB index in the filesystem cache,
> > after
> > > > that
> > > > > queries don't cause disk activity anymore.
> > > > > Time improves now queries that took ~30s take <10s. But I hoped
> > better
> > > > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > > > >
> > > > >
> > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley <ysee...@gmail.com>:
> > > > >
> > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to
> execute
> > > the
> > > > > > > query) is a fraction of a second.
> > > > > > > The response in javabin format is around 300k.
> > > > > >
> > > > > > OK, That tells us a lot.
> > > > > > And if you actually tested so that all the docs would be in the
> > cache
> > > > > > (can you verify this by looking at the cache stats after you
> > > > > > re-execute?) then it seems like the slowness is down to any of:
> > > > > > a) serializing the response (it doesn't seem like a 300K response
> > > > > > should take *that* long to serialize)
> > > > > > b) reading/processing the response (how fast the client can do
> > > > > > something with each doc is also a factor...)
> > > > > > c) other (GC, network, etc)
> > > > > >
> > > > > > You can try taking client processing out of the equation by
> trying
> > a
> > > > > > curl request.
> > > > > >
> > > > > > -Yonik
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Need to move on SOlr cloud (help required)

2016-02-10 Thread Jack Krupansky

What exactly is your motivation? I mean, the primary benefit of SolrCloud
is better support for sharding, and you have only a single shard. If you
have no need for sharding and your master-slave replicated Solr has been
working fine, then stick with it. If only one machine is having a load
problem, then that one node should be replaced. There are indeed plenty of
good reasons to prefer SolrCloud over traditional master-slave replication,
but so far you haven't touched on any of them.

How much data (number of documents) do you have?

What is your typical query latency?

-- Jack Krupansky

On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi <kshitij.shopcl...@gmail.com>
wrote:

> Hi,
>
> We are currently using solr 5.2 and I need to move on solr cloud
> architecture.
>
> As of now we are using 5 machines :
>
> 1. I am using 1 master where we are indexing ourdata.
> 2. I replicate my data on other machines
>
> One or the other machine keeps on showing high load so I am planning to
> move on solr cloud.
>
> Need help on following :
>
> 1. What should be my architecture in case of 5 machines to keep (zookeeper,
> shards, core).
>
> 2. How to add a node.
>
> 3. what are the exact steps/process I need to follow in order to change to
> solr cloud.
>
> 4. How indexing will work in solr cloud as of now I am using mysql query to
> get the data on master and then index the same (how I need to change this
> in case of solr cloud).
>
> Regards,
> Kshitij
>

Re: Solr architecture

2016-02-08 Thread Jack Krupansky

So is there any aging or TTL (in database terminology) of older docs?

And do all of your queries need to query all of the older documents all of
the time or is there a clear hierarchy of querying for aged documents, like
past 24-hours vs. past week vs. past year vs. older than a year? Sure, you
can always use a function query to boost by the inverse of document age,
but Solr would be more efficient with filter queries or separate indexes
for different time scales.

Are documents ever updated or are they write-once?

Are documents explicitly deleted?

Technically you probably could meet those specs, but... how many
organizations have the resources and the energy to do so?

As a back of the envelope calculation, if Solr gave you 100 queries per
second per node, that would mean you would need 1,200 nodes. It would also
depend on whether those queries are very narrow so that a single node can
execute them or if they require fanout to other shards and then aggregation
of results from those other shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Short form: You really have to prototype. Here's the long form:
>
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> I've seen between 20M and 200M docs fit on a single piece of hardware,
> so you'll absolutely have to shard.
>
> And the other thing you haven't told us is whether you plan on
> _adding_ 2B docs a day or whether that number is the total corpus size
> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
> docs/day, 30 days later do you have 2B docs or 60B docs in your
> corpus?
>
> Best,
> Erick
>
> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
> > Also if you are expecting indexing of 2 billion docs as NRT or if it will
> > be offline (during off hours etc).  For more accurate sizing you may also
> > want to index say 10 million documents which may give you idea how much
> is
> > your index size and then use that for extrapolation to come up with
> memory
> > requirements.
> >
> > Thanks,
> > Susheel
> >
> > On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Mark,
> >> Can you give us bit more details: size of docs, query types, are docs
> >> grouped somehow, are they time sensitive, will they update or it is
> rebuild
> >> every time, etc.
> >>
> >> Thanks,
> >> Emir
> >>
> >>
> >> On 08.02.2016 16:56, Mark Robinson wrote:
> >>
> >>> Hi,
> >>> We have a requirement where we would need to index around 2 Billion
> docs
> >>> in
> >>> a day.
> >>> The queries against this indexed data set can be around 80K queries per
> >>> second during peak time and during non peak hours around 12K queries
> per
> >>> second.
> >>>
> >>> Can Solr realize this huge volumes.
> >>>
> >>> If so, assuming we have no constraints for budget what would be a
> >>> recommended Solr set up (number of shards, number of Solr instances
> >>> etc...)
> >>>
> >>> Thanks!
> >>> Mark
> >>>
> >>>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
>

Re: Solr architecture

2016-02-08 Thread Jack Krupansky

Oops... at 100 qps for a single node you would need 120 nodes to get to 12K
qps and 800 nodes to get 80K qps, but that is just an extremely rough
ballpark estimate, not some precise and firm number. And that's if all the
queries can be evenly distributed throughout the cluster and don't require
fanout to other shards, which effectively turns each incoming query into n
queries where n is the number of shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> So is there any aging or TTL (in database terminology) of older docs?
>
> And do all of your queries need to query all of the older documents all of
> the time or is there a clear hierarchy of querying for aged documents, like
> past 24-hours vs. past week vs. past year vs. older than a year? Sure, you
> can always use a function query to boost by the inverse of document age,
> but Solr would be more efficient with filter queries or separate indexes
> for different time scales.
>
> Are documents ever updated or are they write-once?
>
> Are documents explicitly deleted?
>
> Technically you probably could meet those specs, but... how many
> organizations have the resources and the energy to do so?
>
> As a back of the envelope calculation, if Solr gave you 100 queries per
> second per node, that would mean you would need 1,200 nodes. It would also
> depend on whether those queries are very narrow so that a single node can
> execute them or if they require fanout to other shards and then aggregation
> of results from those other shards.
>
> -- Jack Krupansky
>
> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Short form: You really have to prototype. Here's the long form:
>>
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> I've seen between 20M and 200M docs fit on a single piece of hardware,
>> so you'll absolutely have to shard.
>>
>> And the other thing you haven't told us is whether you plan on
>> _adding_ 2B docs a day or whether that number is the total corpus size
>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
>> docs/day, 30 days later do you have 2B docs or 60B docs in your
>> corpus?
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <susheel2...@gmail.com>
>> wrote:
>> > Also if you are expecting indexing of 2 billion docs as NRT or if it
>> will
>> > be offline (during off hours etc).  For more accurate sizing you may
>> also
>> > want to index say 10 million documents which may give you idea how much
>> is
>> > your index size and then use that for extrapolation to come up with
>> memory
>> > requirements.
>> >
>> > Thanks,
>> > Susheel
>> >
>> > On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
>> > emir.arnauto...@sematext.com> wrote:
>> >
>> >> Hi Mark,
>> >> Can you give us bit more details: size of docs, query types, are docs
>> >> grouped somehow, are they time sensitive, will they update or it is
>> rebuild
>> >> every time, etc.
>> >>
>> >> Thanks,
>> >> Emir
>> >>
>> >>
>> >> On 08.02.2016 16:56, Mark Robinson wrote:
>> >>
>> >>> Hi,
>> >>> We have a requirement where we would need to index around 2 Billion
>> docs
>> >>> in
>> >>> a day.
>> >>> The queries against this indexed data set can be around 80K queries
>> per
>> >>> second during peak time and during non peak hours around 12K queries
>> per
>> >>> second.
>> >>>
>> >>> Can Solr realize this huge volumes.
>> >>>
>> >>> If so, assuming we have no constraints for budget what would be a
>> >>> recommended Solr set up (number of shards, number of Solr instances
>> >>> etc...)
>> >>>
>> >>> Thanks!
>> >>> Mark
>> >>>
>> >>>
>> >> --
>> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> >> Solr & Elasticsearch Support * http://sematext.com/
>> >>
>> >>
>>
>
>

Re: URI is too long

2016-02-06 Thread Jack Krupansky

And you're sure that you can't use the terms query parser, which was
explicitly designed for handling a very long list of terms to be implicitly
ORed?

-- Jack Krupansky

On Sat, Feb 6, 2016 at 2:26 PM, Salman Ansari <salman.rah...@gmail.com>
wrote:

> It looked like there was another issue with my query. I had too many
> boolean operators (I believe maxBooleanClause property in SolrConfig.xml).
> I just looped in batch of 1000 to get all the docs. Not sure if there is a
> better way of handling this.
>
> Regards,
> Salman
>
>
> On Wed, Feb 3, 2016 at 12:29 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
> > On 2/2/2016 1:46 PM, Salman Ansari wrote:
> > > OK then, if there is no way around this problem, can someone tell me
> the
> > > maximum size a POST body can handle in Solr?
> >
> > It is configurable in solrconfig.xml.  Look for the
> > formdataUploadLimitInKB setting in the 5.x configsets.  This setting
> > defaults to 2048, which means 2 megabytes.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: large number of fields

2016-02-05 Thread Jack Krupansky

This doesn't sound like a great use case for Solr - or any other search
engine for that matter. I'm not sure what you are really trying to
accomplish, but you are trying to put way too many balls in the air to
juggle efficiently. You really need to re-conceptualize your problem so
that it has far fewer moving parts. Sure, Solr can handle many millions or
even billions of documents, but the focus for scaling Solr is on more
documents and more nodes, not incredibly complex or large documents. The
key to effective and efficient use of Solr is that queries are "quite
short", definitely not "quite long."

That said, the starting point for any data modeling effort is to look at
the full range of desired queries and that should drive the data model. So,
give us more info on queries, in terms of plain English descriptions of
what the user is trying to achieve.

-- Jack Krupansky

On Fri, Feb 5, 2016 at 8:20 AM, Jan Verweij - Experts in search <
j...@searchxperts.nl> wrote:

> Hi,
> We store 50K products stored in Solr. We have 10K customers and each
> customer buys up to 10K of these products. Now we want to influence the
> results by adding a field for every customer.
> So we end up with 10K fields to influence the results on the buying
> behavior of
> each customer (personal results). Don't think this is the way to go so I'm
> looking for suggestions how to solve
> this.
> One other option would be to: 1. create one multivaluefield
> 'company_hitrate'
>  2. store for each company their [companyID]_[hitrate]
>
> During search use boostfields [companyID]_50 …. [companyID]_100 So in this
> case the query can become quit long (51 options) but the number of
> fields is limited to 1. What kind of effect would this have on the search
> performance
> Any other suggestions?
> Jan.

Re: indexing pdf binary stored in mongodb?

2016-02-05 Thread Jack Krupansky

See if they are stored in BSON format using GridFS. If so, you can simply
use the mongofiles command to retrieve the PDF into a local file and index
that in Solr either using Solr Cell or Tika.

See:
http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb
https://docs.mongodb.org/manual/reference/program/mongofiles/


-- Jack Krupansky

On Fri, Feb 5, 2016 at 3:13 PM, Arnett, Gabriel <gabe.arn...@moodys.com>
wrote:

> Anyone have any experience indexing pdfs stored in binary form in mongodb?
>
> .
> Gabe Arnett
> Senior Director
> Moody's Analytics
>
> -
>
> The information contained in this e-mail message, and any attachment
> thereto, is confidential and may not be disclosed without our express
> permission. If you are not the intended recipient or an employee or agent
> responsible for delivering this message to the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution or copying of this message, or any
> attachment thereto, in whole or in part, is strictly prohibited. If you
> have received this message in error, please immediately notify us by
> telephone, fax or e-mail and delete the message and all of its attachments.
> Thank you. Every effort is made to keep our network free from viruses. You
> should, however, review this e-mail message, as well as any attachment
> thereto, for viruses. We take no responsibility and have no liability for
> any computer virus which may be transferred via this e-mail message.
>

Re: implement exact match for one of the search fields only?

2016-02-04 Thread Jack Krupansky

The desired architecture is that you use a middle app layer that clients
send queries to and that middle app layer then constructs the formal query
and sends it on to Solr proper. This architecture also enables breaking a
user query into multiple Solr queries and then aggregating the results.
Besides, the general goal is to avoid app clients talking directly to Solr
anyway.

-- Jack Krupansky

On Thu, Feb 4, 2016 at 2:57 AM, Derek Poh <d...@globalsources.com> wrote:

> Hi Erick
>
> <<
> The manual way of doing this would be to construct an elaborate query,
> like q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket)
> OR NOTE: the parens are necessary or the last part of the above would
> be parsed as P_ShortDescription:dvd default_searchfield:bracket
> >>
>
> Your suggestion to construct the query like q=spp_keyword_exact:"dvd
> bracket" OR P_ShortDescription:(dvd bracket) OR does not fit into our
> current implementation.
> The front-end pages will only pass the "q=search keywords" in the query to
> solr. The list of search fields (qf) is pre-defined in solr.
>
> Do you have any alternatives to implement your suggestion without making
> changes to the front-end?
>
> On 1/29/2016 1:49 AM, Erick Erickson wrote:
>
>> bq: if you are interested phrase query, you should use String field
>>
>> If you do this, you will NOT be able to search within the string. I.e.
>> if the doc field is "my dog has fleas" you cannot match
>> "dog has" with a string-based field.
>>
>> If you want to match the _entire_ string or you want prefix-only
>> matching, then string might work, i.e. if you _only_ want to be able
>> to match
>>
>> "my dog has fleas"
>> "my dog*"
>> but not
>> "dog has fleas".
>>
>> On to the root question though.
>>
>> I really think you want to look at edismax. What you're trying to do
>> is apply the same search term to individual fields. In particular,
>> the pf parameter will automatically apply the search terms _as a phrase_
>> against the field specified, relieving you of having to enclose things
>> in quotes.
>>
>> The manual way of doing this would be to construct an elaborate query,
>> like
>> q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket)
>> OR
>>
>> NOTE: the parens are necessary or the last part of the above would be
>> parsed as
>> P_ShortDescription:dvd default_searchfield:bracket
>>
>> And the =query trick will show you exactly how things are actually
>> searched, it's invaluable.
>>
>> Best,
>> Erick
>>
>> On Thu, Jan 28, 2016 at 5:08 AM, Mugeesh Husain <muge...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> if you are interested phrase query, you should use String field instead
>>> of
>>> text field in schema like as
>>>   
>>>
>>> this will solved you problem.
>>>
>>> if you are missing anything else let share
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>
>

Re: Error configuring UIMA

2016-02-01 Thread Jack Krupansky

Yeah, that's exactly the kind of innocent user error that UIMA simply has
no code to detect and reasonably report.

-- Jack Krupansky

On Mon, Feb 1, 2016 at 12:13 PM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> It was a stupid error, I've mistyped the logField configuration in UIMA
>
> I'd like error not to use the Id but another field, but I've mistyped in
> solrconfig.xml and then I've got that error.
>
> Gian Maria.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: lunedì 1 febbraio 2016 16:54
> To: solr-user@lucene.apache.org
> Subject: Re: Error configuring UIMA
>
> What was the specific error you had to correct? The NPE appears to be in
> exception handling code so the actual exception is not indicated in the
> stack trace.
>
> The UIMA code is rather poor in terms of failing to check and report
> missing parameters or bad parameters which in turn reference data that does
> not exist.
>
> -- Jack Krupansky
>
> On Mon, Feb 1, 2016 at 10:18 AM, alkampfer <alkamp...@nablasoft.com>
> wrote:
>
> >
> >
> > From: outlook_288fbf38c031d...@outlook.com
> > To: solr-user@lucene.apache.org
> > Cc:
> > Date: Mon, 1 Feb 2016 15:59:02 +0100
> > Subject: Error configuring UIMA
> >
> > I've solved the problem, it was caused by wrong configuration in
> > solrconfig.xml.
> >
> > Thanks.
> >
> >
> >
> > > Hi,>  > I’ve followed the guide
> > https://cwiki.apache.org/confluence/display/solr/UIMA+Integration to
> > setup a UIMA integration to test this feature. The doc is not updated
> > for Solr5, I’ve followed the latest comment to that guide and did some
> > other changes but now each request to /update handler fails with the
> > following error.>  > Someone have a clue on what I did wrong?>  > Thanks
> in advance.>
> >  > {>   "responseHeader": {> "status": 500,> "QTime": 443>   },>
> > "error": {> "trace": "java.lang.NullPointerException\n\tat
> > org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(U
> > IMAUpdateRequestProcessor.java:105)\n\tat
> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.pro
> > cessUpdate(JsonLoader.java:143)\n\tat
> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.loa
> > d(JsonLoader.java:113)\n\tat
> > org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76)\n\t
> > at
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandl
> > er.java:98)\n\tat
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Con
> > tentStreamHandlerBase.java:74)\n\tat
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> > rBase.java:143)\n\tat
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\
> > tat
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:210)\n\tat
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:179)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletH
> > andler.java:1652)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> > 585)\n\tat
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> > va:143)\n\tat
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java
> > :577)\n\tat
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandle
> > r.java:223)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandle
> > r.java:1127)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:5
> > 15)\n\tat
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler
> > .java:185)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler
> > .java:1061)\n\tat
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> > va:141)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Conte
> > xtHandlerCollection.java:215)\n\tat
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColle
> > ction.java:110)\n\tat
> > org.eclipse.jetty.server.handler.Handler

Re: Error in UIMA, probably opencalais,

2016-02-01 Thread Jack Krupansky

At the bottom (the fine print!) it says: lineNumber: 15; columnNumber: 7;
The element type "meta" must be terminated by the matching end-tag
"".

-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:45 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi,
>
>
>
> I’ve configured integration with UIMA but when I try to add a document I
> always got the error reported at bottom of the mail.
>
>
>
> It seems to be related to openCalais, but I’ve registered to OpenCalais
> and setup my token in solrconfig, so I wonder if anyone has some clue on
> what could be the reason of the error.
>
>
>
> I’m running this on Solr 5.3.1 instance running on linux.
>
>
>
> Gian Maria.
>
>
>
> null:org.apache.solr.common.SolrException: processing error null.
> id=doc4,  text="This is some textual content to verify UIMA integration..."
>
>  at
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:127)
>
>  at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
>
>  at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
>
>  at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
>
>  at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
>
>  at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
>
>  at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
>
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
>
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>
>  at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>
>  at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>
>  at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>
>  at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>
>  at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>
>  at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>  at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>
>  at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>
>  at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>
>  at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
>
>  *at
> org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:208)*
>
>  at
> org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
>
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
>
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
>
>  at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
>

Re: alternative forum for SOLR user

2016-02-01 Thread Jack Krupansky

Some people prefer to use Stack Overflow, but this mailing list is still
the definitive "forum" for Solr users.

See:
http://stackoverflow.com/questions/tagged/solr


-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:58 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/1/2016 1:13 AM, Jean-Jacques MONOT wrote:
> > I am a newbie with SOLR and just registered to this mailing list.
> >
> > Is there an alternative forum for SOLR user ? I am using this mailing
> > list for support, but did not find "real" web forum.
>
> Are you using "forum" as a word that can include a mailing list, or are
> you talking explicitly about a website for Solr that is running forum
> software?
>
> There is at least one "forum" website that actually mirrors this mailing
> list -- posts made on the forum are sent to the mailing list, and
> vice-versa.  The example I am thinking of is Nabble.
>
> This mailing list is the primary official path to find support on Solr
> -- the list is run by the Apache Software Foundation, which owns all
> rights connected to Solr.  There is no official "forum" website for the
> project, and nothing like it is planned for the near future.  Nabble is
> a third-party website.
>
> There are some third-party systems, entirely separate from this mailing
> list, that offer community support for Solr, such as stackoverflow.
> Another possibility is the #solr IRC channel, which is not exactly an
> official resource, but is frequented by users who have an official
> connection with the project.
>
> Thanks,
> Shawn
>
>

Re: Error configuring UIMA

2016-02-01 Thread Jack Krupansky

What was the specific error you had to correct? The NPE appears to be in
exception handling code so the actual exception is not indicated in the
stack trace.

The UIMA code is rather poor in terms of failing to check and report
missing parameters or bad parameters which in turn reference data that does
not exist.

-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:18 AM, alkampfer <alkamp...@nablasoft.com> wrote:

>
>
> From: outlook_288fbf38c031d...@outlook.com
> To: solr-user@lucene.apache.org
> Cc:
> Date: Mon, 1 Feb 2016 15:59:02 +0100
> Subject: Error configuring UIMA
>
> I've solved the problem, it was caused by wrong configuration in
> solrconfig.xml.
>
> Thanks.
>
>
>
> > Hi,>  > I’ve followed the guide
> https://cwiki.apache.org/confluence/display/solr/UIMA+Integration to
> setup a UIMA integration to test this feature. The doc is not updated for
> Solr5, I’ve followed the latest comment to that guide and did some other
> changes but now each request to /update handler fails with the following
> error.>  > Someone have a clue on what I did wrong?>  > Thanks in advance.>
>  > {>   "responseHeader": {> "status": 500,> "QTime": 443>   },>
> "error": {> "trace": "java.lang.NullPointerException\n\tat
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:105)\n\tat
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:143)\n\tat
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:113)\n\tat
> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76)\n\tat
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)\n\tat
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:499)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\tat
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tat
> java.lang.Thread.run(Thread.java:745)\n",> "code": 500>   }> }>  > --
> > Gian Maria Ricci
> > Cell: +39 320 0136949> >
>
>

Re: Determine if Merge is triggered in SOLR

2016-01-31 Thread Jack Krupansky

You would have to implement your own MergeScheduler that wrapped an
existing merge scheduler and then save the merge info and then write a
custom request handler to retrieve that saved info.

See:
https://lucene.apache.org/core/5_4_1/core/org/apache/lucene/index/MergeScheduler.html
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig


-- Jack Krupansky

On Sun, Jan 31, 2016 at 1:59 PM, abhi Abhishek <abhi26...@gmail.com> wrote:

> Hi All,
> any suggestions/ ideas?
>
> Thanks,
> Abhishek
>
> On Tue, Jan 26, 2016 at 9:16 PM, abhi Abhishek <abhi26...@gmail.com>
> wrote:
>
> > Hi All,
> > is there a way in SOLR to determine if a merge has been triggered in
> > SOLR? is there a API exposed to query this?
> >
> > if its not available is there a way to do the same using lucene jar files
> > available in the SOLR libs?
> >
> > Appreciate your help.
> >
> > Best Regards,
> > Abhishek
> >
>

Re: URI is too long

2016-01-31 Thread Jack Krupansky

Or try the terms query parser that lets you eliminate all the OR operators:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser


-- Jack Krupansky

On Sun, Jan 31, 2016 at 9:23 AM, Paul Libbrecht <p...@hoplahup.net> wrote:

> How about using POST?
>
> paul
>
> > Salman Ansari <mailto:salman.rah...@gmail.com>
> > 31 January 2016 at 15:20
> > Hi,
> >
> > I am building a long query containing multiple ORs between query terms. I
> > started to receive the following exception:
> >
> > The remote server returned an error: (414) Request-URI Too Long. Any idea
> > what is the limit of the URL in Solr? Moreover, as a solution I was
> > thinking of chunking the query into multiple requests but I was wondering
> > if anyone has a better approach?
> >
> > Regards,
> > Salman
> >
>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2475 matches

Mail list logo