Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

2017-03-17 Thread Jay Hill
I've got a very difficult project to tackle. I've been tasked with using
schemaless mode to index json files that we receive. The structure of the
json files will always be very different as we're receiving files from
different customers totally unrelated to one another. We are attempting to
build a "one size fits all" approach to receiving documents from a wide
variety of sources and then index them into Solr.

We're running in Solr 5.3. The schemaless approach works well enough -
until it doesn't. It seems to fail on type guessing and also gets confused
indexing to different shards. If it was reliable it would be the perfect
solution for our task. But the larger the JSON file the more likely it is
to fail. At a certain size it just doesn't work.

I've been advised by some experts and committers that schemaless is a good
tool for prototyping, but risky to run in production, but we thought we
would try it by doing offline indexing using the Cloudera
MapReduceIndexerTool to build offline indexes - but still using managed
schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
that pipes together a series of commands to transform data. For example a
JSON or CSV file can be processed and loaded into a Solr index with a
"readJSON" command piped to a "loadSolr" command, for a simple example.

But the kite-sdk that manages the morphlines only seems to offer as they're
latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
4.10.3)

So I can't see any way to integrate schemaless (which has dependencies
after 4.10.3) with the morphlines.

But I thought I would ask here: Anybody had ANY experience with morphlines
to index to Solr? Any info would help me make sense of this.

Cheers to all!


Cascading failures with replicas

2017-03-17 Thread Walter Underwood
I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut 
down Solr on one host because it got into some kind of bad, can’t-recover state 
where it was causing timeouts across the whole cluster (bug #1).

I ran a load benchmark near the capacity of the cluster. This had run fine in 
test, this was the prod cluster.

Solr Cloud added a replica to replace the down node. The node with two cores 
got double the traffic and started slowly flapping in and out of service. The 
95th percentile response spiked from 3 seconds to 100 seconds. At some point, 
another replica was made, with two replicas from the same shard on the same 
instance. Naturally, that was overloaded, and I killed the benchmark out of 
charity.

Bug #2 is creating a new replica when one host is down. This should be an 
option and default to “false”, because it causes the cascade.

Bug #3 is sending equal traffic to each core without considering the host. Each 
host should get equal traffic, not each core.

Bug #4 is putting two replicas from the same shard on one instance. That is 
just asking for trouble.

When it works, this cluster is awesome.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread donato
Thanks for the response, Erik!

Can you download my schema file here?  CLICK HERE  
.

I'm not too familiar with this technology yet. I tried adding that
=query at the end of my URL, but nothing happened.

Thanks again for the repsonse! All along, I just wanted queries for cat,
cats, kitten and kitties to return the same number of results - and it does
- partially because of the synonyms.txt file. 

But this apostrophe thing is killing me!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709p4325718.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: analysis matches aren't counting as matches in query

2017-03-17 Thread Erick Erickson
The most common issue here is that the query isn't being parsed like
you think it is. The simplest is that the query has spaces somewhere.
I.e. "q=f1:a b" gets parsed as

q=f1:a default_field:b

The analysis page (which I assume you're talking about) tells you what
happens _after_ the query is parsed, and the input bits are
apportioned up over the fields.

I'd start by adding =query to the URL and see what the parsed
query looks like.

Best,
Erick



On Fri, Mar 17, 2017 at 2:36 PM, John Blythe  wrote:
> hi all,
>
> i'm having a hard time w understanding why i'm not getting hits on a
> manufacturer field that i recently updated.
>
> i get the following results, the top row being the index analysis and the
> second the query.
>
> RDTF
> mentor
> advanced
> sterilize
>
> RDTF
> mentor
>
> advanced
> sterilize
>
> yet when the values i used above for index and query analyses are present
> in a document and query (as a filterquery) i get zero results. the rest of
> the query brings back the expected the results when i remove the
> fq=vendor_text:(foo) that produces the above output in Analysis.
>
> any clue what i'm missing?
>
> thanks!


Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread Erick Erickson
what stemmers are you using? I got the results I by using
EnglishPosessiveFilterFactory followeed by PorterStemFilterFactory.

Or you could use Porter and remove the leftover trailing apostrophe.

Best,
Erick

On Fri, Mar 17, 2017 at 5:05 PM, Erick Erickson  wrote:
> Your schema file didn't come through. Have you tried looking at the
> admin UI/Analysis page for the three values? That often tells you what
> is going on.
>
> The other thing to do is attach =query to the URL. That'll show
> you how the query parsed, which is separate from the analysis bits.
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 3:30 PM, donato  wrote:
>> I have been racking my brain for days... I need to remove 's from say
>> "patrick's" If I search for "patrick" or "patricks" I get the same number of
>> results, however, if I search for "patrick's" it's a different number. I
>> just want solr to ignore the 'sCan someone PLEASE help me It is driving
>> me nutsHere is my schema file...
>> Id  Name
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How on EARTH do I remove 's in schema file?

2017-03-17 Thread Erick Erickson
Your schema file didn't come through. Have you tried looking at the
admin UI/Analysis page for the three values? That often tells you what
is going on.

The other thing to do is attach =query to the URL. That'll show
you how the query parsed, which is separate from the analysis bits.

Best,
Erick

On Fri, Mar 17, 2017 at 3:30 PM, donato  wrote:
> I have been racking my brain for days... I need to remove 's from say
> "patrick's" If I search for "patrick" or "patricks" I get the same number of
> results, however, if I search for "patrick's" it's a different number. I
> just want solr to ignore the 'sCan someone PLEASE help me It is driving
> me nutsHere is my schema file...
> Id  Name
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Solr Split

2017-03-17 Thread Azazel K
Hi,


We have a solr index running in 4.5.0 that we are trying to upgrade to 4.7.2 
and split the shard.


The uniqueKey is a TrieLongField, and it's values are always negative :


In prod (2 shards, 1 replica for each shard)

Max : -9223372035490849922
Min : -9223372036854609508


In lab (1 shard, 1 replica): Negatives between ( -21339 to -9223372036854687955 
) and couple of documents with positive values.


To test out if it will work, I executed following steps in lab on new instances 
containing solr/zk cluster


1. From the old cluster(C1), zip up the index while server is not running.  We 
are not going to the source to re-index into new cluster(C2) as we don't own 
the data(yes. that's being addressed).

2. On the new cluster(C2), create a new collection.

2. Stop tomcat server in new cluster(C2).

3. Overwrite new cluster index(C2) with the original cluster's index(C1).

4. Start the server and optimize(C2).

5.  Num of docs match with the original cluster and everything seems to work. 
Total documents: 4021887(C2 and C1).


then


1. Split the shard in new cluster(C2).  For 1.5 GB it takes around 6 minutes to 
complete.

2. Two shards are created, with hash range(0-7fff and 8000-). 
Original range was 8000-.

Num docs for hash range "0-7fff" is 4021886, and for " 8000-" 
is "3680519"


Apparently, both shards contain lot of duplicate documents.  It's not properly 
sharded at all.  I tried this twice with the same output.  What might be the 
issue here?


Any pointers really appreciated.


Azazel


How on EARTH do I remove 's in schema file?

2017-03-17 Thread donato
I have been racking my brain for days... I need to remove 's from say
"patrick's" If I search for "patrick" or "patricks" I get the same number of
results, however, if I search for "patrick's" it's a different number. I
just want solr to ignore the 'sCan someone PLEASE help me It is driving
me nutsHere is my schema file...











   
Id  Name  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-on-EARTH-do-I-remove-s-in-schema-file-tp4325709.html
Sent from the Solr - User mailing list archive at Nabble.com.

analysis matches aren't counting as matches in query

2017-03-17 Thread John Blythe
hi all,

i'm having a hard time w understanding why i'm not getting hits on a
manufacturer field that i recently updated.

i get the following results, the top row being the index analysis and the
second the query.

RDTF
mentor
advanced
sterilize

RDTF
mentor

advanced
sterilize

yet when the values i used above for index and query analyses are present
in a document and query (as a filterquery) i get zero results. the rest of
the query brings back the expected the results when i remove the
fq=vendor_text:(foo) that produces the above output in Analysis.

any clue what i'm missing?

thanks!


Re: Parallelizing post filter for better performance

2017-03-17 Thread Joel Bernstein
You'll probably get better results by trying to get more performance out of
your single threaded postfilter. If you can post the code in you collect()
method you may get some ideas on how to improve the performance.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Mar 17, 2017 at 2:13 PM, Sundeep T  wrote:

> Hello,
>
> Is there a way to execute the post filter in a parallel mode so that
> multiple query results can be filtered in parallel?
>
> Right now, in our code, the post filter is becoming kind of bottleneck as
> we had to do some post processing on every returned result, and it runs
> serially in a single thread.
>
> Thanks
> Sundeep
>


Re: Parallelizing post filter for better performance

2017-03-17 Thread Mikhail Khludnev
Lucene can search segments in parallel. In Solr, you can break it to
multiple shards/cores and "distributeSearch" even on single Solr instance
(even without SolrCloud).

On Fri, Mar 17, 2017 at 9:13 PM, Sundeep T  wrote:

> Hello,
>
> Is there a way to execute the post filter in a parallel mode so that
> multiple query results can be filtered in parallel?
>
> Right now, in our code, the post filter is becoming kind of bottleneck as
> we had to do some post processing on every returned result, and it runs
> serially in a single thread.
>
> Thanks
> Sundeep
>



-- 
Sincerely yours
Mikhail Khludnev


Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey  wrote:
> On 3/17/2017 8:11 AM, Yonik Seeley wrote:
>> For Solr 6.4, we've managed to circumvent this for filter queries and
>> other contexts where scoring isn't needed.
>> http://yonik.com/solr-6-4/  "More efficient filter queries"
>
> Nice!
>
> If the filter looks like the following (because q.op=AND), does it still
> use TermsQuery?
>
> fq=id:(id1 OR id2 OR id3 OR ... id2000)

Yep, that works as well.  As does fq=id:id1 OR id:id2 OR id:id3 ...
Was implemented here: https://issues.apache.org/jira/browse/SOLR-9786

-Yonik


Re: SOLR Data Locality

2017-03-17 Thread Toke Eskildsen
Imad Qureshi  wrote:
> I understand that but unfortunately that's not an option right now.
> We already have 16 TB of index in HDFS.
> 
> So let me rephrase this question. How important is data locality for
> SOLR. Is performance impacted if SOLR data is on a remote node?

The short answer is yes, the long answer is 
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Anecdotally we did some experiments prior to building our multi-TB search 
setup, where we compared local SSDs with remote (Isilon) SSDs. That setup was 
with simple searches and some faceting. I was a bit surprised that the slowdown 
was only 3x. I would expect the speed difference to be even smaller if the 
underlying storage is slow (spinning disks). Old blog post at 
https://sbdevel.wordpress.com/2013/12/06/danish-webscale/


I don't understand the expected gain of adding replicas, if the data are 
remote. Why can't the replica Solrs run on the nodes with the data? Do you have 
very CPU-intensive search?

- Toke Eskildsen


Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 8:11 AM, Yonik Seeley wrote:
> For Solr 6.4, we've managed to circumvent this for filter queries and
> other contexts where scoring isn't needed.
> http://yonik.com/solr-6-4/  "More efficient filter queries"

Nice!

If the filter looks like the following (because q.op=AND), does it still
use TermsQuery?

fq=id:(id1 OR id2 OR id3 OR ... id2000)

Thanks,
Shawn



Parallelizing post filter for better performance

2017-03-17 Thread Sundeep T
Hello,

Is there a way to execute the post filter in a parallel mode so that
multiple query results can be filtered in parallel?

Right now, in our code, the post filter is becoming kind of bottleneck as
we had to do some post processing on every returned result, and it runs
serially in a single thread.

Thanks
Sundeep


Re: Data Import

2017-03-17 Thread Mike Thomsen
If Solr is down, then adding through SolrJ would fail as well. Kafka's new
API has some great features for this sort of thing. The new client API is
designed to be run in a long-running loop where you poll for new messages
with a certain amount of defined timeout (ex: consumer.poll(1000) for 1s)
So if Solr becomes unstable or goes down, it's easy to have the consumer
just stop and either wait until Solr comes back up or save the data to
disk/commit the Kafka offsets to ZK and stop running.

On Fri, Mar 17, 2017 at 1:24 PM, OTH  wrote:

> Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)
>
> @Wunder:
> I'm assuming, that updating to Solr would fail if Solr is unavailable not
> just if posting via say a DB trigger, but probably also if trying to post
> through SolrJ?  (Which is what I'm using for now.)  So, even if using
> SolrJ, it would be a good idea to use a queuing software?
>
> Thanks
>
> On Fri, Mar 17, 2017 at 10:12 PM, vishal jain  wrote:
>
> > Streaming the data through kafka would be a good option if near real time
> > data indexing is the key requirement.
> > In our application the RDBMS data is populated by an ETL job periodically
> > so we don't need real time data indexing for now.
> >
> > Cheers,
> > Vishal
> >
> > On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > Or set a trigger on your RDBMS's main table to put the relevant
> > > information in a different table (call it EVENTS) and have your SolrJ
> > > consult the EVENTS table periodically. Essentially you're using the
> > > EVENTS table as a queue where the trigger is the producer and the
> > > SolrJ program is the consumer.
> > >
> > > It's a polling solution though, so not event-driven. There's no
> > > mechanism that I know of have, say, your RDBMS push an event to DIH
> > > for instance.
> > >
> > > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > > for this kind of problem..
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> > >  wrote:
> > > > One assumes by hooking into the same code that updates RDBMS, as
> > > > opposed to be reverse engineering the changes from looking at the DB
> > > > content. This would be especially the case for Delete changes.
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > http://www.solr-start.com/ - Resources for Solr users, new and
> > > experienced
> > > >
> > > >
> > > > On 17 March 2017 at 11:37, OTH  wrote:
> > > >>>
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr.
> > > >>
> > > >> How can SolrJ be used to make RDBMS updates immediately available?
> > > >> Thanks
> > > >>
> > > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > > sujaybawas...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hi Vishal,
> > > >>>
> > > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> > with
> > > >>> caching has best performance. DIH nested entities allow you to
> define
> > > >>> simple queries.
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr. DIH full import can be used for index all data
> > first
> > > >>> time or restore index in case index is corrupted.
> > > >>>
> > > >>> Thanks,
> > > >>> Sujay
> > > >>>
> > > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> > > wrote:
> > > >>>
> > > >>> > Hi,
> > > >>> >
> > > >>> >
> > > >>> > I am new to Solr and am trying to move data from my RDBMS to
> Solr.
> > I
> > > know
> > > >>> > the available options are:
> > > >>> > 1) Post Tool
> > > >>> > 2) DIH
> > > >>> > 3) SolrJ (as ours is a J2EE application).
> > > >>> >
> > > >>> > I want to know what is the recommended way for Data import in
> > > production
> > > >>> > environment.
> > > >>> > Will sending data via SolrJ in batches be faster than posting a
> csv
> > > using
> > > >>> > POST tool?
> > > >>> >
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Vishal
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Thanks,
> > > >>> Sujay P Bawaskar
> > > >>> M:+91-77091 53669
> > > >>>
> > >
> >
>


Re: Data Import

2017-03-17 Thread OTH
Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)

@Wunder:
I'm assuming, that updating to Solr would fail if Solr is unavailable not
just if posting via say a DB trigger, but probably also if trying to post
through SolrJ?  (Which is what I'm using for now.)  So, even if using
SolrJ, it would be a good idea to use a queuing software?

Thanks

On Fri, Mar 17, 2017 at 10:12 PM, vishal jain  wrote:

> Streaming the data through kafka would be a good option if near real time
> data indexing is the key requirement.
> In our application the RDBMS data is populated by an ETL job periodically
> so we don't need real time data indexing for now.
>
> Cheers,
> Vishal
>
> On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson 
> wrote:
>
> > Or set a trigger on your RDBMS's main table to put the relevant
> > information in a different table (call it EVENTS) and have your SolrJ
> > consult the EVENTS table periodically. Essentially you're using the
> > EVENTS table as a queue where the trigger is the producer and the
> > SolrJ program is the consumer.
> >
> > It's a polling solution though, so not event-driven. There's no
> > mechanism that I know of have, say, your RDBMS push an event to DIH
> > for instance.
> >
> > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > for this kind of problem..
> >
> > Best,
> > Erick
> >
> > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> >  wrote:
> > > One assumes by hooking into the same code that updates RDBMS, as
> > > opposed to be reverse engineering the changes from looking at the DB
> > > content. This would be especially the case for Delete changes.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> > >
> > >
> > > On 17 March 2017 at 11:37, OTH  wrote:
> > >>>
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr.
> > >>
> > >> How can SolrJ be used to make RDBMS updates immediately available?
> > >> Thanks
> > >>
> > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > sujaybawas...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi Vishal,
> > >>>
> > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> with
> > >>> caching has best performance. DIH nested entities allow you to define
> > >>> simple queries.
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr. DIH full import can be used for index all data
> first
> > >>> time or restore index in case index is corrupted.
> > >>>
> > >>> Thanks,
> > >>> Sujay
> > >>>
> > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> > wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> >
> > >>> > I am new to Solr and am trying to move data from my RDBMS to Solr.
> I
> > know
> > >>> > the available options are:
> > >>> > 1) Post Tool
> > >>> > 2) DIH
> > >>> > 3) SolrJ (as ours is a J2EE application).
> > >>> >
> > >>> > I want to know what is the recommended way for Data import in
> > production
> > >>> > environment.
> > >>> > Will sending data via SolrJ in batches be faster than posting a csv
> > using
> > >>> > POST tool?
> > >>> >
> > >>> >
> > >>> > Thanks,
> > >>> > Vishal
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks,
> > >>> Sujay P Bawaskar
> > >>> M:+91-77091 53669
> > >>>
> >
>


RE: Data Import

2017-03-17 Thread Liu, Daphne
NO, I use the free version. I have the driver from someone else. I can share it 
if you want to use Cassandra.
They have modified it for me since the free JDBC driver I found will timeout 
when the document is greater than 16mb.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com



-Original Message-
From: vishal jain [mailto:jain02...@gmail.com]
Sent: Friday, March 17, 2017 12:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Data Import

Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne 
wrote:

> I just want to share my recent project. I have successfully sent all
> our EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import
> JDBC Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13%
> and all my documents can be keep in my Cassandra clusters' memory, we
> are very happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> daphne@cevalogistics.com
>
>
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user 
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use
> it in production. If you do want to use DIH, you may benefit from
> reviewing the DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already
> have Java stack.
>
> The choice is yours in the end.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey  wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr.
> >> I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If
> > you have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread
> > is pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the
> > source system -- how quickly the data can be retrieved.  It usually
> > takes a lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only.
> It may contain confidential information that is privileged. If you are
> not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this e-mail and any
> attachment(s) is strictly prohibited. If you have received this e-mail
> by error, please immediately notify the sender by replying to this
> e-mail and deleting the message including any attachment(s) from your
> system. Thank you in advance for your cooperation and assistance.
> Although the company has taken reasonable precautions to ensure no
> viruses are present in this email, the company cannot accept
> responsibility for any loss or damage arising from the use of this email or 
> attachments.
>
This e-mail message is intended for the 

Re: Data Import

2017-03-17 Thread vishal jain
Streaming the data through kafka would be a good option if near real time
data indexing is the key requirement.
In our application the RDBMS data is populated by an ETL job periodically
so we don't need real time data indexing for now.

Cheers,
Vishal

On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson 
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>  wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH  wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> sujaybawas...@gmail.com>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>


Re: SOLR Data Locality

2017-03-17 Thread Imad Qureshi
Hi Mike

I understand that but unfortunately that's not an option right now. We already 
have 16 TB of index in HDFS. 

So let me rephrase this question. How important is data locality for SOLR. Is 
performance impacted if SOLR data is on a remote node?

Thanks
Imad

> On Mar 17, 2017, at 12:02 PM, Mike Thomsen  wrote:
> 
> I've only ever used the HDFS support with Cloudera's build, but my experience 
> turned me off to use HDFS. I'd much rather use the native file system over 
> HDFS.
> 
>> On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi 
>>  wrote:
>> We have a 30 node Hadoop cluster and each data node has a SOLR instance also 
>> running. Data is stored in HDFS. We are adding 10 nodes to the cluster. 
>> After adding nodes, we'll run HDFS balancer and also create SOLR replicas on 
>> new nodes. This will affect data locality. does this impact how solr works 
>> (I mean performance) if the data is on a remote node? ThanksImad
> 


Re: Data Import

2017-03-17 Thread Walter Underwood
That fails if Solr is not available.

To avoid dropping updates, you need some kind of persistent queue. We use 
Amazon SQS for our incremental updates.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 17, 2017, at 10:09 AM, OTH  wrote:
> 
> Could the database trigger not just post the change to solr?
> 
> On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson 
> wrote:
> 
>> Or set a trigger on your RDBMS's main table to put the relevant
>> information in a different table (call it EVENTS) and have your SolrJ
>> consult the EVENTS table periodically. Essentially you're using the
>> EVENTS table as a queue where the trigger is the producer and the
>> SolrJ program is the consumer.
>> 
>> It's a polling solution though, so not event-driven. There's no
>> mechanism that I know of have, say, your RDBMS push an event to DIH
>> for instance.
>> 
>> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
>> for this kind of problem..
>> 
>> Best,
>> Erick
>> 
>> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>>  wrote:
>>> One assumes by hooking into the same code that updates RDBMS, as
>>> opposed to be reverse engineering the changes from looking at the DB
>>> content. This would be especially the case for Delete changes.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>> 
>>> 
>>> On 17 March 2017 at 11:37, OTH  wrote:
> 
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr.
 
 How can SolrJ be used to make RDBMS updates immediately available?
 Thanks
 
 On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
>> sujaybawas...@gmail.com>
 wrote:
 
> Hi Vishal,
> 
> As per my experience DIH is the best for RDBMS to solr index. DIH with
> caching has best performance. DIH nested entities allow you to define
> simple queries.
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr. DIH full import can be used for index all data first
> time or restore index in case index is corrupted.
> 
> Thanks,
> Sujay
> 
> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
>> wrote:
> 
>> Hi,
>> 
>> 
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I
>> know
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>> 
>> I want to know what is the recommended way for Data import in
>> production
>> environment.
>> Will sending data via SolrJ in batches be faster than posting a csv
>> using
>> POST tool?
>> 
>> 
>> Thanks,
>> Vishal
>> 
> 
> 
> 
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669
> 
>> 



Re: Data Import

2017-03-17 Thread vishal jain
Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne 
wrote:

> I just want to share my recent project. I have successfully sent all our
> EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC
> Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13% and
> all my documents can be keep in my Cassandra clusters' memory, we are very
> happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> daphne@cevalogistics.com
>
>
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user 
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use it in
> production. If you do want to use DIH, you may benefit from reviewing the
> DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already have
> Java stack.
>
> The choice is yours in the end.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey  wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If you
> > have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread is
> > pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the source
> > system -- how quickly the data can be retrieved.  It usually takes a
> > lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only. It
> may contain confidential information that is privileged. If you are not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this e-mail and any attachment(s) is strictly
> prohibited. If you have received this e-mail by error, please immediately
> notify the sender by replying to this e-mail and deleting the message
> including any attachment(s) from your system. Thank you in advance for your
> cooperation and assistance. Although the company has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
>


Re: Data Import

2017-03-17 Thread OTH
Could the database trigger not just post the change to solr?

On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson 
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>  wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH  wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> sujaybawas...@gmail.com>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>


Re: SOLR Data Locality

2017-03-17 Thread Mike Thomsen
I've only ever used the HDFS support with Cloudera's build, but my
experience turned me off to use HDFS. I'd much rather use the native file
system over HDFS.

On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi <
imadgr...@yahoo.com.invalid> wrote:

> We have a 30 node Hadoop cluster and each data node has a SOLR instance
> also running. Data is stored in HDFS. We are adding 10 nodes to the
> cluster. After adding nodes, we'll run HDFS balancer and also create SOLR
> replicas on new nodes. This will affect data locality. does this impact how
> solr works (I mean performance) if the data is on a remote node? ThanksImad
>


Re: Data Import

2017-03-17 Thread Erick Erickson
Or set a trigger on your RDBMS's main table to put the relevant
information in a different table (call it EVENTS) and have your SolrJ
consult the EVENTS table periodically. Essentially you're using the
EVENTS table as a queue where the trigger is the producer and the
SolrJ program is the consumer.

It's a polling solution though, so not event-driven. There's no
mechanism that I know of have, say, your RDBMS push an event to DIH
for instance.

Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
for this kind of problem..

Best,
Erick

On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
 wrote:
> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH  wrote:
>>>
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr.
>>
>> How can SolrJ be used to make RDBMS updates immediately available?
>> Thanks
>>
>> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>>> caching has best performance. DIH nested entities allow you to define
>>> simple queries.
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr. DIH full import can be used for index all data first
>>> time or restore index in case index is corrupted.
>>>
>>> Thanks,
>>> Sujay
>>>
>>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>>>
>>> > Hi,
>>> >
>>> >
>>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>>> > the available options are:
>>> > 1) Post Tool
>>> > 2) DIH
>>> > 3) SolrJ (as ours is a J2EE application).
>>> >
>>> > I want to know what is the recommended way for Data import in production
>>> > environment.
>>> > Will sending data via SolrJ in batches be faster than posting a csv using
>>> > POST tool?
>>> >
>>> >
>>> > Thanks,
>>> > Vishal
>>> >
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Sujay P Bawaskar
>>> M:+91-77091 53669
>>>


Re: Data Import

2017-03-17 Thread vishal jain
Thanks to all of you for the valuable inputs.
Being on J2ee platform I also felt using solrJ in a multi threaded
environment would be a better choice to index RDBMS data into SolrCloud.
I will try with a scheduler triggered micro service to do the job using
SolrJ.

Regards,
Vishal

On Fri, Mar 17, 2017 at 9:11 PM, Alexandre Rafalovitch 
wrote:

> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH  wrote:
> >>
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr.
> >
> > How can SolrJ be used to make RDBMS updates immediately available?
> > Thanks
> >
> > On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar  >
> > wrote:
> >
> >> Hi Vishal,
> >>
> >> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >> caching has best performance. DIH nested entities allow you to define
> >> simple queries.
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr. DIH full import can be used for index all data first
> >> time or restore index in case index is corrupted.
> >>
> >> Thanks,
> >> Sujay
> >>
> >> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> wrote:
> >>
> >> > Hi,
> >> >
> >> >
> >> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >> > the available options are:
> >> > 1) Post Tool
> >> > 2) DIH
> >> > 3) SolrJ (as ours is a J2EE application).
> >> >
> >> > I want to know what is the recommended way for Data import in
> production
> >> > environment.
> >> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >> > POST tool?
> >> >
> >> >
> >> > Thanks,
> >> > Vishal
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Sujay P Bawaskar
> >> M:+91-77091 53669
> >>
>


Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch
One assumes by hooking into the same code that updates RDBMS, as
opposed to be reverse engineering the changes from looking at the DB
content. This would be especially the case for Delete changes.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 11:37, OTH  wrote:
>>
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr.
>
> How can SolrJ be used to make RDBMS updates immediately available?
> Thanks
>
> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
> wrote:
>
>> Hi Vishal,
>>
>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>> caching has best performance. DIH nested entities allow you to define
>> simple queries.
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr. DIH full import can be used for index all data first
>> time or restore index in case index is corrupted.
>>
>> Thanks,
>> Sujay
>>
>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>>
>> > Hi,
>> >
>> >
>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>> > the available options are:
>> > 1) Post Tool
>> > 2) DIH
>> > 3) SolrJ (as ours is a J2EE application).
>> >
>> > I want to know what is the recommended way for Data import in production
>> > environment.
>> > Will sending data via SolrJ in batches be faster than posting a csv using
>> > POST tool?
>> >
>> >
>> > Thanks,
>> > Vishal
>> >
>>
>>
>>
>> --
>> Thanks,
>> Sujay P Bawaskar
>> M:+91-77091 53669
>>


Re: Data Import

2017-03-17 Thread OTH
>
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr.

How can SolrJ be used to make RDBMS updates immediately available?
Thanks

On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar 
wrote:

> Hi Vishal,
>
> As per my experience DIH is the best for RDBMS to solr index. DIH with
> caching has best performance. DIH nested entities allow you to define
> simple queries.
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr. DIH full import can be used for index all data first
> time or restore index in case index is corrupted.
>
> Thanks,
> Sujay
>
> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:
>
> > Hi,
> >
> >
> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> > the available options are:
> > 1) Post Tool
> > 2) DIH
> > 3) SolrJ (as ours is a J2EE application).
> >
> > I want to know what is the recommended way for Data import in production
> > environment.
> > Will sending data via SolrJ in batches be faster than posting a csv using
> > POST tool?
> >
> >
> > Thanks,
> > Vishal
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669
>


Enhanced output of SearchComponent not visible in SolrCloud

2017-03-17 Thread Markus, Sascha
Hi,
I created a serch component which enriches the response for a query.

So my json result looks like

{
  "responseHeader":{...},

  "response":{"numFound":116652,"start":0,"maxScore":1.0,"docs":...},

  "facet_counts":{...},

  "facets":{...},

  "expand.entities":{... }

}

I did this using
rb.rsp.add("expand.entities", my_enrichments);

This worked fine with a standalone solr. But in solr-cloud I see the output
with debugQuery in the result for one shard.
But not in the collected result.
Where is the place to colect this from the shards and add it to the result?
Any hints?

Cheers
 Sascha


Re: Grouping and result pagination

2017-03-17 Thread Shawn Heisey
On 3/17/2017 9:07 AM, Erick Erickson wrote:
> I think the answer is that you have to co-locate the docs with the
> same value you're grouping by on the same shard whether in SolrCloud
> or not...
>
> Hmmm: from: 
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
>
> "group.ngroups and group.facet require that all documents in each
> group must be co-located on the same shard in order for accurate
> counts to be returned."

That is not how things work right now.  The index has 170 million
documents in it, split into six large cold shards and a very small hot
shard.  The routing I'm using for the cold shards is the CRC32 hash of
the database primary key (different field than Solr's uniqueKey) run
through a mod function to determine shard number (0-5).  The hash/mod
calculation is done in the MySQL query.

Is pagination of a grouped query impossible with this index?

I suppose it's theoretically possible that I could hash the set name
instead of the DB primary key which would result in docs from a set
being co-located.  Would that help?  My worry with that approach is that
the cold shards would no longer have relatively uniform sizes.

Thanks,
Shawn



Re: Grouping and result pagination

2017-03-17 Thread Erick Erickson
I think the answer is that you have to co-locate the docs with the
same value you're grouping by on the same shard whether in SolrCloud
or not...

Hmmm: from: 
https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats

"group.ngroups and group.facet require that all documents in each
group must be co-located on the same shard in order for accurate
counts to be returned."

Best,
Erick

On Fri, Mar 17, 2017 at 8:00 AM, Shawn Heisey  wrote:
> We use pagination (start/rows) frequently with our queries.  Nothing
> unusual there.
>
> Now we have need to use grouping with a request like this, for a
> set-mode search, where only one document from each set is returned:
>
> http://idxb1.REDACTED.com:8981/solr/ncmain/lbcheck?q=*:*=true=set_name=set_lead%20desc=1=50
>
> We've worked through most of the problems encountered with this idea.
> The first page of results works perfectly.
>
> The remaining problem is that I cannot seem to paginate -- set the start
> value to 50, 100, etc.  I found some information saying that
> group.ngroups=true is required for pagination, so I added that.  I have
> found that occasionally I can load page two (rows=50=50), but that
> *most* of the time, I can't even get page two to load, and further pages
> have never worked.  The response contains no documents.
>
> The index is distributed (sharded), but not running SolrCloud.
>
> The server where I am trying this is running a SNAPSHOT build of 4.9.  I
> haven't had an opportunity yet to try a newer version -- we don't have
> newer versions on any of the machines for this index.  I can only
> upgrade as far as 5.3, because that's as far as we can go with a
> third-party plugin we are using.
>
> I found the following issue, which says it was fixed before 4.0 was
> released:
>
> https://issues.apache.org/jira/browse/SOLR-2207
>
> Does anyone know whether pagination with grouping is expected to work,
> and if so, how to do it?
>
> Thanks,
> Shawn
>


Re: Alphanumeric sort with alphabets first

2017-03-17 Thread Erick Erickson
I would back up further and say that 2500 fields is too much from the
start. Why do you need this many fields? And you say you can sort on
any of them... for a corpus of any decent size this is going to chew
up memory like crazy. Admittedly OS memory if you use docValues but
still memory.

That said, a custom sort function is probably the way to go if you
really need to.

Best,
Erick

On Thu, Mar 16, 2017 at 9:17 PM, Srinivasan Narayanan
 wrote:
> Can someone please respond?
>
> From: Srinivasan Narayanan 
> Date: Monday, March 13, 2017 at 3:51 PM
> To: "solr-user@lucene.apache.org" 
> Subject: Alphanumeric sort with alphabets first
>
>
> Hello SOLR experts,
>
> I am new to SOLR and I am trying to do alphanumeric sort on string field(s). 
> However, in my case, alphabets should come before numbers. I also have a 
> large number of such fields (~2500), any of which can be alphanumerically 
> sorted upon at runtime. I’ve explored below concepts in SOLR to arrive at a 
> solution:
>
> 1)  Custom similarity plugin : far fetched, and probably not even 
> applicable to my usecase
>
> 2)  Analyzer/tokenizer and regex magic to left pad number parts with 0s : 
> two disadvantages – I believe this needs extra fields (copy) to be created 
> which I cannot do (2500 more fields is too much) and this will still push 
> numbers before alphabets
>
> 3)  Custom function (ValueSource) and regex magic to left pad numeric 
> tokens with 0s, and invoke function for sorting only – a bit better than the 
> previous one, but still numbers come before alphabets.
>
> 4)  Custom function (ValueSource) and regex magic to left pad numeric 
> tokens with 0s, prefix numeric tokens with tilde (~), and invoke function for 
> sorting only – this is where I stand right now. Very ugly, but it works. 
> Because tilde has a very high ASCII value, it pushes numbers behind alphabets.
> There should obviously be a better approach I am missing. Please help!


Re: fq performance

2017-03-17 Thread Erick Erickson
And to chime in.

bq: It contains information about who have access to the
documents, like field as (U1_s:true).

I wanted to make explicit the implications of Micael's response.

You are talking about different _fields_ per user or group, i.e.
Don't do this, it's horribly wasteful. Instead as Michael suggests,
you have a single field  ("access_control" in his example)
that contains the groups and users, i.e.
permissions might contain U1, G1, G4, U1000 and then form the
fq clauses as he suggests.

Also, if you're on an earlier version than 6.4 you can have massive
OR clauses by using the TermsQueryParser.

Best,
Erick



On Fri, Mar 17, 2017 at 7:11 AM, Yonik Seeley  wrote:
> On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey  wrote:
> [...]
>> Lucene has a global configuration called "maxBooleanClauses" which
>> defaults to 1024.
>
> For Solr 6.4, we've managed to circumvent this for filter queries and
> other contexts where scoring isn't needed.
> http://yonik.com/solr-6-4/  "More efficient filter queries"
>
> -Yonik


Grouping and result pagination

2017-03-17 Thread Shawn Heisey
We use pagination (start/rows) frequently with our queries.  Nothing
unusual there.

Now we have need to use grouping with a request like this, for a
set-mode search, where only one document from each set is returned:

http://idxb1.REDACTED.com:8981/solr/ncmain/lbcheck?q=*:*=true=set_name=set_lead%20desc=1=50

We've worked through most of the problems encountered with this idea. 
The first page of results works perfectly.

The remaining problem is that I cannot seem to paginate -- set the start
value to 50, 100, etc.  I found some information saying that
group.ngroups=true is required for pagination, so I added that.  I have
found that occasionally I can load page two (rows=50=50), but that
*most* of the time, I can't even get page two to load, and further pages
have never worked.  The response contains no documents.

The index is distributed (sharded), but not running SolrCloud.

The server where I am trying this is running a SNAPSHOT build of 4.9.  I
haven't had an opportunity yet to try a newer version -- we don't have
newer versions on any of the machines for this index.  I can only
upgrade as far as 5.3, because that's as far as we can go with a
third-party plugin we are using.

I found the following issue, which says it was fixed before 4.0 was
released:

https://issues.apache.org/jira/browse/SOLR-2207

Does anyone know whether pagination with grouping is expected to work,
and if so, how to do it?

Thanks,
Shawn



Re: Exact match works only for some of the strings

2017-03-17 Thread Mikhail Khludnev
Hello Gintas,
>From the first letter I've got that you use colon to separate fieldname and
text.
But here it's =, which is never advised in lucence syntax.

On Fri, Mar 17, 2017 at 2:37 PM, Gintautas Sulskus <
gintautas.suls...@gmail.com> wrote:

> Hi,
>
> Thank you for your replies.
> Sorry, forgot to specify, I am using Solr 4.10.3 (from Cloudera CDH 5.9.0).
>
> When I search for name:Guardian I can see both  "Guardian EU-referendum"
> and "Guardian US" in the result set.
> The debugQuery results for both queries are identical
> http://pastebin.com/xr96EF0r
> Reindexing did not help.
>
>
> Cheers,
> Gintas
>
>
>
> On Thu, Mar 16, 2017 at 8:18 PM, Alvaro Cabrerizo 
> wrote:
>
> > Hello,
> >
> > I've tested on an old solr 4.3 instance and the schema and the field
> > definition are fine. I've also checked that only the
> > query nameExact:"Guardian EU-referendum" gives the result, the other one
> > you have commented (nameExact:"Guardian US") gives 0 hits. Maybe, you
> > forgot to re-index after schema modification. I mean, you indexed your
> > data, then changed the schema and then start querying using the new
> schema
> > that does not match your index.
> >
> > Hope it helps.
> >
> > On Thu, Mar 16, 2017 at 7:50 PM, Mikhail Khludnev 
> wrote:
> >
> > > You can try to check debugQuery to understand how this query is parsed:
> > > double quotes hardly compatible with KeywordTokenizer. Also you can
> check
> > > which terms are indexed in SchemaBrowser. Also, there is Analysis page
> at
> > > Solr Admin.
> > >
> > > On Thu, Mar 16, 2017 at 8:55 PM, Gintautas Sulskus <
> > > gintautas.suls...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am trying to figure out why Solr returns an empty result when
> > searching
> > > > for the following query:
> > > >
> > > > nameExact:"Guardian EU-referendum"
> > > >
> > > >
> > > > The field definition:
> > > >
> > > >  stored="true"
> > > />
> > > >
> > > >
> > > > The type definition:
> > > >
> > > >  > > > sortMissingLast="true" omitNorms="true">
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > The analysis, as expected, matches the query parameter against the
> > stored
> > > > value. Please take a look at the attached image. I am using
> > > > KeywordTokenizer and LowerCaseFilter.
> > > > ​
> > > > What is more strange, the query below works just fine:
> > > >
> > > > nameExact:"Guardian US"
> > > >
> > > >
> > > > Could you please provide me with some clues on what could be wrong?
> > > >
> > > > Thanks,
> > > > Gintas
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>



-- 
Sincerely yours
Mikhail Khludnev


Re: managing active/passive cores in Solr and Haystack

2017-03-17 Thread Shawn Heisey
On 3/15/2017 7:55 AM, serwah sabetghadam wrote:
> Thanks Erick for the fast answer:)
>
> I knew about sharding, just as far as I know it will work on different
> servers.
> I wonder if it is possible to do sth like sharding as you mentioned but on
> a single standalone Solr?
> Can I use the implicit routing on standalone then?

If you're running standalone (not SolrCloud), then everything having to
do with shards must be 100 percent managed by you.  There is no
routing.  There is no capability of automatically managing which
implicit shards belong to which logical index.  There's no automatic
replication of index data for redundancy.  You're in charge of
*everything* that SolrCloud would normally handle automatically.

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

Multiple shards can live in a single Solr instance, whether you use
SolrCloud or the old way described above.  If your query rate is very
low, this probably will perform well.  As the query rate increases, it's
best to only have one core per Solr instance.  Either way, it's
*usually* best to only have one Solr instance per machine.

Thanks,
Shawn



Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey  wrote:
[...]
> Lucene has a global configuration called "maxBooleanClauses" which
> defaults to 1024.

For Solr 6.4, we've managed to circumvent this for filter queries and
other contexts where scoring isn't needed.
http://yonik.com/solr-6-4/  "More efficient filter queries"

-Yonik


RE: Data Import

2017-03-17 Thread Liu, Daphne
I just want to share my recent project. I have successfully sent all our EDI 
documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC Cassandra 
connector indexing our documents.
Since Cassandra is so fast for writing, compression rate is around 13% and all 
my documents can be keep in my Cassandra clusters' memory, we are very happy 
with the result.


Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Friday, March 17, 2017 9:54 AM
To: solr-user 
Subject: Re: Data Import

I feel DIH is much better for prototyping, even though people do use it in 
production. If you do want to use DIH, you may benefit from reviewing the 
DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change 
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports, again 
useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already have Java 
stack.

The choice is yours in the end.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey  wrote:
> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know 
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in
>> production environment. Will sending data via SolrJ in batches be faster 
>> than posting a csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never
> tested it.  The same threading problem that I discuss below would
> apply to indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going
> into Solr, and each batch of documents to be inserted must wait for
> the previous one to finish inserting before it can start.  I do not
> know if DIH batches documents or sends them in one at a time.  If you
> have a manually sharded index, you can run DIH on each shard in
> parallel, but each one will be single-threaded.  That single thread is
> pretty efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading)
> is how you radically speed up the Solr part of indexing.  This is
> usually done with a custom indexing program, which might be written
> with SolrJ or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the
> situation is examined closely, most people find that it's not Solr
> that makes their indexing slow.  The bottleneck is usually the source
> system -- how quickly the data can be retrieved.  It usually takes a
> lot longer to obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>
This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail by error, please immediately 
notify the sender by replying to this e-mail and deleting the message including 
any attachment(s) from your system. Thank you in advance for your cooperation 
and assistance. Although the company has taken reasonable precautions to ensure 
no viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments.


Re: Data Import

2017-03-17 Thread Alexandre Rafalovitch
I feel DIH is much better for prototyping, even though people do use
it in production. If you do want to use DIH, you may benefit from
reviewing the DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports,
again useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already
have Java stack.

The choice is yours in the end.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey  wrote:
> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know 
>> the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in production
>> environment. Will sending data via SolrJ in batches be faster than posting a 
>> csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never tested
> it.  The same threading problem that I discuss below would apply to
> indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going into
> Solr, and each batch of documents to be inserted must wait for the
> previous one to finish inserting before it can start.  I do not know if
> DIH batches documents or sends them in one at a time.  If you have a
> manually sharded index, you can run DIH on each shard in parallel, but
> each one will be single-threaded.  That single thread is pretty
> efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading) is
> how you radically speed up the Solr part of indexing.  This is usually
> done with a custom indexing program, which might be written with SolrJ
> or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the situation
> is examined closely, most people find that it's not Solr that makes
> their indexing slow.  The bottleneck is usually the source system -- how
> quickly the data can be retrieved.  It usually takes a lot longer to
> obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>


Re: Get handler not working

2017-03-17 Thread Chris Ulicny
I didn't realize extra parameters were ignored on collection creation.

I believe I have all of the trace log from the get request included in the
attached document. The collection used was setup as CollectionOne
previously. One instance in cloud mode with 2 shards with
router.field=iqroutingkey.

If there is any more of the log that would be useful, let me know and I'll
can add that to it.

Thanks,
Chris

On Thu, Mar 16, 2017 at 3:55 PM Alexandre Rafalovitch 
wrote:

Well, only router.field is the valid parameter as per
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CREATE:CreateaCollection

In the second case the parameter is ignored and the uniqueKey is used
instead, which is different for you.

But it is the first case that fails for you, so it sounds like maybe
/get handler somehow does not routed correctly. I wonder if there is
another parameter somewhere that should be set to match the field you
use, but is not.

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 16 March 2017 at 12:28, Chris Ulicny  wrote:
> I think I've figured out where the issue is, at least superficially. It's
> in what parameter is used to define the field to route on. I set up two
> collections to use the same configset but slightly altered calls to the
> Collections API.
>
> action=CREATE=CollectionOne=2=compositeId&
> *router.field*
> =iqroutingkey=2=RoutingTest
> action=CREATE=CollectionTwo=2=compositeId&
> *routerField*
> =iqroutingkey=2=RoutingTest
>
> The get handler returns null for CollectionOne (even with a _route_
> parameter), but it will return the document for CollectionTwo in any case.
> I will gather and post the trace logs when I get a chance.
>
>
>
> On Thu, Mar 16, 2017 at 10:52 AM Yonik Seeley  wrote:
>
>> Ah, yeah, if you're using a different route field it's highly likely
>> that's the issue.
>> I was always against that "feature", and this thread demonstrates part
>> of the problem (complicating clients, including us human clients
>> trying to make sense of what's going on).
>>
>> -Yonik
>>
>>
>> On Thu, Mar 16, 2017 at 10:31 AM, Chris Ulicny  wrote:
>> > Speaking of routing, I realized I completely forgot to add the routing
>> > setup to the test cloud, so it probably has something to do with the
>> issue.
>> > I'll add that in and report back.
>> >
>> > So the routing and uniqueKey setup is as follows:
>> >
>> > Schema setup:
>> > iqdocid > > multiValued="false" indexed="true" required="true" stored="true"/>
> > name="iqdocid" type="string" multiValued="false" indexed="true"
required=
>> > "true" stored="true"/>
>> >
>> > I don't think it's mentioned in the documentation about using
routerField
>> > for the compositeId router, but based on the resolution of SOLR-5017
>> > , we decided to use
the
>> > compositeId router with routerField set to 'iqroutingkey' which is
using
>> > the "!" notation. In general, the iqroutingkey field is of the form:
>> > !!
>> >
>> > Unless I misunderstood what was changed with that patch, that form
should
>> > still route appropriately, and it seems that it has distributed the
>> > documents appropriately from our basic testing.
>> >
>> > On Thu, Mar 16, 2017 at 9:42 AM David Hastings <
>> hastings.recurs...@gmail.com>
>> > wrote:
>> >
>> > i still would like to see an experiment where you change the field to
id
>> > instead of iqdocid,
>> >
>> > On Thu, Mar 16, 2017 at 9:33 AM, Yonik Seeley 
wrote:
>> >
>> >> Something to do with routing perhaps? (the mapping of ids to shards,
>> >> by default is based on hashes of the id)
>> >> -Yonik
>> >>
>> >>
>> >> On Thu, Mar 16, 2017 at 9:16 AM, Chris Ulicny 
wrote:
>> >> > iqdocid is already set to be the uniqueKey value.
>> >> >
>> >> > I tried reindexing a few documents back into the problematic cloud
and
>> > am
>> >> > getting the same behavior of no document found for get handler.
>> >> >
>> >> > I've also done some testing on standalone instances as well as some
>> > quick
>> >> > cloud setups (with embedded zk), and I cannot seem to replicate the
>> >> > problem. For each test, I used the exact same configset that is
>> causing
>> >> the
>> >> > issue for us and indexed a document from that instance as well. I
can
>> >> > provide more details if that would be useful in anyway.
>> >> >
>> >> > Standalone instance worked
>> >> > Cloud mode worked regardless of the use of the security plugin
>> >> > Cloud mode worked regardless of explicit get handler definition
>> >> > Cloud mode consistently worked with explicitly defining the get
>> handler,
>> >> > then removing it and reloading the collection
>> >> >
>> >> > The only differences that I know of between the tests and the
>> > problematic
>> >> > cloud is that solr is running as a different user and using an
>> external
>> >> > 

Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 12:46 AM, Ganesh M wrote:
> For how many ORs solr can give the results in less than one second.Can
> I pass 100's of OR condtion in the solr query? will that affects the
> performance ? 

This is a question that's impossible to answer.  The number will vary
depending on the nature of the queries, the size and nature of the data
in the index, and the hardware resources available in the server running
Solr.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Another wrinkle affecting your question:

Lucene has a global configuration called "maxBooleanClauses" which
defaults to 1024.  This means that if one query with a bunch of
AND/OR/NOT clauses ends up with more than 1024 of them, and this
configuration value has not been increased, the query will simply fail
to execute.  This parameter can be increased up to a value a little
larger than two billion, but due to the global nature of the
configuration, you must increase it in *EVERY* Solr configuration, or
you may find that the ones you didn't increase it in will reset it back
down to 1024 -- and this will affect every index, because it's global.

Thanks,
Shawn



Re: Data Import

2017-03-17 Thread Shawn Heisey
On 3/17/2017 3:04 AM, vishal jain wrote:
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the 
> available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment. Will sending data via SolrJ in batches be faster than posting a 
> csv using POST tool?

I've heard that CSV import runs EXTREMELY fast, but I have never tested
it.  The same threading problem that I discuss below would apply to
indexing this way.

DIH is extremely powerful, but it has one glaring problem:  It's
single-threaded, which means that only one stream of data is going into
Solr, and each batch of documents to be inserted must wait for the
previous one to finish inserting before it can start.  I do not know if
DIH batches documents or sends them in one at a time.  If you have a
manually sharded index, you can run DIH on each shard in parallel, but
each one will be single-threaded.  That single thread is pretty
efficient, but it's still only one thread.

Sending multiple index updates to Solr in parallel (multi-threading) is
how you radically speed up the Solr part of indexing.  This is usually
done with a custom indexing program, which might be written with SolrJ
or even in a completely different language.

One thing to keep in mind with ANY indexing method:  Once the situation
is examined closely, most people find that it's not Solr that makes
their indexing slow.  The bottleneck is usually the source system -- how
quickly the data can be retrieved.  It usually takes a lot longer to
obtain the data than it does for Solr to index it.

Thanks,
Shawn



Unified highlighter and complexphrase

2017-03-17 Thread Bjarke Buur Mortensen
Hi list,
Given the text:
"Kontraktsproget vil være dansk og arbejdssproget kan være dansk, svensk,
norsk og engelsk"
and the query:
{!complexphrase df=content_da}("sve* no*")
the unified highlighter (hl.method=unified) does not return any highlights.
For reference, the original highlighter returns a snippet with the expected
highlights:
Kontraktsproget vil være dansk og arbejdssproget kan være dansk,
svensk, norsk og
Is this expected behaviour with the unified highlighter?

I have also filed this a bug report here:
https://issues.apache.org/jira/browse/SOLR-10309
but maybe some of you can help out.

Thanks in advance,
Bjarke
Senior Software Engineer, Eluence A/S


Re: Exact match works only for some of the strings

2017-03-17 Thread Gintautas Sulskus
Hi,

Thank you for your replies.
Sorry, forgot to specify, I am using Solr 4.10.3 (from Cloudera CDH 5.9.0).

When I search for name:Guardian I can see both  "Guardian EU-referendum"
and "Guardian US" in the result set.
The debugQuery results for both queries are identical
http://pastebin.com/xr96EF0r
Reindexing did not help.


Cheers,
Gintas



On Thu, Mar 16, 2017 at 8:18 PM, Alvaro Cabrerizo 
wrote:

> Hello,
>
> I've tested on an old solr 4.3 instance and the schema and the field
> definition are fine. I've also checked that only the
> query nameExact:"Guardian EU-referendum" gives the result, the other one
> you have commented (nameExact:"Guardian US") gives 0 hits. Maybe, you
> forgot to re-index after schema modification. I mean, you indexed your
> data, then changed the schema and then start querying using the new schema
> that does not match your index.
>
> Hope it helps.
>
> On Thu, Mar 16, 2017 at 7:50 PM, Mikhail Khludnev  wrote:
>
> > You can try to check debugQuery to understand how this query is parsed:
> > double quotes hardly compatible with KeywordTokenizer. Also you can check
> > which terms are indexed in SchemaBrowser. Also, there is Analysis page at
> > Solr Admin.
> >
> > On Thu, Mar 16, 2017 at 8:55 PM, Gintautas Sulskus <
> > gintautas.suls...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I am trying to figure out why Solr returns an empty result when
> searching
> > > for the following query:
> > >
> > > nameExact:"Guardian EU-referendum"
> > >
> > >
> > > The field definition:
> > >
> > >  > />
> > >
> > >
> > > The type definition:
> > >
> > >  > > sortMissingLast="true" omitNorms="true">
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > The analysis, as expected, matches the query parameter against the
> stored
> > > value. Please take a look at the attached image. I am using
> > > KeywordTokenizer and LowerCaseFilter.
> > > ​
> > > What is more strange, the query below works just fine:
> > >
> > > nameExact:"Guardian US"
> > >
> > >
> > > Could you please provide me with some clues on what could be wrong?
> > >
> > > Thanks,
> > > Gintas
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


Solrcould with Haystack

2017-03-17 Thread serwah
Hi there,

Is there anyone who has used Solrcloud communicating with Django Haystack?
When I have to use Django Haystack plus distributed search,
the question is if Solrcloud could be a solution. I ask this as I have not
found good sources for that.


Best,
Serwah


Re: Data Import

2017-03-17 Thread Sujay Bawaskar
Hi Vishal,

As per my experience DIH is the best for RDBMS to solr index. DIH with
caching has best performance. DIH nested entities allow you to define
simple queries.
Also, solrj is good when you want your RDBMS updates make immediately
available in solr. DIH full import can be used for index all data first
time or restore index in case index is corrupted.

Thanks,
Sujay

On Fri, Mar 17, 2017 at 2:34 PM, vishal jain  wrote:

> Hi,
>
>
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> the available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment.
> Will sending data via SolrJ in batches be faster than posting a csv using
> POST tool?
>
>
> Thanks,
> Vishal
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669


Data Import

2017-03-17 Thread vishal jain
Hi,


I am new to Solr and am trying to move data from my RDBMS to Solr. I know
the available options are:
1) Post Tool
2) DIH
3) SolrJ (as ours is a J2EE application).

I want to know what is the recommended way for Data import in production
environment.
Will sending data via SolrJ in batches be faster than posting a csv using
POST tool?


Thanks,
Vishal


Solr data Import

2017-03-17 Thread vishal jain
Hi,


I am new to Solr and am trying to move data from my RDBMS to Solr. I know
the available options are:
1) Post Tool
2) DIH
3) SolrJ (as ours is a J2EE application).

I want to know what is the recommended way for Data import in production
environment.
Will sending data via SolrJ in batches be faster than posting a csv using
POST tool?


Thanks,
Vishal


Re: fq performance

2017-03-17 Thread Michael Kuhlmann

Hi Ganesh,

you might want to use something like this:

fq=access_control:(g1 g2 g5 g99 ...)

Then it's only one fq filter per request. Internally it's like an OR condition, 
but in a more condensed form. I already have used this with up to 500 values 
without larger performance degradation (but in that case it was the unique id 
field).

You should think a minute about your filter cache here. Since you only have one 
fq filter per request, you won't blow your cache that fast. But it depends on 
your use case whether you should cache these filters at all. When it's common 
that a single user will send several requests within one commit interval, or 
when it's likely that several users will be in the same groups, that just use 
it like that. But when it's more likely that each request belongs to a 
different user with different security settings, then you should consider 
disabling the cache for this fq filter so that your filter cache (for other 
filters you probably have) won't be polluted: 
fq=*{!cache=false}*access_control:(g1 g2 g5 g99 ...). See 
http://yonik.com/advanced-filter-caching-in-solr/ for information on that.

-Michael



Am 17.03.2017 um 07:46 schrieb Ganesh M:

Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey  wrote:


On 3/16/2017 6:02 AM, Ganesh M wrote:

We have 1 million of documents and would like to query with multiple fq

values.

We have kept the access_control ( multi value field ) which holds

information about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to

pass multiple fq values ( one for each group user belongs to )



q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

The correct syntax is fq=field:value -- what you have there is not going
to work.

This might not do what you expect.  Filter queries are ANDed together --
*every* filter must match, which means that if a document that you want
has only one of those values in access_control, or has 98 of them but
not all 100, then the query isn't going to match that document.  The
solution is one filter query that can match ANY of them, which also
might run faster.  I can't say whether this is a problem for you or
not.  Your data might be completely correct for matching 100 filters.

Also keep in mind that there is a limit to the size of a URL that you
can send into any webserver, including the container that runs Solr.
That default limit is 8192 bytes, and includes the "GET " or "POST " at
the beginning and the " HTTP/1.1" at the end (note the spaces).  The
filter query information for 100 of the filters you mentioned is going
to be over 2K, which will fit in the default, but if your query has more
complexity than you have mentioned here, the total URL might not fit.
There's a workaround to this -- use a POST request and put the
parameters in the request body.


If we fire query with 100 values in the fq, whats the penalty on the

performance ? Can we get the result in less than one second for 1 million
of documents.

With one million documents, each internal filter query result is 25
bytes -- the number of documents divided by eight.  That's 2.5 megabytes
for 100 of them.  In addition, every time a filter is run, it must
examine every document in the index to create that 25 byte
structure, which means that filters which *aren't* found in the
filterCache are relatively slow.  If they are found in the cache,
they're lightning fast, because the cache will contain the entire 25
byte bitset.

If you make your filterCache large enough, it's going to consume a LOT
of java heap memory, particularly if the index gets bigger.  The 

Re: question about function query

2017-03-17 Thread Bernd Fehling
Hi Mikhail,

thanks for your help.
After some more reading and testing I found the solution.
Just in case someone else needs it here are the results.

original query:
q=collection:ftmuenster+AND+-description:*=*
--> numFound="1877"

frange query:
q=collection:ftmuenster={!frange+l=0+u=0}exists(description)
--> numFound="1877"

original query:
q=collection:ftmuenster+AND+description:*=*
--> numFound="4152"

frange query:
q=collection:ftmuenster={!frange+l=1+u=1}exists(description)
--> numFound="4152"


Regards
Bernd

Am 16.03.2017 um 19:53 schrieb Mikhail Khludnev:
> Hello,
> A function query matches all docs. Use {!frange} if you want to select docs
> with some particular values.
> 
> On Thu, Mar 16, 2017 at 6:08 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> I'm testing some function queries and have some questions.
>>
>> original queries:
>> 1. q=collection:ftmuenster=*
>> --> numFound="6029"
>>
>> 2. q=collection:ftmuenster+AND+-description:*=*
>> --> numFound="1877"
>>
>> 3. q=collection:ftmuenster+AND+description:*=*
>> --> numFound="4152"
>>
>> This looks good.
>>
>> But now with function query:
>>
>> q={!func}exists(description)=collection:ftmuenster=*
>> --> numFound="6029"
>>
>> I'm was hoping to get numFound=4152, why not?
>>
>> I also tried:
>> q={!func}exists(description)=collection:ftmuenster=AND=*
>> --> numFound="6029"
>>
>> What are the function queries equivalent to queries 2. and 3. above?
>>
>> Regards
>> Bernd
>>
>>


Re: fq performance

2017-03-17 Thread Ganesh M
Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey  wrote:

> On 3/16/2017 6:02 AM, Ganesh M wrote:
> > We have 1 million of documents and would like to query with multiple fq
> values.
> >
> > We have kept the access_control ( multi value field ) which holds
> information about for which group that document is accessible.
> >
> > Now to get the list of all the documents of an user, we would like to
> pass multiple fq values ( one for each group user belongs to )
> >
> >
> q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...
> >
> > Like this, there could be 100 groups for an user.
>
> The correct syntax is fq=field:value -- what you have there is not going
> to work.
>
> This might not do what you expect.  Filter queries are ANDed together --
> *every* filter must match, which means that if a document that you want
> has only one of those values in access_control, or has 98 of them but
> not all 100, then the query isn't going to match that document.  The
> solution is one filter query that can match ANY of them, which also
> might run faster.  I can't say whether this is a problem for you or
> not.  Your data might be completely correct for matching 100 filters.
>
> Also keep in mind that there is a limit to the size of a URL that you
> can send into any webserver, including the container that runs Solr.
> That default limit is 8192 bytes, and includes the "GET " or "POST " at
> the beginning and the " HTTP/1.1" at the end (note the spaces).  The
> filter query information for 100 of the filters you mentioned is going
> to be over 2K, which will fit in the default, but if your query has more
> complexity than you have mentioned here, the total URL might not fit.
> There's a workaround to this -- use a POST request and put the
> parameters in the request body.
>
> > If we fire query with 100 values in the fq, whats the penalty on the
> performance ? Can we get the result in less than one second for 1 million
> of documents.
>
> With one million documents, each internal filter query result is 25
> bytes -- the number of documents divided by eight.  That's 2.5 megabytes
> for 100 of them.  In addition, every time a filter is run, it must
> examine every document in the index to create that 25 byte
> structure, which means that filters which *aren't* found in the
> filterCache are relatively slow.  If they are found in the cache,
> they're lightning fast, because the cache will contain the entire 25
> byte bitset.
>
> If you make your filterCache large enough, it's going to consume a LOT
> of java heap memory, particularly if the index gets bigger.  The nice
> thing about the filterCache is that once the cache entries exist, the
> filters are REALLY fast, and if they're all cached, you would DEFINITELY
> be able to get results in under one second.  I have no idea whether the
> same would happen when filters aren't cached.  It might.  Filters that
> do not exist in the cache will be executed in parallel, so the number of
> CPUs that you have in the machine, along with the query rate, will have
> a big impact on the overall performance of a single query with a lot of
> filters.
>
> Also related to the filterCache, keep in mind that every time a commit
> is made that opens a new searcher, the filterCache will be autowarmed.
> If the autowarmCount value for the filterCache is large, that can make
> commits take a very long time, which will cause problems if commits are
> happening frequently.  On the other hand, a very small autowarmCount can
> cause slow performance after a commit if you use a lot of filters.
>
> My reply is longer and more dense than I had anticipated.  Apologies if
> it's information overload.
>
> Thanks,
>