subject:"Ideas"

Slow ReadProcessor read fields Warnings - Ideas to investigate?

2019-05-22 Thread David Winter

Hello User Group,

we run Solr with HDFS and got a lot of the following warning:
Slow ReadProcessor read fields took 15093ms (threshold=1ms); ack: 
seqno: 3 reply: SUCCESS reply: SUCCESS reply: SUCCESS 
downstreamAckTimeNanos: 798309 flag: 0 flag: 0 flag: 0, targets: 
[DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK], 
DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK], 
DatanodeInfoWithStorage[xxx.xxx.xxx.xxx:50010,DS-xx,DISK]]

It started with the default threshold of 30 seconds. But 10 seconds are 
still too much for a query and we configured the warning threshold to 10 
seconds.
It resulted in a flood of warnings and uncovered a slow HDFS read 
performance. The HDFS statistics looks quite good and stable.

We are not sure how to investigate the reason and what we can improve to 
solve the issue.
Did anybody have similar issues? 

Mit freundlichen Grüßen / Kind regards

David Winter

Re: Looking for design ideas

2018-03-18 Thread Rick Leir

Steve
Does a document have a different URL when it is in a personal DB? 

I suspect the easiest solution is to use just one index.

You can have a field containing an integer identifying the personal DB. For 
public, set this to zero. Call it DBid. Update the doc to change this and the 
URL when the user starts editing.

Then the query contains the userid, and you boost on this field. Or something 
like that.
Cheers -- Rick


On March 18, 2018 11:13:49 AM EDT, Steven White <swhite4...@gmail.com> wrote:
>Hi everyone,
>
>I have a design problem that i"m not sure how to solve best so I
>figured I
>share it here and see what ideas others may have.
>
>I have a DB that hold documents (over 1 million and growing).  This is
>known as the "Public" DB that holds documents visible to all of my end
>users.
>
>My application let users "check-out" one or more documents at a time
>off
>this "Public" DB, edit them and "check-in" back into the "Public" DB. 
>When
>a document is checked-out, it goes into a "Personal" DB for that user
>(and
>the document in the "Public" DB is flagged as such to alert other
>users.)
>The owner of this checked-out document in the "Personal" DB can make
>changes to the document and save it back into the "Personal" DB as
>often as
>he wants to.  Sometimes the document lives in the "Personal" DB for few
>minutes before it is checked-in back into the "Public" DB and sometimes
>it
>can live in the "Personal" DB for 1 day or 1 month.  When a document is
>saved into the "Personal" DB, only the owner of that document can see
>it.
>
>Currently there are 100 users but this will grow to at least 500 or
>maybe
>even 1000.
>
>I'm looking at a solution on how to enable a full text search on those
>documents, both in the "Public" and "Personal" DB so that:
>
>1) Documents in the "Public" DB are searchable by all users.  This is
>the
>easy part.
>
>2) Documents in the "Personal" DB of each user is searchable by the
>owner
>of that "Personal" DB.  This is easy too.
>
>3) A user can search both the "Public" and "Personal" DB at anytime but
>if
>a document is in the "Personal" DB, we will not search it the "Public"
>--
>i.e.: whatever is in "Personal" DB takes over what's in the "Public"
>DB.
>
>Item #3 is important and is what I'm trying to solve.  The goal is to
>give
>hits to the user on documents that they are editing (in their
>"Personal"
>DB) instead of that in the "Public".
>
>The way I'm thinking to solve this problem is to create 2 Solr indexes
>(do
>we call those "cores"?):
>
>1) The "Public" DB is indexed into the "Public" Solr index.
>
>2) The "Personal" DB is indexed into the "Personal" Solr index with a
>field
>indicating the owner of that document.
>
>With the above 2 indexes, I can now send the user's search syntax to
>both
>indexes but for the "Public", I will also send a list of IDs (those
>documents in the user's "Personal" DB) to exclude from the result set.
>This way, I let a user search both the "Public" and "Personal" DB as
>such
>the documents in the "Personal" DB are included in the search and are
>excluded from the "Public" DB.
>
>Did I make sense?  If so, is this doable?  Will ranking be effected
>given
>that I'm searching 2 indexes?
>
>Let me know what issues I might be overlooking with this solution.
>
>Thanks
>
>Steve

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Looking for design ideas

2018-03-18 Thread Rahul Singh

I’ve worked on something similar - data set was 100m documents with thousands 
of users. The ranking is relative in each index. Eg. What is #1 , #2, #3 is 
only 1,2,3 in that index.

Your challenge will in the user interface result display: how to merge results 
in a way that the relevant results are shown first before non relevant results.

There are numerous ways to merge — could even retrieve , merge, index, and 
retrieve from that — but computing power aside, that’s not efficient.

You could consider two indexes not as public and private but as a metadata 
(data indexed only, not stored) and data (index / stored values). This way 
you’ll get your ranking without having to compromise. Once you have your doc 
ids , you can retrieve from a data index / read only SolR cluster or a scalable 
persistent store (Cassandra, Mongo, etc. ) that would scale way better than 
SolR itself for thousands if not millions of users ( please let’s not start a 
debate about this ).

This way your users would have relevant results, and fast access to the index , 
the data would be protected - if you filter by the doc owner Id as a “or” query 
in addition to doc owner I’d = ‘public’. What you lose in not getting the 
document Data from the initial query you can retrieve asynchronously or maybe 
“join” with another collection — which I’ve not done but I know it’s possible.

Also may want to consider CQRS pattern for doc checkin / checkout Actions to 
keep the indexing / query time scalable. It may be more work but it’s more 
scalable. Go big or go home. ;)

Hope it helps

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Mar 18, 2018, 11:14 AM -0400, Steven White <swhite4...@gmail.com>, wrote:
> Hi everyone,
>
> I have a design problem that i"m not sure how to solve best so I figured I
> share it here and see what ideas others may have.
>
> I have a DB that hold documents (over 1 million and growing). This is
> known as the "Public" DB that holds documents visible to all of my end
> users.
>
> My application let users "check-out" one or more documents at a time off
> this "Public" DB, edit them and "check-in" back into the "Public" DB. When
> a document is checked-out, it goes into a "Personal" DB for that user (and
> the document in the "Public" DB is flagged as such to alert other users.)
> The owner of this checked-out document in the "Personal" DB can make
> changes to the document and save it back into the "Personal" DB as often as
> he wants to. Sometimes the document lives in the "Personal" DB for few
> minutes before it is checked-in back into the "Public" DB and sometimes it
> can live in the "Personal" DB for 1 day or 1 month. When a document is
> saved into the "Personal" DB, only the owner of that document can see it.
>
> Currently there are 100 users but this will grow to at least 500 or maybe
> even 1000.
>
> I'm looking at a solution on how to enable a full text search on those
> documents, both in the "Public" and "Personal" DB so that:
>
> 1) Documents in the "Public" DB are searchable by all users. This is the
> easy part.
>
> 2) Documents in the "Personal" DB of each user is searchable by the owner
> of that "Personal" DB. This is easy too.
>
> 3) A user can search both the "Public" and "Personal" DB at anytime but if
> a document is in the "Personal" DB, we will not search it the "Public" --
> i.e.: whatever is in "Personal" DB takes over what's in the "Public" DB.
>
> Item #3 is important and is what I'm trying to solve. The goal is to give
> hits to the user on documents that they are editing (in their "Personal"
> DB) instead of that in the "Public".
>
> The way I'm thinking to solve this problem is to create 2 Solr indexes (do
> we call those "cores"?):
>
> 1) The "Public" DB is indexed into the "Public" Solr index.
>
> 2) The "Personal" DB is indexed into the "Personal" Solr index with a field
> indicating the owner of that document.
>
> With the above 2 indexes, I can now send the user's search syntax to both
> indexes but for the "Public", I will also send a list of IDs (those
> documents in the user's "Personal" DB) to exclude from the result set.
> This way, I let a user search both the "Public" and "Personal" DB as such
> the documents in the "Personal" DB are included in the search and are
> excluded from the "Public" DB.
>
> Did I make sense? If so, is this doable? Will ranking be effected given
> that I'm searching 2 indexes?
>
> Let me know what issues I might be overlooking with this solution.
>
> Thanks
>
> Steve

Looking for design ideas

2018-03-18 Thread Steven White

Hi everyone,

I have a design problem that i"m not sure how to solve best so I figured I
share it here and see what ideas others may have.

I have a DB that hold documents (over 1 million and growing).  This is
known as the "Public" DB that holds documents visible to all of my end
users.

My application let users "check-out" one or more documents at a time off
this "Public" DB, edit them and "check-in" back into the "Public" DB.  When
a document is checked-out, it goes into a "Personal" DB for that user (and
the document in the "Public" DB is flagged as such to alert other users.)
The owner of this checked-out document in the "Personal" DB can make
changes to the document and save it back into the "Personal" DB as often as
he wants to.  Sometimes the document lives in the "Personal" DB for few
minutes before it is checked-in back into the "Public" DB and sometimes it
can live in the "Personal" DB for 1 day or 1 month.  When a document is
saved into the "Personal" DB, only the owner of that document can see it.

Currently there are 100 users but this will grow to at least 500 or maybe
even 1000.

I'm looking at a solution on how to enable a full text search on those
documents, both in the "Public" and "Personal" DB so that:

1) Documents in the "Public" DB are searchable by all users.  This is the
easy part.

2) Documents in the "Personal" DB of each user is searchable by the owner
of that "Personal" DB.  This is easy too.

3) A user can search both the "Public" and "Personal" DB at anytime but if
a document is in the "Personal" DB, we will not search it the "Public" --
i.e.: whatever is in "Personal" DB takes over what's in the "Public" DB.

Item #3 is important and is what I'm trying to solve.  The goal is to give
hits to the user on documents that they are editing (in their "Personal"
DB) instead of that in the "Public".

The way I'm thinking to solve this problem is to create 2 Solr indexes (do
we call those "cores"?):

1) The "Public" DB is indexed into the "Public" Solr index.

2) The "Personal" DB is indexed into the "Personal" Solr index with a field
indicating the owner of that document.

With the above 2 indexes, I can now send the user's search syntax to both
indexes but for the "Public", I will also send a list of IDs (those
documents in the user's "Personal" DB) to exclude from the result set.
This way, I let a user search both the "Public" and "Personal" DB as such
the documents in the "Personal" DB are included in the search and are
excluded from the "Public" DB.

Did I make sense?  If so, is this doable?  Will ranking be effected given
that I'm searching 2 indexes?

Let me know what issues I might be overlooking with this solution.

Thanks

Steve

Re: Different ideas for querying unique and non-unique records

2017-08-30 Thread Rick Leir

Susheel, Just a guess, but carrot2.org might be useful. But it might be 
overkill. Cheers -- Rick

On August 30, 2017 7:40:08 AM MDT, Susheel Kumar <susheel2...@gmail.com> wrote:
>Hello,
>
>I am looking for different ideas/suggestions to solve the use case am
>working on.
>
>We have couple of fields in schema along with id, business_email and
>personal_email.  We need to return all records based on unique business
>and
>personal email's.
>
>The criteria for unique records is either of business or personal email
>has
>not repeated again in other records.
>The criteria for non-unique records is if any of the business or
>personal
>email has occurred/repeats in other records then all those records are
>non-unique.
>E.g considering below documents.
>- for unique records below only id=1 should be returned (since john.doe
>is
>not present in any other records personal or business email)
>- non unique records, below id=2,3 should be returned (since
>isabel.dora is
>present in multiple records. doesn't matter if it is present in
>business or
>personal email)
>
>Documents
>===
>{id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com}
>{id:2,business_email_s:isabel.d...@abc.com}
>{id:3,personal_email_s:isabel.d...@abc.com}
>
>I am able to solve this using Streaming expression query but not sure
>if
>performance will become an bottleneck as the streaming expression is
>quite
>big. So looking for
>different ideas like using de-dupe or during ingestion/pre-process etc.
>without impacting performance much.
>
>Thanks,
>Susheel

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Different ideas for querying unique and non-unique records

2017-08-30 Thread Susheel Kumar

Hello,

I am looking for different ideas/suggestions to solve the use case am
working on.

We have couple of fields in schema along with id, business_email and
personal_email.  We need to return all records based on unique business and
personal email's.

The criteria for unique records is either of business or personal email has
not repeated again in other records.
The criteria for non-unique records is if any of the business or personal
email has occurred/repeats in other records then all those records are
non-unique.
E.g considering below documents.
- for unique records below only id=1 should be returned (since john.doe is
not present in any other records personal or business email)
- non unique records, below id=2,3 should be returned (since isabel.dora is
present in multiple records. doesn't matter if it is present in business or
personal email)

Documents
===
{id:1,business_email_s:john@abc.com,personal_email_s:john@abc.com}
{id:2,business_email_s:isabel.d...@abc.com}
{id:3,personal_email_s:isabel.d...@abc.com}

I am able to solve this using Streaming expression query but not sure if
performance will become an bottleneck as the streaming expression is quite
big. So looking for
different ideas like using de-dupe or during ingestion/pre-process etc.
without impacting performance much.

Thanks,
Susheel

Re: Ideas

2015-09-21 Thread Paul Libbrecht

Writing a query component would be pretty easy or?
It would throw an exception if crazy numbers are requested...

I can provide a simple example of a maven project for a query component.

Paul


William Bell wrote:
> We have some Denial of service attacks on our web site. SOLR threads are
> going crazy.
>
> Basically someone is hitting start=15 + and rows=20. The start is crazy
> large.
>
> And then they jump around. start=15 then start=213030 etc.
>
> Any ideas for how to stop this besides blocking these IPs?
>
> Sometimes it is Google doing it even though these search results are set
> with No-index and No-Follow on these pages.
>
> Thoughts? Ideas?

Re: Ideas

2015-09-21 Thread DVT

Hi Bill,
  the classical way would be to have a reverse proxy in front of the
application that catches such cases. A decent reverse proxy or even
application firewall router will allow you to define limits on bandwidth
and sessions per time unit. Some even recognize specific
denial-of-service patterns.

Of course, you could also simply limit the ranges of parameters accepted
over the Internet - unless these wild ranges may actually occur in valid
scenarios.

A bit more complex is the third alternative that requires valid sessions
and permits paging only in one or the other direction. This way, start
and offset values would not be exposed, only functions for next
page/previous page or maybe some larger steps would be supported.
Stepping to one offset would also only be permitted if you come from a
proper previous page. Initial requests (in new sessions) would have to
start at offset 1. Constraints on the parameters in subsequent requests
within a session are a bit harder to handle.

Cheers,
--Jürgen

On 21.09.2015 19:28, William Bell wrote:
> We have some Denial of service attacks on our web site. SOLR threads are
> going crazy.
>
> Basically someone is hitting start=15 + and rows=20. The start is crazy
> large.
>
> And then they jump around. start=15 then start=213030 etc.
>
> Any ideas for how to stop this besides blocking these IPs?
>
> Sometimes it is Google doing it even though these search results are set
> with No-index and No-Follow on these pages.
>
> Thoughts? Ideas?
>
> Thanks
>

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением

*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

DevoteThem GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
<mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: Ideas

2015-09-21 Thread Doug Turnbull

The nginx reverse proxy we use blocks ridicilous start and rows values

https://github.com/o19s/solr_nginx

Another silly thing I've noticed is you can pass sleep() as a function
query. It's not documented, but I think a big hole. I wonder if I could DoS
your Solr by sleeping and hogging all the available query threads?

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.0/org/apache/solr/search/ValueSourceParser.java#114

On Mon, Sep 21, 2015 at 1:37 PM, Jürgen Wagner (DVT) <
juergen.wag...@devoteam.com> wrote:

> Hi Bill,
>   the classical way would be to have a reverse proxy in front of the
> application that catches such cases. A decent reverse proxy or even
> application firewall router will allow you to define limits on bandwidth
> and sessions per time unit. Some even recognize specific denial-of-service
> patterns.
>
> Of course, you could also simply limit the ranges of parameters accepted
> over the Internet - unless these wild ranges may actually occur in valid
> scenarios.
>
> A bit more complex is the third alternative that requires valid sessions
> and permits paging only in one or the other direction. This way, start and
> offset values would not be exposed, only functions for next page/previous
> page or maybe some larger steps would be supported. Stepping to one offset
> would also only be permitted if you come from a proper previous page.
> Initial requests (in new sessions) would have to start at offset 1.
> Constraints on the parameters in subsequent requests within a session are a
> bit harder to handle.
>
> Cheers,
> --Jürgen
>
> On 21.09.2015 19:28, William Bell wrote:
>
> We have some Denial of service attacks on our web site. SOLR threads are
> going crazy.
>
> Basically someone is hitting start=15 + and rows=20. The start is crazy
> large.
>
> And then they jump around. start=15 then start=213030 etc.
>
> Any ideas for how to stop this besides blocking these IPs?
>
> Sometimes it is Google doing it even though these search results are set
> with No-index and No-Follow on these pages.
>
> Thoughts? Ideas?
>
> Thanks
>
>
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
>
> *i.A. Jürgen Wagner*
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> DevoteThem GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
> --
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>
>
>


-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Ideas

2015-09-21 Thread William Bell

We have some Denial of service attacks on our web site. SOLR threads are
going crazy.

Basically someone is hitting start=15 + and rows=20. The start is crazy
large.

And then they jump around. start=15 then start=213030 etc.

Any ideas for how to stop this besides blocking these IPs?

Sometimes it is Google doing it even though these search results are set
with No-index and No-Follow on these pages.

Thoughts? Ideas?

Thanks

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: Ideas

2015-09-21 Thread Walter Underwood

I have put a limit in the front end at a couple of sites. Nobody gets more than 
50 pages of results. Show page 50 if they request beyond that.

First got hit by this at Netflix, years ago.

Solr 4 is much better about deep paging, but here at Chegg we got deep paging 
plus a stupid, long query. That was using too much CPU.

Right now, block the IPs. Those are hostile.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 21, 2015, at 10:31 AM, Paul Libbrecht <p...@hoplahup.net> wrote:
> 
> Writing a query component would be pretty easy or?
> It would throw an exception if crazy numbers are requested...
> 
> I can provide a simple example of a maven project for a query component.
> 
> Paul
> 
> 
> William Bell wrote:
>> We have some Denial of service attacks on our web site. SOLR threads are
>> going crazy.
>> 
>> Basically someone is hitting start=15 + and rows=20. The start is crazy
>> large.
>> 
>> And then they jump around. start=15 then start=213030 etc.
>> 
>> Any ideas for how to stop this besides blocking these IPs?
>> 
>> Sometimes it is Google doing it even though these search results are set
>> with No-index and No-Follow on these pages.
>> 
>> Thoughts? Ideas?
>

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Ian Rose

Hi again, all -

Since several people were kind enough to jump in to offer advice on this
thread, I wanted to follow up in case anyone finds this useful in the
future.

*tl;dr: *Routing updates to a random Solr node (and then letting it forward
the docs to where they need to go) is very expensive, more than I
expected.  Using a smart router that uses the cluster config to route
documents directly to their shard results in (near) linear scaling for us.

*Expository version:*

We use Go on our client, for which (to my knowledge) there is no SolrCloud
router implementation.  So we started by just routing updates to a random
Solr node and letting it forward the docs to where they need to go.  My
theory was that this would lead to a constant amount of additional work
(and thus still linear scaling).  This was based on the observation that if
you send an update of K documents to a Solr node in a N node cluster, in
the worst case scenario, all K documents will need to be forwarded on to
other nodes.  Since Solr nodes have perfect knowledge of where docs belong,
each doc would only take 1 additional hop to get to its replica.  So random
routing (in the limit) imposes 1 additional network hop for each document.

In practice, however, we find that (for small networks, at least) per-node
performance falls as you add shards.  In fact, the client performance (in
writes/sec) was essentially constant no matter how many shards we added.  I
do have a working theory as to why this might be (i.e. where the flaw is in
my logic above) but as this is merely an unverified theory I don't want to
lead anyone astray by writing it up here.

However, by writing a smart router that retrieves the clusterstate.json
file from Zookeeper and uses that to perfectly route documents to their
proper shard, we were able to achieve much better scalability.  Using a
synthetic workload, we were able to achieve 141.7 writes/sec to a cluster
of size 1 and 2506 writes/sec to a cluster of size 20 (125
writes/sec/node).  So a dropoff of ~12% which is not too bad.  We are
hoping to continue our tests with larger clusters to ensure that the
per-node write performance levels off and does not continue to drop as the
cluster scales.

I will also note that we initially had several bugs in our smart router
implementation so if you follow a similar path and see bad performance look
to your router implementation as you might not be routing correctly.  We
ended up writing a simple proxy that we ran in front of Solr to observe all
requests which helped immensely when verifying and debugging our router.
Yes tcpdump does something similar but viewing HTTP-level traffic is way
more convenient than TCP-level.  Plus Go makes little proxies like this
super easy to do.

Hope all that is useful to someone.  Thanks again to the posters above for
providing suggestions!

- Ian



On Sat, Nov 1, 2014 at 7:13 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: but it should be more or less a constant factor no matter how many
 Solr nodes you are using, right?

 Not really. You've stated that you're not driving Solr very hard in
 your tests. Therefore you're waiting on I/O. Therefore your tests just
 aren't going to scale linearly with the number of shards. This is a
 simplification, but

 Your network utilization is pretty much irrelevant. I send a packet
 somewhere. somewhere does some stuff and sends me back an
 acknowledgement. While I'm waiting, the network is getting no traffic,
 so. If the network traffic was in the 90% range that would be
 different, so it's a good thing to monitor.

 Really, use a leader aware client and rack enough clients together
 that you're driving Solr hard. Then double the number of shards. Then
 rack enough _more_ clients to drive Solr at the same level. In this
 case I'll go out on a limb and predict near 2x throughput increases.

 One additional note, though. When you add _replicas_ to shards expect
 to see a drop in throughput that may be quite significant, 20-40%
 anecdotally...

 Best,
 Erick

 On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote:
  On 11/1/2014 9:52 AM, Ian Rose wrote:
  Just to make sure I am thinking about this right: batching will
 certainly
  make a big difference in performance, but it should be more or less a
  constant factor no matter how many Solr nodes you are using, right?
 Right
  now in my load tests, I'm not actually that concerned about the absolute
  performance numbers; instead I'm just trying to figure out why relative
  performance (no matter how bad it is since I am not batching) does not
 go
  up with more Solr nodes.  Once I get that part figured out and we are
  seeing more writes per sec when we add nodes, then I'll turn on
 batching in
  the client to see what kind of additional performance gain that gets us.
 
  The basic problem I see with your methodology is that you are sending an
  update request and waiting for it to complete before sending another.
  No

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Shawn Heisey

On 11/7/2014 7:17 AM, Ian Rose wrote:
 *tl;dr: *Routing updates to a random Solr node (and then letting it forward
 the docs to where they need to go) is very expensive, more than I
 expected.  Using a smart router that uses the cluster config to route
 documents directly to their shard results in (near) linear scaling for us.

I will admit that I do not know everything that has to happen in order
to bounce updates to the proper shard leader, but I would have expected
the overhead involved to be relatively small.

I have opened an issue so we can see whether this situation can be improved.

https://issues.apache.org/jira/browse/SOLR-6717

Thanks,
Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Erick Erickson

Ian:

Thanks much for the writeup! It's always good to have real-world documentation!

Best,
Erick

On Fri, Nov 7, 2014 at 8:26 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 11/7/2014 7:17 AM, Ian Rose wrote:
 *tl;dr: *Routing updates to a random Solr node (and then letting it forward
 the docs to where they need to go) is very expensive, more than I
 expected.  Using a smart router that uses the cluster config to route
 documents directly to their shard results in (near) linear scaling for us.

 I will admit that I do not know everything that has to happen in order
 to bounce updates to the proper shard leader, but I would have expected
 the overhead involved to be relatively small.

 I have opened an issue so we can see whether this situation can be improved.

 https://issues.apache.org/jira/browse/SOLR-6717

 Thanks,
 Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Ian Rose

Erick,

Just to make sure I am thinking about this right: batching will certainly
make a big difference in performance, but it should be more or less a
constant factor no matter how many Solr nodes you are using, right?  Right
now in my load tests, I'm not actually that concerned about the absolute
performance numbers; instead I'm just trying to figure out why relative
performance (no matter how bad it is since I am not batching) does not go
up with more Solr nodes.  Once I get that part figured out and we are
seeing more writes per sec when we add nodes, then I'll turn on batching in
the client to see what kind of additional performance gain that gets us.

Cheers,
Ian


On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com
wrote:

 Yes, I was inadvertently sending them to a replica. When I sent them to the
 leader, the leader reported (1000 adds) and the replica reported only 1 add
 per document. So, it looks like the leader forwards the batched jobs
 individually to the replicas.

 On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Internally, the docs are batched up into smaller buckets (10 as I
  remember) and forwarded to the correct shard leader. I suspect that's
  what you're seeing.
 
  Erick
 
  On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
   Regarding batch indexing:
   When I send batches of 1000 docs to a standalone Solr server, the log
  file
   reports (1000 adds) in LogUpdateProcessor. But when I send them to
 the
   leader of a replicated index, the leader log file reports much smaller
   numbers, usually (12 adds). Why do the batches appear to be broken
 up?
  
   Peter
  
   On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   NP, just making sure.
  
   I suspect you'll get lots more bang for the buck, and
   results much more closely matching your expectations if
  
   1 you batch up a bunch of docs at once rather than
   sending them one at a time. That's probably the easiest
   thing to try. Sending docs one at a time is something of
   an anti-pattern. I usually start with batches of 1,000.
  
   And just to check.. You're not issuing any commits from the
   client, right? Performance will be terrible if you issue commits
   after every doc, that's totally an anti-pattern. Doubly so for
   optimizes Since you showed us your solrconfig  autocommit
   settings I'm assuming not but want to be sure.
  
   2 use a leader-aware client. I'm totally unfamiliar with Go,
   so I have no suggestions whatsoever to offer there But you'll
   want to batch in this case too.
  
   On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com
  wrote:
Hi Erick -
   
Thanks for the detailed response and apologies for my confusing
terminology.  I should have said WPS (writes per second) instead
 of
  QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote QPS I was referring to writes.
   
It seems clear at this point that I should wrap up the code to do
  smart
routing rather than choose Solr nodes randomly.  And then see if
 that
changes things.  I must admit that although I understand that random
  node
selection will impose a performance hit, theoretically it seems to
 me
   that
the system should still scale up as you add more nodes (albeit at
  lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do
 is
   just
try it experimentally.  I hope to have that working today - will
  report
back on my findings.
   
Cheers,
- Ian
   
p.s. To clarify why we are rolling our own smart router code, we use
  Go
over here rather than Java.  Although if we still get bad
 performance
   with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our
   implementation.
   
   
On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
I'm really confused:
   
bq: I am not issuing any queries, only writes (document inserts)
   
bq: It's clear that once the load test client has ~40 simulated
 users
   
bq: A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right
   
QPS is usually used to mean Queries Per Second, which is
 different
   from
the statement that I am not issuing any queries. And what do
  the
number of users have to do with inserting documents?
   
You also state:  In many cases, CPU on the solr servers is quite
  low as
well
   
So let's talk about indexing first. Indexing should scale nearly
linearly as long as
1 you are routing your docs to the correct leader, which

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Erick Erickson

bq: but it should be more or less a constant factor no matter how many
Solr nodes you are using, right?

Not really. You've stated that you're not driving Solr very hard in
your tests. Therefore you're waiting on I/O. Therefore your tests just
aren't going to scale linearly with the number of shards. This is a
simplification, but

Your network utilization is pretty much irrelevant. I send a packet
somewhere. somewhere does some stuff and sends me back an
acknowledgement. While I'm waiting, the network is getting no traffic,
so. If the network traffic was in the 90% range that would be
different, so it's a good thing to monitor.

Really, use a leader aware client and rack enough clients together
that you're driving Solr hard. Then double the number of shards. Then
rack enough _more_ clients to drive Solr at the same level. In this
case I'll go out on a limb and predict near 2x throughput increases.

One additional note, though. When you add _replicas_ to shards expect
to see a drop in throughput that may be quite significant, 20-40%
anecdotally...

Best,
Erick

On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 11/1/2014 9:52 AM, Ian Rose wrote:
 Just to make sure I am thinking about this right: batching will certainly
 make a big difference in performance, but it should be more or less a
 constant factor no matter how many Solr nodes you are using, right?  Right
 now in my load tests, I'm not actually that concerned about the absolute
 performance numbers; instead I'm just trying to figure out why relative
 performance (no matter how bad it is since I am not batching) does not go
 up with more Solr nodes.  Once I get that part figured out and we are
 seeing more writes per sec when we add nodes, then I'll turn on batching in
 the client to see what kind of additional performance gain that gets us.

 The basic problem I see with your methodology is that you are sending an
 update request and waiting for it to complete before sending another.
 No matter how big the batches are, this is an inefficient use of resources.

 If you send many such requests at the same time, then they will be
 handled in parallel.  Lucene (and by extension, Solr) has the thread
 synchronization required to keep multiple simultaneous update requests
 from stomping on each other and corrupting the index.

 If you have enough CPU cores, such handling will *truly* be in parallel,
 otherwise the operating system will just take turns giving each thread
 CPU time.  This results in a pretty good facsimile of parallel
 operation, but because it splits the available CPU resources, isn't as
 fast as true parallel operation.

 Thanks,
 Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Ian Rose

Hi Erick -

Thanks for the detailed response and apologies for my confusing
terminology.  I should have said WPS (writes per second) instead of QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote QPS I was referring to writes.

It seems clear at this point that I should wrap up the code to do smart
routing rather than choose Solr nodes randomly.  And then see if that
changes things.  I must admit that although I understand that random node
selection will impose a performance hit, theoretically it seems to me that
the system should still scale up as you add more nodes (albeit at lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do is just
try it experimentally.  I hope to have that working today - will report
back on my findings.

Cheers,
- Ian

p.s. To clarify why we are rolling our own smart router code, we use Go
over here rather than Java.  Although if we still get bad performance with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our implementation.


On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
wrote:

 I'm really confused:

 bq: I am not issuing any queries, only writes (document inserts)

 bq: It's clear that once the load test client has ~40 simulated users

 bq: A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right

 QPS is usually used to mean Queries Per Second, which is different from
 the statement that I am not issuing any queries. And what do the
 number of users have to do with inserting documents?

 You also state:  In many cases, CPU on the solr servers is quite low as
 well

 So let's talk about indexing first. Indexing should scale nearly
 linearly as long as
 1 you are routing your docs to the correct leader, which happens with
 SolrJ
 and the CloudSolrSever automatically. Rather than rolling your own, I
 strongly
 suggest you try this out.
 2 you have enough clients feeding the cluster to push CPU utilization
 on them all.
 Very often slow indexing, or in your case lack of scaling is a
 result of document
 acquisition or, in your case, your doc generator is spending all it's
 time waiting for
 the individual documents to get to Solr and come back.

 bq: chooses a random solr server for each ADD request (with 1 doc per add
 request)

 Probably your culprit right there. Each and every document requires that
 you
 have to cross the network (and forward that doc to the correct leader). So
 given
 that you're not seeing high CPU utilization, I suspect that you're not
 sending
 enough docs to SolrCloud fast enough to see scaling. You need to batch up
 multiple docs, I generally send 1,000 docs at a time.

 But even if you do solve this, the inter-node routing will prevent
 linear scaling.
 When a doc (or a batch of docs) goes to a random Solr node, here's what
 happens:
 1 the docs are re-packaged into groups based on which shard they're
 destined for
 2 the sub-packets are forwarded to the leader for each shard
 3 the responses are gathered back and returned to the client.

 This set of operations will eventually degrade the scaling.

 bq:  A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
 behind sharding.

 If we're talking search requests, the answer is no. Sharding is
 what you do when your collection no longer fits on a single node.
 If it _does_ fit on a single node, then you'll usually get better query
 performance by adding a bunch of replicas to a single shard. When
 the number of  docs on each shard grows large enough that you
 no longer get good query performance, _then_ you shard. And
 take the query hit.

 If we're talking about inserts, then see above. I suspect your problem is
 that you're _not_ saturating the SolrCloud cluster, you're sending
 docs to Solr very inefficiently and waiting on I/O. Batching docs and
 sending them to the right leader should scale pretty linearly until you
 start saturating your network.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote:
  Thanks for the suggestions so for, all.
 
  1) We are not using SolrJ on the client (not using Java at all) but I am
  working on writing a smart router so that we can always send to the
  correct node.  I am certainly curious to see how that changes things.
  Nonetheless even with the overhead of extra routing hops, the observed
  behavior (no increase in performance with more nodes) doesn't make any
  sense to me.
 
  2) Commits: we are using autoCommit with openSearcher=false
 (maxTime=6)
  and autoSoftCommit (maxTime=15000).
 
  3) Suggestions to batch documents certainly make sense for production
 code
  but

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson

NP, just making sure.

I suspect you'll get lots more bang for the buck, and
results much more closely matching your expectations if

1 you batch up a bunch of docs at once rather than
sending them one at a time. That's probably the easiest
thing to try. Sending docs one at a time is something of
an anti-pattern. I usually start with batches of 1,000.

And just to check.. You're not issuing any commits from the
client, right? Performance will be terrible if you issue commits
after every doc, that's totally an anti-pattern. Doubly so for
optimizes Since you showed us your solrconfig  autocommit
settings I'm assuming not but want to be sure.

2 use a leader-aware client. I'm totally unfamiliar with Go,
so I have no suggestions whatsoever to offer there But you'll
want to batch in this case too.

On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote:
 Hi Erick -

 Thanks for the detailed response and apologies for my confusing
 terminology.  I should have said WPS (writes per second) instead of QPS
 but I didn't want to introduce a weird new acronym since QPS is well
 known.  Clearly a bad decision on my part.  To clarify: I am doing
 *only* writes
 (document adds).  Whenever I wrote QPS I was referring to writes.

 It seems clear at this point that I should wrap up the code to do smart
 routing rather than choose Solr nodes randomly.  And then see if that
 changes things.  I must admit that although I understand that random node
 selection will impose a performance hit, theoretically it seems to me that
 the system should still scale up as you add more nodes (albeit at lower
 absolute level of performance than if you used a smart router).
 Nonetheless, I'm just theorycrafting here so the better thing to do is just
 try it experimentally.  I hope to have that working today - will report
 back on my findings.

 Cheers,
 - Ian

 p.s. To clarify why we are rolling our own smart router code, we use Go
 over here rather than Java.  Although if we still get bad performance with
 our custom Go router I may try a pure Java load client using
 CloudSolrServer to eliminate the possibility of bugs in our implementation.


 On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 I'm really confused:

 bq: I am not issuing any queries, only writes (document inserts)

 bq: It's clear that once the load test client has ~40 simulated users

 bq: A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right

 QPS is usually used to mean Queries Per Second, which is different from
 the statement that I am not issuing any queries. And what do the
 number of users have to do with inserting documents?

 You also state:  In many cases, CPU on the solr servers is quite low as
 well

 So let's talk about indexing first. Indexing should scale nearly
 linearly as long as
 1 you are routing your docs to the correct leader, which happens with
 SolrJ
 and the CloudSolrSever automatically. Rather than rolling your own, I
 strongly
 suggest you try this out.
 2 you have enough clients feeding the cluster to push CPU utilization
 on them all.
 Very often slow indexing, or in your case lack of scaling is a
 result of document
 acquisition or, in your case, your doc generator is spending all it's
 time waiting for
 the individual documents to get to Solr and come back.

 bq: chooses a random solr server for each ADD request (with 1 doc per add
 request)

 Probably your culprit right there. Each and every document requires that
 you
 have to cross the network (and forward that doc to the correct leader). So
 given
 that you're not seeing high CPU utilization, I suspect that you're not
 sending
 enough docs to SolrCloud fast enough to see scaling. You need to batch up
 multiple docs, I generally send 1,000 docs at a time.

 But even if you do solve this, the inter-node routing will prevent
 linear scaling.
 When a doc (or a batch of docs) goes to a random Solr node, here's what
 happens:
 1 the docs are re-packaged into groups based on which shard they're
 destined for
 2 the sub-packets are forwarded to the leader for each shard
 3 the responses are gathered back and returned to the client.

 This set of operations will eventually degrade the scaling.

 bq:  A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
 behind sharding.

 If we're talking search requests, the answer is no. Sharding is
 what you do when your collection no longer fits on a single node.
 If it _does_ fit on a single node, then you'll usually get better query
 performance by adding a bunch of replicas to a single shard. When
 the number of  docs on each shard grows large enough that you
 no longer get good query performance, _then_ you shard. And
 take the query hit.

 If we're talking about inserts, then see above. I suspect your problem is
 that you're _not_ saturating the SolrCloud cluster,

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan

Regarding batch indexing:
When I send batches of 1000 docs to a standalone Solr server, the log file
reports (1000 adds) in LogUpdateProcessor. But when I send them to the
leader of a replicated index, the leader log file reports much smaller
numbers, usually (12 adds). Why do the batches appear to be broken up?

Peter

On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com
wrote:

 NP, just making sure.

 I suspect you'll get lots more bang for the buck, and
 results much more closely matching your expectations if

 1 you batch up a bunch of docs at once rather than
 sending them one at a time. That's probably the easiest
 thing to try. Sending docs one at a time is something of
 an anti-pattern. I usually start with batches of 1,000.

 And just to check.. You're not issuing any commits from the
 client, right? Performance will be terrible if you issue commits
 after every doc, that's totally an anti-pattern. Doubly so for
 optimizes Since you showed us your solrconfig  autocommit
 settings I'm assuming not but want to be sure.

 2 use a leader-aware client. I'm totally unfamiliar with Go,
 so I have no suggestions whatsoever to offer there But you'll
 want to batch in this case too.

 On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote:
  Hi Erick -
 
  Thanks for the detailed response and apologies for my confusing
  terminology.  I should have said WPS (writes per second) instead of QPS
  but I didn't want to introduce a weird new acronym since QPS is well
  known.  Clearly a bad decision on my part.  To clarify: I am doing
  *only* writes
  (document adds).  Whenever I wrote QPS I was referring to writes.
 
  It seems clear at this point that I should wrap up the code to do smart
  routing rather than choose Solr nodes randomly.  And then see if that
  changes things.  I must admit that although I understand that random node
  selection will impose a performance hit, theoretically it seems to me
 that
  the system should still scale up as you add more nodes (albeit at lower
  absolute level of performance than if you used a smart router).
  Nonetheless, I'm just theorycrafting here so the better thing to do is
 just
  try it experimentally.  I hope to have that working today - will report
  back on my findings.
 
  Cheers,
  - Ian
 
  p.s. To clarify why we are rolling our own smart router code, we use Go
  over here rather than Java.  Although if we still get bad performance
 with
  our custom Go router I may try a pure Java load client using
  CloudSolrServer to eliminate the possibility of bugs in our
 implementation.
 
 
  On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I'm really confused:
 
  bq: I am not issuing any queries, only writes (document inserts)
 
  bq: It's clear that once the load test client has ~40 simulated users
 
  bq: A cluster of 3 shards over 3 Solr nodes *should* support
  a higher QPS than 2 shards over 2 Solr nodes, right
 
  QPS is usually used to mean Queries Per Second, which is different
 from
  the statement that I am not issuing any queries. And what do the
  number of users have to do with inserting documents?
 
  You also state:  In many cases, CPU on the solr servers is quite low as
  well
 
  So let's talk about indexing first. Indexing should scale nearly
  linearly as long as
  1 you are routing your docs to the correct leader, which happens with
  SolrJ
  and the CloudSolrSever automatically. Rather than rolling your own, I
  strongly
  suggest you try this out.
  2 you have enough clients feeding the cluster to push CPU utilization
  on them all.
  Very often slow indexing, or in your case lack of scaling is a
  result of document
  acquisition or, in your case, your doc generator is spending all it's
  time waiting for
  the individual documents to get to Solr and come back.
 
  bq: chooses a random solr server for each ADD request (with 1 doc per
 add
  request)
 
  Probably your culprit right there. Each and every document requires that
  you
  have to cross the network (and forward that doc to the correct leader).
 So
  given
  that you're not seeing high CPU utilization, I suspect that you're not
  sending
  enough docs to SolrCloud fast enough to see scaling. You need to batch
 up
  multiple docs, I generally send 1,000 docs at a time.
 
  But even if you do solve this, the inter-node routing will prevent
  linear scaling.
  When a doc (or a batch of docs) goes to a random Solr node, here's what
  happens:
  1 the docs are re-packaged into groups based on which shard they're
  destined for
  2 the sub-packets are forwarded to the leader for each shard
  3 the responses are gathered back and returned to the client.
 
  This set of operations will eventually degrade the scaling.
 
  bq:  A cluster of 3 shards over 3 Solr nodes *should* support
  a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
 idea
  behind sharding.
 
  If we're talking search

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson

Internally, the docs are batched up into smaller buckets (10 as I
remember) and forwarded to the correct shard leader. I suspect that's
what you're seeing.

Erick

On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote:
 Regarding batch indexing:
 When I send batches of 1000 docs to a standalone Solr server, the log file
 reports (1000 adds) in LogUpdateProcessor. But when I send them to the
 leader of a replicated index, the leader log file reports much smaller
 numbers, usually (12 adds). Why do the batches appear to be broken up?

 Peter

 On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 NP, just making sure.

 I suspect you'll get lots more bang for the buck, and
 results much more closely matching your expectations if

 1 you batch up a bunch of docs at once rather than
 sending them one at a time. That's probably the easiest
 thing to try. Sending docs one at a time is something of
 an anti-pattern. I usually start with batches of 1,000.

 And just to check.. You're not issuing any commits from the
 client, right? Performance will be terrible if you issue commits
 after every doc, that's totally an anti-pattern. Doubly so for
 optimizes Since you showed us your solrconfig  autocommit
 settings I'm assuming not but want to be sure.

 2 use a leader-aware client. I'm totally unfamiliar with Go,
 so I have no suggestions whatsoever to offer there But you'll
 want to batch in this case too.

 On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote:
  Hi Erick -
 
  Thanks for the detailed response and apologies for my confusing
  terminology.  I should have said WPS (writes per second) instead of QPS
  but I didn't want to introduce a weird new acronym since QPS is well
  known.  Clearly a bad decision on my part.  To clarify: I am doing
  *only* writes
  (document adds).  Whenever I wrote QPS I was referring to writes.
 
  It seems clear at this point that I should wrap up the code to do smart
  routing rather than choose Solr nodes randomly.  And then see if that
  changes things.  I must admit that although I understand that random node
  selection will impose a performance hit, theoretically it seems to me
 that
  the system should still scale up as you add more nodes (albeit at lower
  absolute level of performance than if you used a smart router).
  Nonetheless, I'm just theorycrafting here so the better thing to do is
 just
  try it experimentally.  I hope to have that working today - will report
  back on my findings.
 
  Cheers,
  - Ian
 
  p.s. To clarify why we are rolling our own smart router code, we use Go
  over here rather than Java.  Although if we still get bad performance
 with
  our custom Go router I may try a pure Java load client using
  CloudSolrServer to eliminate the possibility of bugs in our
 implementation.
 
 
  On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I'm really confused:
 
  bq: I am not issuing any queries, only writes (document inserts)
 
  bq: It's clear that once the load test client has ~40 simulated users
 
  bq: A cluster of 3 shards over 3 Solr nodes *should* support
  a higher QPS than 2 shards over 2 Solr nodes, right
 
  QPS is usually used to mean Queries Per Second, which is different
 from
  the statement that I am not issuing any queries. And what do the
  number of users have to do with inserting documents?
 
  You also state:  In many cases, CPU on the solr servers is quite low as
  well
 
  So let's talk about indexing first. Indexing should scale nearly
  linearly as long as
  1 you are routing your docs to the correct leader, which happens with
  SolrJ
  and the CloudSolrSever automatically. Rather than rolling your own, I
  strongly
  suggest you try this out.
  2 you have enough clients feeding the cluster to push CPU utilization
  on them all.
  Very often slow indexing, or in your case lack of scaling is a
  result of document
  acquisition or, in your case, your doc generator is spending all it's
  time waiting for
  the individual documents to get to Solr and come back.
 
  bq: chooses a random solr server for each ADD request (with 1 doc per
 add
  request)
 
  Probably your culprit right there. Each and every document requires that
  you
  have to cross the network (and forward that doc to the correct leader).
 So
  given
  that you're not seeing high CPU utilization, I suspect that you're not
  sending
  enough docs to SolrCloud fast enough to see scaling. You need to batch
 up
  multiple docs, I generally send 1,000 docs at a time.
 
  But even if you do solve this, the inter-node routing will prevent
  linear scaling.
  When a doc (or a batch of docs) goes to a random Solr node, here's what
  happens:
  1 the docs are re-packaged into groups based on which shard they're
  destined for
  2 the sub-packets are forwarded to the leader for each shard
  3 the responses are gathered back and returned to the client.

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan

Yes, I was inadvertently sending them to a replica. When I sent them to the
leader, the leader reported (1000 adds) and the replica reported only 1 add
per document. So, it looks like the leader forwards the batched jobs
individually to the replicas.

On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Internally, the docs are batched up into smaller buckets (10 as I
 remember) and forwarded to the correct shard leader. I suspect that's
 what you're seeing.

 Erick

 On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com
 wrote:
  Regarding batch indexing:
  When I send batches of 1000 docs to a standalone Solr server, the log
 file
  reports (1000 adds) in LogUpdateProcessor. But when I send them to the
  leader of a replicated index, the leader log file reports much smaller
  numbers, usually (12 adds). Why do the batches appear to be broken up?
 
  Peter
 
  On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  NP, just making sure.
 
  I suspect you'll get lots more bang for the buck, and
  results much more closely matching your expectations if
 
  1 you batch up a bunch of docs at once rather than
  sending them one at a time. That's probably the easiest
  thing to try. Sending docs one at a time is something of
  an anti-pattern. I usually start with batches of 1,000.
 
  And just to check.. You're not issuing any commits from the
  client, right? Performance will be terrible if you issue commits
  after every doc, that's totally an anti-pattern. Doubly so for
  optimizes Since you showed us your solrconfig  autocommit
  settings I'm assuming not but want to be sure.
 
  2 use a leader-aware client. I'm totally unfamiliar with Go,
  so I have no suggestions whatsoever to offer there But you'll
  want to batch in this case too.
 
  On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com
 wrote:
   Hi Erick -
  
   Thanks for the detailed response and apologies for my confusing
   terminology.  I should have said WPS (writes per second) instead of
 QPS
   but I didn't want to introduce a weird new acronym since QPS is well
   known.  Clearly a bad decision on my part.  To clarify: I am doing
   *only* writes
   (document adds).  Whenever I wrote QPS I was referring to writes.
  
   It seems clear at this point that I should wrap up the code to do
 smart
   routing rather than choose Solr nodes randomly.  And then see if that
   changes things.  I must admit that although I understand that random
 node
   selection will impose a performance hit, theoretically it seems to me
  that
   the system should still scale up as you add more nodes (albeit at
 lower
   absolute level of performance than if you used a smart router).
   Nonetheless, I'm just theorycrafting here so the better thing to do is
  just
   try it experimentally.  I hope to have that working today - will
 report
   back on my findings.
  
   Cheers,
   - Ian
  
   p.s. To clarify why we are rolling our own smart router code, we use
 Go
   over here rather than Java.  Although if we still get bad performance
  with
   our custom Go router I may try a pure Java load client using
   CloudSolrServer to eliminate the possibility of bugs in our
  implementation.
  
  
   On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
 erickerick...@gmail.com
  
   wrote:
  
   I'm really confused:
  
   bq: I am not issuing any queries, only writes (document inserts)
  
   bq: It's clear that once the load test client has ~40 simulated users
  
   bq: A cluster of 3 shards over 3 Solr nodes *should* support
   a higher QPS than 2 shards over 2 Solr nodes, right
  
   QPS is usually used to mean Queries Per Second, which is different
  from
   the statement that I am not issuing any queries. And what do
 the
   number of users have to do with inserting documents?
  
   You also state:  In many cases, CPU on the solr servers is quite
 low as
   well
  
   So let's talk about indexing first. Indexing should scale nearly
   linearly as long as
   1 you are routing your docs to the correct leader, which happens
 with
   SolrJ
   and the CloudSolrSever automatically. Rather than rolling your own, I
   strongly
   suggest you try this out.
   2 you have enough clients feeding the cluster to push CPU
 utilization
   on them all.
   Very often slow indexing, or in your case lack of scaling is a
   result of document
   acquisition or, in your case, your doc generator is spending all it's
   time waiting for
   the individual documents to get to Solr and come back.
  
   bq: chooses a random solr server for each ADD request (with 1 doc
 per
  add
   request)
  
   Probably your culprit right there. Each and every document requires
 that
   you
   have to cross the network (and forward that doc to the correct
 leader).
  So
   given
   that you're not seeing high CPU utilization, I suspect that you're
 not
   sending
   enough docs to SolrCloud fast enough to see scaling.

Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose

Howdy all -

The short version is: We are not seeing Solr Cloud performance scale (event
close to) linearly as we add nodes. Can anyone suggest good diagnostics for
finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud
fail to scale?

In detail:

We have used Solr (in non-Cloud mode) for over a year and are now beginning
a transition to SolrCloud.  To this end I have been running some basic load
tests to figure out what kind of capacity we should expect to provision.
In short, I am seeing very poor scalability (increase in effective QPS) as
I add Solr nodes.  I'm hoping to get some ideas on where I should be
looking to debug this.  Apologies in advance for the length of this email;
I'm trying to be comprehensive and provide all relevant information.

Our setup:

1 load generating client
 - generates tiny, fake documents with unique IDs
 - performs only writes (no queries at all)
 - chooses a random solr server for each ADD request (with 1 doc per add
request)

N collections spread over K solr servers
 - every collection is sharded K times (so every solr instance has 1 shard
from every collection)
 - no replicas
 - external zookeeper server (not using zkRun)
 - autoCommit maxTime=6
 - autoSoftCommit maxTime =15000

Everything is running within a single zone on Google Compute Engine, so
high quality gigabit network links between all machines (ping times  1ms).

My methodology is as follows.
1. Start up a K solr servers.
2. Remove all existing collections.
3. Create N collections, with numShards=K for each.
4. Start load testing.  Every minute, print the number of successful
updates and the number of failed updates.
5. Keep increasing the offered load (via simulated users) until the qps
flatlines.

In brief (more detailed results at the bottom of email), I find that for
any number of nodes between 2 and 5, the QPS always caps out at ~3000.
Obviously something must be wrong here, as there should be a trend of the
QPS scaling (roughly) linearly with the number of nodes.  Or at the very
least going up at all!

So my question is what else should I be looking at here?

* CPU on the loadtest client is well under 100%
* No other obvious bottlenecks on loadtest client (running 2 clients leads
to ~1/2 qps on each)
* In many cases, CPU on the solr servers is quite low as well (e.g. with
100 users hitting 5 solr nodes, all nodes are 50% idle)
* Network bandwidth is a few MB/s, well under the gigabit capacity of our
network
* Disk bandwidth ( 2 MB/s) and iops ( 20/s) are low.

Any ideas?  Thanks very much!
- Ian


p.s. Here is my raw data broken out by number of nodes and number of
simulated users:


Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210
229021528502202900240321026032002803210210031803138535158031020903152560320
27603252890380305041375451560410220041525004202700425280043028505152450520
2640525279053028405100290052002810

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey

On 10/30/2014 2:23 PM, Ian Rose wrote:
 My methodology is as follows.
 1. Start up a K solr servers.
 2. Remove all existing collections.
 3. Create N collections, with numShards=K for each.
 4. Start load testing.  Every minute, print the number of successful
 updates and the number of failed updates.
 5. Keep increasing the offered load (via simulated users) until the qps
 flatlines.

If you want to increase QPS, you should not be increasing numShards. 
You need to increase replicationFactor.  When your numShards matches the
number of servers, every single server will be doing part of the work
for every query.  If you increase replicationFactor instead, then each
server can be doing a different query in parallel.

Sharding the index is what you need to do when you need to scale the
size of the index, so each server does not get overwhelmed by dealing
with every document for every query.

Getting a high QPS with a big index requires increasing both numShards
*AND* replicationFactor.

Thanks,
Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose


 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.



I think this is true only for actual queries, right?  I am not issuing any
queries, only writes (document inserts).  In the case of writes, increasing
the number of shards should increase my throughput (in ops/sec) more or
less linearly, right?


On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/30/2014 2:23 PM, Ian Rose wrote:
  My methodology is as follows.
  1. Start up a K solr servers.
  2. Remove all existing collections.
  3. Create N collections, with numShards=K for each.
  4. Start load testing.  Every minute, print the number of successful
  updates and the number of failed updates.
  5. Keep increasing the offered load (via simulated users) until the qps
  flatlines.

 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.  If you increase replicationFactor instead, then each
 server can be doing a different query in parallel.

 Sharding the index is what you need to do when you need to scale the
 size of the index, so each server does not get overwhelmed by dealing
 with every document for every query.

 Getting a high QPS with a big index requires increasing both numShards
 *AND* replicationFactor.

 Thanks,
 Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Matt Hilt

If you are issuing writes to shard non-leaders, then there is a large overhead 
for the eventual redirect to the leader. I noticed a 3-5 times performance 
increase by making my write client leader aware.


On Oct 30, 2014, at 2:56 PM, Ian Rose ianr...@fullstory.com wrote:

 
 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.
 
 
 
 I think this is true only for actual queries, right?  I am not issuing any
 queries, only writes (document inserts).  In the case of writes, increasing
 the number of shards should increase my throughput (in ops/sec) more or
 less linearly, right?
 
 
 On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 10/30/2014 2:23 PM, Ian Rose wrote:
 My methodology is as follows.
 1. Start up a K solr servers.
 2. Remove all existing collections.
 3. Create N collections, with numShards=K for each.
 4. Start load testing.  Every minute, print the number of successful
 updates and the number of failed updates.
 5. Keep increasing the offered load (via simulated users) until the qps
 flatlines.
 
 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.  If you increase replicationFactor instead, then each
 server can be doing a different query in parallel.
 
 Sharding the index is what you need to do when you need to scale the
 size of the index, so each server does not get overwhelmed by dealing
 with every document for every query.
 
 Getting a high QPS with a big index requires increasing both numShards
 *AND* replicationFactor.
 
 Thanks,
 Shawn
 
 



smime.p7s
Description: S/MIME cryptographic signature

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey

On 10/30/2014 2:56 PM, Ian Rose wrote:
 I think this is true only for actual queries, right? I am not issuing
 any queries, only writes (document inserts). In the case of writes,
 increasing the number of shards should increase my throughput (in
 ops/sec) more or less linearly, right?

No, that won't affect indexing speed all that much.  The way to increase
indexing speed is to increase the number of processes or threads that
are indexing at the same time.  Instead of having one client sending
update requests, try five of them.  Also, index many documents with each
update request.  Sending one document at a time is very inefficient.

You didn't say how you're doing commits, but those need to be as
infrequent as you can manage.  Ideally, you would use autoCommit with
openSearcher=false on an interval of about five minutes, and send an
explicit commit (with the default openSearcher=true) after all the
indexing is done.

You may have requirements regarding document visibility that this won't
satisfy, but try to avoid doing commits with openSearcher=true (soft
commits qualify for this) extremely frequently, like once a second. 
Once a minute is much more realistic.  Opening a new searcher is an
expensive operation, especially if you have cache warming configured.

Thanks,
Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson

Your indexing client, if written in SolrJ, should use CloudSolrServer
which is, in Matt's terms leader aware. It divides up the
documents to be indexed into packets that where each doc in
the packet belongs on the same shard, and then sends the packet
to the shard leader. This avoids a lot of re-routing and should
scale essentially linearly. You may have to add more clients
though, depending upon who hard the document-generator is
working.

Also, make sure that you send batches of documents as Shawn
suggests, I use 1,000 as a starting point.

Best,
Erick

On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:
 On 10/30/2014 2:56 PM, Ian Rose wrote:
 I think this is true only for actual queries, right? I am not issuing
 any queries, only writes (document inserts). In the case of writes,
 increasing the number of shards should increase my throughput (in
 ops/sec) more or less linearly, right?

 No, that won't affect indexing speed all that much.  The way to increase
 indexing speed is to increase the number of processes or threads that
 are indexing at the same time.  Instead of having one client sending
 update requests, try five of them.  Also, index many documents with each
 update request.  Sending one document at a time is very inefficient.

 You didn't say how you're doing commits, but those need to be as
 infrequent as you can manage.  Ideally, you would use autoCommit with
 openSearcher=false on an interval of about five minutes, and send an
 explicit commit (with the default openSearcher=true) after all the
 indexing is done.

 You may have requirements regarding document visibility that this won't
 satisfy, but try to avoid doing commits with openSearcher=true (soft
 commits qualify for this) extremely frequently, like once a second.
 Once a minute is much more realistic.  Opening a new searcher is an
 expensive operation, especially if you have cache warming configured.

 Thanks,
 Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose

Thanks for the suggestions so for, all.

1) We are not using SolrJ on the client (not using Java at all) but I am
working on writing a smart router so that we can always send to the
correct node.  I am certainly curious to see how that changes things.
Nonetheless even with the overhead of extra routing hops, the observed
behavior (no increase in performance with more nodes) doesn't make any
sense to me.

2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
and autoSoftCommit (maxTime=15000).

3) Suggestions to batch documents certainly make sense for production code
but in this case I am not real concerned with absolute performance; I just
want to see the *relative* performance as we use more Solr nodes.  So I
don't think batching or not really matters.

4) No, that won't affect indexing speed all that much.  The way to
increase indexing speed is to increase the number of processes or threads
that are indexing at the same time.  Instead of having one client
sending update
requests, try five of them.

Can you elaborate on this some?  I'm worried I might be misunderstanding
something fundamental.  A cluster of 3 shards over 3 Solr nodes
*should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.  Regarding your comment of increase the number of
processes or threads, note that for each value of K (number of Solr nodes)
I measured with several different numbers of simulated users so that I
could find a saturation point.  For example, take a look at my data for
K=2:

Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
1003180

It's clear that once the load test client has ~40 simulated users, the Solr
cluster is saturated.  Creating more users just increases the average
request latency, such that the total QPS remained (nearly) constant.  So I
feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
The problem is that I am finding roughly this same max point, no matter
how many simulated users the load test client created, for any value of K
( 1).

Cheers,
- Ian


On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Your indexing client, if written in SolrJ, should use CloudSolrServer
 which is, in Matt's terms leader aware. It divides up the
 documents to be indexed into packets that where each doc in
 the packet belongs on the same shard, and then sends the packet
 to the shard leader. This avoids a lot of re-routing and should
 scale essentially linearly. You may have to add more clients
 though, depending upon who hard the document-generator is
 working.

 Also, make sure that you send batches of documents as Shawn
 suggests, I use 1,000 as a starting point.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:
  On 10/30/2014 2:56 PM, Ian Rose wrote:
  I think this is true only for actual queries, right? I am not issuing
  any queries, only writes (document inserts). In the case of writes,
  increasing the number of shards should increase my throughput (in
  ops/sec) more or less linearly, right?
 
  No, that won't affect indexing speed all that much.  The way to increase
  indexing speed is to increase the number of processes or threads that
  are indexing at the same time.  Instead of having one client sending
  update requests, try five of them.  Also, index many documents with each
  update request.  Sending one document at a time is very inefficient.
 
  You didn't say how you're doing commits, but those need to be as
  infrequent as you can manage.  Ideally, you would use autoCommit with
  openSearcher=false on an interval of about five minutes, and send an
  explicit commit (with the default openSearcher=true) after all the
  indexing is done.
 
  You may have requirements regarding document visibility that this won't
  satisfy, but try to avoid doing commits with openSearcher=true (soft
  commits qualify for this) extremely frequently, like once a second.
  Once a minute is much more realistic.  Opening a new searcher is an
  expensive operation, especially if you have cache warming configured.
 
  Thanks,
  Shawn

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson

I'm really confused:

bq: I am not issuing any queries, only writes (document inserts)

bq: It's clear that once the load test client has ~40 simulated users

bq: A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right

QPS is usually used to mean Queries Per Second, which is different from
the statement that I am not issuing any queries. And what do the
number of users have to do with inserting documents?

You also state:  In many cases, CPU on the solr servers is quite low as well

So let's talk about indexing first. Indexing should scale nearly
linearly as long as
1 you are routing your docs to the correct leader, which happens with SolrJ
and the CloudSolrSever automatically. Rather than rolling your own, I strongly
suggest you try this out.
2 you have enough clients feeding the cluster to push CPU utilization
on them all.
Very often slow indexing, or in your case lack of scaling is a
result of document
acquisition or, in your case, your doc generator is spending all it's
time waiting for
the individual documents to get to Solr and come back.

bq: chooses a random solr server for each ADD request (with 1 doc per add
request)

Probably your culprit right there. Each and every document requires that you
have to cross the network (and forward that doc to the correct leader). So given
that you're not seeing high CPU utilization, I suspect that you're not sending
enough docs to SolrCloud fast enough to see scaling. You need to batch up
multiple docs, I generally send 1,000 docs at a time.

But even if you do solve this, the inter-node routing will prevent
linear scaling.
When a doc (or a batch of docs) goes to a random Solr node, here's what
happens:
1 the docs are re-packaged into groups based on which shard they're
destined for
2 the sub-packets are forwarded to the leader for each shard
3 the responses are gathered back and returned to the client.

This set of operations will eventually degrade the scaling.

bq:  A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.

If we're talking search requests, the answer is no. Sharding is
what you do when your collection no longer fits on a single node.
If it _does_ fit on a single node, then you'll usually get better query
performance by adding a bunch of replicas to a single shard. When
the number of  docs on each shard grows large enough that you
no longer get good query performance, _then_ you shard. And
take the query hit.

If we're talking about inserts, then see above. I suspect your problem is
that you're _not_ saturating the SolrCloud cluster, you're sending
docs to Solr very inefficiently and waiting on I/O. Batching docs and
sending them to the right leader should scale pretty linearly until you
start saturating your network.

Best,
Erick

On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote:
 Thanks for the suggestions so for, all.

 1) We are not using SolrJ on the client (not using Java at all) but I am
 working on writing a smart router so that we can always send to the
 correct node.  I am certainly curious to see how that changes things.
 Nonetheless even with the overhead of extra routing hops, the observed
 behavior (no increase in performance with more nodes) doesn't make any
 sense to me.

 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
 and autoSoftCommit (maxTime=15000).

 3) Suggestions to batch documents certainly make sense for production code
 but in this case I am not real concerned with absolute performance; I just
 want to see the *relative* performance as we use more Solr nodes.  So I
 don't think batching or not really matters.

 4) No, that won't affect indexing speed all that much.  The way to
 increase indexing speed is to increase the number of processes or threads
 that are indexing at the same time.  Instead of having one client
 sending update
 requests, try five of them.

 Can you elaborate on this some?  I'm worried I might be misunderstanding
 something fundamental.  A cluster of 3 shards over 3 Solr nodes
 *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
 behind sharding.  Regarding your comment of increase the number of
 processes or threads, note that for each value of K (number of Solr nodes)
 I measured with several different numbers of simulated users so that I
 could find a saturation point.  For example, take a look at my data for
 K=2:

 Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
 1003180

 It's clear that once the load test client has ~40 simulated users, the Solr
 cluster is saturated.  Creating more users just increases the average
 request latency, such that the total QPS remained (nearly) constant.  So I
 feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
 The problem is that I am finding roughly this same max

Spelling suggestions--any ideas?

2014-04-17 Thread Ed Smiley

Correctly spelled words are returning as not spelled correctly, with the 
original, correctly spelled word with a single oddball character appended as 
multiple suggestions...
--

Ed Smiley, Senior Software Architect, eBooks
ProQuest | 161 E Evelyn Ave|
Mountain View, CA 94041 | USA |
+1 650 475 8700 extension 3772
ed.smi...@proquest.com
www.proquest.comhttp://www.proquest.com/ | 
www.ebrary.comhttp://www.ebrary.com/ | www.eblib.comhttp://www.eblib.com/
ebrary and EBL, ProQuest businesses.

Need ideas to perform historical search

2013-07-18 Thread SolrLover


I am trying to implement Historical search using SOLR.

Ex:

If I search on address 800 5th Ave and provide a time range, it should list
the name of the person who was living at the address during the time period.
I am trying to figure out a way to store the data without redundancy.

I can do a join in the database to return all the names who were living in a
particular address during a particular time but I know it's difficult to do
that in SOLR and SOLR is not a database (it works best when the data is
denormalized).,..

Is there any other way / idea by which I can reduce the redundancy of
creating multiple records for a particular person again and again?







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need ideas to perform historical search

2013-07-18 Thread Alexandre Rafalovitch

Why do you care about redundancy? That's the search engine's architectural
tradeoff (as far as I understand). And, the tokens are all normalized under
the covers, so it does not take as much space as you expect.

Specifically regarding your issue, maybe you should store 'occupancy' as
the record. That's similar to what they do at Gilt:
http://www.slideshare.net/trenaman/personalized-search-on-the-largest-flash-sale-site-in-america(slide
36+)

The other option is to use location as spans with some clever queries:
http://wiki.apache.org/solr/SpatialForTimeDurations (follow the links).

Regards,
Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)

On Thu, Jul 18, 2013 at 5:58 PM, SolrLover bbar...@gmail.com wrote:

I am trying to implement Historical search using SOLR.

Ex:

If I search on address 800 5th Ave and provide a time range, it should list
the name of the person who was living at the address during the time
period.
I am trying to figure out a way to store the data without redundancy.

I can do a join in the database to return all the names who were living in
a
particular address during a particular time but I know it's difficult to do
that in SOLR and SOLR is not a database (it works best when the data is
denormalized).,..

Is there any other way / idea by which I can reduce the redundancy of
creating multiple records for a particular person again and again?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Need-ideas-to-perform-historical-search-tp4078980.html
Sent from the Solr - User mailing list archive at Nabble.com.

ScorerDocQueue.java's downHeap showing up as frequent hotspot in profiling - ideas why?

2012-10-16 Thread Aaron Daubman

Greetings,

In a recent batch of solr 3.6.1 slow response time queries the
profiler highlighted downHeap (line 212) in SoorerDocQueue.java as
averaging more than 60ms across the 16 calls I was looking at and
showing it spiking up over 100ms - which, after looking at the code
(two int comparisons?!?) I am at a loss to explain:

Here's the source:
https://github.com/apache/lucene-solr/blob/6b8783bfa59351878c59e47deaa7739d95150a22/lucene/core/src/java/org/apache/lucene/util/ScorerDocQueue.java#L212

Here's the invocation trace of one of the many similar:
---snip---
Thread.run:722 (0ms self time, 416 ms total time)
 QueuedThreadPool$3.run:526 (0ms self time, 416 ms total time)
  QueuedThreadPool.runJob:595 (0ms self time, 416 ms total time)
   ExecutorCallback$ExecutorCallbackInvoker.run:130 (0ms self time,
416 ms total time)
ExecutorCallback$ExecutorCallbackInvoker.call:124 (0ms self time,
416 ms total time)
 AbstractConnection$1.onCompleted:63 (0ms self time, 416 ms total time)
  AbstractConnection$1.onCompleted:71 (0ms self time, 416 ms total time)
   HttpConnection.onFillable:253 (0ms self time, 416 ms total time)
HttpChannel.run:246 (0ms self time, 416 ms total time)
 Server.handle:403 (0ms self time, 416 ms total time)
  HandlerWrapper.handle:97 (0ms self time, 416 ms total time)
   IPAccessHandler.handle:204 (0ms self time, 416 ms total time)
HandlerCollection.handle:110 (0ms self time, 416 ms total time)
 ContextHandlerCollection.handle:258 (0ms self time, 416
ms total time)
  ScopedHandler.handle:136 (0ms self time, 416 ms total time)
   ContextHandler.doScope:973 (0ms self time, 416 ms total time)
SessionHandler.doScope:174 (0ms self time, 416 ms total time)
 ServletHandler.doScope:358 (0ms self time, 416 ms total time)
  ContextHandler.doHandle:1044 (0ms self time, 416 ms
total time)
   SessionHandler.doHandle:213 (0ms self time, 416 ms
total time)
SecurityHandler.handle:540 (0ms self time, 416 ms
total time)
 ScopedHandler.handle:138 (0ms self time, 416 ms total time)
  ServletHandler.doHandle:429 (0ms self time, 416
ms total time)
   ServletHandler$CachedChain.doFilter:1274 (0ms
self time, 416 ms total time)
SolrDispatchFilter.doFilter:260 (0ms self
time, 416 ms total time)
 SolrDispatchFilter.execute:365 (0ms self
time, 416 ms total time)
  SolrCore.execute:1376 (0ms self time, 416 ms
total time)
   RequestHandlerBase.handleRequest:129 (0ms
self time, 416 ms total time)
SearchHandler.handleRequestBody:186 (0ms
self time, 416 ms total time)
 QueryComponent.process:394 (0ms self
time, 416 ms total time)
  SolrIndexSearcher.search:375 (0ms self
time, 416 ms total time)
   SolrIndexSearcher.getDocListC:1176 (0ms
self time, 416 ms total time)
SolrIndexSearcher.getDocListNC:1296
(0ms self time, 416 ms total time)
 IndexSearcher.search:364 (0ms self
time, 416 ms total time)
  IndexSearcher.search:581 (0ms self
time, 416 ms total time)
   FilteredQuery$2.score:169 (0ms self
time, 416 ms total time)
BooleanScorer2.advance:320 (0ms
self time, 416 ms total time)
 ReqExclScorer.advance:112 (0ms
self time, 416 ms total time)
  DisjunctionSumScorer.advance:229
(52ms self time, 416 ms total time)

DisjunctionSumScorer.advanceAfterCurrent:171 (0ms self time, 308 ms
total time)

ScorerDocQueue.topNextAndAdjustElsePop:120 (0ms self time, 308 ms
total time)

ScorerDocQueue.checkAdjustElsePop:135 (0ms self time, 111 ms total
time)
  ScorerDocQueue.downHeap:212
(111ms self time, 111 ms total time)
---snip---

Any ideas on what is causing this seemingly inordinate amount of time
in downHeap? Is this symptomatic of anything in particular?

Thanks, as always!
 Aaron

Any ideas on Solr 4.0 Release.

2012-07-05 Thread Sohail Aboobaker

Hi,

Congratulations on Alpha release. I am wondering is there a ball park on
final release for 4.0? Is it expected in August or Sep time frame or is it
further away? We badly need some features included in this release. These
are around grouped facet counts. We have limited use for Solr in our
current release. In next release, we will add more features (full text
searching, location based searches etc.). I am wondering if the facet and
group counts side of things is stable in Alpha or not? I have tested with
the nightly builds before and it works fine for our scenarios.

Thanks.

Regards,
Sohail

RE: Any ideas on Solr 4.0 Release.

2012-07-05 Thread Steven A Rowe

Hi Sohail,

Some of your questions are answered here: 
http://wiki.apache.org/solr/Solr4.0. 

See Chris Hostetter's blog post for more info, particularly on questions around 
stability: 
http://www.lucidimagination.com/blog/2012/07/03/4-0-alpha-whats-in-a-name/.

Steve 

-Original Message-
From: Sohail Aboobaker [mailto:sabooba...@gmail.com] 
Sent: Thursday, July 05, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: Any ideas on Solr 4.0 Release.

Hi,

Congratulations on Alpha release. I am wondering is there a ball park on final 
release for 4.0? Is it expected in August or Sep time frame or is it further 
away? We badly need some features included in this release. These are around 
grouped facet counts. We have limited use for Solr in our current release. In 
next release, we will add more features (full text searching, location based 
searches etc.). I am wondering if the facet and group counts side of things is 
stable in Alpha or not? I have tested with the nightly builds before and it 
works fine for our scenarios.

Thanks.

Regards,
Sohail

Re: Strange spikes in query response times...any ideas where else to look?

2012-06-29 Thread solr


Otis,

Thanks for the response. We'll check out that tool and see how it goes.

Regarding JMeter...you are exactly correct in that I was assuming 1  
thread = 1 query per second. I thought we had set up some sort of  
throttling mechanism to ensure that...and clearly I was mistaken. By  
the math we are getting A LOT more qps...and in a preliminary look  
those spikes look like they just might be correlated to high qps. We  
are pursuing this line and my gut tells me this *is* the problem.


Thanks for the info on the tool (which we will look at) and for the  
heads-up on the qps.



Peter Lee
ProQuest

Quoting Otis Gospodnetic otis_gospodne...@yahoo.com:


Peter,

These could be JVM, or it could be index reopening and warmup  
queries, or  
Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll  
release an agent that tracks and graphs errors and timings of each  
Solr search component, which may reveal interesting stuff.  In the  
mean time, look at the graph with IO as well as graph with caches.  
 That's where I'd first look for signs.


Re users/threads question - if I understand correctly, this is the  
problem:  JMeter is set up to run 15 threads from a single test  
machine...but I noticed that the JMeter report is showing close to  
47 queries per second.  It sounds like you re equating # of threads  
to QPS, which isn't right.  Imagine you had 10 threads and each  
query took 0.1 seconds (processed by a single CPU core) and the  
server had 10 CPU cores.  That would mean that your 1 thread could  
run 10 queries per second utilizing just 1 CPU core. And 10 threads  
would utilize all 10 CPU cores and would give you 10x higher  
throughput - 10x10=100 QPS.


So if you need to simulate just 2-5 QPS, just lower the number of  
threads.  What that number should be depends on query complexity and  
hw resources (cores or IO).


Otis

Performance Monitoring for Solr / ElasticSearch / HBase -  
http://sematext.com/spm 






From: s...@isshomefront.com s...@isshomefront.com
To: solr-user@lucene.apache.org
Sent: Thursday, June 28, 2012 9:20 PM
Subject: RE: Strange spikes in query response times...any ideas  
where else to look?


Michael,

Thank you for responding...and for the excellent questions.

1) We have never seen this response time spike with a  
user-interactive search. However, in the span of about 40 minutes,  
which included about 82,000 queries, we only saw a handful of  
near-equally distributed spikes. We have tried sending queries  
from the admin tool while the test was running, but given those  
odds, I'm not surprised we've never hit on one of those few  
spikes we are seeing in the test results.


2) Good point and I should have mentioned this. We are using  
multiple methods to track these response times.
  a) Looking at the catalina.out file and plotting the response  
times recorded there (I think this is logging the QTime as seen by  
Solr).
  b) Looking at what JMeter is reporting as response times. In  
general, these are very close if not identical to what is being  
seen in the Catalina.out file. I have not run a line-by-line  
comparison, but putting the query response graphs next to each  
other shows them to be nearly (or possibly exactly) the same.  
Nothing looked out of the ordinary.


3) We are using multiple threads. Before your email I was looking  
at the results, doing some math, and double checking the reports  
from JMeter. I did notice that our throughput is much higher than  
we meant for it to be. JMeter is set up to run 15 threads from a  
single test machine...but I noticed that the JMeter report is  
showing close to 47 queries per second. We are only targeting TWO  
to FIVE queries per second. This is up next on our list of things  
to look at and how to control more effectively. We do have three  
separate machines set up for JMeter testing and we are  
investigating to see if perhaps all three of these machines are  
inadvertently being launched during the test at one time and  
overwhelming the server. This *might* be one facet of the problem.  
Agreed on that.


Even as we investigate this last item regarding the number of  
users/threads, I wouldn't mind any other thoughts you or anyone  
else had to offer. We are checking on this user/threads issue and  
for the sake of anyone else you finds this discussion useful I'll  
note what we find.


Thanks again.

Peter S. Lee
ProQuest

Quoting Michael Ryan mr...@moreover.com:


A few questions...

1) Do you only see these spikes when running JMeter? I.e., do you  
ever see a spike when you manually run a query?


2) How are you measuring the response time? In my experience there  
are three different ways to measure query speed. Usually all of  
them will be approximately equal, but in some situations they can  
be quite different, and this difference can be a clue as to where  
the bottleneck is:

   1) The response time as seen by the end user (in this case

Strange spikes in query response times...any ideas where else to look?

2012-06-28 Thread solr


Greetings all,

We are working on building up a large Solr index for over 300 million  
records...and this is our first look at Solr. We are currently running  
a set of unique search queries against a single server (so no  
replication, no indexing going on at the same time, and no distributed  
search) with a set number of records (in our case, 10 million records  
in the index) for about 30 minutes, with nearly all of our searches  
being unique (I say nearly because our set of queries is unique, but  
I have not yet confirmed that JMeter is selecting these queries with  
no replacement).


We are striving for a 2 second response time on the average, and  
indeed we are pretty darned close. In fact, if you look at the average  
responses time, we are well under the 2 seconds per query.  
Unfortunately, we are seeing that about once every 6 minutes or so  
(and it is not a regular event...exactly six minutes apart...it is  
about six minutes but it fluctuates) we get a single query that  
returns in something like 15 to 20 seconds


We have been trying to identify what is causing this spike every so  
often and we are completely baffled. What we have done thus far:


1) Looked through the SAR logs and have not seen anything that  
correlates to this issue
2) Tracked the JVM statistics...especially the garbage  
collections...no correlations there either

3) Examined the queries...no pattern obvious there
4) Played with the JVM memory settings (heap settings, cache settings,  
and any other settings we could find)
5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a  
fresh install of Redhat 5.7 enterprise, tried on a large instance of  
AWS EC2, tried on a fresh instance of a VMWare based virtual machine  
from our own data center) an still nothing is giving us a clue as to  
what is causing these spikes

5) No correlation found between the number of hits returned and the spikes


Our data is very simple and so are the queries. The schema consists of  
40 fields, most of which are string fields, 2 of which are  
location fields, and a small handful of which are integer fields.  
All fields are indexed and all fields are stored.


Our queries are also rather simple. Many of the queries are a simple  
one-field search. The most complex query we have is a 3-field search.  
Again, no correlation has been established between the query and these  
spikes. Also, about 60% of our queries return zero hits (on the  
assumption that we want to make solr search its entire index every so  
often. 60% is more than we intended and we will fix that soon...but  
that is what is currently happening. Again, no correlation found  
between spikes and 0-hit returned queries).


For some time we were testing with 100 million records in the index  
and the aggregate data looked quite good. Most queries were returning  
in under 2 seconds. Unfortunately, it was when we looked at the  
individual data points that we found spikes every 6-8 minutes or so  
hitting sometimes as high as 150 seconds!


We have been testing with 100 million records in the index, 50 million  
records in the index, 25 million, 20 million, 15 million, and 10  
million records. As I  indicated at the start, we are now at 10  
million records with 15-20 seconds spikes.


As we have decreased the number of records in the index,the size (but  
not the frequency) of the spikes has been dropping.


My question is: Is this type of behavior normal for Solr when it is  
being overstressed? I've read of lots of people with far more  
complicated schemas running MORE than 10 million records in an index  
and never once complained about these spikes. Since I am new at this,  
I am not sure what Solr's failure mode looks like when it has too  
many records to search.


I am hoping someone looking at this note can at least give me another  
direction to look. 10 million records searched in less than 2 seconds  
most of the time is great...but those 10 and 20 seconds spikes are not  
going to go over well with our customers...and I somehow think there  
is more we should be able to do here.


Thanks.

Peter S. Lee
ProQuest

RE: Strange spikes in query response times...any ideas where else to look?

2012-06-28 Thread Michael Ryan

A few questions...

1) Do you only see these spikes when running JMeter? I.e., do you ever see a 
spike when you manually run a query?

2) How are you measuring the response time? In my experience there are three 
different ways to measure query speed. Usually all of them will be 
approximately equal, but in some situations they can be quite different, and 
this difference can be a clue as to where the bottleneck is:
  1) The response time as seen by the end user (in this case, JMeter)
  2) The response time as seen by the container (for example, in Jetty you can 
get this by enabling logLatency in jetty.xml)
  3) The QTime as returned in the Solr response

3) Are you running multiple queries concurrently, or are you just using a 
single thread in JMeter?

-Michael

-Original Message-
From: s...@isshomefront.com [mailto:s...@isshomefront.com] 
Sent: Thursday, June 28, 2012 7:56 PM
To: solr-user@lucene.apache.org
Subject: Strange spikes in query response times...any ideas where else to 
look?

Greetings all,

We are working on building up a large Solr index for over 300 million  
records...and this is our first look at Solr. We are currently running  
a set of unique search queries against a single server (so no  
replication, no indexing going on at the same time, and no distributed  
search) with a set number of records (in our case, 10 million records  
in the index) for about 30 minutes, with nearly all of our searches  
being unique (I say nearly because our set of queries is unique, but  
I have not yet confirmed that JMeter is selecting these queries with  
no replacement).

We are striving for a 2 second response time on the average, and  
indeed we are pretty darned close. In fact, if you look at the average  
responses time, we are well under the 2 seconds per query.  
Unfortunately, we are seeing that about once every 6 minutes or so  
(and it is not a regular event...exactly six minutes apart...it is  
about six minutes but it fluctuates) we get a single query that  
returns in something like 15 to 20 seconds

We have been trying to identify what is causing this spike every so  
often and we are completely baffled. What we have done thus far:

1) Looked through the SAR logs and have not seen anything that  
correlates to this issue
2) Tracked the JVM statistics...especially the garbage  
collections...no correlations there either
3) Examined the queries...no pattern obvious there
4) Played with the JVM memory settings (heap settings, cache settings,  
and any other settings we could find)
5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a  
fresh install of Redhat 5.7 enterprise, tried on a large instance of  
AWS EC2, tried on a fresh instance of a VMWare based virtual machine  
from our own data center) an still nothing is giving us a clue as to  
what is causing these spikes
5) No correlation found between the number of hits returned and the spikes


Our data is very simple and so are the queries. The schema consists of  
40 fields, most of which are string fields, 2 of which are  
location fields, and a small handful of which are integer fields.  
All fields are indexed and all fields are stored.

Our queries are also rather simple. Many of the queries are a simple  
one-field search. The most complex query we have is a 3-field search.  
Again, no correlation has been established between the query and these  
spikes. Also, about 60% of our queries return zero hits (on the  
assumption that we want to make solr search its entire index every so  
often. 60% is more than we intended and we will fix that soon...but  
that is what is currently happening. Again, no correlation found  
between spikes and 0-hit returned queries).

For some time we were testing with 100 million records in the index  
and the aggregate data looked quite good. Most queries were returning  
in under 2 seconds. Unfortunately, it was when we looked at the  
individual data points that we found spikes every 6-8 minutes or so  
hitting sometimes as high as 150 seconds!

We have been testing with 100 million records in the index, 50 million  
records in the index, 25 million, 20 million, 15 million, and 10  
million records. As I  indicated at the start, we are now at 10  
million records with 15-20 seconds spikes.

As we have decreased the number of records in the index,the size (but  
not the frequency) of the spikes has been dropping.

My question is: Is this type of behavior normal for Solr when it is  
being overstressed? I've read of lots of people with far more  
complicated schemas running MORE than 10 million records in an index  
and never once complained about these spikes. Since I am new at this,  
I am not sure what Solr's failure mode looks like when it has too  
many records to search.

I am hoping someone looking at this note can at least give me another  
direction to look. 10 million records searched in less than 2 seconds  
most of the time is great...but those 10 and 20 seconds

RE: Strange spikes in query response times...any ideas where else to look?

2012-06-28 Thread solr


Michael,

Thank you for responding...and for the excellent questions.

1) We have never seen this response time spike with a user-interactive  
search. However, in the span of about 40 minutes, which included about  
82,000 queries, we only saw a handful of near-equally distributed  
spikes. We have tried sending queries from the admin tool while the  
test was running, but given those odds, I'm not surprised we've never  
hit on one of those few spikes we are seeing in the test results.


2) Good point and I should have mentioned this. We are using multiple  
methods to track these response times.
  a) Looking at the catalina.out file and plotting the response times  
recorded there (I think this is logging the QTime as seen by Solr).
  b) Looking at what JMeter is reporting as response times. In  
general, these are very close if not identical to what is being seen  
in the Catalina.out file. I have not run a line-by-line comparison,  
but putting the query response graphs next to each other shows them to  
be nearly (or possibly exactly) the same. Nothing looked out of the  
ordinary.


3) We are using multiple threads. Before your email I was looking at  
the results, doing some math, and double checking the reports from  
JMeter. I did notice that our throughput is much higher than we meant  
for it to be. JMeter is set up to run 15 threads from a single test  
machine...but I noticed that the JMeter report is showing close to 47  
queries per second. We are only targeting TWO to FIVE queries per  
second. This is up next on our list of things to look at and how to  
control more effectively. We do have three separate machines set up  
for JMeter testing and we are investigating to see if perhaps all  
three of these machines are inadvertently being launched during the  
test at one time and overwhelming the server. This *might* be one  
facet of the problem. Agreed on that.


Even as we investigate this last item regarding the number of  
users/threads, I wouldn't mind any other thoughts you or anyone else  
had to offer. We are checking on this user/threads issue and for the  
sake of anyone else you finds this discussion useful I'll note what we  
find.


Thanks again.

Peter S. Lee
ProQuest

Quoting Michael Ryan mr...@moreover.com:


A few questions...

1) Do you only see these spikes when running JMeter? I.e., do you  
ever see a spike when you manually run a query?


2) How are you measuring the response time? In my experience there  
are three different ways to measure query speed. Usually all of them  
will be approximately equal, but in some situations they can be  
quite different, and this difference can be a clue as to where the  
bottleneck is:

  1) The response time as seen by the end user (in this case, JMeter)
  2) The response time as seen by the container (for example, in  
Jetty you can get this by enabling logLatency in jetty.xml)

  3) The QTime as returned in the Solr response

3) Are you running multiple queries concurrently, or are you just  
using a single thread in JMeter?


-Michael

-Original Message-
From: s...@isshomefront.com [mailto:s...@isshomefront.com]
Sent: Thursday, June 28, 2012 7:56 PM
To: solr-user@lucene.apache.org
Subject: Strange spikes in query response times...any ideas where  
else to look?


Greetings all,

We are working on building up a large Solr index for over 300 million
records...and this is our first look at Solr. We are currently running
a set of unique search queries against a single server (so no
replication, no indexing going on at the same time, and no distributed
search) with a set number of records (in our case, 10 million records
in the index) for about 30 minutes, with nearly all of our searches
being unique (I say nearly because our set of queries is unique, but
I have not yet confirmed that JMeter is selecting these queries with
no replacement).

We are striving for a 2 second response time on the average, and
indeed we are pretty darned close. In fact, if you look at the average
responses time, we are well under the 2 seconds per query.
Unfortunately, we are seeing that about once every 6 minutes or so
(and it is not a regular event...exactly six minutes apart...it is
about six minutes but it fluctuates) we get a single query that
returns in something like 15 to 20 seconds

We have been trying to identify what is causing this spike every so
often and we are completely baffled. What we have done thus far:

1) Looked through the SAR logs and have not seen anything that
correlates to this issue
2) Tracked the JVM statistics...especially the garbage
collections...no correlations there either
3) Examined the queries...no pattern obvious there
4) Played with the JVM memory settings (heap settings, cache settings,
and any other settings we could find)
5) Changed hardware: Brand new 4 processor, 8 gig RAM server with a
fresh install of Redhat 5.7 enterprise, tried on a large instance of
AWS EC2, tried on a fresh instance of a VMWare

Re: Strange spikes in query response times...any ideas where else to look?

2012-06-28 Thread Otis Gospodnetic

Peter,

These could be JVM, or it could be index reopening and warmup queries, or  
Grab SPM for Solr - http://sematext.com/spm - in 24-48h we'll release an agent 
that tracks and graphs errors and timings of each Solr search component, which 
may reveal interesting stuff.  In the mean time, look at the graph with IO as 
well as graph with caches.  That's where I'd first look for signs.

Re users/threads question - if I understand correctly, this is the problem: 
 JMeter is set up to run 15 threads from a single test machine...but I noticed 
that the JMeter report is showing close to 47 queries per second.  It sounds 
like you re equating # of threads to QPS, which isn't right.  Imagine you had 
10 threads and each query took 0.1 seconds (processed by a single CPU core) and 
the server had 10 CPU cores.  That would mean that your 1 thread could run 10 
queries per second utilizing just 1 CPU core. And 10 threads would utilize all 
10 CPU cores and would give you 10x higher throughput - 10x10=100 QPS.

So if you need to simulate just 2-5 QPS, just lower the number of threads.  
What that number should be depends on query complexity and hw resources (cores 
or IO).

Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 From: s...@isshomefront.com s...@isshomefront.com
To: solr-user@lucene.apache.org 
Sent: Thursday, June 28, 2012 9:20 PM
Subject: RE: Strange spikes in query response times...any ideas where else 
to look?
 
Michael,

Thank you for responding...and for the excellent questions.

1) We have never seen this response time spike with a user-interactive search. 
However, in the span of about 40 minutes, which included about 82,000 queries, 
we only saw a handful of near-equally distributed spikes. We have tried 
sending queries from the admin tool while the test was running, but given 
those odds, I'm not surprised we've never hit on one of those few spikes we 
are seeing in the test results.

2) Good point and I should have mentioned this. We are using multiple methods 
to track these response times.
  a) Looking at the catalina.out file and plotting the response times recorded 
there (I think this is logging the QTime as seen by Solr).
  b) Looking at what JMeter is reporting as response times. In general, these 
are very close if not identical to what is being seen in the Catalina.out 
file. I have not run a line-by-line comparison, but putting the query response 
graphs next to each other shows them to be nearly (or possibly exactly) the 
same. Nothing looked out of the ordinary.

3) We are using multiple threads. Before your email I was looking at the 
results, doing some math, and double checking the reports from JMeter. I did 
notice that our throughput is much higher than we meant for it to be. JMeter 
is set up to run 15 threads from a single test machine...but I noticed that 
the JMeter report is showing close to 47 queries per second. We are only 
targeting TWO to FIVE queries per second. This is up next on our list of 
things to look at and how to control more effectively. We do have three 
separate machines set up for JMeter testing and we are investigating to see if 
perhaps all three of these machines are inadvertently being launched during 
the test at one time and overwhelming the server. This *might* be one facet of 
the problem. Agreed on that.

Even as we investigate this last item regarding the number of users/threads, I 
wouldn't mind any other thoughts you or anyone else had to offer. We are 
checking on this user/threads issue and for the sake of anyone else you finds 
this discussion useful I'll note what we find.

Thanks again.

Peter S. Lee
ProQuest

Quoting Michael Ryan mr...@moreover.com:

 A few questions...
 
 1) Do you only see these spikes when running JMeter? I.e., do you ever see a 
 spike when you manually run a query?
 
 2) How are you measuring the response time? In my experience there are three 
 different ways to measure query speed. Usually all of them will be 
 approximately equal, but in some situations they can be quite different, and 
 this difference can be a clue as to where the bottleneck is:
   1) The response time as seen by the end user (in this case, JMeter)
   2) The response time as seen by the container (for example, in Jetty you 
can get this by enabling logLatency in jetty.xml)
   3) The QTime as returned in the Solr response
 
 3) Are you running multiple queries concurrently, or are you just using a 
 single thread in JMeter?
 
 -Michael
 
 -Original Message-
 From: s...@isshomefront.com [mailto:s...@isshomefront.com]
 Sent: Thursday, June 28, 2012 7:56 PM
 To: solr-user@lucene.apache.org
 Subject: Strange spikes in query response times...any ideas where else to 
 look?
 
 Greetings all,
 
 We are working on building up a large Solr index for over 300 million
 records...and this is our first look at Solr. We are currently running
 a set of unique search

RE: ideas for indexing large amount of pdf docs

2011-08-16 Thread Rode González

Hi Jay, thanks. great idea, in next days we'll try to do something like
you'd exposed. 

best,
rode.

---
Rode González
Libnova, SL
Paseo de la Castellana, 153-Madrid
[t]91 449 08 94  [f]91 141 21 21
www.libnova.es

 -Mensaje original-
 De: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
 Enviado el: lunes, 15 de agosto de 2011 14:54
 Para: solr-user@lucene.apache.org
 Asunto: RE: ideas for indexing large amount of pdf docs
 
 Note on i:  Solr replication provides pretty good clustering support
 out-of-the-box, including replication of multiple cores.  Read the Wiki
 on replication (Google +solr +replication if you don't know where it
 is).
 
 In my experience, the problem with indexing PDFs is it takes a lot of
 CPU on the document parsing side (client), not on the Solr server side.
 So make sure you do that part on the client and not the server.
 
 Avoiding iii:
 
 
 Suggest that you write yourself a multi-threaded performance test so
 that you aren't guessing what your performance will be.
 
 We wrote one in Perl.  It handles an individual thread (we were testing
 inquiry), and we wrote a little batch file / shell script to start up
 the desired number of threads.
 
 The main statement in our batch file (the rest just set the variables).
 A  Shell script would be even easier.
 
 for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h
 %SOLRHOST%
 -c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l
 %SOLRLOC%\lastsynonyms.txt -z %FUZZ%
 
 The perl
 
 
 #!/usr/bin/perl
 
 #
 # Perl program to run a thread of solr testing
 #
 
 use Getopt::Std;  # For options processing
 use POSIX;# For time formatting
 use XML::Simple;  # For processing of XML config file
 use Data::Dumper; # For debugging XML config file
 use HTTP::Request::Common;# For HTTP request to Solr
 use HTTP::Response;
 use LWP::UserAgent;   # For HTTP request to Solr
 
 $host = YOURHOST:8983;
 $realm = YOUR AUTHENTICATION REALM;
 $firstlist = firstsynonyms.txt;
 $lastlist = lastsynonyms.txt;
 $fuzzy = ;
 
 $me = $0;
 
 sub usage() {
   print perl $me -c iterations [-d] [-h host:port ] [-u user [-p
 password]] \n;
   print \t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r
 realm]\n;
   exit(8);
 }
 
 
 #
 # Process the command line options, and open the output file.
 #
 
 getopts('dc:u:p:f:l:h:r:z:') || usage();
 
 if(!$opt_c) {
   usage();
 }
 
 $count = $opt_c;
 
 if($opt_u) {
   $user = $opt_u;
 }
 
 if($opt_p) {
   $password = $opt_p;
 }
 
 if($opt_h) {
   $host = $opt_h;
 }
 
 if($opt_f) {
   $firstlist = $opt_f;
 }
 
 if($opt_l) {
   $lastlist = $opt_l;
 }
 
 if($opt_r) {
   $realm = $opt_r;
 }
 
 if($opt_z) {
   $fuzzy = ~ . $opt_z;
 }
 
 $debug = $opt_d;
 
 
 #
 # If the host string does not include a :, add :80
 #
 
 if($host !~ /:/) {
   $host = $host . :80;
 }
 
 #
 # Read the lists of first and last names
 #
 
 open(SYNFILE,$firstlist) || die Can't open first name list
 $firstlist\n;
 while(SYNFILE) {
   @newwords = split /,/;
   for($i=0; $i = $#newwords; ++$i) {
   $newwords[$i] =~ s/^\s+//;
   $newwords[$i] =~ s/\s+$//;
   $newwords[$i] = lc($newwords[$i]);
   }
   push @firstnames, @newwords;
 }
 close(SYNFILE);
 
 open(SYNFILE,$lastlist) || die Can't open last name list
 $lastlist\n;
 while(SYNFILE) {
   @newwords = split /,/;
   for($i=0; $i = $#newwords; ++$i) {
   $newwords[$i] =~ s/^\s+//;
   $newwords[$i] =~ s/\s+$//;
   $newwords[$i] = lc($newwords[$i]);
   }
   push @lastnames, @newwords;
 }
 close(SYNFILE);
 
 
 print $#firstnames First Names, $#lastnames Last Names\n;
 print User: $user\n;
 
 
 my $userAgent = LWP::UserAgent-new(agent = 'solrtest.pl');
 $userAgent-credentials($host,$realm,$user,$password);
 
 $uri = http://$host/solr/select;;
 
 $starttime = time();
 
 for($c=0; $c  $count; ++$c) {
   $fname = $firstnames[rand $#firstnames];
   $lname = $lastnames[rand $#lastnames];
   $response = $userAgent-request(
   POST $uri,
   [
   q = lnamesyn:$lname AND fnamesyn:$fname$fuzzy,
   rows = 25
   ]);
 
   if($debug) {
   print Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy;
   print $response-content();
   }
   print POST for $fname $lname completed, HTTP status= .
 $response-code . \n;
 }
 
 $elapsed = time() - $starttime;
 $average = $elapsed / $count;
 
 print Time: $elapsed s ($average/request)\n;
 
 
 -Original Message-
 From: Rode Gonzalez (libnova) [mailto:r...@libnova.es]
 Sent: Saturday, August 13, 2011 3:50 AM
 To: solr-user@lucene.apache.org
 Subject: ideas for indexing large amount of pdf docs
 
 Hi all,
 
 I want to ask about the best way to implement

RE: ideas for indexing large amount of pdf docs

2011-08-15 Thread Jaeger, Jay - DOT

Note on i:  Solr replication provides pretty good clustering support 
out-of-the-box, including replication of multiple cores.  Read the Wiki on 
replication (Google +solr +replication if you don't know where it is).  

In my experience, the problem with indexing PDFs is it takes a lot of CPU on 
the document parsing side (client), not on the Solr server side.  So make sure 
you do that part on the client and not the server.

Avoiding iii:


Suggest that you write yourself a multi-threaded performance test so that you 
aren't guessing what your performance will be.

We wrote one in Perl.  It handles an individual thread (we were testing 
inquiry), and we wrote a little batch file / shell script to start up the 
desired number of threads.

The main statement in our batch file (the rest just set the variables).  A  
Shell script would be even easier.

for /L %%i in (1,1,%THREADS%) DO start /B perl solrtest.pl -h %SOLRHOST% 
-c %COUNT% -u %1 -p %2 -r %SOLRREALM% -f %SOLRLOC%\firstsynonyms.txt -l 
%SOLRLOC%\lastsynonyms.txt -z %FUZZ%

The perl


#!/usr/bin/perl

#
#   Perl program to run a thread of solr testing
#

use Getopt::Std;# For options processing
use POSIX;  # For time formatting
use XML::Simple;# For processing of XML config file
use Data::Dumper;   # For debugging XML config file
use HTTP::Request::Common;  # For HTTP request to Solr
use HTTP::Response;
use LWP::UserAgent; # For HTTP request to Solr

$host = YOURHOST:8983;
$realm = YOUR AUTHENTICATION REALM;
$firstlist = firstsynonyms.txt;
$lastlist = lastsynonyms.txt;
$fuzzy = ;

$me = $0;

sub usage() {
print perl $me -c iterations [-d] [-h host:port ] [-u user [-p 
password]] \n;
print \t\t[-f firstnamefile] [-l lastnamefile] [-z fuzzy] [-r 
realm]\n;
exit(8);
}


#
#   Process the command line options, and open the output file.
#

getopts('dc:u:p:f:l:h:r:z:') || usage();

if(!$opt_c) {
usage();
}

$count = $opt_c;

if($opt_u) {
$user = $opt_u;
}

if($opt_p) {
$password = $opt_p;
}

if($opt_h) {
$host = $opt_h;
}

if($opt_f) {
$firstlist = $opt_f;
}

if($opt_l) {
$lastlist = $opt_l;
}

if($opt_r) {
$realm = $opt_r;
}

if($opt_z) {
$fuzzy = ~ . $opt_z;
}

$debug = $opt_d;


#
#   If the host string does not include a :, add :80
#

if($host !~ /:/) {
$host = $host . :80;
}

#
#   Read the lists of first and last names
#

open(SYNFILE,$firstlist) || die Can't open first name list $firstlist\n;
while(SYNFILE) {
@newwords = split /,/;
for($i=0; $i = $#newwords; ++$i) {
$newwords[$i] =~ s/^\s+//;
$newwords[$i] =~ s/\s+$//;
$newwords[$i] = lc($newwords[$i]);
}
push @firstnames, @newwords;
}
close(SYNFILE);

open(SYNFILE,$lastlist) || die Can't open last name list $lastlist\n;
while(SYNFILE) {
@newwords = split /,/;
for($i=0; $i = $#newwords; ++$i) {
$newwords[$i] =~ s/^\s+//;
$newwords[$i] =~ s/\s+$//;
$newwords[$i] = lc($newwords[$i]);
}
push @lastnames, @newwords;
}
close(SYNFILE);


print $#firstnames First Names, $#lastnames Last Names\n;
print User: $user\n;


my $userAgent = LWP::UserAgent-new(agent = 'solrtest.pl');
$userAgent-credentials($host,$realm,$user,$password);

$uri = http://$host/solr/select;;

$starttime = time();

for($c=0; $c  $count; ++$c) {
$fname = $firstnames[rand $#firstnames];
$lname = $lastnames[rand $#lastnames];
$response = $userAgent-request(
POST $uri,
[ 
q = lnamesyn:$lname AND fnamesyn:$fname$fuzzy,
rows = 25
]);

if($debug) {
print Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy;
print $response-content();
}
print POST for $fname $lname completed, HTTP status= . 
$response-code . \n;
}

$elapsed = time() - $starttime;
$average = $elapsed / $count;

print Time: $elapsed s ($average/request)\n;


-Original Message-
From: Rode Gonzalez (libnova) [mailto:r...@libnova.es] 
Sent: Saturday, August 13, 2011 3:50 AM
To: solr-user@lucene.apache.org
Subject: ideas for indexing large amount of pdf docs

Hi all,

I want to ask about the best way to implement a solution for indexing a 
large amount of pdf documents between 10-60 MB each one. 100 to 1000 users 
connected simultaneously.

I actually have 1 core of solr 3.3.0 and it works fine for a few number of 
pdf docs but I'm afraid about the moment when we enter in production time.

some possibilities:

i. clustering. I have no experience in this, so it will be a bad idea to 
venture into this.

ii. multicore solution. make some kind of hash to choose one core at each 
query (exact queries) and thus reduce

ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Hi all,

I want to ask about the best way to implement a solution for indexing a 
large amount of pdf documents between 10-60 MB each one. 100 to 1000 users 
connected simultaneously.

I actually have 1 core of solr 3.3.0 and it works fine for a few number of 
pdf docs but I'm afraid about the moment when we enter in production time.

some possibilities:

i. clustering. I have no experience in this, so it will be a bad idea to 
venture into this.

ii. multicore solution. make some kind of hash to choose one core at each 
query (exact queries) and thus reduce the size of the individual indexes to 
consult or to consult all the cores at same time (complex queries).

iii. do nothing more and wait for the catastrophe in the response times :P


Someone with experience can help a bit to decide?

Thanks a lot in advance.

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson

Yeah, parsing PDF files can be pretty resource-intensive, so one solution
is to offload it somewhere else. You can use the Tika libraries in SolrJ
to parse the PDFs on as many clients as you want, just transmitting the
results to Solr for indexing.

HOw are all these docs being submitted? Is this some kind of
on-the-fly indexing/searching or what? I'm mostly curious what
your projected max ingestion rate is...

Best
Erick

On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)
r...@libnova.es wrote:
 Hi all,

 I want to ask about the best way to implement a solution for indexing a
 large amount of pdf documents between 10-60 MB each one. 100 to 1000 users
 connected simultaneously.

 I actually have 1 core of solr 3.3.0 and it works fine for a few number of
 pdf docs but I'm afraid about the moment when we enter in production time.

 some possibilities:

 i. clustering. I have no experience in this, so it will be a bad idea to
 venture into this.

 ii. multicore solution. make some kind of hash to choose one core at each
 query (exact queries) and thus reduce the size of the individual indexes to
 consult or to consult all the cores at same time (complex queries).

 iii. do nothing more and wait for the catastrophe in the response times :P


 Someone with experience can help a bit to decide?

 Thanks a lot in advance.

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Hi Erick, 

Our app insert the pdf from a backoffice site and the people can 
search/consult throught a front end site. Both written in php. I've 
installed a tomcat for solr exclusivelly.

the pdf docs are indexed and not stored using the standard 
solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars 
included in contrib/extraction dir, you know) in an offline mode 
(summarizing: the internal users submit the docs; this docs were saved in 
the server; there is a task that take the docs and put them into the indexer 
throught a curl utility; when the task finish, the doc is available to the 
frontend; once more, we use curl utilities to make queries to solr).

The problem isn't the process of indexing. The max injection rate can be 
1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i 
don't know exactly... but a lot of them,so many books in a library.

But no problem about this, this part of the process runs offline. take a 
doc, index a doc; take another doc, index another doc, ...

The problem is the response time when the number of pdf's grow and grow... 
How is the better manner, the best way, the fantastic idea to minimize this 
time all as possible when we entering in production time.

Best,

Rode.


-Original Message-

From: Erick Erickson erickerick...@gmail.com

To: solr-user@lucene.apache.org

Date: Sat, 13 Aug 2011 12:13:27 -0400

Subject: Re: ideas for indexing large amount of pdf docs




Yeah, parsing PDF files can be pretty resource-intensive, so one solution

is to offload it somewhere else. You can use the Tika libraries in SolrJ

to parse the PDFs on as many clients as you want, just transmitting the

results to Solr for indexing.



HOw are all these docs being submitted? Is this some kind of

on-the-fly indexing/searching or what? I'm mostly curious what

your projected max ingestion rate is...



Best

Erick



On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)

r...@libnova.es wrote:

 Hi all,



 I want to ask about the best way to implement a solution for indexing a

 large amount of pdf documents between 10-60 MB each one. 100 to 1000 users

 connected simultaneously.



 I actually have 1 core of solr 3.3.0 and it works fine for a few number of

 pdf docs but I'm afraid about the moment when we enter in production time.



 some possibilities:



 i. clustering. I have no experience in this, so it will be a bad idea to

 venture into this.



 ii. multicore solution. make some kind of hash to choose one core at each

 query (exact queries) and thus reduce the size of the individual indexes 
to

 consult or to consult all the cores at same time (complex queries).



 iii. do nothing more and wait for the catastrophe in the response times :P





 Someone with experience can help a bit to decide?



 Thanks a lot in advance.

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell

You could send PDF for processing using a queue solution like Amazon SQS. Kick 
off Amazon instances to process the queue.

Once you process with Tika to text just send the update to Solr.

Bill Bell
Sent from mobile


On Aug 13, 2011, at 10:13 AM, Erick Erickson erickerick...@gmail.com wrote:

 Yeah, parsing PDF files can be pretty resource-intensive, so one solution
 is to offload it somewhere else. You can use the Tika libraries in SolrJ
 to parse the PDFs on as many clients as you want, just transmitting the
 results to Solr for indexing.
 
 HOw are all these docs being submitted? Is this some kind of
 on-the-fly indexing/searching or what? I'm mostly curious what
 your projected max ingestion rate is...
 
 Best
 Erick
 
 On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)
 r...@libnova.es wrote:
 Hi all,
 
 I want to ask about the best way to implement a solution for indexing a
 large amount of pdf documents between 10-60 MB each one. 100 to 1000 users
 connected simultaneously.
 
 I actually have 1 core of solr 3.3.0 and it works fine for a few number of
 pdf docs but I'm afraid about the moment when we enter in production time.
 
 some possibilities:
 
 i. clustering. I have no experience in this, so it will be a bad idea to
 venture into this.
 
 ii. multicore solution. make some kind of hash to choose one core at each
 query (exact queries) and thus reduce the size of the individual indexes to
 consult or to consult all the cores at same time (complex queries).
 
 iii. do nothing more and wait for the catastrophe in the response times :P
 
 
 Someone with experience can help a bit to decide?
 
 Thanks a lot in advance.

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson

Ahhh, ok, my reply was irrelevant G...

Here's a good write-up on this problem:
http://www.lucidimagination.com/content/scaling-lucene-and-solr

But Solr handles millions of documents on a single server in many cases,
so waiting until the search app falls over is actually feasible.

In general, if you can get an adequate query response time from a single
machine, you just set up a master/slave architecture and add as many slaves
as you need to handle your maximum load. So scaling wide is a very quick
process. Don't go to sharding unless and until your machine can't give adequate
response times at all...

Mark's paper outlines this very well.

Best
Erick

On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova)
r...@libnova.es wrote:
 Hi Erick,

 Our app insert the pdf from a backoffice site and the people can
 search/consult throught a front end site. Both written in php. I've
 installed a tomcat for solr exclusivelly.

 the pdf docs are indexed and not stored using the standard
 solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars
 included in contrib/extraction dir, you know) in an offline mode
 (summarizing: the internal users submit the docs; this docs were saved in
 the server; there is a task that take the docs and put them into the indexer
 throught a curl utility; when the task finish, the doc is available to the
 frontend; once more, we use curl utilities to make queries to solr).

 The problem isn't the process of indexing. The max injection rate can be
 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i
 don't know exactly... but a lot of them,so many books in a library.

 But no problem about this, this part of the process runs offline. take a
 doc, index a doc; take another doc, index another doc, ...

 The problem is the response time when the number of pdf's grow and grow...
 How is the better manner, the best way, the fantastic idea to minimize this
 time all as possible when we entering in production time.

 Best,

 Rode.


 -Original Message-

 From: Erick Erickson erickerick...@gmail.com

 To: solr-user@lucene.apache.org

 Date: Sat, 13 Aug 2011 12:13:27 -0400

 Subject: Re: ideas for indexing large amount of pdf docs




 Yeah, parsing PDF files can be pretty resource-intensive, so one solution

 is to offload it somewhere else. You can use the Tika libraries in SolrJ

 to parse the PDFs on as many clients as you want, just transmitting the

 results to Solr for indexing.



 HOw are all these docs being submitted? Is this some kind of

 on-the-fly indexing/searching or what? I'm mostly curious what

 your projected max ingestion rate is...



 Best

 Erick



 On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)

 r...@libnova.es wrote:

 Hi all,



 I want to ask about the best way to implement a solution for indexing a

 large amount of pdf documents between 10-60 MB each one. 100 to 1000 users

 connected simultaneously.



 I actually have 1 core of solr 3.3.0 and it works fine for a few number of

 pdf docs but I'm afraid about the moment when we enter in production time.



 some possibilities:



 i. clustering. I have no experience in this, so it will be a bad idea to

 venture into this.



 ii. multicore solution. make some kind of hash to choose one core at each

 query (exact queries) and thus reduce the size of the individual indexes
 to

 consult or to consult all the cores at same time (complex queries).



 iii. do nothing more and wait for the catastrophe in the response times :P





 Someone with experience can help a bit to decide?



 Thanks a lot in advance.

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I 
will study the master/slave architecture for many slaves. In the future 
perhaps we will need it =)

Best regards,
Rode.


-Original Message-

From: Erick Erickson erickerick...@gmail.com

To: solr-user@lucene.apache.org

Date: Sat, 13 Aug 2011 15:34:19 -0400

Subject: Re: ideas for indexing large amount of pdf docs




Ahhh, ok, my reply was irrelevant G...



Here's a good write-up on this problem:

http://www.lucidimagination.com/content/scaling-lucene-and-solr 
[http://www.lucidimagination.com/content/scaling-lucene-and-solr]



But Solr handles millions of documents on a single server in many cases,

so waiting until the search app falls over is actually feasible.



In general, if you can get an adequate query response time from a single

machine, you just set up a master/slave architecture and add as many slaves

as you need to handle your maximum load. So scaling wide is a very quick

process. Don't go to sharding unless and until your machine can't give 
adequate

response times at all...



Mark's paper outlines this very well.



Best

Erick



On Sat, Aug 13, 2011 at 2:13 PM, Rode Gonzalez (libnova)

r...@libnova.es wrote:

 Hi Erick,



 Our app insert the pdf from a backoffice site and the people can

 search/consult throught a front end site. Both written in php. I've

 installed a tomcat for solr exclusivelly.



 the pdf docs are indexed and not stored using the standard

 solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars

 included in contrib/extraction dir, you know) in an offline mode

 (summarizing: the internal users submit the docs; this docs were saved in

 the server; there is a task that take the docs and put them into the 
indexer

 throught a curl utility; when the task finish, the doc is available to the

 frontend; once more, we use curl utilities to make queries to solr).



 The problem isn't the process of indexing. The max injection rate can be

 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i

 don't know exactly... but a lot of them,so many books in a library.



 But no problem about this, this part of the process runs offline. take a

 doc, index a doc; take another doc, index another doc, ...



 The problem is the response time when the number of pdf's grow and grow...

 How is the better manner, the best way, the fantastic idea to minimize 
this

 time all as possible when we entering in production time.



 Best,



 Rode.





 -Original Message-



 From: Erick Erickson erickerick...@gmail.com



 To: solr-user@lucene.apache.org



 Date: Sat, 13 Aug 2011 12:13:27 -0400



 Subject: Re: ideas for indexing large amount of pdf docs









 Yeah, parsing PDF files can be pretty resource-intensive, so one solution



 is to offload it somewhere else. You can use the Tika libraries in SolrJ



 to parse the PDFs on as many clients as you want, just transmitting the



 results to Solr for indexing.







 HOw are all these docs being submitted? Is this some kind of



 on-the-fly indexing/searching or what? I'm mostly curious what



 your projected max ingestion rate is...







 Best



 Erick







 On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)



 r...@libnova.es wrote:



 Hi all,







 I want to ask about the best way to implement a solution for indexing a



 large amount of pdf documents between 10-60 MB each one. 100 to 1000 
users



 connected simultaneously.







 I actually have 1 core of solr 3.3.0 and it works fine for a few number 
of



 pdf docs but I'm afraid about the moment when we enter in production 
time.







 some possibilities:







 i. clustering. I have no experience in this, so it will be a bad idea to



 venture into this.







 ii. multicore solution. make some kind of hash to choose one core at each



 query (exact queries) and thus reduce the size of the individual indexes

 to



 consult or to consult all the cores at same time (complex queries).







 iii. do nothing more and wait for the catastrophe in the response times 
:P











 Someone with experience can help a bit to decide?







 Thanks a lot in advance.

ideas for versioning query?

2011-08-01 Thread Mike Sokolov

A customer has an interesting problem: some documents will have multiple 
versions. In search results, only the most recent version of a given 
document should be shown. The trick is that each user has access to a 
different set of document versions, and each user should see only the 
most recent version of a document that they have access to.


Is this something that can reasonably be solved with grouping?  In 3.x? 
I haven't followed the grouping discussions closely: would someone point 
me in the right direction please?


--
Michael Sokolov
Engineering Director
www.ifactory.com

Re: ideas for versioning query?

2011-08-01 Thread Tomás Fernández Löbbe

Hi Michael, I guess this could be solved using grouping as you said.
Documents inside a group can be sorted on a field (in your case, the version
field, see parameter group.sort), and you can show only the first one. It
will be more complex to show facets (post grouping faceting is work in
progress but still not committed to the trunk).

I would be easier from the Solr side if you could do something at index
time, like indicating which document is the current one and which one is
an old one (you would need to update the old document whenever a new version
is indexed).

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov soko...@ifactory.com wrote:

 A customer has an interesting problem: some documents will have multiple
 versions. In search results, only the most recent version of a given
 document should be shown. The trick is that each user has access to a
 different set of document versions, and each user should see only the most
 recent version of a document that they have access to.

 Is this something that can reasonably be solved with grouping?  In 3.x? I
 haven't followed the grouping discussions closely: would someone point me in
 the right direction please?

 --
 Michael Sokolov
 Engineering Director
 www.ifactory.com

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov

Thanks, Tomas.  Yes we are planning to keep a current flag in the most 
current document.  But there are cases where, for a given user, the most 
current document is not that one, because they only have access to some 
older documents.


I took a look at http://wiki.apache.org/solr/FieldCollapsing and it 
seems as if it will do what we need here.  My one concern is that it 
might not be efficient at computing group.ngroups for a very large 
number of groups, which we would ideally want.  Is that something I 
should be worried about?


-Mike

On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote:

Hi Michael, I guess this could be solved using grouping as you said.
Documents inside a group can be sorted on a field (in your case, the version
field, see parameter group.sort), and you can show only the first one. It
will be more complex to show facets (post grouping faceting is work in
progress but still not committed to the trunk).

I would be easier from the Solr side if you could do something at index
time, like indicating which document is the current one and which one is
an old one (you would need to update the old document whenever a new version
is indexed).

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com  wrote:

   

A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
different set of document versions, and each user should see only the most
recent version of a document that they have access to.

Is this something that can reasonably be solved with grouping?  In 3.x? I
haven't followed the grouping discussions closely: would someone point me in
the right direction please?

--
Michael Sokolov
Engineering Director
www.ifactory.com

Re: ideas for versioning query?

2011-08-01 Thread Martijn v Groningen

Hi Mike, how many docs and groups do you have in your index?
I think the group.sort option fits your requirements.

If I remember correctly group.ngroup=true adds something like 30% extra time
on top of the search request with grouping,
but that was on my local test dataset (~30M docs, ~8000 groups) and my
machine. You might encounter different search times when setting
group.ngroup=true.

Martijn

2011/8/1 Mike Sokolov soko...@ifactory.com

Thanks, Tomas. Yes we are planning to keep a current flag in the most
current document. But there are cases where, for a given user, the most
current document is not that one, because they only have access to some
older documents.

I took a look at
http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsingand
it seems as if it will do what we need here. My one concern is that it
might not be efficient at computing group.ngroups for a very large number of
groups, which we would ideally want. Is that something I should be worried
about?

-Mike

On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote:

Hi Michael, I guess this could be solved using grouping as you said.
Documents inside a group can be sorted on a field (in your case, the
version
field, see parameter group.sort), and you can show only the first one. It
will be more complex to show facets (post grouping faceting is work in
progress but still not committed to the trunk).

I would be easier from the Solr side if you could do something at index
time, like indicating which document is the current one and which one is
an old one (you would need to update the old document whenever a new
version
is indexed).

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com
wrote:

A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
different set of document versions, and each user should see only the
most
recent version of a document that they have access to.

Is this something that can reasonably be solved with grouping? In 3.x? I
haven't followed the grouping discussions closely: would someone point me
in
the right direction please?

--
Michael Sokolov
Engineering Director
www.ifactory.com

--
Met vriendelijke groet,

Martijn van Groningen

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov

I think a 30% increase is acceptable. Yes, I think we'll try it.
Although our case is more like # groups ~ # documents / N, where N is a
smallish number (~1-5?). We are planning for a variety of different
index sizes, but aiming for a sweet spot around a few M docs.

-Mike

On 08/01/2011 11:00 AM, Martijn v Groningen wrote:

Hi Mike, how many docs and groups do you have in your index?
I think the group.sort option fits your requirements.

Martijn

2011/8/1 Mike Sokolovsoko...@ifactory.com

-Mike

On 08/01/2011 10:08 AM, Tomás Fernández Löbbe wrote:

Regards,

Tomás

On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolovsoko...@ifactory.com
wrote:

Is this something that can reasonably be solved with grouping? In 3.x? I
haven't followed the grouping discussions closely: would someone point me
in
the right direction please?

--
Michael Sokolov
Engineering Director
www.ifactory.com

Solr just 'hangs' under load test - ideas?

2011-06-29 Thread Bob Sandiford

Hi, all.

I'm hoping someone has some thoughts here.

We're running Solr 3.1 (with the patch for SolrQueryParser.java to not do the 
getLuceneVersion() calls, but use luceneMatchVersion directly).

We're running in a Tomcat instance, 64 bit Java.  CATALINA_OPTS are: -Xmx7168m 
-Xms7168m -XX:MaxPermSize=256M

We're running 2 Solr cores, with the same schema.

We use SolrJ to run our searches from a Java app running in JBoss.

JBoss, Tomcat, and the Solr Index folders are all on the same server.

In case it's relevant, we're using JMeter as a load test harness.

We're running on Solaris, a 16 processor box with 48GB physical memory.

I've run a successful load test at a 100 user load (at that rate there are 
about 5-10 solr searches / second), and solr search responses were coming in 
under 100ms.

When I tried to ramp up, as far as I can tell, Solr is just hanging.  (We have 
some logging statements around the SolrJ calls - just before, we log how long 
our query construction takes, then we run the SolrJ query and log the search 
times.  We're getting a number of the query construction logs, but no 
corresponding search time logs).

Symptoms:
The Tomcat and JBoss processes show as well under 1% CPU, and they are still 
the top processes.  CPU states show around 99% idle.   RES usage for the two 
Java processes around 3GB each.  LWP under 120 for each.  STATE just shows as 
sleep.  JBoss is still 'alive', as I can get into a piece of software that 
talks to our JBoss app to get data.

We set things up to use log4j logging for Solr - the log isn't showing any 
errors or exceptions.

We're not indexing - just searching.

Back in January, we did load testing on a prototype, and had no problems 
(though that was Solr 1.4 at the time).  It ramped up beautifully - bottle 
necks were our apps, not Solr.  What I'm benchmarking now is a descendent of 
that prototyping - a bit more complex on searches and more fields in the 
schema, but same basic search logic as far as SolrJ usage.

Any ideas?  What else to look at?  Ringing any bells?

I can send more details if anyone wants specifics...

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.comhttp://www.sirsidynix.com/

Re: Solr just 'hangs' under load test - ideas?

2011-06-29 Thread Yonik Seeley

Can you get a thread dump to see what is hanging?

-Yonik
http://www.lucidimagination.com

On Wed, Jun 29, 2011 at 11:45 AM, Bob Sandiford
bob.sandif...@sirsidynix.com wrote:
 Hi, all.

 I'm hoping someone has some thoughts here.

 We're running Solr 3.1 (with the patch for SolrQueryParser.java to not do the 
 getLuceneVersion() calls, but use luceneMatchVersion directly).

 We're running in a Tomcat instance, 64 bit Java.  CATALINA_OPTS are: 
 -Xmx7168m -Xms7168m -XX:MaxPermSize=256M

 We're running 2 Solr cores, with the same schema.

 We use SolrJ to run our searches from a Java app running in JBoss.

 JBoss, Tomcat, and the Solr Index folders are all on the same server.

 In case it's relevant, we're using JMeter as a load test harness.

 We're running on Solaris, a 16 processor box with 48GB physical memory.

 I've run a successful load test at a 100 user load (at that rate there are 
 about 5-10 solr searches / second), and solr search responses were coming in 
 under 100ms.

 When I tried to ramp up, as far as I can tell, Solr is just hanging.  (We 
 have some logging statements around the SolrJ calls - just before, we log how 
 long our query construction takes, then we run the SolrJ query and log the 
 search times.  We're getting a number of the query construction logs, but no 
 corresponding search time logs).

 Symptoms:
 The Tomcat and JBoss processes show as well under 1% CPU, and they are still 
 the top processes.  CPU states show around 99% idle.   RES usage for the two 
 Java processes around 3GB each.  LWP under 120 for each.  STATE just shows as 
 sleep.  JBoss is still 'alive', as I can get into a piece of software that 
 talks to our JBoss app to get data.

 We set things up to use log4j logging for Solr - the log isn't showing any 
 errors or exceptions.

 We're not indexing - just searching.

 Back in January, we did load testing on a prototype, and had no problems 
 (though that was Solr 1.4 at the time).  It ramped up beautifully - bottle 
 necks were our apps, not Solr.  What I'm benchmarking now is a descendent of 
 that prototyping - a bit more complex on searches and more fields in the 
 schema, but same basic search logic as far as SolrJ usage.

 Any ideas?  What else to look at?  Ringing any bells?

 I can send more details if anyone wants specifics...

 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.comhttp://www.sirsidynix.com/

RE: Solr just 'hangs' under load test - ideas?

2011-06-29 Thread Bob Sandiford

OK - I figured it out.  It's not solr at all (and I'm not really surprised).

In the prototype benchmarks, we used a different instance of tomcat than we're 
using for production load tests.  Our prototype tomcat instance had no 
maxThreads value set, so was using the default value of 200.  The production 
tomcat environment has a maxThreads value of 15 - we were just running out of 
threads and getting connection refused exceptions thrown when we ramped up the 
Solr hits past a certain level.

Thanks for considering, Yonik (and any others waiting to see any reply I 
made)...

(As others have said - this listserv is great!)

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Wednesday, June 29, 2011 12:18 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr just 'hangs' under load test - ideas?
 
 Can you get a thread dump to see what is hanging?
 
 -Yonik
 http://www.lucidimagination.com
 
 On Wed, Jun 29, 2011 at 11:45 AM, Bob Sandiford
 bob.sandif...@sirsidynix.com wrote:
  Hi, all.
 
  I'm hoping someone has some thoughts here.
 
  We're running Solr 3.1 (with the patch for SolrQueryParser.java to
 not do the getLuceneVersion() calls, but use luceneMatchVersion
 directly).
 
  We're running in a Tomcat instance, 64 bit Java.  CATALINA_OPTS are:
 -Xmx7168m -Xms7168m -XX:MaxPermSize=256M
 
  We're running 2 Solr cores, with the same schema.
 
  We use SolrJ to run our searches from a Java app running in JBoss.
 
  JBoss, Tomcat, and the Solr Index folders are all on the same server.
 
  In case it's relevant, we're using JMeter as a load test harness.
 
  We're running on Solaris, a 16 processor box with 48GB physical
 memory.
 
  I've run a successful load test at a 100 user load (at that rate
 there are about 5-10 solr searches / second), and solr search responses
 were coming in under 100ms.
 
  When I tried to ramp up, as far as I can tell, Solr is just hanging.
  (We have some logging statements around the SolrJ calls - just before,
 we log how long our query construction takes, then we run the SolrJ
 query and log the search times.  We're getting a number of the query
 construction logs, but no corresponding search time logs).
 
  Symptoms:
  The Tomcat and JBoss processes show as well under 1% CPU, and they
 are still the top processes.  CPU states show around 99% idle.   RES
 usage for the two Java processes around 3GB each.  LWP under 120 for
 each.  STATE just shows as sleep.  JBoss is still 'alive', as I can get
 into a piece of software that talks to our JBoss app to get data.
 
  We set things up to use log4j logging for Solr - the log isn't
 showing any errors or exceptions.
 
  We're not indexing - just searching.
 
  Back in January, we did load testing on a prototype, and had no
 problems (though that was Solr 1.4 at the time).  It ramped up
 beautifully - bottle necks were our apps, not Solr.  What I'm
 benchmarking now is a descendent of that prototyping - a bit more
 complex on searches and more fields in the schema, but same basic
 search logic as far as SolrJ usage.
 
  Any ideas?  What else to look at?  Ringing any bells?
 
  I can send more details if anyone wants specifics...
 
  Bob Sandiford | Lead Software Engineer | SirsiDynix
  P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
  www.sirsidynix.comhttp://www.sirsidynix.com/

Re: Ideas on how to implement sponsored results

2008-06-04 Thread Alexander Ramos Jardim

Cuong,

I think you will need some manipulation beyond solr queries. You should
separate the results by your site criteria after retrieving them. After
that, you could cache the results on your application and randomize the
lists every time you render the a page.

I don't know if solr has collapsing capabilities but it has any beyond
faceting, it would be a great boost to your work.

2008/6/3 climbingrose [EMAIL PROTECTED]:

 Hi Alexander,

 Thanks for your suggestion. I think my problem is a bit different from
 yours. We don't have any sponsored words but we have to retrieve sponsored
 results directly from the index. This is because a site can have 60,000
 products which is hard to insert/update keywords. I can live with that by
 issuing a separate query to fetch sponsored results. My problem is to
 equally distribute sponsored results between sites so that each site will
 have an opportunity to show their sponsored results no matter how many
 products they have. For example, if site A has 6 products, site B has
 only 2000 then sponsored products from site B will have a very small chance
 to be displayed.


 On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim 
 [EMAIL PROTECTED] wrote:

  Cuong,
 
  I have implemented sponsored words for a client. I don't know if my
 working
  can help you but I will expose it and let you decide.
 
  I have an index containing products entries that I created a field called
  sponsored words. What I do is to boost this field , so when these words
 are
  matched in the query that products appear first on my result.
 
  2008/6/3 climbingrose [EMAIL PROTECTED]:
 
   Hi all,
  
   I'm trying to implement sponsored results in Solr search results
  similar
   to that of Google. We index products from various sites and would like
 to
   allow certain sites to promote their products. My approach is to query
 a
   slave instance to get sponsored results for user queries in addition to
  the
   normal search results. This part is easy. However, since the number of
   products indexed for each sites can be very different (100, 1000, 1
  or
   6 products), we need a way to fairly distribute the sponsored
 results
   among sites.
  
   My initial thought is utilising field collapsing patch to collapse the
   search results on siteId field. You can imagine that this will create a
   series of buckets of results, each bucket representing results from a
   site. After that, 2 or 3 buckets will randomly be selected from which I
   will
   randomly select one or two results from. However, since I want these
   sponsored results to be relevant to user queries, I'd like only want to
   have
   the first 30 results in each buckets.
  
   Obviously, it's desirable that if the user refreshes the page, new
   sponsored
   results will be displayed. On the other hand, I also want to have the
   advantages of Solr cache.
  
   What would be the best way to implement this functionality? Thanks.
  
   Cheers,
   Cuong
  
 
 
 
  --
  Alexander Ramos Jardim
 



 --
 Regards,

 Cuong Hoang




-- 
Alexander Ramos Jardim

Ideas on how to implement sponsored results

2008-06-03 Thread climbingrose

Hi all,

I'm trying to implement sponsored results in Solr search results similar
to that of Google. We index products from various sites and would like to
allow certain sites to promote their products. My approach is to query a
slave instance to get sponsored results for user queries in addition to the
normal search results. This part is easy. However, since the number of
products indexed for each sites can be very different (100, 1000, 1 or
6 products), we need a way to fairly distribute the sponsored results
among sites.

My initial thought is utilising field collapsing patch to collapse the
search results on siteId field. You can imagine that this will create a
series of buckets of results, each bucket representing results from a
site. After that, 2 or 3 buckets will randomly be selected from which I will
randomly select one or two results from. However, since I want these
sponsored results to be relevant to user queries, I'd like only want to have
the first 30 results in each buckets.

Obviously, it's desirable that if the user refreshes the page, new sponsored
results will be displayed. On the other hand, I also want to have the
advantages of Solr cache.

What would be the best way to implement this functionality? Thanks.

Cheers,
Cuong

Re: Ideas on how to implement sponsored results

2008-06-03 Thread Alexander Ramos Jardim

Cuong,

I have implemented sponsored words for a client. I don't know if my working
can help you but I will expose it and let you decide.

I have an index containing products entries that I created a field called
sponsored words. What I do is to boost this field , so when these words are
matched in the query that products appear first on my result.

2008/6/3 climbingrose [EMAIL PROTECTED]:

 Hi all,

 I'm trying to implement sponsored results in Solr search results similar
 to that of Google. We index products from various sites and would like to
 allow certain sites to promote their products. My approach is to query a
 slave instance to get sponsored results for user queries in addition to the
 normal search results. This part is easy. However, since the number of
 products indexed for each sites can be very different (100, 1000, 1 or
 6 products), we need a way to fairly distribute the sponsored results
 among sites.

 My initial thought is utilising field collapsing patch to collapse the
 search results on siteId field. You can imagine that this will create a
 series of buckets of results, each bucket representing results from a
 site. After that, 2 or 3 buckets will randomly be selected from which I
 will
 randomly select one or two results from. However, since I want these
 sponsored results to be relevant to user queries, I'd like only want to
 have
 the first 30 results in each buckets.

 Obviously, it's desirable that if the user refreshes the page, new
 sponsored
 results will be displayed. On the other hand, I also want to have the
 advantages of Solr cache.

 What would be the best way to implement this functionality? Thanks.

 Cheers,
 Cuong




-- 
Alexander Ramos Jardim

Re: Ideas on how to implement sponsored results

2008-06-03 Thread climbingrose

Hi Alexander,

Thanks for your suggestion. I think my problem is a bit different from
yours. We don't have any sponsored words but we have to retrieve sponsored
results directly from the index. This is because a site can have 60,000
products which is hard to insert/update keywords. I can live with that by
issuing a separate query to fetch sponsored results. My problem is to
equally distribute sponsored results between sites so that each site will
have an opportunity to show their sponsored results no matter how many
products they have. For example, if site A has 6 products, site B has
only 2000 then sponsored products from site B will have a very small chance
to be displayed.


On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim 
[EMAIL PROTECTED] wrote:

 Cuong,

 I have implemented sponsored words for a client. I don't know if my working
 can help you but I will expose it and let you decide.

 I have an index containing products entries that I created a field called
 sponsored words. What I do is to boost this field , so when these words are
 matched in the query that products appear first on my result.

 2008/6/3 climbingrose [EMAIL PROTECTED]:

  Hi all,
 
  I'm trying to implement sponsored results in Solr search results
 similar
  to that of Google. We index products from various sites and would like to
  allow certain sites to promote their products. My approach is to query a
  slave instance to get sponsored results for user queries in addition to
 the
  normal search results. This part is easy. However, since the number of
  products indexed for each sites can be very different (100, 1000, 1
 or
  6 products), we need a way to fairly distribute the sponsored results
  among sites.
 
  My initial thought is utilising field collapsing patch to collapse the
  search results on siteId field. You can imagine that this will create a
  series of buckets of results, each bucket representing results from a
  site. After that, 2 or 3 buckets will randomly be selected from which I
  will
  randomly select one or two results from. However, since I want these
  sponsored results to be relevant to user queries, I'd like only want to
  have
  the first 30 results in each buckets.
 
  Obviously, it's desirable that if the user refreshes the page, new
  sponsored
  results will be displayed. On the other hand, I also want to have the
  advantages of Solr cache.
 
  What would be the best way to implement this functionality? Thanks.
 
  Cheers,
  Cuong
 



 --
 Alexander Ramos Jardim




-- 
Regards,

Cuong Hoang

JSON tokenizer? tagging ideas

2008-01-25 Thread Ryan McKinley

I've been struggling with how to get various bits of structured data 
into solr documents.  In various projects I have tried various ideas, 
but none feel great.


Take a simple example where I want a document field to be the list of 
linked data with name, ID, and path.  I have tried things like:


doc
  field name=idID/field
  field name=linkIDA nameA pathA/field
  field name=linkIDB nameB pathB/field
  field name=linkIDC nameC pathC/field
/doc

this is ok -- when spaces are a problem, i've tokenized on \n -- but 
this feels very brittle.


I'm considering a general JSON tokenizer and want to know what you all 
think.  Consider:

doc
  field name=idID/field
  field name=link{ id:10 name:nameA path:/... }/field
  field name=link{ id:11 name:nameB path:/... }/field
  field name=link{ id:12 name:nameB path:/... }/field
/doc

The tokenizer can make a token for each key:value pair, that is:
 id:10, name:nameA,path:,id:11...

Perhaps this could be part of the general 'tag' design:
http://wiki.apache.org/solr/UserTagDesign

rather then having fixed prefixes ~erik#lucene, we could use json syntax:
 {user:erik, text:lucene, date:20071112 }

Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSON 
parsing is super fast.  The prefix queries are probably slower with a 
longer string, but I guess you could just use:

 {u:erik, t:lucene, d:20071112 }

Thoughts?

ryan

Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes

I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently
pushing the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script
(instead of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster

Condensing the loader into a single executable sounds right if
you have performance problems. ;-)

You could also try adding multiple docs in a single post if you
notice your problems are with tcp setup time, though if you're
doing localhost connections that should be minimal.

If you're already local to the solr server, you might check out the
CSV slurper. http://wiki.apache.org/solr/UpdateCSV  It's a little
specialized.

And then there's of course the question of are you doing full
re-indexing or incremental indexing of changes?

--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

 I inherited an existing (working) solr indexing script that runs like
 this:



 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr



 We're injecting about 1000 records / minute (constantly), frequently
 pushing the edge of our CPU / RAM limitations.



 I'm in the process of building a Perl script to use DBI and
 lwp::simple::post that will perform this all from a single script
 (instead of 3).



 Two specific questions

 1: Does anyone have a clever (or better) way to perform this process
 efficiently?



 2: Is there a way to inject into solr without using POST / curl / http?



 Admittedly, I'm no solr expert - I'm starting from someone else's setup,
 trying to reverse-engineer my way out.  Any input would be greatly
 appreciated.

RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread David Whalen

What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

 -Original Message-
 From: Clay Webster [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 09, 2007 11:43 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Any clever ideas to inject into solr? Without http?
 
 Condensing the loader into a single executable sounds right 
 if you have performance problems. ;-)
 
 You could also try adding multiple docs in a single post if 
 you notice your problems are with tcp setup time, though if 
 you're doing localhost connections that should be minimal.
 
 If you're already local to the solr server, you might check 
 out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
 It's a little specialized.
 
 And then there's of course the question of are you doing 
 full re-indexing or incremental indexing of changes?
 
 --cw
 
 
 On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 
  I inherited an existing (working) solr indexing script that 
 runs like
  this:
 
 
 
  Python script queries the mysql DB then calls bash script
 
  Bash script performs a curl POST submit to solr
 
 
 
  We're injecting about 1000 records / minute (constantly), 
 frequently 
  pushing the edge of our CPU / RAM limitations.
 
 
 
  I'm in the process of building a Perl script to use DBI and 
  lwp::simple::post that will perform this all from a single script 
  (instead of 3).
 
 
 
  Two specific questions
 
  1: Does anyone have a clever (or better) way to perform 
 this process 
  efficiently?
 
 
 
  2: Is there a way to inject into solr without using POST / 
 curl / http?
 
 
 
  Admittedly, I'm no solr expert - I'm starting from someone else's 
  setup, trying to reverse-engineer my way out.  Any input would be 
  greatly appreciated.

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Tobin Cataldo

(re)building the index separately (ie. on a different computer) and then 
replacing the active index may be an option.


David Whalen wrote:

What we're looking for is a way to inject *without* using
curl, or wget, or any other http-based communication.  We'd
like for the HTTP daemon to only handle search requests, not
indexing requests on top of them.

Plus, I have to believe there's a faster way to get documents
into solr/lucene than using curl

_
david whalen
senior applications developer
eNR Services, Inc.
[EMAIL PROTECTED]
203-849-7240
  

  

-Original Message-
From: Clay Webster [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 11:43 AM

To: solr-user@lucene.apache.org
Subject: Re: Any clever ideas to inject into solr? Without http?

Condensing the loader into a single executable sounds right 
if you have performance problems. ;-)


You could also try adding multiple docs in a single post if 
you notice your problems are with tcp setup time, though if 
you're doing localhost connections that should be minimal.


If you're already local to the solr server, you might check 
out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV  
It's a little specialized.


And then there's of course the question of are you doing 
full re-indexing or incremental indexing of changes?


--cw


On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:

I inherited an existing (working) solr indexing script that 
  

runs like


this:



Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr



We're injecting about 1000 records / minute (constantly), 
  
frequently 


pushing the edge of our CPU / RAM limitations.



I'm in the process of building a Perl script to use DBI and 
lwp::simple::post that will perform this all from a single script 
(instead of 3).




Two specific questions

1: Does anyone have a clever (or better) way to perform 
  
this process 


efficiently?



2: Is there a way to inject into solr without using POST / 
  

curl / http?



Admittedly, I'm no solr expert - I'm starting from someone else's 
setup, trying to reverse-engineer my way out.  Any input would be 
greatly appreciated.

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Brian Whitman



On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote:




2: Is there a way to inject into solr without using POST / curl /  
http?




Check http://wiki.apache.org/solr/EmbeddedSolr

There's examples in java and cocoa to use the DirectSolrConnection  
class, querying and updating solr w/o a web server. It uses JNI in  
the Cocoa case.

-b

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Clay Webster

If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw

On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:

 What we're looking for is a way to inject *without* using
 curl, or wget, or any other http-based communication.  We'd
 like for the HTTP daemon to only handle search requests, not
 indexing requests on top of them.

 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

 _
 david whalen
 senior applications developer
 eNR Services, Inc.
 [EMAIL PROTECTED]
 203-849-7240


  -Original Message-
  From: Clay Webster [mailto:[EMAIL PROTECTED]
  Sent: Thursday, August 09, 2007 11:43 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Any clever ideas to inject into solr? Without http?
 
  Condensing the loader into a single executable sounds right
  if you have performance problems. ;-)
 
  You could also try adding multiple docs in a single post if
  you notice your problems are with tcp setup time, though if
  you're doing localhost connections that should be minimal.
 
  If you're already local to the solr server, you might check
  out the CSV slurper. http://wiki.apache.org/solr/UpdateCSV
  It's a little specialized.
 
  And then there's of course the question of are you doing
  full re-indexing or incremental indexing of changes?
 
  --cw
 
 
  On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
  
   I inherited an existing (working) solr indexing script that
  runs like
   this:
  
  
  
   Python script queries the mysql DB then calls bash script
  
   Bash script performs a curl POST submit to solr
  
  
  
   We're injecting about 1000 records / minute (constantly),
  frequently
   pushing the edge of our CPU / RAM limitations.
  
  
  
   I'm in the process of building a Perl script to use DBI and
   lwp::simple::post that will perform this all from a single script
   (instead of 3).
  
  
  
   Two specific questions
  
   1: Does anyone have a clever (or better) way to perform
  this process
   efficiently?
  
  
  
   2: Is there a way to inject into solr without using POST /
  curl / http?
  
  
  
   Admittedly, I'm no solr expert - I'm starting from someone else's
   setup, trying to reverse-engineer my way out.  Any input would be
   greatly appreciated.

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley

On 8/9/07, David Whalen [EMAIL PROTECTED] wrote:
 Plus, I have to believe there's a faster way to get documents
 into solr/lucene than using curl

One issue with HTTP is latency.  You can get around that by adding
multiple documents per request, or by using multiple threads
concurrently.

You can also bypass HTTP by using something like the CVS loader (very
light weight) and specifying a local file (via stream.file parameter).
http://wiki.apache.org/solr/UpdateCSV
I doubt you will see much of a difference between reading locally vs
streaming over HTTP, but it might be interesting to see the exact
overhead.

-Yonik

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley

On 8/9/07, Siegfried Goeschl [EMAIL PROTECTED] wrote:
 +) my colleague just finished a database import service running within
 the servlet container to avoid writing out the data to the file system
 and transmitting it over HTTP.

Most people doing this read data out of the database and construct the
XML in-memory for sending to Solr... one definitely doesn't want to
write intermediate stuff to the filesystem (unless perhaps it's a CSV
dump).

 +) I think there were some discussion regarding a generic database
 importer but nothing I'm aware of

Absolutely a needed feature... it's in the queue:
https://issues.apache.org/jira/browse/SOLR-103

But there will always be more complex cases, pulling from multiple
data sources, doing some merging and munging, etc.  The easiest way to
handle many of those would probably be via a scripting language that
does the app-specific merging+munging and then uses a Solr client
(which constructs in-memory CSV or XML and sends to Solr).

-Yonik

RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Kevin Holmes

Is this a native feature, or do we need to get creative with scp from
one server to the other?


If it's a contention between search and indexing, separate  them
via a query-slave and an index-master.

--cw

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Yonik Seeley

On 8/9/07, Kevin Holmes [EMAIL PROTECTED] wrote:
 Python script queries the mysql DB then calls bash script

 Bash script performs a curl POST submit to solr

For the most up-to-date solr client for python, check out
https://issues.apache.org/jira/browse/SOLR-216

-Yonik

RE: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Lance Norskog

Jython is a Python interpreter implemented in Java. (I have a lot of Python
code.)

Total throughput in the servlet is very sensitive to the total number of
servlet sockets available v.s. the number of CPUs.

The different analysers have very different performance.

You might leave some data in the DB, instead of storing it all in the index.

Underlying this all, you have a sneaky network performance problem. Your
successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
each post takes time. Not obvious: your server has sockets building up in
TIME_WAIT state.  (This means the sockets are shutting down. Having both
ends agree to close the connection is metaphysically difficult. The TCP/IP
spec even has a bug in this area.) Sockets building up can use TCP resources
to run low or may run out. Your kernel configuration may be weak in this
area.

Lance

-Original Message-
From: Kevin Holmes [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 09, 2007 8:13 AM
To: solr-user@lucene.apache.org
Subject: Any clever ideas to inject into solr? Without http?

I inherited an existing (working) solr indexing script that runs like
this:

 

Python script queries the mysql DB then calls bash script

Bash script performs a curl POST submit to solr

 

We're injecting about 1000 records / minute (constantly), frequently pushing
the edge of our CPU / RAM limitations.

 

I'm in the process of building a Perl script to use DBI and
lwp::simple::post that will perform this all from a single script (instead
of 3).

 

Two specific questions

1: Does anyone have a clever (or better) way to perform this process
efficiently?

 

2: Is there a way to inject into solr without using POST / curl / http?

 

Admittedly, I'm no solr expert - I'm starting from someone else's setup,
trying to reverse-engineer my way out.  Any input would be greatly
appreciated.

Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Norberto Meijome

On Thu, 9 Aug 2007 15:23:03 -0700
Lance Norskog [EMAIL PROTECTED] wrote:

 Underlying this all, you have a sneaky network performance problem. Your
 successive posts do not reuse a TCP socket. Obvious: re-opening a new socket
 each post takes time. Not obvious: your server has sockets building up in
 TIME_WAIT state.  (This means the sockets are shutting down. Having both
 ends agree to close the connection is metaphysically difficult. The TCP/IP
 spec even has a bug in this area.) Sockets building up can use TCP resources
 to run low or may run out. Your kernel configuration may be weak in this
 area.

Good point. and putting my pedantic hat on here, it may not necessarily be 
'kernel configuration', but network stack - not sure what OS the OP is using.
B
_
{Beto|Norberto|Numard} Meijome

All parts should go together without forcing. You must remember that the parts 
you are reassembling were disassembled by you.
 Therefore, if you can't get them together again, there must be a reason. 
 By all means, do not use hammer.
   IBM maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-30 Thread Daniel Einspanjer


On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:

: Not really.  The explain scores aren't normalized and I also couldn't
: find a way to get the explain data as anything other than a whitespace
: formatted text blob from Solr.  Keep in mind that they need confidence

the defualt way Solr dumps score explainations is just as plain text, but
the Explanation objects are actually fairly well structured, and easy to
walk in a custom request handler -- this would let you make direct
comparisons of the various peices of the Explanations from doc 1 with doc
2 if you wanted.


Does anyone have any experience with examining Explanation objects in
a custom request handler?

I started this project using Solr on top of Lucene because I wanted
the flexibility it provided. The ability to have dynamic field names
so the user could configure what fields they wanted to index and how
they wanted them to be indexed (using field type configurations good
for titles or for person names or for years, etc.).

What I quickly found I could do without though was the HTTP overhead.
I implemented the EmbeddedSolr class found on the Solr wiki that let
me interact with the Solr engine directly. This is important since I'm
doing thousands of queries in a batch.

I need to find out about this custom request handler thing. If anyone
has any example code, it would be greatly appreciated.

Daniel

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-09 Thread Sean Timm





Yes, for good (hopefully)
or bad.

-Sean

Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM:


Interesting..
Surrogates can also bring the searcher's subjectivity (opinion and
context) into it by the learning process ?
shridhar
  
Sean Timm wrote:
  
   It may not be easy or even possible
without major changes, but having
global collection statistics would allow scores to be compared across
searchers. To do this, the master indexes would need to be able to
communicate with each other.

An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, 
"Surrogate Scoring for Improved Metasearch Precision" , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.

-Sean

[EMAIL PROTECTED] wrote: 

On 4/11/07, Chris
Hostetter [EMAIL PROTECTED]
wrote: 
  
  

A custom Similaity class with simplified tf, idf, and queryNorm
functions 
might also help you get scores from the Explain method that are more 
easily manageable since you'll have predictible query structures hard 
coded into your application. 

ie: run the large query once, get the results back, and for each result

look at the explanation and pull out the individual pieces of hte 
explanation and compare them with those of hte other matches to create 
your own "normalization". 

   
  
Chuck Williams mentioned a proposal he had for normalization of scores
that 
would give a constant score range that would allow comparison of
scores. 
Chuck, did you ever write any code to that end or was it just
algorithmic 
discussion? 
  
Here is the point I'm at now: 
  
I have my matching engine working. The fields to be indexed and the
queries 
are defined by the user. Hoss, I'm not sure how that affects your idea
of 
having a custom Similarity class since you mentioned that having
predictable 
query structures was important... 
The user kicks off an indexing then defines the queries they want to
try 
matching with. Here is an example of the query fragments I'm working
with 
right now: 
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] 
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) 
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 
director_name_mv:${Director}~.7 
  
For each item in the source feed, the variables are interpolated (the
query 
term is transformed into a grouped term if there are multiple values
for a 
variable). That query is then made to find the overall best match. 
I then determine the relevance for each query fragment. I haven't
written 
any plugins for Lucene yet, so my current method of determining the 
relevance is by running each query fragment by itself then iterating
through 
the results looking to see if the overall best match is in this result
set. 
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a 
configured fragment weight. 
  
Since the scores aren't normalized, I have no good way of determining a
poor 
overall match from a really high quality one. The overall item could be
the 
first item returned in each of the query fragments. 
  
Any help here would be very appreciated. Ideally, I'm hoping that maybe
  
Chuck has a patch or plugin that I could use to normalize my scores
such 
that I could let the user do a matching run, look at the results and 
determine what score threshold to set for subsequent runs. 
  
Thanks, 
Daniel

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-06 Thread Shridhar Venkatraman





Interesting..
Surrogates can also bring the searcher's subjectivity (opinion and
context) into it by the learning process ?
shridhar

Sean Timm wrote:

  
It may not be easy or even possible without major changes, but having
global collection statistics would allow scores to be compared across
searchers. To do this, the master indexes would need to be able to
communicate with each other.
  
An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, 
"Surrogate Scoring for Improved Metasearch Precision" , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.
  
-Sean
  
  [EMAIL PROTECTED] wrote:
  On 4/11/07, Chris Hostetter
[EMAIL PROTECTED]
wrote: 

  
A custom Similaity class with simplified tf, idf, and queryNorm
functions 
might also help you get scores from the Explain method that are more 
easily manageable since you'll have predictible query structures hard 
coded into your application. 
  
ie: run the large query once, get the results back, and for each result
  
look at the explanation and pull out the individual pieces of hte 
explanation and compare them with those of hte other matches to create 
your own "normalization". 



Chuck Williams mentioned a proposal he had for normalization of scores
that 
would give a constant score range that would allow comparison of
scores. 
Chuck, did you ever write any code to that end or was it just
algorithmic 
discussion? 

Here is the point I'm at now: 

I have my matching engine working. The fields to be indexed and the
queries 
are defined by the user. Hoss, I'm not sure how that affects your idea
of 
having a custom Similarity class since you mentioned that having
predictable 
query structures was important... 
The user kicks off an indexing then defines the queries they want to
try 
matching with. Here is an example of the query fragments I'm working
with 
right now: 
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}] 
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2 
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~) 
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5 
director_name_mv:${Director}~.7 

For each item in the source feed, the variables are interpolated (the
query 
term is transformed into a grouped term if there are multiple values
for a 
variable). That query is then made to find the overall best match. 
I then determine the relevance for each query fragment. I haven't
written 
any plugins for Lucene yet, so my current method of determining the 
relevance is by running each query fragment by itself then iterating
through 
the results looking to see if the overall best match is in this result
set. 
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a 
configured fragment weight. 

Since the scores aren't normalized, I have no good way of determining a
poor 
overall match from a really high quality one. The overall item could be
the 
first item returned in each of the query fragments. 

Any help here would be very appreciated. Ideally, I'm hoping that maybe

Chuck has a patch or plugin that I could use to normalize my scores
such 
that I could let the user do a matching run, look at the results and 
determine what score threshold to set for subsequent runs. 

Thanks, 
Daniel

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-05 Thread Daniel Einspanjer


On 4/11/07, Chris Hostetter [EMAIL PROTECTED] wrote:



A custom Similaity class with simplified tf, idf, and queryNorm functions
might also help you get scores from the Explain method that are more
easily manageable since you'll have predictible query structures hard
coded into your application.

ie: run the large query once, get the results back, and for each result
look at the explanation and pull out the individual pieces of hte
explanation and compare them with those of hte other matches to create
your own normalization.



Chuck Williams mentioned a proposal he had for normalization of scores that
would give a constant score range that would allow comparison of scores.
Chuck, did you ever write any code to that end or was it just algorithmic
discussion?

Here is the point I'm at now:

I have my matching engine working.  The fields to be indexed and the queries
are defined by the user.  Hoss, I'm not sure how that affects your idea of
having a custom Similarity class since you mentioned that having predictable
query structures was important...
The user kicks off an indexing then defines the queries they want to try
matching with.  Here is an example of the query fragments I'm working with
right now:
year_str:${Year}^2 year_str:[${Year -1} TO ${Year +1}]
title_title_mv:${Title}^10 title_title_mv:${Title}^2
+(title_title_mv:${Title}~^5 title_title_mv:${Title}~)
director_name_mv:${Director}~2^10 director_name_mv:${Director}^5
director_name_mv:${Director}~.7

For each item in the source feed, the variables are interpolated (the query
term is transformed into a grouped term if there are multiple values for a
variable). That query is then made to find the overall best match.
I then determine the relevance for each query fragment.  I haven't written
any plugins for Lucene yet, so my current method of determining the
relevance is by running each query fragment by itself then iterating through
the results looking to see if the overall best match is in this result set.
If it is, I record the rank and multiply that rank (e.g. 5 out of 10) by a
configured fragment weight.

Since the scores aren't normalized, I have no good way of determining a poor
overall match from a really high quality one. The overall item could be the
first item returned in each of the query fragments.

Any help here would be very appreciated. Ideally, I'm hoping that maybe
Chuck has a patch or plugin that I could use to normalize my scores such
that I could let the user do a matching run, look at the results and
determine what score threshold to set for subsequent runs.

Thanks,
Daniel

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-05 Thread Sean Timm





It may not be easy or even possible without major changes, but having
global collection statistics would allow scores to be compared across
searchers. To do this, the master indexes would need to be able to
communicate with each other.

An other approach to merging across searchers is described here:
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, Greg Pass, Ophir
Frieder, 
"Surrogate Scoring for Improved Metasearch Precision" , Proceedings
of the 2005 ACM Conference on Research and Development in Information
Retrieval (SIGIR-2005), Salvador, Brazil, August 2005.

-Sean

[EMAIL PROTECTED] wrote:
On 4/11/07, Chris Hostetter
[EMAIL PROTECTED] wrote:
  
  

A custom Similaity class with simplified tf, idf, and queryNorm
functions

might also help you get scores from the Explain method that are more

easily manageable since you'll have predictible query structures hard

coded into your application.


ie: run the large query once, get the results back, and for each result

look at the explanation and pull out the individual pieces of hte

explanation and compare them with those of hte other matches to create

your own "normalization".

  
  
  
Chuck Williams mentioned a proposal he had for normalization of scores
that
  
would give a constant score range that would allow comparison of
scores.
  
Chuck, did you ever write any code to that end or was it just
algorithmic
  
discussion?
  
  
Here is the point I'm at now:
  
  
I have my matching engine working. The fields to be indexed and the
queries
  
are defined by the user. Hoss, I'm not sure how that affects your idea
of
  
having a custom Similarity class since you mentioned that having
predictable
  
query structures was important...
  
The user kicks off an indexing then defines the queries they want to
try
  
matching with. Here is an example of the query fragments I'm working
with
  
right now:
  
year_str:"${Year}"^2 year_str:[${Year -1} TO ${Year +1}]
  
title_title_mv:"${Title}"^10 title_title_mv:${Title}^2
  
+(title_title_mv:"${Title}"~^5 title_title_mv:${Title}~)
  
director_name_mv:"${Director}"~2^10 director_name_mv:${Director}^5
  
director_name_mv:${Director}~.7
  
  
For each item in the source feed, the variables are interpolated (the
query
  
term is transformed into a grouped term if there are multiple values
for a
  
variable). That query is then made to find the overall best match.
  
I then determine the relevance for each query fragment. I haven't
written
  
any plugins for Lucene yet, so my current method of determining the
  
relevance is by running each query fragment by itself then iterating
through
  
the results looking to see if the overall best match is in this result
set.
  
If it is, I record the rank and multiply that rank (e.g. 5 out of 10)
by a
  
configured fragment weight.
  
  
Since the scores aren't normalized, I have no good way of determining a
poor
  
overall match from a really high quality one. The overall item could be
the
  
first item returned in each of the query fragments.
  
  
Any help here would be very appreciated. Ideally, I'm hoping that maybe
  
Chuck has a patch or plugin that I could use to normalize my scores
such
  
that I could let the user do a matching run, look at the results and
  
determine what score threshold to set for subsequent runs.
  
  
Thanks,
  
Daniel

Any Parm Substituion Ideas...

2007-04-10 Thread Jim Dow

I really like the flexibility of naming request handlers to append general 
constraints / filters.

Has anyone spun thoughts around something like a solr.ParmSubstHandler or any 
way to pass maybe a special
ps=0:discussions; ps=1:images; ps=2:false


requestHandler name=partitioned class=solr.ParmSubstHandler 
lst name=defaults
...
.

lst name=appends
  str name=fqcategory:[0]/str
  str name=fqcategory:[1]/str
  str name=fqisadmin:[2]/str
/lst
...
/requestHandler

This may be inappropriate for building into SOLR; I'm not sure, but I'm looking 
at techniques to round out the appends to be even more flexible.

If there is interest and it makes sense to a wider audience, maybe I should try 
my hand at it.

Thanks...Jim Dow.

Re: Any Parm Substituion Ideas...

2007-04-10 Thread Chris Hostetter


I'm not certain that i understand exactly what you are describing, but
there was some discussion a while back that may be similar...

http://issues.apache.org/jira/browse/SOLR-109

...there's not a lot in the issue itself, but the linked discussion may be
fruitful for you.

if what you are describing is the same thing then i certianly think it
would be a handy addition to SolrQueryParser and the core request
handlers.

: Has anyone spun thoughts around something like a solr.ParmSubstHandler or 
any way to pass maybe a special
: ps=0:discussions; ps=1:images; ps=2:false




-Hoss

80 matches

Mail list logo