Authentication and distributed search in 7.2.1

2018-02-28 Thread Peter Sturge
Hi,
In 7.2.1 there's the authentication module and associated security.json
file, which works well for single cores. (Note: standalone mode, no
SolrCloud)
It doesn't appear to work with distributed searches, including multi-shard
local searches .
  e.g. shards=localhost:8983/solr/core1,localhost:8983/solr/core2

Even when shards is just a single core  - shards=localhost:8983/solr/core1,
if the base search is to a different core (e.g.
http://localhost:8983/solr/somecore/select?
shards=localhost:8983/solr/core1.. , no error and no results are returned
status=0 numfound=0.

Can anyone please confirm if Solr 7 authentication does/doesn't support
distributed/sharded searches?

Many thanks,
Peter


Re: security authentication API via solrj?

2018-02-26 Thread Peter Sturge
Hi,

Thanks for your response.
I've done this using the 'raw' rest style, as I'm not familiar enough with
the new solrj client.
It would be quite nice to have a native solrj class for handling security
mgt operations (add/delete users, roles etc.)..kind of like the
CoreAdmin/CollectionAdmin/Configset etc.

Thanks,
Peter



On Mon, Feb 26, 2018 at 1:13 AM, Shawn Heisey <elyog...@elyograg.org> wrote:

> On 2/25/2018 1:28 PM, Peter Sturge wrote:
>
>> I was wondering if 7.2.1 solrj had native support for the
>> security/authentication endpoint? I couldn't find anything in the docs
>> about it, but maybe someone has some experience with it?
>> Note: This is about adding/deleting users on a solr instance using solrj,
>> not authenticating (that is well documented).
>>
>
> At first I was looking for how to use authentication with SolrJ. I came up
> with this:
>
> 
> Looks like this is not available when using the sugar objects like
> SolrQuery.  To use authentication, it seems you have to create the request
> objects yourself.
>
> https://lucene.apache.org/solr/guide/7_2/basic-authenticatio
> n-plugin.html#using-basic-auth-with-solrj
> 
>
> Then I noticed you were talking about the actual security endpoint --
> adding users.
>
> I have been looking over the objects available in SolrJ, and I do not see
> anything useful.  It looks like you might need a new request object class,
> implemented similar to DirectXMLRequest, but using JSON and not XML.  It
> might be possible to make it an implicitly defined class rather than
> creating a whole class file.
>
> A proper sugar class for handling the security endpoint should be
> created.  I would do it, but I'm not sure how.
>
> Thanks,
> Shawn
>
>


security authentication API via solrj?

2018-02-25 Thread Peter Sturge
Hi,
I was wondering if 7.2.1 solrj had native support for the
security/authentication endpoint? I couldn't find anything in the docs
about it, but maybe someone has some experience with it?
Note: This is about adding/deleting users on a solr instance using solrj,
not authenticating (that is well documented).
Thanks,
Peter


Re: q.op in 7.2.1 solconfig.xml

2018-02-21 Thread Peter Sturge
Hi,
Thanks for your reply.

I managed to get it working by specifying it in the requestHandler /select
section in solrconfig.xml:
  


  explicit
  10
  AND

  

Your other suggestions will be also useful for everyone with a similar
question. Thanks!

Peter



On Wed, Feb 21, 2018 at 11:22 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Peter,
> You should be able to set any param in several places/ways:
> * define in request handler defaults
> * define in params.json and reference using useParams in request handler
> definition
> * use initParams to define default for one or multiple handlers
>
> You can see examples in configs that are part of Solr installation.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 21 Feb 2018, at 23:27, Peter Sturge <peter.stu...@gmail.com> wrote:
> >
> > Hi,
> > I'm going through a major upgrade from 4.6 to 7.2.1 and I can see the
> > defaultOperator has now been removed.
> >
> > The docs mention it's possible to set a default value for the new q.op
> > directive in solrconfig.xml, but it doesn't say how or where.
> >
> > Does anyone have an example of specifying a default q.op parameter?
> Really
> > don't want to have to include it in *every* query, that kinda defeats its
> > purpose...
> >
> > Thanks,
> > Peter
>
>


q.op in 7.2.1 solconfig.xml

2018-02-21 Thread Peter Sturge
Hi,
I'm going through a major upgrade from 4.6 to 7.2.1 and I can see the
defaultOperator has now been removed.

The docs mention it's possible to set a default value for the new q.op
directive in solrconfig.xml, but it doesn't say how or where.

Does anyone have an example of specifying a default q.op parameter? Really
don't want to have to include it in *every* query, that kinda defeats its
purpose...

Thanks,
Peter


Re: Java profiler?

2017-12-06 Thread Peter Sturge
Hi,
We'be been using JPRofiler (www.ej-technologies.com) for years now.
Without a doubt, the most comprehensive and useful profiler for java.
Works very well, supports remote profiling and includes some very neat heap
walking/gc profiling.
Peter


On Tue, Dec 5, 2017 at 3:21 PM, Walter Underwood 
wrote:

> Anybody have a favorite profiler to use with Solr? I’ve been asked to look
> at why out queries are slow on a detail level.
>
> Personally, I think they are slow because they are so long, up to 40 terms.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>


Re: MongoDb vs Solr

2017-08-05 Thread Peter Sturge
*And insults are not something I'd like to see in this mailing list, at all*
+1
Everyone is entitled to their opinion..

Solr can and does work extremely well as a database - it depends on your db
requirements.
For distributed/replicated search via REST API that is read heavy, Solr is
a great choice.

If you need joins or stored procedure-like functionality, don't choose any
of the mentioned ones - stick with SQL.

Security-wise, Solr is pretty much like all db access tools - you will need
a robust front-end to keep your data secure.
It's just that with an easy-to-use API like Solr, it's easier to
accidentally 'let it run free'. If you're using Solr for db rather than
search, you will need a secure front-end.

Joy and good will to all, regardless of what tool you choose!

Peter


On Sat, Aug 5, 2017 at 5:08 PM, Walter Underwood 
wrote:

> I read the seven year old slides just now. The Guardian was using Solr to
> deliver the content. Their repository (see slide 38) is an RDBMS.
>
> https://www.slideshare.net/matwall/no-sql-at-the-guardian
>
> In slide 37, part of “Is Solr a database?”, they note “Search index not
> really persistence”. To me, that means “not a database”.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 5, 2017, at 4:59 AM, Dave  wrote:
> >
> > And to add to the conversation, 7 year old blog posts are not a reason
> to make decisions for your tech stack.
> >
> > And insults are not something I'd like to see in this mailing list, at
> all, so please do not repeat any such disrespect or condescending
> statements in your contributions to the mailing list that's supposed to
> serve as a source of help, which, you asked for.
> >
> >> On Aug 5, 2017, at 7:54 AM, Dave  wrote:
> >>
> >> Also I wouldn't really recommend mongodb at all, it should only to be
> used as a fast front end to an acid compliant relational db same with
> memcahed for example. If you're going to stick to open source, as I do, you
> should use the correct tool for the job.
> >>
> >>> On Aug 5, 2017, at 7:32 AM, GW  wrote:
> >>>
> >>> Insults for Walter only.. sorry..
> >>>
>  On 5 August 2017 at 06:28, GW  wrote:
> 
>  For The Guardian, Solr is the new database | Lucidworks
>   cd=2=rja=8=0ahUKEwiR1rn6_b_VAhVB7IMKHWGKBj4QFgguMAE=
> https%3A%2F%2Flucidworks.com%2F2010%2F04%2F29%2Ffor-the-
> guardian-solr-is-the-new-database%2F=AFQjCNE6CwwFRMvNhgzvEZu-Sryu_
> vtL8A>
>  https://lucidworks.com/2010/04/29/for-the-guardian-solr-
>  is-the-new-database/
>  Apr 29, 2010 - For The Guardian, *Solr* is the new *database*. I
> blogged
>  a few days ago about how open search source is disrupting the
> relationship
>  between ...
> 
>  You are arrogant and probably lame as a programmer.
> 
>  All offense intended
> 
> > On 5 August 2017 at 06:23, GW  wrote:
> >
> > Watch their videos
> >
> > On 4 August 2017 at 23:26, Walter Underwood 
> > wrote:
> >
> >> MarkLogic can do many-to-many. I worked there six years ago. They
> use
> >> search engine index structure with generational updates, including
> segment
> >> level caches. With locking. Pretty good stuff.
> >>
> >> A many to many relationship is an intersection across posting lists,
> >> with transactions. Straightforward, but not easy to do it fast.
> >>
> >> The “Inside MarkLogic Server” paper does a good job of explaining
> the
> >> guts.
> >>
> >> Now, back to our regularly scheduled Solr presentations.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
> >>> On Aug 4, 2017, at 8:13 PM, David Hastings 
> >> wrote:
> >>>
> >>> Also, id love to see an example of a many to many relationship in a
> >> nosql db as you described, since that's a rdbms concept. If it
> exists in a
> >> nosql environment I would like to learn how...
> >>>
>  On Aug 4, 2017, at 10:56 PM, Dave 
> >> wrote:
> 
>  Uhm. Dude are you drinking?
> 
>  1. Lucidworks would never say that.
>  2. Maria is not a json +MySQL. Maria is a fork of the last open
> >> source version of MySQL before oracle bought them
>  3.walter is 100% correct. Solr is search. The only complex data
> >> structure it has is an array. Something like mongo can do arrays
> hashes
> >> arrays of hashes etc, it's actually json based. But it can't search
> well as
> >> a search engine can.
> 
>  There is no one tool. Use each for their own abilities.
> 
> 
> 

Re: Grouping facets: Possible to get facet results for each Group?

2015-10-15 Thread Peter Sturge
Great - can't wait to try this out! Many thanks for your help on pointing
me towards this new faceting feature.
Thanks,
Peter


On Thu, Oct 15, 2015 at 10:04 AM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> It will not be an impediment, if you have a flat document with single
> valued field interested, you can use Pivot Facets and apply stats over the
> facets as well.
> Take a look to the modern Json faceting approach Yonik introduced.
> Since I start using it I strongly recommend it, it's amazingly clear to
> define your faceting structure, store it in a file in Json and use it at
> query time !
>
> I am a strong supporter of this approach, it is young but already powerful.
> Pretty sure it will help you.
>
> Cheers
>
> [1] http://yonik.com/json-facet-api/
> [2] http://yonik.com/solr-facet-functions/
> [3] http://yonik.com/solr-subfacets/
>
> On 14 October 2015 at 22:12, Peter Sturge <peter.stu...@gmail.com> wrote:
>
> > Yes, you are right about that - I've used pivots before and they do need
> to
> > be used judiciously.
> > Fortunately, we only ever use single-value fields, as it gives some good
> > advantages in a heavily sharded environment.
> > Our document structure is, by it's very nature always flat, so it could
> be
> > an impediment to nested facets, but I don't know enough about them to
> know
> > for sure.
> > Thanks,
> > Peter
> >
> >
> > On Wed, Oct 14, 2015 at 9:44 AM, Alessandro Benedetti <
> > benedetti.ale...@gmail.com> wrote:
> >
> > > mmm let's say that nested facets are a subset of Pivot Facets.
> > > if pivot faceting works with the classic flat document structure, the
> sub
> > > facet are working with any nested structure.
> > > So be careful about pivot faceting in a flat document with multi valued
> > > fields, because you lose the relation across the different fields
> value.
> > >
> > > Cheers
> > >
> > > On 13 October 2015 at 18:06, Peter Sturge <peter.stu...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > > Thanks for your response.
> > > > I did have a look at pivots, and they could work in a way. We're
> still
> > on
> > > > Solr 4.3, so I'll have to wait for sub-facets - but they sure look
> > pretty
> > > > cool!
> > > > Peter
> > > >
> > > >
> > > > On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
> > > > benedetti.ale...@gmail.com> wrote:
> > > >
> > > > > Can you model your business domain with Solr nested Docs ? In the
> > case
> > > > you
> > > > > can use Yonik article about nested facets.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On 13 October 2015 at 05:05, Alexandre Rafalovitch <
> > arafa...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Could you use the new nested facets syntax?
> > > > > > http://yonik.com/solr-subfacets/
> > > > > >
> > > > > > Regards,
> > > > > >Alex.
> > > > > > 
> > > > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > > > > http://www.solr-start.com/
> > > > > >
> > > > > > On 11 October 2015 at 09:51, Peter Sturge <
> peter.stu...@gmail.com>
> > > > > wrote:
> > > > > > > Been trying to coerce Group faceting to give some faceting back
> > for
> > > > > each
> > > > > > > group, but maybe this use case isn't catered for in Grouping? :
> > > > > > >
> > > > > > > So the Use Case is this:
> > > > > > > Let's say I do a grouped search that returns say, 9 distinct
> > > groups,
> > > > > and
> > > > > > in
> > > > > > > these groups are various numbers of unique field values that
> need
> > > > > > faceting
> > > > > > > - but the faceting needs to be within each group:
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > --
> > > > >
> > > > > Benedetti Alessandro
> > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > >
> > > > > "Tyger, tyger burning bright
> > > > > In the forests of the night,
> > > > > What immortal hand or eye
> > > > > Could frame thy fearful symmetry?"
> > > > >
> > > > > William Blake - Songs of Experience -1794 England
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-14 Thread Peter Sturge
Yes, you are right about that - I've used pivots before and they do need to
be used judiciously.
Fortunately, we only ever use single-value fields, as it gives some good
advantages in a heavily sharded environment.
Our document structure is, by it's very nature always flat, so it could be
an impediment to nested facets, but I don't know enough about them to know
for sure.
Thanks,
Peter


On Wed, Oct 14, 2015 at 9:44 AM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> mmm let's say that nested facets are a subset of Pivot Facets.
> if pivot faceting works with the classic flat document structure, the sub
> facet are working with any nested structure.
> So be careful about pivot faceting in a flat document with multi valued
> fields, because you lose the relation across the different fields value.
>
> Cheers
>
> On 13 October 2015 at 18:06, Peter Sturge <peter.stu...@gmail.com> wrote:
>
> > Hi,
> > Thanks for your response.
> > I did have a look at pivots, and they could work in a way. We're still on
> > Solr 4.3, so I'll have to wait for sub-facets - but they sure look pretty
> > cool!
> > Peter
> >
> >
> > On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
> > benedetti.ale...@gmail.com> wrote:
> >
> > > Can you model your business domain with Solr nested Docs ? In the case
> > you
> > > can use Yonik article about nested facets.
> > >
> > > Cheers
> > >
> > > On 13 October 2015 at 05:05, Alexandre Rafalovitch <arafa...@gmail.com
> >
> > > wrote:
> > >
> > > > Could you use the new nested facets syntax?
> > > > http://yonik.com/solr-subfacets/
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > > http://www.solr-start.com/
> > > >
> > > > On 11 October 2015 at 09:51, Peter Sturge <peter.stu...@gmail.com>
> > > wrote:
> > > > > Been trying to coerce Group faceting to give some faceting back for
> > > each
> > > > > group, but maybe this use case isn't catered for in Grouping? :
> > > > >
> > > > > So the Use Case is this:
> > > > > Let's say I do a grouped search that returns say, 9 distinct
> groups,
> > > and
> > > > in
> > > > > these groups are various numbers of unique field values that need
> > > > faceting
> > > > > - but the faceting needs to be within each group:
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Grouping facets: Possible to get facet results for each Group?

2015-10-13 Thread Peter Sturge
Hi,
Thanks for your response.
I did have a look at pivots, and they could work in a way. We're still on
Solr 4.3, so I'll have to wait for sub-facets - but they sure look pretty
cool!
Peter


On Tue, Oct 13, 2015 at 12:30 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> Can you model your business domain with Solr nested Docs ? In the case you
> can use Yonik article about nested facets.
>
> Cheers
>
> On 13 October 2015 at 05:05, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
> > Could you use the new nested facets syntax?
> > http://yonik.com/solr-subfacets/
> >
> > Regards,
> >Alex.
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> > On 11 October 2015 at 09:51, Peter Sturge <peter.stu...@gmail.com>
> wrote:
> > > Been trying to coerce Group faceting to give some faceting back for
> each
> > > group, but maybe this use case isn't catered for in Grouping? :
> > >
> > > So the Use Case is this:
> > > Let's say I do a grouped search that returns say, 9 distinct groups,
> and
> > in
> > > these groups are various numbers of unique field values that need
> > faceting
> > > - but the faceting needs to be within each group:
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Fwd: Grouping facets: Possible to get facet results for each Group?

2015-10-12 Thread Peter Sturge
Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


user:*=true=user=true=host=true

This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,
+Peter


Grouping facets: Possible to get facet results for each Group?

2015-10-11 Thread Peter Sturge
Hello Solr Forum,

Been trying to coerce Group faceting to give some faceting back for each
group, but maybe this use case isn't catered for in Grouping? :

So the Use Case is this:
Let's say I do a grouped search that returns say, 9 distinct groups, and in
these groups are various numbers of unique field values that need faceting
- but the faceting needs to be within each group:


user:*=true=user=true=host=true

This query gives back grouped facets for each 'host' value (i.e. the facet
counts are 'collapsed') - but the facet counts (unique values of 'user'
field) are aggregated for all the groups, not on a 'per-group' basis (i.e.
returned as 'global facets' - outside of the grouped results).
The results from the query above doesn't say which unique values for
'users' are in which group. If the number of doc hits is very large (can
easily be in the 100's of thousands) it's not practical to iterate through
the docs looking for unique values.
This Use Case necessitates the unique values within each group, rather than
the total doc hits.

Is this possible with grouping, or inconjunction with another module?

Many thanks,
+Peter


Re: Basic Auth (again)

2015-07-23 Thread Peter Sturge
Hi Steve,

We've not yet moved to Solr 5, but we do use Jetty 9. In any case, Basic
Auth is a Jetty thing, not a Solr thing.
We do use this mechanism to great effect to secure things like index
writers and such, and it does work well once it's setup.
Jetty, as with all containers, is a bit fussy about everything being in its
place (sorry to state the obvious :-).

I see you've got a non-global url pattern - is this definitely definitely
correct? In 100% of cases, Solr should be the only app running, so a global
url is standard practice.
Your Jetty's got Solr security-constraint set to /db/*, but your url is
http://localhost:8983/solr/ - you'll need a corresponding servlet-mapping
entry if you want to use /db/* (and the url will change accordingly to
http://localhost:8983/db/solr/)
To simplify things - even if just to get things working initially, can you
set it to a /* url-pattern and use default-role? You can always tweak it
later on.

I take it from your url that you're not using any sharding/multi-core
stuff. If you are using multi-core, include the core name in the url (e.g.
localhost:8983/solr/mycore/select?q=*:*).

You can also set the jetty-logging.properties file as described in:
http://www.eclipse.org/jetty/documentation/9.2.7.v20150116/configuring-logging.html
.
A 404 would suggest that Solr hasn't loaded, possibly due to missing
mappings in the xml. You can run netstat -a on your Windows box to see if
Solr is listening on port 8983.

Thanks,
Peter


On Thu, Jul 23, 2015 at 9:39 PM, Steven White swhite4...@gmail.com wrote:

 Hi Petter,

 I'm on Solr 5.2.1 which comes with Jetty 9.2.  I'm setting this up on
 Windows 2012 but will need to do the same on Linux too.

 I followed the step per this link:
 https://wiki.apache.org/solr/SolrSecurity#Jetty_realm_example very much to
 the book.  Here are the changes I made:

 File: C:\Solr\solr-5.2.1\server\etc\webdefault.xml

   security-constraint
 web-resource-collection
   web-resource-nameSolr authenticated
 application/web-resource-name
   url-pattern/db/*/url-pattern
 /web-resource-collection
auth-constraint
   role-namedb-role/role-name
 /auth-constraint
   /security-constraint

 login-config
   auth-methodBASIC/auth-method
   realm-nameTest Realm/realm-name
 /login-config

 File: E:\Solr\solr-5.2.1\server\etc\jetty.xml

 New class=org.eclipse.jetty.security.HashLoginService
   Set name=nameTest Realm/Set
   Set name=configSystemProperty name=jetty.home
 default=.//etc/realm.properties/Set
   Set name=refreshInterval0/Set
Call name=start/Call
 /New

 File: E:\Solr\solr-5.2.1\server\etc\realm.properties

 admin: admin, db-role

 I then restarted Solr.  After this, accessing http://localhost:8983/solr/
 gives me:

 HTTP ERROR: 404

 Problem accessing /solr/. Reason:

 Not Found
 Powered by Jetty://

 In a previous post, I asked if anyone has setup Solr 5.2.1 or any 5.x with
 Basic Auth and got it working, I have not heard back.  Either this feature
 is not tested or not in use.  If it is not in use, how do folks secure
 their Solr instance?

 Thanks

 Steve

 On Thu, Jul 23, 2015 at 2:52 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  Hi Steve,
 
  What version of Jetty are you using?
 
  Have you got a webdefault.xml in your etc folder?
  If so, does it have an entry like this:
 
login-config
  auth-methodBASIC/auth-method
  realm-nameRealm Name as specified in jetty.xml/realm-name
/login-config
 
  It's been a few years since I set this up, but I believe you also need an
  auth-constraint in webdefault.xml - this tells jetty which apps are using
  which realms:
 
security-constraint
  web-resource-collection
web-resource-nameA web application name/web-resource-name
url-pattern/*/url-pattern
  /web-resource-collection
  auth-constraint
role-namedefault-role/role-name
  /auth-constraint
/security-constraint
 
  Your realm.properties should then have user account entries for the role
  similar to:
 
  admin: some-cred, default-role
 
 
  Hope this helps,
  Peter
 
 
  On Thu, Jul 23, 2015 at 7:41 PM, Steven White swhite4...@gmail.com
  wrote:
 
   (re-posting as new email thread to see if this will make it to the
 list)
  
  
   That didn't help.  I still get the same result and virtually no log to
  help
   me figure out where / what things are going wrong.
  
   Here is all that I see in C:\Solr\solr-5.2.1\server\logs\solr.log:
  
 INFO  - 2015-07-23 05:29:12.065; [   ]
 org.eclipse.jetty.util.log.Log;
   Logging initialized @286ms
 INFO  - 2015-07-23 05:29:12.231; [   ]
 org.eclipse.jetty.server.Server;
   jetty-9.2.10.v20150310
 WARN  - 2015-07-23 05:29:12.240; [   ]
   org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
 INFO  - 2015-07-23 05:29:12.255

Re: Basic Auth (again)

2015-07-23 Thread Peter Sturge
Hi Steve,

What version of Jetty are you using?

Have you got a webdefault.xml in your etc folder?
If so, does it have an entry like this:

  login-config
auth-methodBASIC/auth-method
realm-nameRealm Name as specified in jetty.xml/realm-name
  /login-config

It's been a few years since I set this up, but I believe you also need an
auth-constraint in webdefault.xml - this tells jetty which apps are using
which realms:

  security-constraint
web-resource-collection
  web-resource-nameA web application name/web-resource-name
  url-pattern/*/url-pattern
/web-resource-collection
auth-constraint
  role-namedefault-role/role-name
/auth-constraint
  /security-constraint

Your realm.properties should then have user account entries for the role
similar to:

admin: some-cred, default-role


Hope this helps,
Peter


On Thu, Jul 23, 2015 at 7:41 PM, Steven White swhite4...@gmail.com wrote:

 (re-posting as new email thread to see if this will make it to the list)


 That didn't help.  I still get the same result and virtually no log to help
 me figure out where / what things are going wrong.

 Here is all that I see in C:\Solr\solr-5.2.1\server\logs\solr.log:

   INFO  - 2015-07-23 05:29:12.065; [   ] org.eclipse.jetty.util.log.Log;
 Logging initialized @286ms
   INFO  - 2015-07-23 05:29:12.231; [   ] org.eclipse.jetty.server.Server;
 jetty-9.2.10.v20150310
   WARN  - 2015-07-23 05:29:12.240; [   ]
 org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
   INFO  - 2015-07-23 05:29:12.255; [   ]
 org.eclipse.jetty.server.AbstractConnector; Started
 ServerConnector@5a5fae16
 {HTTP/1.1}{0.0.0.0:8983}
   INFO  - 2015-07-23 05:29:12.256; [   ] org.eclipse.jetty.server.Server;
 Started @478ms

 Does anyone know where / what logs I should turn on to debug this?  Should
 I be posting this issue on the Jetty mailing list?

 Steve


 On Wed, Jul 22, 2015 at 10:34 AM, Peter Sturge peter.stu...@gmail.com
  wrote:

  Try adding the start call in your jetty.xml:
  Set name=nameRealm Name/Set
  Set name=configSystemProperty name=jetty.home
  default=.//etc/realm.properties/Set
  Set name=refreshInterval5/Set
  Call name=start/Call



Re: Basic auth

2015-07-22 Thread Peter Sturge
if you're using Jetty you can use the standard realms mechanism for Basic
Auth, and it works the same on Windows or UNIX. There's plenty of docs on
the Jetty site about getting this working, although it does vary somewhat
depending on the version of Jetty you're running (N.B. I would suggest
using Jetty 9, and not 8, as 8 is missing some key authentication classes).
If, when you execute a search query to your Solr instance you get a
username and password popup, then Jetty's auth is setup. If you don't then
something's wrong in the Jetty config.

it's worth noting that if you're doing distributed searches Basic Auth on
its own will not work for you. This is because Solr sends distributed
requests to remote instances on behalf of the user, and it has no knowledge
of the web container's auth mechanics. We got 'round this by customizing
Solr to receive credentials and use them for authentication to remote
instances - SOLR-1861 is an old implementation for a previous release, and
there has been some significant refactoring of SearchHandler since then,
but the concept works well for distributed queries.

Thanks,
Peter



On Wed, Jul 22, 2015 at 11:18 AM, O. Klein kl...@octoweb.nl wrote:

 Steven White wrote
  Thanks for updating the wiki page.  However, my issue remains, I cannot
  get
  Basic auth working.  Has anyone got it working, on Windows?

 Doesn't work for me on Linux either.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Basic-auth-tp4218053p4218519.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Basic auth

2015-07-22 Thread Peter Sturge
Try adding the start call in your jetty.xml:
Set name=nameRealm Name/Set
Set name=configSystemProperty name=jetty.home
default=.//etc/realm.properties/Set
Set name=refreshInterval5/Set
Call name=start/Call


On Wed, Jul 22, 2015 at 2:53 PM, O. Klein kl...@octoweb.nl wrote:

 Yeah I can't get it to work on Jetty 9 either on Linux.

 Just trying to password protect the admin pages.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Basic-auth-tp4218053p4218565.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How large is your solr index?

2015-01-07 Thread Peter Sturge
 Is there a problem with multi-valued fields and distributed queries?

 No. But there are some components that don't do the right thing in
 distributed mode, joins for instance. The list is actually quite small and
 is getting smaller all the time.

Yes, joins is the main one. There used to be some dist constraints on
grouping, but that might be from the 3.x days of field collapsing.

 Sounds like you're doing something similar to us. In some cases we have a
 hard commit every minute. Keeping the caches hot seems like a very good
 reason to send data to a specific shard. At least I'm assuming that when
you
 add documents to a single shard and commit; the other shards won't be
 impacted...

 Not true if the other shards have had any indexing activity. The commit is
 usually forwarded to all shards. If the individual index on a
 particular shard is
 unchanged then it should be a no-op though.

This is an excellent point, and well worth taking some care on.
We do it by indexing to a number of shards, and only commit to those that
actually have something to commit - although an empty commit might be a
no-op on the indexing side, it's not on the automwarming/faceting side -
care needs to be taken so that you don't hose your caches unnecessarily.


On Wed, Jan 7, 2015 at 4:42 PM, Erick Erickson erickerick...@gmail.com
wrote:

 See below:


 On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam bram.van...@intix.eu wrote:
  On 01/06/2015 07:54 PM, Erick Erickson wrote:
 
  Have you considered pre-supposing SolrCloud and using the SPLITSHARD
  API command?
 
 
  I think that's the direction we'll probably be going. Index size (at
 least
  for us) can be unpredictable in some cases. Some clients start out small
 and
  then grow exponentially, while others start big and then don't grow much
 at
  all. Starting with SolrCloud would at least give us that flexibility.
 
  That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
 certain
  size, it would be better for us to simply add an extra shard, without
  splitting.
 

 True, and you can do this if you take explicit control of the document
 routing, but...
 that's quite tricky. You forever after have to send any _updates_ to the
 same
 shard you did the first time, whereas SPLITSHARD will do the right thing.

 
  On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge peter.stu...@gmail.com
  wrote:
 
  ++1 for the automagic shard creator. We've been looking into doing this
  sort of thing internally - i.e. when a shard reaches a certain size/num
  docs, it creates 'sub-shards' to which new commits are sent and queries
  to
  the 'parent' shard are included. The concept works, as long as you
 don't
  try any non-dist stuff - it's one reason why all our fields are always
  single valued.
 
 
  Is there a problem with multi-valued fields and distributed queries?

 No. But there are some components that don't do the right thing in
 distributed mode, joins for instance. The list is actually quite small and
 is getting smaller all the time.

 
  A cool side-effect of sub-sharding (for lack of a snappy term) is that
  the
  parent shard then stops suffering from auto-warming latency due to
  commits
  (we do a fair amount of committing). In theory, you could carry on
  sub-sharding until your hardware starts gasping for air.
 
 
  Sounds like you're doing something similar to us. In some cases we have a
  hard commit every minute. Keeping the caches hot seems like a very good
  reason to send data to a specific shard. At least I'm assuming that when
 you
  add documents to a single shard and commit; the other shards won't be
  impacted...

 Not true if the other shards have had any indexing activity. The commit is
 usually forwarded to all shards. If the individual index on a
 particular shard is
 unchanged then it should be a no-op though.

 But the usage pattern here is its own bit of a trap. If all your
 indexing is going
 to a single shard, then also the entire indexing _load_ is happening on
 that
 shard. So the CPU utilization will be higher on that shard than the older
 ones.
 Since distributed requests need to get a response from every shard before
 returning to the client, the response time will be bounded by the response
 from
 the slowest shard and this may actually be slower. Probably only noticeable
 when the CPU is maxed anyway though.



 
   - Bram
 



Re: How large is your solr index?

2015-01-06 Thread Peter Sturge
Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
performs reasonably well on commodity hardware with lots of faceting and
concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
it works.

++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to
the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued. There are also other implications like cleanup, deletes and
security to take into account, to name a few.
A cool side-effect of sub-sharding (for lack of a snappy term) is that the
parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.


On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam bram.van...@intix.eu wrote:

 On 01/04/2015 02:22 AM, Jack Krupansky wrote:

 The reality doesn't seem to
 be there today. 50 to 100 million documents, yes, but beyond that takes
 some kind of heroic effort, whether a much beefier box, very careful and
 limited data modeling or limiting of query capabilities or tolerance of
 higher latency, expert tuning, etc.


 I disagree. On the scale, at least. Up until 500M Solr performs well
 (read: well enough considering the scale) in a single shard on a single box
 of commodity hardware. Without any tuning or heroic efforts. Sure, some
 queries aren't as snappy as you'd like, and sure, indexing and querying at
 the same time will be somewhat unpleasant, but it will work, and it will
 work well enough.

 Will it work for thousands of concurrent users? Of course not. Anyone who
 is after that sort of thing won't find themselves in this scenario -- they
 will throw hardware at the problem.

 There is something to be said for making sharding less painful. It would
 be nice if, for instance, Solr would automagically create a new shard once
 some magic number was reached (2B at the latest, I guess). But then that'll
 break some query features ... :-(

 The reason we're using single large instances (sometimes on beefy
 hardware) is that SolrCloud is a pain. Not just from an administrative
 point of view (though that seems to be getting better, kudos for that!),
 but mostly because some queries cannot be executed with distributed=true.
 Our users, at least, prefer a slow query over an impossible query.

 Actually, this 2B limit is a good thing. It'll help me convince
 $management to donate some of our time to Solr :-)

  - Bram



Re: Get matched Term in join query

2014-12-09 Thread Peter Sturge
Hi,

Your question is a good one - I have added an option to search through
results and filter that way, but it's not ideal, as very often there are
10,000 or millions of hits, with only 20 results per page returned.

I've realized I run into the classic 'Terms-can't-filtered' issue. To
filter Terms would, in the worst case, mean looking up a great many items.
For now, I'm going with the TermsComponent added to the standard
searchhandler. The drawback is you get back all terms that match the
terms.regex, even those not necessarily in the results.

Many thanks,
Peter



On Tue, Dec 9, 2014 at 7:32 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Hello Peter,

 Let's limit or just fix the problem definition. I've got that dealing with
 cross core join id mandatory. Is it right?
 Then, do you need facets (from all resultset) or just a snippets (just from
 result page)?
 09.12.2014 1:23 пользователь Peter Sturge peter.stu...@gmail.com
 написал:

  Hi Forum,
 
  Is it possible for a Solr query to return the term(s) that matched a
  particular field/query?
 
  For example, let's say there's a field like this:
  raw=This is a raw text field that happens to contain some text that's
 also
  in the action field value...
 
  And another field in a different index like this:
  action=contain
 
  And they are tokenized on whitespace.
 
  If my query is:
  q={!join from=action to=raw fromIndex=TheActionIndex}*
 
  If 'action' was in the same index, it would be ok, but
  the problem is the match in 'TheActionIndex' isn't returned as it's in a
  different index.
 
  The query returns matching raw documents, but not *which* term was
 matched
  to cause it to be returned.
  I've tried the highlighting trick, but that doesn't work here - it
 returns
  highlighting on all terms.
  It would be great to get these back as facets, but getting them back at
 all
  would be great.
 
  Is it possible to have the query return which term(s) from 'raw' actually
  matched the value in 'action'?
  Maybe an extended TermsComponent to add only matched terms to the
 response
  payload or similar?
 
  Many thanks,
  Peter
 



Get matched Term in join query

2014-12-08 Thread Peter Sturge
Hi Forum,

Is it possible for a Solr query to return the term(s) that matched a
particular field/query?

For example, let's say there's a field like this:
raw=This is a raw text field that happens to contain some text that's also
in the action field value...

And another field in a different index like this:
action=contain

And they are tokenized on whitespace.

If my query is:
q={!join from=action to=raw fromIndex=TheActionIndex}*

If 'action' was in the same index, it would be ok, but
the problem is the match in 'TheActionIndex' isn't returned as it's in a
different index.

The query returns matching raw documents, but not *which* term was matched
to cause it to be returned.
I've tried the highlighting trick, but that doesn't work here - it returns
highlighting on all terms.
It would be great to get these back as facets, but getting them back at all
would be great.

Is it possible to have the query return which term(s) from 'raw' actually
matched the value in 'action'?
Maybe an extended TermsComponent to add only matched terms to the response
payload or similar?

Many thanks,
Peter


Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Solr Group,

Got an interesting use case (to me, at least), perhaps someone could give
some insight on how best to achieve this?

I've got a core that has about 7million entries, with a field call 'addr'.
By definition, every entry has a unique 'addr' value, so there are 7million
unique values for this field.
I then have another core with ~20million entries. These have a field called
'dest', and there may be, say around 800-1000 unique values for 'dest', but
there's always a value - the number of unique values varies.

So..the problem is this:
What is the best/only/most efficient way to consutruct a search where by I
get back an (ideally faceted) list of values for 'dest' that occur in
'addr'?
Can I do this with just faceting (e.g. facet query or similar)? Or do I
need grouping?
Note, I don't actually need the documents themselves, only the list of
unique values that are the intersection of 'dest' and 'addr'.

Can anyone help shed some light on how best to do this?

Many thanks,
Peter


Re: Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Toke,
Thanks for your input.

I guess you mean take the 1k or so values and build a boolean query from
them?
If that's not what you mean, my apologies..
I'd thought of doing that - the trouble I had was
the unique values could be 20k, or 15,167 or any arbirary and potentially
high-ish number - it's not really known and can/will change over time. I
believe a boolean query with more than 1024 ops can blow up the query, so
scalability is a concern.
The other issue is how this would yield the unique facet values -
e.g. dest=8.8.8.8 (17) [i.e. 8.8.8.8 is in the 'addr' list and occurs 17
times in entries with a 'dest' field] - in fact, I need the uniques
value(s) ('8.8.8.8') more than I need the count ('17')

I could get the facet list of 'dest' values, then trawl through each one,
but this will be a complicated and time-consuming client-side operation.
I'm also looking at creating a custom QueryParser that would build the
relevant DocLists, then intersect them and return the values, but I
wouldn't want to reinvent the wheel if possible, given that facets already
build unique term lists, seems so close - I guess it's like taking two
facet lists (1 for addr, 1 for dest), intersecting them and returning the
result:

List 1:
a
b
c
d
e
f

List 2:
a
a
g
z
c
c
c
e

Resultant intersection:
a (2)
c (3)
e (1)


Thanks,
Peter



On Wed, Nov 19, 2014 at 7:16 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Peter Sturge [peter.stu...@gmail.com] wrote:

 [addr 7M unique, dest 1K unique]

  What is the best/only/most efficient way to consutruct a search where by
 I
  get back an (ideally faceted) list of values for 'dest' that occur in
  'addr'?

 I assume the actual values are defined by a query? As the number of
 possible values in dest is not that large, extracting those first and then
 using them as a filter when searching for addr seems like a fairly
 efficient way of solving the problem.

 - Toke Eskildsen



Re: Handling intersection facets of many values

2014-11-19 Thread Peter Sturge
Hi Toke,
Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't
realistically scale to large value sets.

I've been wrestling with joins this evening and have managed to get these
working - and it works very nicely - and across cores (although not shards
yet afaik)!

For anyone looking to do this sort of facet intersecting, here's my query:
127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join
from=addr to=dest
fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0

Thanks,
Peter


On Wed, Nov 19, 2014 at 9:23 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Peter Sturge [peter.stu...@gmail.com] wrote:
  I guess you mean take the 1k or so values and build a boolean query from
  them?

 Not really. Let me try again:

 1) Perform a facet call with facet.limit=-1 on dest to get the relevant
 dest values.
 The result will always be 1000 values or less. Take those values and
 construct a filter query a OR b OR c.

 2) Perform a facet call on addr with the original query + the newly
 constructed filter query.
 The facet response should not contain the intersection.

 1000 is a bit close to the default limit for boolean queries, so you might
 want to raise that.

  I'm also looking at creating a custom QueryParser that would build the
  relevant DocLists, then intersect them and return the values, [...]

 You are describing a Join in Solr and that would likely solve your
 problem, but it does not work across cores. Is it possible to have both the
 addr and dest data in the same core?

 - Toke Eskildsen



Re: Facet sort descending

2013-09-10 Thread Peter Sturge
Hi,

This question could possibly be about rare idr facet counting - i.e. retrun
the facets counts with the least values.
I remember doing a patch for this years ago, but then it broke when some
UninvertedField facet optimization came in around ~3.5 time.
It's a neat idea though to have an option to show the 'rarest N' facets not
just the 'top N'.

Thanks,
Peter



On Mon, Sep 9, 2013 at 11:43 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Is there a plan to add a descending sort order for facet queries ?
 : Best regards Sandro

 I don't understand your question.

 if you specify multiple facet.query params, then the constraint counts are
 returned in the order they were initially specified -- there is no need
 for server side sorting, because they all come back (as opposed to
 facet.field where the number of constraints can be unbounded and you may
 request just the top X using facet.limit)

 If you are asking about facet.field and using facet.sort to specify the
 order of the constraints for each field, then no -- i don't believe anyone
 is currently working on adding options for descending sort.

 I don't think it would be hard to add if someone wanted to ... I just
 don't know that there has ever been enough demand for anyone to look into
 it.


 -Hoss



Re: Facet sort descending

2013-09-10 Thread Peter Sturge
Hi Sandro,
Ah, ok, this is quite simple then - you should be able to sort these any
way you like in your client code since the facet data is all there.
On the server-side, you can look at
https://issues.apache.org/jira/browse/SOLR-1672 - please note this is an
old patch for 1.4, so this won't work on 4.x - but it can give an idea of
how/where to do the sorting on the server-side, if you want to go down that
road.
HTH
Peter



On Tue, Sep 10, 2013 at 11:49 AM, Sandro Zbinden zbin...@imagic.ch wrote:

 Hi

 @Peter This is actually the requirement. We have. For both sort options
 (index, count) we would like to have the possibility to add the desc option.

 Instead of this result
  q=*:*facet=truefacet.field=image_textfacet.sort=indexrows=0

 lst name=facet_fields
   lst name=image_text
  int name=a12/int
 int name=b23/int
 int name=c200/int
  /lst
 /lst

 We would like to add desc to the sort option like facet.sort=index,desc
  to get the following result

 lst name=facet_fields
   lst name=image_text
  int name=c200/int
 int name=b23/int
 int name=a12/int
  /lst
 /lst

 Bests Sandro


 -Ursprüngliche Nachricht-
 Von: Peter Sturge [mailto:peter.stu...@gmail.com]
 Gesendet: Dienstag, 10. September 2013 11:17
 An: solr-user@lucene.apache.org
 Betreff: Re: Facet sort descending

 Hi,

 This question could possibly be about rare idr facet counting - i.e.
 retrun the facets counts with the least values.
 I remember doing a patch for this years ago, but then it broke when some
 UninvertedField facet optimization came in around ~3.5 time.
 It's a neat idea though to have an option to show the 'rarest N' facets
 not just the 'top N'.

 Thanks,
 Peter



 On Mon, Sep 9, 2013 at 11:43 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:

 
  : Is there a plan to add a descending sort order for facet queries ?
  : Best regards Sandro
 
  I don't understand your question.
 
  if you specify multiple facet.query params, then the constraint counts
  are returned in the order they were initially specified -- there is no
  need for server side sorting, because they all come back (as opposed
  to facet.field where the number of constraints can be unbounded and
  you may request just the top X using facet.limit)
 
  If you are asking about facet.field and using facet.sort to specify
  the order of the constraints for each field, then no -- i don't
  believe anyone is currently working on adding options for descending
 sort.
 
  I don't think it would be hard to add if someone wanted to ... I just
  don't know that there has ever been enough demand for anyone to look
  into it.
 
 
  -Hoss
 



Re: Facet field display name

2013-08-12 Thread Peter Sturge
2c worth,
We do lots of facet lookups to allow 'prettyprint' versions of facet names.
We do this on the client-side, though. The reason is then the lookups can
be different for different locations/users etc. - makes it easy for
localization.
It's also very easy to implement such a lookup, without having to disturb
the innards of Solr...



On Mon, Aug 12, 2013 at 2:25 PM, Erick Erickson erickerick...@gmail.comwrote:

 Have you seen the key parameter here:

 http://wiki.apache.org/solr/SimpleFacetParameters#key_:_Changing_the_output_key

 it allows you to label the output key anything you want, and since these
 are
 field names, this seems to-able.

 Best,
 Erick


 On Mon, Aug 12, 2013 at 4:02 AM, Aleksander Akerø aleksan...@gurusoft.no
 wrote:

  Hi
 
  I wondered if there was some way to configure a display name for facet
  fields. Either that or some way to display nordic letters without it
  messing up the faceting.
 
  Say I wanted a facet field called område (norwegian, area in
 english).
  Then I would have to create the field something like this in schema.xml:
 
  field name=omrade type=string indexed=true stored=true
  required=false /
 
  But then I would have to do a replace to show a prettier name in
  frontend. It would be preferred not to do this sort of hardcoding, as I
  would have to do this for all the facet fields.
 
 
  Either that or I could try encoding the 'å' like this:
 
  field name=omr#229;de type=string indexed=true stored=true
  required=false /
 
  Then it will show up with a pretty name, but the faceting will fail.
 Maybe
  this is due to encoding issues, seen as the frontend is encoded with
  ISO-8859-1?
 
 
  So does anyone have a good practice for either getting this sort of
 problem
  working properly. Or a way to define an alternative display name for a
  facet field, that I could display instead of the field.name?
 
 
  *Aleksander Akerø*
  Systemkonsulent
  Mobil: 944 89 054
  E-post: aleksan...@gurusoft.no
 
  *Gurusoft AS*
  Telefon: 92 44 09 99
  Østre Kullerød
  www.gurusoft.no
 



Re: Applying Sum on Field

2013-07-11 Thread Peter Sturge
Hi,

If you mean adding up numeric values stored in fields - no, Solr doesn't do
this by default.
We had a similar requirement for this, and created a custom SearchComponent
to handle sum, average, stats etc.
There are a number of things you need to bear in mind, such as:
  * Handling errors when a query asks for sums on fields that are
non-numeric
  * Performance issues - e.g. are you willing to wait to add up 50 million
fields of stringified numbers
  * How to return result payloads in a client-friendly way
  * Be prepared to coalesce results from multi-shard/distributed queries.
It's not trivial, but it is do-able.

Peter




On Thu, Jul 11, 2013 at 12:56 PM, Jamshaid Ashraf jamshaid...@gmail.comwrote:

 Hi,

 I'm a new solr user, I wanted to know is there any way to apply sum on a
 field in a result document of group query?

 Following is the query and its result set, I wanted to apply sum on 'price'
 filed grouping on type:


 *Sample input:*

 doc
 str name=id3/str
 str name=typeCaffe/str
 str name=contentYummm  Drinking a latte at Caffe Grecco in SF shistoric
 North Beach Learning text analysis with SolrInAction by Manning on my
 iPad/str
 long name=_version_1440257540658036736/long
 int name=price250/int
 /doc
 doc
 str name=id1/str
 str name=typeCaffe/str
 str name=contentYummm  Drinking a latte at Caffe Grecco in SF shistoric
 North Beach Learning text analysis with SolrInAction by Manning on my
 iPad/str
 long name=_version_1440257592044552192/long
 int name=price100/int
 /doc
 *
 *
 *Query:*

 http://localhost:8080/solr/collection2/select?q=caffedf=contentgroup=truegroup.field=type

 your help will be greatly appreciated!

 Regards,
 Jamshaid



Re: Two instances of solr - the same datadir?

2013-07-03 Thread Peter Sturge
You can do a reload, yes, but a commit() is considerably faster.


On Tue, Jul 2, 2013 at 10:35 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Wouldn't it be better to do a RELOAD?

 http://wiki.apache.org/solr/CoreAdmin#RELOAD

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Tue, Jul 2, 2013 at 5:05 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  The RO instance commit isn't (or shouldn't be) doing any real writing,
 just
  an empty commit to force new searchers, autowarm/refresh caches etc.
  Admittedly, we do all this on 3.6, so 4.0 could have different behaviour
 in
  this area.
  As long as you don't have autocommit in solrconfig.xml, there wouldn't be
  any commits 'behind the scenes' (we do all our commits via a local solrj
  client so it can be fully managed).
  The only caveat might be NRT/soft commits, but I'm not too familiar with
  this in 4.0.
  In any case, your RO instance must be getting updated somehow, otherwise
  how would it know your write instance made any changes?
  Perhaps your write instance notifies the RO instance externally from
 Solr?
  (a perfectly valid approach, and one that would allow a 'single' lock to
  work without contention)
 
 
 
  On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   Interesting, we are running 4.0 - and solr will refuse the start (or
   reload) the core. But from looking at the code I am not seeing it is
  doing
   any writing - but I should digg more...
  
   Are you sure it needs to do writing? Because I am not calling commits,
 in
   fact I have deactivated *all* components that write into index, so
 unless
   there is something deep inside, which automatically calls the commit,
 it
   should never happen.
  
   roman
  
  
   On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hmmm, single lock sounds dangerous. It probably works ok because
 you've
been [un]lucky.
For example, even with a RO instance, you still need to do a commit
 in
order to reload caches/changes from the other instance.
What happens if this commit gets called in the middle of the other
instance's commit? I've not tested this scenario, but it's very
  possible
with a 'single' lock the results are indeterminate.
If the 'single' lock mechanism is making assumptions e.g. no other
   process
will interfere, and then one does, the Lucene index could very well
 get
corrupted.
   
For the error you're seeing using 'native', we use native lockType
 for
   both
write and RO instances, and it works fine - no contention.
Which version of Solr are you using? Perhaps there's been a change in
behaviour?
   
Peter
   
   
On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
   
 as i discovered, it is not good to use 'native' locktype in this
scenario,
 actually there is a note in the solrconfig.xml which says the same

 when a core is reloaded and solr tries to grab lock, it will fail -
   even
if
 the instance is configured to be read-only, so i am using 'single'
  lock
for
 the readers and 'native' for the writer, which seems to work OK

 roman


 On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
 
wrote:

  I have auto commit after 40k RECs/1800secs. But I only tested
 with
manual
  commit, but I don't see why it should work differently.
  Roman
  On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com
   wrote:
 
  If it makes you feel better, I also considered this approach
 when
  I
was
 in
  the same situation with a separate indexer and searcher on one
Physical
  linux machine.
 
  My main concern was re-using the FS cache between both
  instances -
If
 I
  replicated to myself there would be two independent copies of
 the
index,
  FS-cached separately.
 
  I like the suggestion of using autoCommit to reload the index.
 If
   I'm
  reading that right, you'd set an autoCommit on 'zero docs
  changing',
or
  just 'every N seconds'? Did that work?
 
  Best of luck!
 
  Tim
 
 
  On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   So here it is for a record how I am solving it right now:
  
   Write-master is started with:
 -Dmontysolr.warming.enabled=false
   -Dmontysolr.write.master=true -Dmontysolr.read.master=
   http://localhost:5005
   Read-master is started with: -Dmontysolr.warming.enabled=true
   -Dmontysolr.write.master=false
  
  
   solrconfig.xml changes:
  
   1. all index

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Peter Sturge
Hmmm, single lock sounds dangerous. It probably works ok because you've
been [un]lucky.
For example, even with a RO instance, you still need to do a commit in
order to reload caches/changes from the other instance.
What happens if this commit gets called in the middle of the other
instance's commit? I've not tested this scenario, but it's very possible
with a 'single' lock the results are indeterminate.
If the 'single' lock mechanism is making assumptions e.g. no other process
will interfere, and then one does, the Lucene index could very well get
corrupted.

For the error you're seeing using 'native', we use native lockType for both
write and RO instances, and it works fine - no contention.
Which version of Solr are you using? Perhaps there's been a change in
behaviour?

Peter


On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com wrote:

 as i discovered, it is not good to use 'native' locktype in this scenario,
 actually there is a note in the solrconfig.xml which says the same

 when a core is reloaded and solr tries to grab lock, it will fail - even if
 the instance is configured to be read-only, so i am using 'single' lock for
 the readers and 'native' for the writer, which seems to work OK

 roman


 On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com wrote:

  I have auto commit after 40k RECs/1800secs. But I only tested with manual
  commit, but I don't see why it should work differently.
  Roman
  On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com wrote:
 
  If it makes you feel better, I also considered this approach when I was
 in
  the same situation with a separate indexer and searcher on one Physical
  linux machine.
 
  My main concern was re-using the FS cache between both instances - If
 I
  replicated to myself there would be two independent copies of the index,
  FS-cached separately.
 
  I like the suggestion of using autoCommit to reload the index. If I'm
  reading that right, you'd set an autoCommit on 'zero docs changing', or
  just 'every N seconds'? Did that work?
 
  Best of luck!
 
  Tim
 
 
  On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
 
   So here it is for a record how I am solving it right now:
  
   Write-master is started with: -Dmontysolr.warming.enabled=false
   -Dmontysolr.write.master=true -Dmontysolr.read.master=
   http://localhost:5005
   Read-master is started with: -Dmontysolr.warming.enabled=true
   -Dmontysolr.write.master=false
  
  
   solrconfig.xml changes:
  
   1. all index changing components have this bit,
   enable=${montysolr.master:true} - ie.
  
   updateHandler class=solr.DirectUpdateHandler2
enable=${montysolr.master:true}
  
   2. for cache warming de/activation
  
   listener event=newSearcher
 class=solr.QuerySenderListener
 enable=${montysolr.enable.warming:true}...
  
   3. to trigger refresh of the read-only-master (from write-master):
  
   listener event=postCommit
 class=solr.RunExecutableListener
 enable=${montysolr.master:true}
 str name=execurl/str
 str name=dir./str
 bool name=waitfalse/bool
 arr name=args str${montysolr.read.master:http://localhost
  
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
   /listener
  
   This works, I still don't like the reload of the whole core, but it
  seems
   like the easiest thing to do now.
  
   -- roman
  
  
   On Wed, Jun 5, 2013 at 12:07 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi Peter,
   
Thank you, I am glad to read that this usecase is not alien.
   
I'd like to make the second instance (searcher) completely
 read-only,
  so
   I
have disabled all the components that can write.
   
(being lazy ;)) I'll probably use
http://wiki.apache.org/solr/CollectionDistribution to call the curl
   after
commit, or write some IndexReaderFactory that checks for changes
   
The problem with calling the 'core reload' - is that it seems lots
 of
   work
for just opening a new searcher, eeekkk...somewhere I read that it
 is
   cheap
to reload a core, but re-opening the index searches must be
 definitely
cheaper...
   
roman
   
   
On Wed, Jun 5, 2013 at 4:03 AM, Peter Sturge 
 peter.stu...@gmail.com
   wrote:
   
Hi,
We use this very same scenario to great effect - 2 instances using
  the
same
dataDir with many cores - 1 is a writer (no caching), the other is
 a
searcher (lots of caching).
To get the searcher to see the index changes from the writer, you
  need
   the
searcher to do an empty commit - i.e. you invoke a commit with 0
documents.
This will refresh the caches (including autowarming), [re]build the
relevant searchers etc. and make any index changes visible to the
 RO
instance.
Also, make sure to use lockTypenative/lockType in
 solrconfig.xml
  to
ensure the two instances don't try to commit at the same time

Re: Two instances of solr - the same datadir?

2013-07-02 Thread Peter Sturge
The RO instance commit isn't (or shouldn't be) doing any real writing, just
an empty commit to force new searchers, autowarm/refresh caches etc.
Admittedly, we do all this on 3.6, so 4.0 could have different behaviour in
this area.
As long as you don't have autocommit in solrconfig.xml, there wouldn't be
any commits 'behind the scenes' (we do all our commits via a local solrj
client so it can be fully managed).
The only caveat might be NRT/soft commits, but I'm not too familiar with
this in 4.0.
In any case, your RO instance must be getting updated somehow, otherwise
how would it know your write instance made any changes?
Perhaps your write instance notifies the RO instance externally from Solr?
(a perfectly valid approach, and one that would allow a 'single' lock to
work without contention)



On Tue, Jul 2, 2013 at 7:59 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Interesting, we are running 4.0 - and solr will refuse the start (or
 reload) the core. But from looking at the code I am not seeing it is doing
 any writing - but I should digg more...

 Are you sure it needs to do writing? Because I am not calling commits, in
 fact I have deactivated *all* components that write into index, so unless
 there is something deep inside, which automatically calls the commit, it
 should never happen.

 roman


 On Tue, Jul 2, 2013 at 2:54 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  Hmmm, single lock sounds dangerous. It probably works ok because you've
  been [un]lucky.
  For example, even with a RO instance, you still need to do a commit in
  order to reload caches/changes from the other instance.
  What happens if this commit gets called in the middle of the other
  instance's commit? I've not tested this scenario, but it's very possible
  with a 'single' lock the results are indeterminate.
  If the 'single' lock mechanism is making assumptions e.g. no other
 process
  will interfere, and then one does, the Lucene index could very well get
  corrupted.
 
  For the error you're seeing using 'native', we use native lockType for
 both
  write and RO instances, and it works fine - no contention.
  Which version of Solr are you using? Perhaps there's been a change in
  behaviour?
 
  Peter
 
 
  On Tue, Jul 2, 2013 at 7:30 PM, Roman Chyla roman.ch...@gmail.com
 wrote:
 
   as i discovered, it is not good to use 'native' locktype in this
  scenario,
   actually there is a note in the solrconfig.xml which says the same
  
   when a core is reloaded and solr tries to grab lock, it will fail -
 even
  if
   the instance is configured to be read-only, so i am using 'single' lock
  for
   the readers and 'native' for the writer, which seems to work OK
  
   roman
  
  
   On Fri, Jun 7, 2013 at 9:05 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
I have auto commit after 40k RECs/1800secs. But I only tested with
  manual
commit, but I don't see why it should work differently.
Roman
On 7 Jun 2013 20:52, Tim Vaillancourt t...@elementspace.com
 wrote:
   
If it makes you feel better, I also considered this approach when I
  was
   in
the same situation with a separate indexer and searcher on one
  Physical
linux machine.
   
My main concern was re-using the FS cache between both instances -
  If
   I
replicated to myself there would be two independent copies of the
  index,
FS-cached separately.
   
I like the suggestion of using autoCommit to reload the index. If
 I'm
reading that right, you'd set an autoCommit on 'zero docs changing',
  or
just 'every N seconds'? Did that work?
   
Best of luck!
   
Tim
   
   
On 5 June 2013 10:19, Roman Chyla roman.ch...@gmail.com wrote:
   
 So here it is for a record how I am solving it right now:

 Write-master is started with: -Dmontysolr.warming.enabled=false
 -Dmontysolr.write.master=true -Dmontysolr.read.master=
 http://localhost:5005
 Read-master is started with: -Dmontysolr.warming.enabled=true
 -Dmontysolr.write.master=false


 solrconfig.xml changes:

 1. all index changing components have this bit,
 enable=${montysolr.master:true} - ie.

 updateHandler class=solr.DirectUpdateHandler2
  enable=${montysolr.master:true}

 2. for cache warming de/activation

 listener event=newSearcher
   class=solr.QuerySenderListener
   enable=${montysolr.enable.warming:true}...

 3. to trigger refresh of the read-only-master (from write-master):

 listener event=postCommit
   class=solr.RunExecutableListener
   enable=${montysolr.master:true}
   str name=execurl/str
   str name=dir./str
   bool name=waitfalse/bool
   arr name=args str${montysolr.read.master:
  http://localhost


   
  
 
 }/solr/admin/cores?wt=jsonamp;action=RELOADamp;core=collection1/str/arr
 /listener

 This works, I still don't like the reload

Re: Improving performance to return 2000+ documents

2013-06-29 Thread Peter Sturge
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter



On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, depending on how many docs get served
 from the cache the time will vary. But this is
 just ugly, if you can avoid this use-case it would
 be a Good Thing.

 Problem here is that each and every shard must
 assemble the list of 2,000 documents (just ID and
 sort criteria, usually score).

 Then the node serving the original request merges
 the sub-lists to pick the top 2,000. Then the node
 sends another request to each shard to get
 the full document. Then the node merges this
 into the full list to return to the user.

 Solr really isn't built for this use-case, is it actually
 a compelling situation?

 And having your document cache set at 1M is kinda
 high if you have very big documents.

 FWIW,
 Erick


 On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Also, I don't see a consistent response time from solr, I ran ab again
 and
  I get this:
 
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:   x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   10.858 seconds
  Complete requests:  500
  Failed requests:8
 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
  Write errors:   0
  Total transferred:  769297992 bytes
  HTML transferred:   769268492 bytes
  Requests per second:46.05 [#/sec] (mean)
  Time per request:   217.167 [ms] (mean)
  Time per request:   21.717 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  69187.90 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.3  0   2
  Processing:   110  215  72.0190 497
  Waiting:   91  180  70.5152 473
  Total:112  216  72.0191 497
 
  Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
   100%497 (longest request)
 
 
  Sometimes it takes a lot of time, sometimes its pretty quick.
 
  Thanks,
  -Utkarsh
 
 
  On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Hello,
  
   I have a usecase where I need to retrive top 2000 documents matching a
   query.
   What are the parameters (in query, solrconfig, schema) I shoud look at
 to
   improve this?
  
   I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
   RAM, 8vCPU and 7GB JVM heap size.
  
   I have documentCache:
 documentCache class=solr.LRUCache  size=100
   initialSize=100   autowarmCount=0/
  
   allText is a copyField.
  
   This is the result I get:
   ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   
  
   Benchmarking x.amazonaws.com (be patient)
   Completed 100 requests
   Completed 200 requests
   Completed 300 requests
   Completed 400 requests
   Completed 500 requests
   Finished 500 requests
  
  
   Server Software:
   Server Hostname:x.amazonaws.com
   Server Port:8983
  
   Document Path:
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   Document Length:1538537 bytes
  
   Concurrency Level:  10
   Time taken for tests:   35.999 seconds
   Complete requests:  500
   Failed requests:21
  (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
   Write errors:   0
   Non-2xx responses:  2
   Total 

Re: Two instances of solr - the same datadir?

2013-06-05 Thread Peter Sturge
Hi,
We use this very same scenario to great effect - 2 instances using the same
dataDir with many cores - 1 is a writer (no caching), the other is a
searcher (lots of caching).
To get the searcher to see the index changes from the writer, you need the
searcher to do an empty commit - i.e. you invoke a commit with 0 documents.
This will refresh the caches (including autowarming), [re]build the
relevant searchers etc. and make any index changes visible to the RO
instance.
Also, make sure to use lockTypenative/lockType in solrconfig.xml to
ensure the two instances don't try to commit at the same time.
There are several ways to trigger a commit:
Call commit() periodically within your own code.
Use autoCommit in solrconfig.xml.
Use an RPC/IPC mechanism between the 2 instance processes to tell the
searcher the index has changed, then call commit when called (more complex
coding, but good if the index changes on an ad-hoc basis).
Note, doing things this way isn't really suitable for an NRT environment.

HTH,
Peter



On Tue, Jun 4, 2013 at 11:23 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Replication is fine, I am going to use it, but I wanted it for instances
 *distributed* across several (physical) machines - but here I have one
 physical machine, it has many cores. I want to run 2 instances of solr
 because I think it has these benefits:

 1) I can give less RAM to the writer (4GB), and use more RAM for the
 searcher (28GB)
 2) I can deactivate warming for the writer and keep it for the searcher
 (this considerably speeds up indexing - each time we commit, the server is
 rebuilding a citation network of 80M edges)
 3) saving disk space and better OS caching (OS should be able to use more
 RAM for the caching, which should result in faster operations - the two
 processes are accessing the same index)

 Maybe I should just forget it and go with the replication, but it doesn't
 'feel right' IFF it is on the same physical machine. And Lucene
 specifically has a method for discovering changes and re-opening the index
 (DirectoryReader.openIfChanged)

 Am I not seeing something?

 roman



 On Tue, Jun 4, 2013 at 5:30 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:

  Roman,
 
  Could you be more specific as to why replication doesn't meet your
  requirements?  It was geared explicitly for this purpose, including the
  automatic discovery of changes to the data on the index master.
 
  Jason
 
  On Jun 4, 2013, at 1:50 PM, Roman Chyla roman.ch...@gmail.com wrote:
 
   OK, so I have verified the two instances can run alongside, sharing the
   same datadir
  
   All update handlers are unaccessible in the read-only master
  
   updateHandler class=solr.DirectUpdateHandler2
   enable=${solr.can.write:true}
  
   java -Dsolr.can.write=false .
  
   And I can reload the index manually:
  
   curl 
  
 
 http://localhost:5005/solr/admin/cores?wt=jsonaction=RELOADcore=collection1
   
  
   But this is not an ideal solution; I'd like for the read-only server to
   discover index changes on its own. Any pointers?
  
   Thanks,
  
roman
  
  
   On Tue, Jun 4, 2013 at 2:01 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
  
   Hello,
  
   I need your expert advice. I am thinking about running two instances
 of
   solr that share the same datadirectory. The *reason* being: indexing
   instance is constantly building cache after every commit (we have a
 big
   cache) and this slows it down. But indexing doesn't need much RAM,
 only
  the
   search does (and server has lots of CPUs)
  
   So, it is like having two solr instances
  
   1. solr-indexing-master
   2. solr-read-only-master
  
   In the solrconfig.xml I can disable update components, It should be
  fine.
   However, I don't know how to 'trigger' index re-opening on (2) after
 the
   commit happens on (1).
  
   Ideally, the second instance could monitor the disk and re-open disk
  after
   new files appear there. Do I have to implement custom
  IndexReaderFactory?
   Or something else?
  
   Please note: I know about the replication, this usecase is IMHO
 slightly
   different - in fact, write-only-master (1) is also a replication
 master
  
   Googling turned out only this
   http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/71912 -
  no
   pointers there.
  
   But If I am approaching the problem wrongly, please don't hesitate to
   're-educate' me :)
  
   Thanks!
  
roman
  
 
 



Re: Sharing index data between two Solr instances

2013-05-10 Thread Peter Sturge
Hello Milen,

We do something very similar to this, except we use separate processes on
the same machine for the writer and reader. We do this so we can tune
caches etc. to optimize for each, and still use the same index files. On MP
machines, this works very well.
If you've got 2 separate machines, I would have thought replication would
be the way to go, as it performs the necessary syncronization for you.
If you do share the same index files between 2 instances, you need to be
aware of locking/contention issues (which it sounds like you are aware),
and if they're on separate machines, you'll likely need some superfast
shared disk channel (FC SAN or similar) to keep performance up (in our
experience, Solr works best with fast local-attached storage - e.g. SSD or
15k SAS drives rather than SAN, and definitely not iSCSI or NAS). In order
for the read-only instance to take the changes made by the writing
instance, it will need to do an empty commit (i.e. no docs to commit - just
auto-warming caches, readers etc.).
For us, as our writer is constantly writing, we do a timed refresh on the
read-only instance, but for separate machines you could use an rpc
mechanism between the two instances. - again though, replication already
does all this. Have you considered using replication?

Thanks,
Peter



On Fri, May 10, 2013 at 4:14 PM, milen.ti...@materna.de wrote:

 Hello together!

 I've been googleing on this topic but still couldn't find a definitive
 answer to my question.

 We have a setup of two machines both running Solr 4.2 within Tomcat. We
 are considering sharing the index data between both webapps. One of the
 machines will be configured to update the index periodically, the other one
 will be accessing it read-only. Using native locking on a network-mounted
 NTFS, is it possible for the reader to detect when new index data has been
 imported or do we need to signal it from the updating webapp and make a
 commit in order to open a new reader with the updated content?

 Thanks in advance!

 Milen Tilev
 Master of Science
 Softwareentwickler
 Business Unit Information
 

 MATERNA GmbH
 Information  Communications

 Voßkuhle 37
 44141 Dortmund
 Deutschland

 Telefon: +49 231 5599-8257
 Fax: +49 231 5599-98257
 E-Mail: milen.ti...@materna.demailto:milen.ti...@materna.de

 www.materna.dehttp://www.materna.de/ | Newsletter
 http://www.materna.de/newsletter | Twitter
 http://twitter.com/MATERNA_GmbH | XING
 http://www.xing.com/companies/MATERNAGMBH | Facebook
 http://www.facebook.com/maternagmbh
 

 Sitz der MATERNA GmbH: Voßkuhle 37, 44141 Dortmund
 Geschäftsführer: Dr. Winfried Materna, Helmut an de Meulen, Ralph Hartwig
 Amtsgericht Dortmund HRB 5839




Re: Scaling Solr on VMWare

2013-04-17 Thread Peter Sturge
Hi,

We have run solr in VM environments extensively (3.6 not Cloud, but the
issues will be similar).
There are some significant things to be aware of when running Solr in a
virtualized environment (these can be equally true with Hyper-V and Xen as
well):
If you're doing heavy indexing, the networking can be a real bottleneck,
depending on the environment.
If you're using a virtual cluster, and you have other VMs that use lots of
network and/or CPU (e.g. a SQL Server, email etc.), you will encounter
performance issues (note: it's generally a good idea to tie a Solr instance
to a physical machine in the cluster).
Using virtual switches can, in some instances, create network bottlenecks,
particularly with high input indexing. There are myriad scenarios for
vSwitches, so it's not practical to go into all the possible scenarios here
- but the general rule is - be careful!
CPU context switching can have a huge impact on Solr, so assigning CPUs,
cores and virtual cores needs some care to ensure there's enough CPU
resource to get the jobs done, but not so many the VM is continually
waiting for cores to become free (VMWare will wait until all configured
core slots are free before proceeding with a request).

The above scratches the surface of running multi-threaded production
applications like Solr in a virtual environment, but hopefully it can
provide a staring point.

Thanks,
Peter



On Wed, Apr 17, 2013 at 11:56 AM, adfel70 adfe...@gmail.com wrote:

 Hi
 We are currently considering running solr cloud on vmware.
 Di you have any insights regarding the issue you encountered and generally
 regarding using virtual machines instead of physical machines for solr
 cloud?


 Frank Wennerdahl wrote
  Hi Otis and thanks for your response.
 
  We are indeed suspecting that the problem with only 2 cores being used
  might
  be caused by the virtual environment. We're hoping that someone with
  experience of running Solr on VMWare might know more about this or the
  other
  issues we have.
 
  The servlet we're running is the bundled Jetty servlet (Solr version
 4.1).
  As we have seen a higher number of CPU cores utilized when sending data
 to
  Solr locally it seems that the servlet isn't restricting the number of
  threads used.
 
  Frank
 
  -Original Message-
  From: Otis Gospodnetic [mailto:

  otis.gospodnetic@

  ]
  Sent: den 26 mars 2013 05:09
  To:

  solr-user@.apache

  Subject: Re: Scaling Solr on VMWare
 
  Hi Frank,
 
  If your servlet container had a crazy low setting for the max number of
  threads I think you would see the CPU underutilized.  But I think you
  would
  also see errors in on the client about connections being requested.
  Sounds
  like a possibly VM issue that's not Solr-specific...
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Mon, Mar 25, 2013 at 1:18 PM, Frank Wennerdahl
  lt;

  frank.wennerdahl@

  gt; wrote:
  Hi.
 
 
 
  We are currently benchmarking our Solr setup and are having trouble
  with scaling hardware for a single Solr instance. We want to
  investigate how one instance scales with hardware to find the optimal
  ratio of hardware vs sharding when scaling. Our main problem is that
  we cannot identify any hardware limitations, CPU is far from maxed
  out, disk I/O is not an issue as far as we can see and there is plenty
 of
  RAM available.
 
 
 
  In short we have a couple of questions that we hope someone here could
  help us with. Detailed information about our setup, use case and
  things we've tried is provided below the questions.
 
 
 
  Questions:
 
  1.   What could cause Solr to utilize only 2 CPU cores when sending
  multiple update requests in parallel in a VMWare environment?
 
  2.   Is there a software limit on the number of CPU cores that Solr
  can
  utilize while indexing?
 
  3.   Ruling out network and disk performance, what could cause a
  decrease in indexing speed when sending data over a network as opposed
  to sending it from the local machine?
 
 
 
  We are running on three cores per Solr instance, however only one core
  receives any non-trivial load. We are using VMWare (ESX 5.0) virtual
  machines for hosting Solr and a QNAP NAS containing 12 HDDs in a RAID5
  setup for storage. Our data consists of a huge amount of small-sized
  documents.
  When indexing we are using Solr's javabin format (although not through
  Solrj, we have implemented the format in C#/.NET) and our batch size
  is currently 1000 documents. The actual size of the data varies, but
  the batches we have used range from approximately 450KB to 1050KB.
  We're sending these batches to Solr in parallel using a number of send
  threads.
 
 
 
  There are two issues that we've run into:
 
  1.   When sending data from one VM to Solr on another VM we observed
  that Solr did not seem to utilize CPU cores properly. The Solr VM had
  8 vCPUs available and we were using 4 threads sending data in
  parallel. We saw a low 

Re: Selective field level security

2012-09-17 Thread Peter Sturge
Hi,

Solr doesn't have any built-in mechanism for document/field level security
- basically it's delegated to the container to provide security, but this
of course won't apply to specific documents and/or fields.
There are are a lot of ways to skin this cat, some bits of which have been
covered by your message.

What can be the trickiest thing about this isn't so much adding indexed
fields etc., but rather how you plan to determine who the 'searching user'
actually is.
This task can seem not too bad at first, then all sorts of worms start
streaming out of the can (e.g. how to avoid spoofing/identity theft).
Once you're app is confident it has a bona-fide user, you then need a way
to map the user to a set of fields/docs/permissions etc. that he/she
can/can't look at.

There are plenty of approaches - mainly driven by:
 * where your original data lives (outside of Solr? does it still exist?
etc)
 * is there an external ACL mechanism that you can use (e.g. file system
permissions)
 * how do you manage users? (e.g. internal emplyoyees? public website
account holders? anyone?)

Two Jiras of note might help you in your quest:
SOLR-1872   (a good approach if you don't have access to the original
data at search-time)
SOLR-1895   (uses ManifoldCF - good if you have access to original data
and use its permissions - e.g. file system ACL)

HTH,
Peter





On Mon, Sep 17, 2012 at 7:44 PM, Nalini Kartha nalinikar...@gmail.comwrote:

 Hi,

 We're trying to push some security related info into the index which will
 control which users can search certain fields and we're wondering what the
 best way to accomplish this is.

 Some records that are being indexed and searched can have certain fields
 marked as private. When a field is marked as private, some querying users
 should not see/search on it whereas some super users can.

 Here's the solutions we're considering -

- Index a separate boolean value into a new _INTERNAL field to indicate
if the corresponding field value is marked private or not and include a
filter in the query when the searching user is not a super user.

 So for eg., consider that a record can contain 3 fields - field[123] where
 field1 and field2 can be marked as private but field3 cannot.

 Record A has only field1 marked as private, record B has both field1 and
 field2 marked as private.

 When we index these records here's what we'd end up with in the index -

 Record A -
 field1:something,  field1_INTERNAL:1, field2:something,
 field2_INTERNAL:0, field3:something
 Record B -
 field1:something,  field1_INTERNAL:1, field2:something,
 field2_INTERNAL:1, field3:something

 If the searching user is NOT a super user then the query (let's say it's
 'hidden security') needs to look like this-

 ((field3:hidden) OR (field1:hidden AND field1_INTERNAL:0) OR (field2:hidden
 AND field2_INTERNAL:0)) AND ((field3:security) OR (field1:security AND
 field1_INTERNAL:0) OR (field2:security AND field2_INTERNAL:0))

 Manipulating the query this way seems painful and error prone so we're
 wondering if Solr provides anything out of the box that would help with
 this?


- Index the private values themselves into a separate _INTERNAL field
and then determine which fields to query depending on the visibility of
 the
searching user.

 So using the example from above, here's what the indexed records would look
 like -

 Record A - field1_INTERNAL:something, field2:something,
  field3:something
 Record B - field1_INTERNAL:something, field2_INTERNAL:something,
 field3:something

 If the searching user is NOT a super user then the query just needs to be
 against the regular fields whereas if the searching user IS a super user,
 the query needs to be against BOTH the regular and INTERNAL fields.

 The issue with this solution is that since the number of docs that include
 the INTERNAL fields is going to be much fewer we're wondering if relevancy
 would be messed up when we're querying both regular and internal fields for
 super users?

 Thoughts?

 Thanks,
 Nalini



Re: solr 1872

2012-07-31 Thread Peter Sturge
Hi,

The acl file usually goes in the conf folder, so if you specify different
conf folders for each core, you could have a different one for each.
The acl file can also be specified in solrconfig.xml, under the
SolrACLSecurity section:
  str name=config-fileacl.xml/str
If you use a different solrconfig.xml for each core, you could specify
different files that way.

Keep in mind that if you just need to control core access, you can use
jetty realms or similar acl mechanism for your container.
SolrACLSecurity is for controlling fine-grained access to data within a
core.

Thanks,
Peter



On Tue, Jul 31, 2012 at 5:50 AM, Sujatha Arun suja.a...@gmail.com wrote:

 Peter,

 In a multicore environment , where should the acl file reside , under the
 conf directory ,Can I use a acl file per core ?

 Regards
 Sujatha

 On Tue, Jul 31, 2012 at 9:15 AM, Sujatha Arun suja.a...@gmail.com wrote:

  Renamed to zip and worked fine,thanks
 
  Regards
  Sujatha
 
 
  On Tue, Jul 31, 2012 at 9:15 AM, Sujatha Arun suja.a...@gmail.com
 wrote:
 
  thanks ,was looking to the rar file for instructions on set up .
 
  Regards
  Sujatha
 
 
  On Tue, Jul 31, 2012 at 1:07 AM, Peter Sturge peter.stu...@gmail.com
 wrote:
 
  I can access the rar fine with WinRAR, so should be ok, but yes, it
 might
  be in zip format.
  In any case, better to use the slightly later version --
  SolrACLSecurity.java
  26kb 12 Apr 2010 10:35
 
  Thanks,
  Peter
 
 
 
  On Mon, Jul 30, 2012 at 7:50 PM, Sujatha Arun suja.a...@gmail.com
  wrote:
 
   I am uable to use the rar file from the site
   https://issues.apache.org/jira/browse/SOLR-1872.
  
   When I try to open it,I get the message 'SolrACLSecurity.rar is not
 RAR
   archive.
  
   Is the file there at this link?
  
   Regards
   Sujatha
  
 
 
 
 



Re: solr 1872

2012-07-30 Thread Peter Sturge
I can access the rar fine with WinRAR, so should be ok, but yes, it might
be in zip format.
In any case, better to use the slightly later version -- SolrACLSecurity.java
26kb 12 Apr 2010 10:35

Thanks,
Peter



On Mon, Jul 30, 2012 at 7:50 PM, Sujatha Arun suja.a...@gmail.com wrote:

 I am uable to use the rar file from the site
 https://issues.apache.org/jira/browse/SOLR-1872.

 When I try to open it,I get the message 'SolrACLSecurity.rar is not RAR
 archive.

 Is the file there at this link?

 Regards
 Sujatha



Re: Determining which shard is failing using partialResults / some other technique?

2012-01-15 Thread Peter Sturge
Hi,

There are a couple ways of handling this.

One is to do it from the 'client' side - i.e. do a Solr ping to each
shard beforehand to find out which/if any shards are unavailable. This
may not always work if you use forwarders/proxies etc.

What we do is add the name of all failed shards to the
CommonParams.FAILED_SHARDS parameter in the response header (if
partialResults=true), by retrieving the current list (if any) and
appending:

Excerpt from SearchHandler.java : handleRequestBody():
[code]
  log.info(Waiting for shard replies...);
  // now wait for replies, but if anyone puts more requests on
  // the outgoing queue, send them out immediately (by exiting
  // this loop)
  while (rb.outgoing.size() == 0) {
ShardResponse srsp = comm.takeCompletedOrError();
if (srsp == null) break;  // no more requests to wait for

// If any shard does not respond (ConnectException) we respond with
// other shards and set partialResults to true
for (ShardResponse shardRsp : srsp.getShardRequest().responses) {
  Throwable th = shardRsp.getException();
  if (th != null) {
log.info(Got shard exception for:  + srsp.getShard()
+  :  + th.getClass().getName() +  cause:  + th.getCause());
if (th instanceof SolrServerException  th.getCause()
instanceof Exception) {
  // Was there an exception and return partial results
is false?  If so, abort everything and rethrow
  if (failOnShardFailure) {
log.info(Not set for partial results. Aborting...);
comm.cancelAll();
throw new
SolrException(SolrException.ErrorCode.SERVER_ERROR, th);
  }

if(rsp.getResponseHeader().get(CommonParams.FAILED_SHARDS) == null) {

rsp.getResponseHeader().add(CommonParams.FAILED_SHARDS,
shardRsp.getShard() + | +
  (srsp.getException() != null 
srsp.getException().getCause() != null ?

srsp.getException().getCause().getClass().getSimpleName() :
  (th instanceof SolrServerException 
th.getCause() != null ? th.getCause().getClass().getSimpleName() :
th.getClass().getSimpleName(;
  }
  else {


//Append the name of the failed shard, delimiting
multiple failed shards with |
String prslt =
rsp.getResponseHeader().get(CommonParams.FAILED_SHARDS).toString();
prslt += ; + shardRsp.getShard() + | +
 (srsp.getException() != null 
srsp.getException().getCause() != null ?

srsp.getException().getCause().getClass().getSimpleName() :
  (th instanceof SolrServerException 
th.getCause() != null ? th.getCause().getClass().getSimpleName() :
th.getClass().getSimpleName()));
rsp.getResponseHeader().remove(CommonParams.FAILED_SHARDS);

rsp.getResponseHeader().add(CommonParams.FAILED_SHARDS, prslt);
  }
  log.error(Connection to shard [ +
shardRsp.getShard() + ] did not succeed, th.getCause());
} else {
  comm.cancelAll();
  if (th instanceof SolrException) {
throw (SolrException) th;
  } else {
throw new
SolrException(SolrException.ErrorCode.SERVER_ERROR,
srsp.getException());
  }
}
  }
}
rb.finished.add(srsp.getShardRequest());
[/code]

[Note we also log the failure to the [local] server's log]
Your client can then extract the CommonParams.FAILED_SHARDS parameter
and display and/or process accordingly.


Re: Faceting Question

2012-01-15 Thread Peter Sturge
Hi,

It's quite coincidental that I was just about to ask this very
question to the forum experts.
I think this is the same sort of thing Jamie was asking about. (the
only difference in my question is that the values won't be known at
query time)

Is it possible to create a request that will return *multiple* facet
ranges - 1 for each value of a given field? (ideally, up to some
facet.limit)

For example: Let's say you query: user:* AND timestamp:[yesterday TO
now], with a facet field of 'user'.
Let's now say the faceting returns a count of 50, and there are 5
different values for 'user' - let's say user1, user2, user3, user4 and
user5 (50 things happened over the last 24 hours by 5 different
users).

Is it possible, in a single query, to get back 5 facet ranges over the
24hr period - one for each user? Or, do you simply have to do the
search, and then iterate through each value returned and date facet on
that?

Pivot faceting can give results for combinations of multiple facets,
but not ranges.

Thanks,
Peter




On Sun, Jan 15, 2012 at 3:30 PM, Lee Carroll
lee.a.carr...@googlemail.com wrote:
  Does
 that make more sense?

 Ah I see.

 I'm not certain but take a look at pivot faceting

 https://issues.apache.org/jira/browse/SOLR-792

 cheers lee c


Highlighting and regex

2011-11-17 Thread Peter Sturge
Hi,

Been wrestling with a question on highlighting (or not) - perhaps
someone can help?

The question is this:
Is it possible, using highlighting or perhaps another more suited
component, to return words/tokens from a stored field based on a
regular expression's capture groups?

What I was kind of thinking would happen with highlighting regex
(hl.regex.pattern) - but doesn't seem to (although I am a highlighting
novice), is that capture groups specified in a regex would be
highlighted.

For example:
1) given a field called
desc

2) with a stored value of:
the quick brown fox jumps over the lazy dog

3) specify a regex of:
   .*quick\s(\S+)\sfox.+\sthe\s(\S+)\sdog.*

4) get in the response:
  embrown/em and
  emlazy/em
either as highlighting or through some other means.

(I find that using hl.regex.pattern on the above yields: emthe quick
brown fox jumps over the lazy dog/em)

I'm guessing that I'm misinterpreting the functionality offered by
highlighting, but I couldn't find much on the subject in the way of
usage docs.

I could write a custom highlighter or SearchComponent plugin that
would do this, but is there some mechanism out there that can do this
sort of thing already?
It wouldn't necessarily have to be based on regex, but regex tends to
be the de-facto standard for doing capture group token matching (not
sure how Solr syntax would do something similar unless there were
multiples, maybe?).

Any insights greatly appreciated.

Many thanks,
Peter


Re: SSD experience

2011-08-23 Thread Peter Sturge
Just to add a few cents worth regarding SSD...

We use Vertex SSD drives for storing indexes, and wow, they really
scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
commit times where we see the biggest performance boost.
In tests, we found that locally attached 15k SAS drives are the next
best for performance. SANs can work well, but should be FibreChannel.
IP-based SANs are ok, as long they're not heavily taxed by other,
non-Solr disk I/O.
NAS is far and away the poorest performing - not recommended for real indexes.

HTH,
Peter



On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!



Re: SSD experience

2011-08-23 Thread Peter Sturge
The Solr index directory lives directly on the SSD (running on Windows
- where the word symlink does not appear in any dictionary within a
100 mile radius of Redmond :-)

Currently, the main limiting factors of SSD are cost and size. SSDs
will get larger over time. Splitting indexes across multiple shards on
multiple SSDs is a wonderfully fast, if not slightly extravagant
method of getting excellent IO performance.
Regarding cost, I've seen many organizations where the use of fast
SANs costs at least the same if not more per GB of storage than SSD.
Hybrid drives can be a good cost-effective alternative as well.

Peter



On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote:
 Interesting. Do you make a symlink to the indexes or is the whole Solr 
 directory on SSD?

 thanks,
 Gerard

 Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...

 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real 
 indexes.

 HTH,
 Peter



 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!







Re: SSD experience

2011-08-23 Thread Peter Sturge
Ah yes, the beautiful new links in Windows 6. These are 'symlinks' in
name only - they operate *very* differently from LUNIX symlinks, and
sadly, not quite so well. NTFS is one of the best things about
Windows, but it's architecture is not well suited to 'on-the-fly'
redirection, as there are many items 'in the chain' to cater for at
various points - e.g. driver stack, sid context, SACL/DACLs, DFS,
auditing etc.This makes links on NTFS much more difficult to manage
and it is common to encounter all manner of strange behaviour when
using them.


On Tue, Aug 23, 2011 at 5:34 PM, Sanne Grinovero
sanne.grinov...@gmail.com wrote:
 Indeed I would never actually use it, but symlinks do exist on Windows.

 http://en.wikipedia.org/wiki/NTFS_symbolic_link

 Sanne

 2011/8/23 Peter Sturge peter.stu...@gmail.com:
 The Solr index directory lives directly on the SSD (running on Windows
 - where the word symlink does not appear in any dictionary within a
 100 mile radius of Redmond :-)

 Currently, the main limiting factors of SSD are cost and size. SSDs
 will get larger over time. Splitting indexes across multiple shards on
 multiple SSDs is a wonderfully fast, if not slightly extravagant
 method of getting excellent IO performance.
 Regarding cost, I've seen many organizations where the use of fast
 SANs costs at least the same if not more per GB of storage than SSD.
 Hybrid drives can be a good cost-effective alternative as well.

 Peter



 On Tue, Aug 23, 2011 at 3:29 PM, Gerard Roos l...@gerardroos.nl wrote:
 Interesting. Do you make a symlink to the indexes or is the whole Solr 
 directory on SSD?

 thanks,
 Gerard

 Op 23 aug. 2011, om 12:53 heeft Peter Sturge het volgende geschreven:

 Just to add a few cents worth regarding SSD...

 We use Vertex SSD drives for storing indexes, and wow, they really
 scream compared to SATA/SAS/SAN. As we do some heavy commits, it's the
 commit times where we see the biggest performance boost.
 In tests, we found that locally attached 15k SAS drives are the next
 best for performance. SANs can work well, but should be FibreChannel.
 IP-based SANs are ok, as long they're not heavily taxed by other,
 non-Solr disk I/O.
 NAS is far and away the poorest performing - not recommended for real 
 indexes.

 HTH,
 Peter



 On Mon, Aug 22, 2011 at 3:54 PM, Rich Cariens richcari...@gmail.com 
 wrote:
 Ahoy ahoy!

 Does anyone have any experiences or stories they can share with the list
 about how SSDs impacted search performance for better or worse?

 I found a Lucene SSD performance benchmark
 dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
 the wiki engine is refusing to let me view the attachment (I get You
 are not allowed to do AttachFile on this page.).

 Thanks in advance!









Re: exceeded limit of maxWarmingSearchers ERROR

2011-08-14 Thread Peter Sturge
It's worth noting that the fast commit rate is only an indirect part
of the issue you're seeing. As the error comes from cache warming - a
consequence of committing, it's not the fault of commiting directly.
It's well worth having a good close look at exactly what you're caches
are doing when they are warmed, and trying as much as possible to
remove any uneeded facet/field caching etc.
The time it takes to repopulate the caches causes the error - if it's
slower than the commit rate, you'll get into the 'try again later'
spiral.

There are a number of ways to help mitigate this - NRT is the
certainly the [hopefullly near] future for this. Other strategies
include distributed search/cloud/ZK - splitting the index into logical
shards, so your commits and their associated caches are smaller and
more targeted. You can also use two Solr instances - one optimized for
writes/commits, one for reads, (write commits are async of the 'read'
instance), plus there are customized solutions like RankingAlgorithm,
Zoie etc.


On Sun, Aug 14, 2011 at 2:47 AM, Naveen Gupta nkgiit...@gmail.com wrote:
 Hi,

 Most of the settings are default.

 We have single node (Memory 1 GB, Index Size 4GB)

 We have a requirement where we are doing very fast commit. This is kind of
 real time requirement where we are polling many threads from third party and
 indexes into our system.

 We want these results to be available soon.

 We are committing for each user (may have 10k threads and inside that 1
 thread may have 10 messages). So overall documents per user will be having
 around .1 million (10)

 Earlier we were using commit Within  as 10 milliseconds inside the document,
 but that was slowing the indexing and we were not getting any error.

 As we removed the commit Within, indexing became very fast. But after that
 we started experiencing in the system

 As i read many forums, everybody told that this is happening because of very
 fast commit rate, but what is the solution for our problem?

 We are using CURL to post the data and commit

 Also till now we are using default solrconfig.

 Aug 14, 2011 12:12:04 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1052)
        at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:424)
        at
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
        at
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:177)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
        at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
        at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
        at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
        at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
        at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:662)



Re: LockObtainFailedException

2011-08-11 Thread Peter Sturge
Hi,

When you get this exception with no other error or explananation in
the logs, this is almost always because the JVM has run out of memory.
Have you checked/profiled your mem usage/GC during the stream operation?



On Thu, Aug 11, 2011 at 3:18 AM, Naveen Gupta nkgiit...@gmail.com wrote:
 Hi,

 We are doing streaming update to solr for multiple user,

 We are getting


 Aug 10, 2011 11:56:55 AM org.apache.solr.common.SolrException log

 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
 out: NativeFSLock@/var/lib/solr/data/index/write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097)
        at
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
        at
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
        at
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174)
        at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222)
        at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        at
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
        at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
        at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
        at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
        at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
        at org.apache.tomcat.util.net.JIoEndpoint

 Aug 10, 2011 12:00:16 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
 out: NativeFSLock@/var/lib/solr/data/index/write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097)
        at
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
        at
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
        at
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174)
        at
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222)
        at
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        at
 org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
        at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
        at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
        at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
 

Re: LockObtainFailedException

2011-08-11 Thread Peter Sturge
Optimizing indexing time is a very different question.
I'm guessing your 3mins+ time you refer to is the commit time.

There are a whole host of things to take into account regarding
indexing, like: number of segments, schema, how many fields, storing
fields, omitting norms, caching, autowarming, search activity etc. -
the list goes on...
The trouble is, you can look at 100 different Solr installations with
slow indexing, and find 200 different reasons why each is slow.

The best place to start is to get a full understanding of precisely
how your data is being stored in the index, starting with adding docs,
going through your schema, Lucene segments, solrconfig.xml etc,
looking at caches, commit triggers etc. - really getting to know how
each step is affecting performance.
Once you really have a handle on all the indexing steps, you'll be
able to spot the bottlenecks that relate to your particular
environment.

An index of 4.5GB isn't that big (but the number of documents tends to
have more of an effect than the physical size), so the bottleneck(s)
should be findable once you trace through the indexing operations.



On Thu, Aug 11, 2011 at 1:02 PM, Naveen Gupta nkgiit...@gmail.com wrote:
 Yes this was happening because of JVM heap size

 But the real issue is that if our index size is growing (very high)

 then indexing time is taking very long (using streaming)

 earlier for indexing 15,000 docs at a time (commit after 15000 docs) , it
 was taking 3 mins 20 secs time,

 after deleting the index data, it is taking 9 secs

 What would be approach to have better indexing performance as well as index
 size should also at the same time.

 The index size was around 4.5 GB

 Thanks
 Naveen

 On Thu, Aug 11, 2011 at 3:47 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 When you get this exception with no other error or explananation in
 the logs, this is almost always because the JVM has run out of memory.
 Have you checked/profiled your mem usage/GC during the stream operation?



 On Thu, Aug 11, 2011 at 3:18 AM, Naveen Gupta nkgiit...@gmail.com wrote:
  Hi,
 
  We are doing streaming update to solr for multiple user,
 
  We are getting
 
 
  Aug 10, 2011 11:56:55 AM org.apache.solr.common.SolrException log
 
  SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed
  out: NativeFSLock@/var/lib/solr/data/index/write.lock
         at org.apache.lucene.store.Lock.obtain(Lock.java:84)
         at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097)
         at
  org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
         at
 
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102)
         at
 
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174)
         at
 
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222)
         at
 
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
         at
  org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
         at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)
         at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
         at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
         at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
         at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
         at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
         at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
         at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
         at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
         at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
         at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
         at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
         at
 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
         at
 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
         at org.apache.tomcat.util.net.JIoEndpoint
 
  Aug 10, 2011 12:00:16 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed
  out: NativeFSLock@/var/lib/solr/data/index/write.lock
         at org.apache.lucene.store.Lock.obtain(Lock.java:84

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-17 Thread Peter Sturge
You'll need to be a bit careful using joins, as the performance hit
can be significant if you have lots of cross-referencing to do, which
I believe you would given your scenario.

Your table could be setup to use the username as the key (for fast
lookup), then map these to your own data class or collection or
similar to hold your other information: products, expiry etc.
By using your own data class, it's then easy to extend it later if you
want to add additional parameters. (for example: HashMapString,
MyDataClass)

When a search comes in, the user is looked up to retrieve the data
class, then its contents (as defined by you) is examined and the query
is processed/filtered appropriately.

You'll need a bootstrap mechanism for populating the list in the first
place. One thing worth looking at is lazy loading - i.e. the first
time a user does a search (you lookup the user in the table, and it
isn't there), you load the data class (maybe from your DB, a file, or
index), then ad it to the table. This is good if you have 10's of
thousands or millions of users, but only a handful are actually
searching, some perhaps very rarely.

If you do have millions of users, and your data class has heavy
requirements (e.g. many thousands of products + info etc.), you might
want to 'time-out' in-memory table entries, if the table gets really
huge - it depends on the usage of your system. (you can run a
synchronized cleanup thread to do this if you deemed it necessary).


On Fri, Jun 17, 2011 at 6:06 AM, Sujatha Arun suja.a...@gmail.com wrote:
 Alexey,

 Do you mean that we  have current Index as it is and have a separate core
 which  has only the user-id ,product-id relation and at while querying ,do a
 join between the two cores based on the user-id.


 This would involve us to Index/delete the product  as and when the user
 subscription for a product changes ,This would involve some amount of
 latency if the Indexing (we have a queue system for Indexing across the
 various instances) or deletion is delayed

 IF we want to go ahead with this solution ,We currently are using solr 1.3
 , so  is this functionality available as a patch for solr 1.3?Would it be
 possible to  do with a separate Index  instead of a core ,then I can create
 only one  Index common for all our instances and then use this instance to
 do the join.

 Thanks
 Sujatha

 On Thu, Jun 16, 2011 at 9:27 PM, Alexey Serba ase...@gmail.com wrote:

  So a search for a product once the user logs in and searches for only the
  products that he has access to Will translate to something like this .
 ,the
  product ids are obtained form the db  for a particular user and can run
  into  n  number.
 
  search term fq=product_id(100 10001  ..n number)
 
  but we are currently running into too many Boolean expansion error .We
 are
  not able to tie the user also into roles as each user is mainly any one
 who
  comes to site and purchases a product .

 I'm wondering if new trunk Solr join functionality can help here.

 * http://wiki.apache.org/solr/Join

 In theory you can index your products (product_id, ...) and
 user_id-product many-to-many relation (user_product_id, user_id) into
 signle/different cores and then do join, like
 f=search termsfq={!join from=product_id to=user_product_id}user_id:10101

 But I haven't tried that, so I'm just speculating.




Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Peter Sturge
Hi,

SOLR-1834 is good when the original documents' ACL is accessible.
SOLR-1872 is good where the usernames are persistent - neither of
these really fit your use case.
It sounds like you need more of an 'in-memory', transient access
control mechanism. Does the access have to exist beyond the user's
session (or the Solr vm session)?
Your best bet is probably something like a custom SearchComponent or
similar, that keeps track of user purchases, and either adjusts/limits
the query or the results to suit.
With your own module in the query chain, you can then decide when the
'expiry' is, and limit results accordingly.

SearchComponent's are pretty easy to write and integrate. Have a look at:
   http://wiki.apache.org/solr/SearchComponent
for info on SearchComponent and its usage.




On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun suja.a...@gmail.com wrote:
 Hello,


 Our Use Case is as follows

 Several solr webapps (one JVM) ,Each webapp catering to one client .Each
 client has their users who can purchase products from the  site .Once they
 purchase ,they have full access to the products ,other wise they can only
 view details .

 The products are not tied to the user at the document  level, simply because
 , once the purchase duration of product expires ,the user will no longer
 have access to that product.

 So a search for a product once the user logs in and searches for only the
 products that he has access to Will translate to something like this . ,the
 product ids are obtained form the db  for a particular user and can run
 into  n  number.

 search term fq=product_id(100 10001  ..n number)

 but we are currently running into too many Boolean expansion error .We are
 not able to tie the user also into roles as each user is mainly any one who
 comes to site and purchases a product .

 Given the 2 solutions above as SOLR -1872 where we have to specify the user
 in an ACL file  and
 query for allow and deny also translates to what  we are trying to do above

 In Case of SOLR 1834 ,we are required to use a crawler (APACHE manifoldCF)
 for indexing the Permissions(also the data) into the document and then
 querying on it ,this will also not work in our scenario as we have  n web
 apps having the same requirement  ,it would be tedious to set this up for
 each webapp and also the  requirement that once the user permission for a
 product is revoked ,then he should not be able to search  on the same within
 his subscribed products.

 Any pointers would be helpful and sorry about the lengthy description.

 Regards
 Sujatha



Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Peter Sturge
SOLR-1872 doesn't add discrete booleans to the query, it does it
programmatically, so you shouldn't see this problem. (if you have a
look at the code, you'll see how it filters queries)
I suppose you could modify SOLR-1872 to use an in-memory,
dynamically-updated user list (+ associated filters) instead of using
the acl file.
This would give you the 'changing users' and 'expiry' functionailty you need.



On Tue, Jun 14, 2011 at 10:08 AM, Sujatha Arun suja.a...@gmail.com wrote:
 Thanks Peter , for your input .

 I really  would like a document and schema agnostic   solution as  in solr
 1872.

  Am I right  in my assumption that SOLR1872  is same as the solution that
 we currently have where we add a flter query of the products  to orignal
 query and hence (SOLR 1872) will also run into  TOO many boolean clause
 expanson error?

 Regards
 Sujatha


 On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 SOLR-1834 is good when the original documents' ACL is accessible.
 SOLR-1872 is good where the usernames are persistent - neither of
 these really fit your use case.
 It sounds like you need more of an 'in-memory', transient access
 control mechanism. Does the access have to exist beyond the user's
 session (or the Solr vm session)?
 Your best bet is probably something like a custom SearchComponent or
 similar, that keeps track of user purchases, and either adjusts/limits
 the query or the results to suit.
 With your own module in the query chain, you can then decide when the
 'expiry' is, and limit results accordingly.

 SearchComponent's are pretty easy to write and integrate. Have a look at:
   http://wiki.apache.org/solr/SearchComponent
 for info on SearchComponent and its usage.




 On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun suja.a...@gmail.com wrote:
  Hello,
 
 
  Our Use Case is as follows
 
  Several solr webapps (one JVM) ,Each webapp catering to one client .Each
  client has their users who can purchase products from the  site .Once
 they
  purchase ,they have full access to the products ,other wise they can only
  view details .
 
  The products are not tied to the user at the document  level, simply
 because
  , once the purchase duration of product expires ,the user will no longer
  have access to that product.
 
  So a search for a product once the user logs in and searches for only the
  products that he has access to Will translate to something like this .
 ,the
  product ids are obtained form the db  for a particular user and can run
  into  n  number.
 
  search term fq=product_id(100 10001  ..n number)
 
  but we are currently running into too many Boolean expansion error .We
 are
  not able to tie the user also into roles as each user is mainly any one
 who
  comes to site and purchases a product .
 
  Given the 2 solutions above as SOLR -1872 where we have to specify the
 user
  in an ACL file  and
  query for allow and deny also translates to what  we are trying to do
 above
 
  In Case of SOLR 1834 ,we are required to use a crawler (APACHE
 manifoldCF)
  for indexing the Permissions(also the data) into the document and then
  querying on it ,this will also not work in our scenario as we have  n web
  apps having the same requirement  ,it would be tedious to set this up for
  each webapp and also the  requirement that once the user permission for a
  product is revoked ,then he should not be able to search  on the same
 within
  his subscribed products.
 
  Any pointers would be helpful and sorry about the lengthy description.
 
  Regards
  Sujatha
 




Re: [POLL] How do you (like to) do logging with Solr

2011-05-16 Thread Peter Sturge
 [X]  I always use the JDK logging as bundled in solr.war, that's perfect
 [ ]  I sometimes use log4j or another framework and am happy with
re-packaging solr.war
 [ ]  Give me solr.war WITHOUT an slf4j logger binding, so I can
choose at deploy time
 [ ]  Let me choose whether to bundle a binding or not at build time,
using an ANT option
 [ ]  What's wrong with the solr/example Jetty? I never run Solr elsewhere!
 [ ]  What? Solr can do logging? How cool!


Re: DIH for e-mails

2011-05-05 Thread Peter Sturge
The best way to add your own fields is to create a custom Transformer sub-class.
See:
http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler

This will guide you through the steps.

Peter


2011/5/5 方振鹏 michong900...@xmu.edu.cn:



 I’m using Data Import Handler for index emails.

 The problem is that I wanna add my own field such as security_number.

 Someone have any idea?

 Regards,

 Jame Bond Fang




Re: Trying to Post. Emails rejected as spam.

2011-04-07 Thread Peter Sturge
This happens almost always because you're sending from a 'free' mail
account (gmail, yahoo, hotmail, etc), and your message contains words
that spam filters don't like.
For me, it was the use of the word 'remplica' (deliberately
mis-spelled so this mail gets sent).

It can also happen from 'non-free' mail servers that have been
successfully attacked by spambots, so that filters give it a really
bad reputation score.


On Thu, Apr 7, 2011 at 8:14 PM, Parker Johnson pjoh...@yahoo.com wrote:

 Hello everyone.  Does anyone else have problems posting to the list?  My
 messages keep getting rejected with this response below.  I'll be surprised if
 this one makes it through :)

 -Park

 Sorry, we were unable to deliver your message to the following address.

 solr-user@lucene.apache.org:
 Remote  host said: 552 spam score (8.0) exceeded threshold
 (FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
  ) [BODY]

 --- Below this line is a copy of the message.



Re: Exception on distributed date facet SOLR-1709

2011-03-18 Thread Peter Sturge
Hi Viswa,

This patch was orignally built for the 3x branch, and I don't see any
ported patch revision or testing for trunk. A lot has changed in
faceting from 3x to trunk, so it will likely need a bit of adjusting
to cater for these changes (e.g. deprecation of date range in favour
of range). Have you tried this patch on 3x branch?

Thanks,
Peter



On Fri, Mar 18, 2011 at 7:09 AM, Viswa S svis...@hotmail.com wrote:
 Folks,

 We are trying to do some date faceting on our distributed environment,
 applied solr-1709 on the trunk. A date facet query throws the below
 exception, I have attached the patched source for reference. Any help would
 be appreciated.

 Other Info:
 Java ver: 1_6_0_24
 Trung change list: 1022216




 SEVERE: java.lang.ClassCastException: java.util.Date cannot be cast to
 java.lang.Integer

     at
 org.apache.solr.handler.component.FacetComponent.countFacets(FacetComponent.java:294)

     at
 org.apache.solr.handler.component.FacetComponent.handleResponses(FacetComponent.java:232)

     at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:326)

     at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1325)

     at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)

     at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)

     at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)

     at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)

     at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

     at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

     at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)

     at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)

     at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

     at
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

     at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

     at org.mortbay.jetty.Server.handle(Server.java:326)

     at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

     at
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)

     at
 org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)

     at
 org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

     at
 org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

     at
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)

     at
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)







Re: problem using dataimporthandler

2011-03-15 Thread Peter Sturge
Could possibly be your original xml file was in unicode (with a BOM
header - FFFE or FEFF) - xml will see it as content if the underlying
file system doesn't handle it.


On Tue, Mar 15, 2011 at 10:00 PM, sivaram yogendra.bopp...@gmail.com wrote:
 I got rid of the problem by just copying the other schema and config files(
 which sound like nothing to do with the error on the dataconfig file but I
 gave it a try) and it worked I don't know if I'm missing something here
 but its working now.

 Thanks,
 Ram.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/problem-using-dataimporthandler-tp495415p2684044.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Math-generated fields during query

2011-03-10 Thread Peter Sturge
Hi Dan,

Yes, you're right - in fact that was precisely what I was thinking of
doing! Also looking at SOLR-1298  SOLR-1566 - which would be good for
applying functions generically rather than on a per-use-case basis.

Thanks!
Peter


On Thu, Mar 10, 2011 at 3:58 PM, dan sutton danbsut...@gmail.com wrote:
 As a workaround can you not have a search component run after the
 querycomponent, and have the qty_ordered,unit_price as stored fields
 and returned with the fl parameter and have your custom component do
 the calc, unless you need to sort by this value too?

 Dan

 On Wed, Mar 9, 2011 at 10:06 PM, Peter Sturge peter.stu...@gmail.com wrote:
 Hi,

 I was wondering if it is possible during a query to create a returned
 field 'on the fly' (like function query, but for concrete values, not
 score).

 For example, if I input this query:
   q=_val_:product(15,3)fl=*,score

 For every returned document, I get score = 45.

 If I change it slightly to add *:* like this:
   q=*:* _val_:product(15,3)fl=*,score

 I get score = 32.526913.

 If I try my use case of _val_:product(qty_ordered,unit_price), I get
 varying scores depending on...well depending on something.

 I understand this is doing relevance scoring, but it doesn't seem to
 tally with the FunctionQuery Wiki
 [example at the bottom of the page]:

   q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score
 ...where score will contain the resultant volume.

 Is there a trick to getting not a score, but the actual value of
 quantity*price (e.g. product(5,2.21) == 11.05)?

 Many thanks




Re: Help -DIH (mail)

2011-03-09 Thread Peter Sturge
Hi,

You've included some output in your message, so I presume something
*did* happen when you ran the 'status' command (but it might not be
what you wanted to happen :-)

If you run:
http://localhost:8983/solr/mail/dataimport?command=status

and you get something like this back:
str name=statusidle/str
str name=importResponse/
lst name=statusMessages/

It means that no full-import or delta-import has been run during the
life of the JVM Solr session.

You should try running:
   http://localhost:8983/solr/mail/dataimport?command=full-import

Then run:
   http://localhost:8983/solr/mail/dataimport?command=status

to see the status of the full-import (busy, idle, error, rolled back etc.)

You can enable java logging by editing your JRE's lib/logging.properties file.

Something like this should give you some log files:
handlers= java.util.logging.FileHandler
.level= INFO
java.util.logging.FileHandler.pattern = ./logs/mylogs%d.log
java.util.logging.FileeHandler.level = INFO
java.util.logging.FileHandler.limit = 50
java.util.logging.FileHandler.count = 1
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter

NOTE: Make sure the 'logs' folder exists (in your $cwd) before you
start, or you'll get an error.

HTH
Peter




On Wed, Mar 9, 2011 at 12:47 PM, Matias Alonso matiasgalo...@gmail.com wrote:
 Hi Peter,

 When I execute the commands you mentioned, nothing happend.
 Below I show you the comands executed and the answered of they.
 Sorry, but I don´t know how to enable the log; my jre is by default.
 Rememeber I´m running the example-DIH (trunk\solr\example\example-DIH\solr);
 java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar.



 Import:
 http://localhost:8983/solr/mail/dataimport?command=status
 http://localhost:8983/solr/mail/dataimport?command=full-import

 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime15/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 -
 str name=command
 full-importhttp://localhost:8983/solr/mail/dataimport?command=full-import
 /str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages/
 -
 str name=WARNING
 This response format is experimental.  It is likely to change in the future.
 /str
 /response



 Status:
 http://localhost:8983/solr/mail/dataimport?command=status
 http://localhost:8983/solr/mail/dataimport?command=full-import


 response
 -
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 -
 lst name=initArgs
 -
 lst name=defaults
 str name=configdata-config.xml/str
 /lst
 /lst
 -
 str name=command
 statushttp://localhost:8983/solr/mail/dataimport?command=full-import
 /str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages/
 -
 str name=WARNING
 This response format is experimental.  It is likely to change in the future.
 /str
 /response




 Thank you for your help.

 Matias.






 2011/3/4 Peter Sturge peter.stu...@gmail.com

 Can you try this:

 Issue a full import command like this:

 http://localhost:8983/solr/dataimport?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 (There is no core name here - if you're using a core name (db?), then add
 that in between solr/ and /dataimport)

 then, run:
 http://localhost:8983/solr/dataimport?command=status
 http://localhost:8983/solr/db/dataimport?command=full-import

 This will show the results of the previous import. Has it been rolled-back?
 If so, there might be something in the log if it's enabled (see your jre's
 lib/logging.properties file).
 (you won't see any errors unless you run the status command - that's where
 they're stored)

 HTH
 Peter




 On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.com
 wrote:

  I´m using the trunk.
 
  Thanks Peter for your preoccupation!
 
  Matias.
 
 
 
  2011/3/4 Peter Sturge peter.stu...@gmail.com
 
   Hi Matias,
  
   What version of Solr are you using? Are you running any patches (maybe
   SOLR-2245)?
  
   Thanks,
   Peter
  
  
  
   On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com
   wrote:
  
Hi Peter,
   
From DataImportHandler Development Console I made a full-import,
 but
didn´t work.
   
Now, I execute 
http://localhost:8983/solr/mail/dataimport?command=full-import; but
nothing
happends; no index; no errors.
   
thks...
   
Matias.
   
   
   
2011/3/4 Peter Sturge peter.stu...@gmail.com
   
 Hi Mataias,



   
  
 
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
 the dataimport handler, but you need to tell it to do something by
 sending a command:

  
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
 ?command=full-import
 http://localhost:8983/solr/db/dataimport?command=full-import

 If you haven't already, have a look at:



   
  
 
 http://www.lucidimagination.com

Re: Help -DIH (mail)

2011-03-09 Thread Peter Sturge
,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher warm
 INFO: autowarming result for Searcher@1cee792 main

 documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.core.QuerySenderListener newSearcher
 INFO: QuerySenderListener sending requests to Searcher@1cee792 main
 09/03/2011 11:52:03 org.apache.solr.core.SolrCore execute
 INFO: [mail] webapp=null path=null
 params={start=0event=newSearcherq=solrrows=10} hits=0 status=0 QTime=0
 09/03/2011 11:52:03 org.apache.solr.core.SolrCore execute
 INFO: [mail] webapp=null path=null
 params={start=0event=newSearcherq=rocksrows=10} hits=0 status=0 QTime=0
 09/03/2011 11:52:03 org.apache.solr.core.SolrCore execute
 INFO: [mail] webapp=null path=null
 params={event=newSearcherq=static+newSearcher+warming+query+from+solrconfig.xml}
 hits=0 status=0 QTime=0
 09/03/2011 11:52:03 org.apache.solr.core.QuerySenderListener newSearcher
 INFO: QuerySenderListener done.
 09/03/2011 11:52:03 org.apache.solr.core.SolrCore registerSearcher
 INFO: [mail] Registered new searcher Searcher@1cee792 main
 09/03/2011 11:52:03 org.apache.solr.search.SolrIndexSearcher close
 INFO: Closing Searcher@9a18a0 main

 fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

 filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

 queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=5,evictions=0,size=5,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}

 documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.SolrWriter persist
 INFO: Wrote last indexed time to dataimport.properties
 09/03/2011 11:52:03 org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {deleteByQuery=*:*,optimize=} 0 0
 09/03/2011 11:52:03 org.apache.solr.handler.dataimport.DocBuilder execute
 INFO: Time taken = 0:0:2.359



 09/03/2011 11:54:58 org.apache.solr.core.SolrCore execute
 INFO: [mail] webapp=/solr path=/dataimport params={command=status} status=0
 QTime=0



 Thks,

 Matias.





 2011/3/9 Peter Sturge peter.stu...@gmail.com

 Hi,

 You've included some output in your message, so I presume something
 *did* happen when you ran the 'status' command (but it might not be
 what you wanted to happen :-)

 If you run:
 http://localhost:8983/solr/mail/dataimport?command=status

 and you get something like this back:
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages/

 It means that no full-import or delta-import has been run during the
 life of the JVM Solr session.

 You should try running:
    http://localhost:8983/solr/mail/dataimport?command=full-import

 Then run:
   http://localhost:8983/solr/mail/dataimport?command=status

 to see the status of the full-import (busy, idle, error, rolled back etc.)

 You can enable java logging by editing your JRE's lib/logging.properties
 file.

 Something like this should give you some log files:
 handlers= java.util.logging.FileHandler
 .level= INFO
 java.util.logging.FileHandler.pattern = ./logs/mylogs%d.log
 java.util.logging.FileeHandler.level = INFO
 java.util.logging.FileHandler.limit = 50
 java.util.logging.FileHandler.count = 1
 java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter

 NOTE: Make sure the 'logs' folder exists (in your $cwd) before you
 start, or you'll get an error.

 HTH
 Peter




 On Wed, Mar 9, 2011 at 12:47 PM, Matias Alonso matiasgalo...@gmail.com
 wrote:
  Hi Peter,
 
  When I execute the commands you mentioned, nothing happend.
  Below I show you the comands executed and the answered of they.
  Sorry, but I don´t know how to enable the log; my jre is by default.
  Rememeber I´m running the example-DIH
 (trunk\solr\example\example-DIH\solr);
  java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar.
 
 
 
  Import:
  http://localhost:8983/solr/mail/dataimport?command=status
  http://localhost:8983/solr/mail/dataimport?command=full-import
 
  response
  -
  lst name=responseHeader
  int name=status0/int
  int name=QTime15/int
  /lst
  -
  lst name=initArgs
  -
  lst

Math-generated fields during query

2011-03-09 Thread Peter Sturge
Hi,

I was wondering if it is possible during a query to create a returned
field 'on the fly' (like function query, but for concrete values, not
score).

For example, if I input this query:
   q=_val_:product(15,3)fl=*,score

For every returned document, I get score = 45.

If I change it slightly to add *:* like this:
   q=*:* _val_:product(15,3)fl=*,score

I get score = 32.526913.

If I try my use case of _val_:product(qty_ordered,unit_price), I get
varying scores depending on...well depending on something.

I understand this is doing relevance scoring, but it doesn't seem to
tally with the FunctionQuery Wiki
[example at the bottom of the page]:

   q=boxname:findbox+_val_:product(product(x,y),z)fl=*,score
...where score will contain the resultant volume.

Is there a trick to getting not a score, but the actual value of
quantity*price (e.g. product(5,2.21) == 11.05)?

Many thanks


Solr chained exclusion query

2011-03-04 Thread Peter Sturge
Hello,

I've been wrestling with a query use case, perhaps someone has done this
already?
Is it possible to write a query that excludes results based on another
query?

Scenario:
I have an index that holds:
   'customer'  (textgen)
   'product'   (textgen)
   'saledate'   (date)

I'm looking to return documents for 'customer' entries who have bought a
'product' in the past, but haven't bought in, say, the last month.
(i.e. need to exclude *all* 'customer' documents who have bought 'product'
in the last month, as well as those who have never bought 'product')

A very simple query like this:
 q=products:Dog AND -(products:Dog AND saledate:[2011-01-01T00:00:00Z TO
*])
returns 'Dog' documents prior to 1 Jan, but these need to be excluded if
there are matches after 1 Jan.
I wasn't expecting the above query to do the extra exclusion - it's just to
illustrate the general problem that it operates at document level, not query
level (like a SQL subquery).
If I could could pipe the results of the above to another query, that would
likely do the trick.
I've tried negative boosts, magic _query_, query() and such, but with no
luck.

Is this possible?
Any insight into how to write such a query would be much appreciated!

Thanks,
Peter


Re: Solr chained exclusion query

2011-03-04 Thread Peter Sturge
Hi,

Oh, how I wish it was as simple as that! :-)
The tricky ingredient in the use case is to exclude all documents (from any
'saledate') if there's a recent 'product' match (e.g. last month).
So, essentially you have to somehow build a query that looks at 2 different
criteria for the same field ('saledate'). This requires the criteria to be
applied at the DocSet level,
rather than on each Document (or, do them sequentially like in SOLR-2026).

I've been having a look at Karl's SOLR-2026, which looks very interesting,
but I've not got it working on trunk as yet.
The only other way I can see is to do multiple client-side round-trip
queries - using the results of the initial search as a filter for the
second.
It's a bit messy, and not a performance winner (esp w/ distributed searches
on large indexes), so hopefully a server-side solution is out there.

Thanks!
Peter





On Fri, Mar 4, 2011 at 2:14 PM, Savvas-Andreas Moysidis 
savvas.andreas.moysi...@googlemail.com wrote:

 Can you not calculate on the fly when the date which is one month before
 the
 current is and use that as your upper limit?

 e.g. taking today as an example your upper limit would be
 20011-02-04T00:00:00Z
 and so your query would be something like:
 q=products:Dog AND saledate:[* TO 20011-02-04T00:00:00Z]


 On 4 March 2011 11:40, Peter Sturge peter.stu...@gmail.com wrote:

  Hello,
 
  I've been wrestling with a query use case, perhaps someone has done this
  already?
  Is it possible to write a query that excludes results based on another
  query?
 
  Scenario:
  I have an index that holds:
'customer'  (textgen)
'product'   (textgen)
'saledate'   (date)
 
  I'm looking to return documents for 'customer' entries who have bought a
  'product' in the past, but haven't bought in, say, the last month.
  (i.e. need to exclude *all* 'customer' documents who have bought
 'product'
  in the last month, as well as those who have never bought 'product')
 
  A very simple query like this:
  q=products:Dog AND -(products:Dog AND saledate:[2011-01-01T00:00:00Z
 TO
  *])
  returns 'Dog' documents prior to 1 Jan, but these need to be excluded if
  there are matches after 1 Jan.
  I wasn't expecting the above query to do the extra exclusion - it's just
 to
  illustrate the general problem that it operates at document level, not
  query
  level (like a SQL subquery).
  If I could could pipe the results of the above to another query, that
 would
  likely do the trick.
  I've tried negative boosts, magic _query_, query() and such, but with no
  luck.
 
  Is this possible?
  Any insight into how to write such a query would be much appreciated!
 
  Thanks,
  Peter
 



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Hi,

You need to put your password in as well. You should use protocol=imap
unless your gmail is set for imaps (I don't believe the free gmail gives you
this).

entity name=email
  user=u...@mydomain.com
  password=userpwd
  host=imap.mydomain.com
  include=
  exclude=
  processor=MailEntityProcessor
  protocol=imap
   /

HTH
Peter



On Fri, Mar 4, 2011 at 4:42 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, Mar 4, 2011 at 9:20 PM, Matias Alonso matiasgalo...@gmail.com
 wrote:
  Hi everyone!
 
 
   I’m trying to index mails into solr through DHI (based on the
  “example-DIH”). For this I´m using my personal email from gmail, but I
 can´t
  index.

 Have not used the MailEntityProcessor with Gmail, but some
 points below:

  Configuration in Data-config .xml:
 
  dataConfig
 
   document
 
 entity name=email
 
   user=m...@gmail.com
 ^ I presume that you have put in your actual
email address here.
 [...]
   protocol=imap/
  ^ Shouldn't this be imaps, at least as
 per http://wiki.apache.org/solr/MailEntityProcessor

 Regards,
 Gora



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Hi Matias,

Can you post your data-config.xml? (with disquised names/credentials)

Thanks,
Peter


On Fri, Mar 4, 2011 at 5:13 PM, Matias Alonso matiasgalo...@gmail.comwrote:

 Thks Peter,

 Yes, gmail gives me imaps (i understood that). So, I tried what you mention
 but I had get the original mesange I posted.

 Matias.




 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi,
 
  You need to put your password in as well. You should use protocol=imap
  unless your gmail is set for imaps (I don't believe the free gmail gives
  you
  this).
 
 entity name=email
   user=u...@mydomain.com
   password=userpwd
   host=imap.mydomain.com
   include=
   exclude=
   processor=MailEntityProcessor
   protocol=imap
/
 
  HTH
  Peter
 
 
 
  On Fri, Mar 4, 2011 at 4:42 PM, Gora Mohanty g...@mimirtech.com wrote:
 
   On Fri, Mar 4, 2011 at 9:20 PM, Matias Alonso matiasgalo...@gmail.com
 
   wrote:
Hi everyone!
   
   
 I’m trying to index mails into solr through DHI (based on the
“example-DIH”). For this I´m using my personal email from gmail, but
 I
   can´t
index.
  
   Have not used the MailEntityProcessor with Gmail, but some
   points below:
  
Configuration in Data-config .xml:
   
dataConfig
   
 document
   
   entity name=email
   
 user=m...@gmail.com
   ^ I presume that you have put in your actual
  email address here.
   [...]
 protocol=imap/
^ Shouldn't this be imaps, at least as
   per http://wiki.apache.org/solr/MailEntityProcessor
  
   Regards,
   Gora
  
 



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Hi Matias,

I haven't seen it in the posts, but I may have missed it -- what is the
import command you're sending?
Something like: http://localhost:8983/solr/db/dataimport?command=full-import

Can you also test it with deltaFetch=false. I seem to remember having some
problems with delta in the MailEntityProcessor.



On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso matiasgalo...@gmail.comwrote:

 dataConfig
  document
   entity name=email
   user=myem...@gmail.com
  password=mypassword
  host=imap.gmail.com
  fetchMailsSince=2011-01-01 00:00:00
  deltaFetch=true
  include=
  exclude=
  recurse=false


 folders=Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo
   includeContent=true
  processAttachments=false
  includeOtherUserFolders=false
  includeSharedFolders=false
  batchSize=100
  processor=MailEntityProcessor
  protocol=imaps /
  /document
 /dataConfig

 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Matias,
 
  Can you post your data-config.xml? (with disquised names/credentials)
 
  Thanks,
  Peter
 
 
  On Fri, Mar 4, 2011 at 5:13 PM, Matias Alonso matiasgalo...@gmail.com
  wrote:
 
   Thks Peter,
  
   Yes, gmail gives me imaps (i understood that). So, I tried what you
  mention
   but I had get the original mesange I posted.
  
   Matias.
  
  
  
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi,
   
You need to put your password in as well. You should use
  protocol=imap
unless your gmail is set for imaps (I don't believe the free gmail
  gives
you
this).
   
   entity name=email
 user=u...@mydomain.com
 password=userpwd
 host=imap.mydomain.com
 include=
 exclude=
 processor=MailEntityProcessor
 protocol=imap
  /
   
HTH
Peter
   
   
   
On Fri, Mar 4, 2011 at 4:42 PM, Gora Mohanty g...@mimirtech.com
  wrote:
   
 On Fri, Mar 4, 2011 at 9:20 PM, Matias Alonso 
  matiasgalo...@gmail.com
   
 wrote:
  Hi everyone!
 
 
   I’m trying to index mails into solr through DHI (based on the
  “example-DIH”). For this I´m using my personal email from gmail,
  but
   I
 can´t
  index.

 Have not used the MailEntityProcessor with Gmail, but some
 points below:

  Configuration in Data-config .xml:
 
  dataConfig
 
   document
 
 entity name=email
 
   user=m...@gmail.com
 ^ I presume that you have put in your actual
email address here.
 [...]
   protocol=imap/
  ^ Shouldn't this be imaps, at least as
 per http://wiki.apache.org/solr/MailEntityProcessor

 Regards,
 Gora

   
  
 



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Hi Mataias,

http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
the dataimport handler, but you need to tell it to do something by
sending a command:
http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
?command=full-importhttp://localhost:8983/solr/db/dataimport?command=full-import

If you haven't already, have a look at:

http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler

It gives very thorough and useful advice on getting the DIH working.

Peter



On Fri, Mar 4, 2011 at 6:59 PM, Matias Alonso matiasgalo...@gmail.comwrote:

 Hi Peter,

 I test with deltaFetch=false, but doesn´t work :(
 I'm using DataImportHandler Development Console to index (
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport);
 I'm working with example-DIH.

 thks...



 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Matias,
 
  I haven't seen it in the posts, but I may have missed it -- what is the
  import command you're sending?
  Something like:
  http://localhost:8983/solr/db/dataimport?command=full-import
 
  Can you also test it with deltaFetch=false. I seem to remember having
  some
  problems with delta in the MailEntityProcessor.
 
 
 
  On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso matiasgalo...@gmail.com
  wrote:
 
   dataConfig
document
 entity name=email
 user=myem...@gmail.com
password=mypassword
host=imap.gmail.com
fetchMailsSince=2011-01-01 00:00:00
deltaFetch=true
include=
exclude=
recurse=false
  
  
  
 
 folders=Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo
 includeContent=true
processAttachments=false
includeOtherUserFolders=false
includeSharedFolders=false
batchSize=100
processor=MailEntityProcessor
protocol=imaps /
/document
   /dataConfig
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi Matias,
   
Can you post your data-config.xml? (with disquised names/credentials)
   
Thanks,
Peter
   
   
On Fri, Mar 4, 2011 at 5:13 PM, Matias Alonso 
 matiasgalo...@gmail.com
wrote:
   
 Thks Peter,

 Yes, gmail gives me imaps (i understood that). So, I tried what you
mention
 but I had get the original mesange I posted.

 Matias.




 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi,
 
  You need to put your password in as well. You should use
protocol=imap
  unless your gmail is set for imaps (I don't believe the free
 gmail
gives
  you
  this).
 
 entity name=email
   user=u...@mydomain.com
   password=userpwd
   host=imap.mydomain.com
   include=
   exclude=
   processor=MailEntityProcessor
   protocol=imap
/
 
  HTH
  Peter
 
 
 
  On Fri, Mar 4, 2011 at 4:42 PM, Gora Mohanty g...@mimirtech.com
 
wrote:
 
   On Fri, Mar 4, 2011 at 9:20 PM, Matias Alonso 
matiasgalo...@gmail.com
 
   wrote:
Hi everyone!
   
   
 I’m trying to index mails into solr through DHI (based on
 the
“example-DIH”). For this I´m using my personal email from
  gmail,
but
 I
   can´t
index.
  
   Have not used the MailEntityProcessor with Gmail, but some
   points below:
  
Configuration in Data-config .xml:
   
dataConfig
   
 document
   
   entity name=email
   
 user=m...@gmail.com
   ^ I presume that you have put in your
 actual
  email address here.
   [...]
 protocol=imap/
^ Shouldn't this be imaps, at least as
   per http://wiki.apache.org/solr/MailEntityProcessor
  
   Regards,
   Gora
  
 

   
  
 



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Hi Matias,

What version of Solr are you using? Are you running any patches (maybe
SOLR-2245)?

Thanks,
Peter



On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.comwrote:

 Hi Peter,

 From DataImportHandler Development Console I made a full-import, but
 didn´t work.

 Now, I execute 
 http://localhost:8983/solr/mail/dataimport?command=full-import; but
 nothing
 happends; no index; no errors.

 thks...

 Matias.



 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Mataias,
 
 
 
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
  the dataimport handler, but you need to tell it to do something by
  sending a command:
  http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
  ?command=full-import
  http://localhost:8983/solr/db/dataimport?command=full-import
 
  If you haven't already, have a look at:
 
 
 
 http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler
 
  It gives very thorough and useful advice on getting the DIH working.
 
  Peter
 
 
 
  On Fri, Mar 4, 2011 at 6:59 PM, Matias Alonso matiasgalo...@gmail.com
  wrote:
 
   Hi Peter,
  
   I test with deltaFetch=false, but doesn´t work :(
   I'm using DataImportHandler Development Console to index (
  
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
  );
   I'm working with example-DIH.
  
   thks...
  
  
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi Matias,
   
I haven't seen it in the posts, but I may have missed it -- what is
 the
import command you're sending?
Something like:
http://localhost:8983/solr/db/dataimport?command=full-import
   
Can you also test it with deltaFetch=false. I seem to remember
 having
some
problems with delta in the MailEntityProcessor.
   
   
   
On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso 
 matiasgalo...@gmail.com
wrote:
   
 dataConfig
  document
   entity name=email
   user=myem...@gmail.com
  password=mypassword
  host=imap.gmail.com
  fetchMailsSince=2011-01-01 00:00:00
  deltaFetch=true
  include=
  exclude=
  recurse=false



   
  
 
 folders=Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo
   includeContent=true
  processAttachments=false
  includeOtherUserFolders=false
  includeSharedFolders=false
  batchSize=100
  processor=MailEntityProcessor
  protocol=imaps /
  /document
 /dataConfig

 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Matias,
 
  Can you post your data-config.xml? (with disquised
  names/credentials)
 
  Thanks,
  Peter
 
 
  On Fri, Mar 4, 2011 at 5:13 PM, Matias Alonso 
   matiasgalo...@gmail.com
  wrote:
 
   Thks Peter,
  
   Yes, gmail gives me imaps (i understood that). So, I tried what
  you
  mention
   but I had get the original mesange I posted.
  
   Matias.
  
  
  
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi,
   
You need to put your password in as well. You should use
  protocol=imap
unless your gmail is set for imaps (I don't believe the free
   gmail
  gives
you
this).
   
   entity name=email
 user=u...@mydomain.com
 password=userpwd
 host=imap.mydomain.com
 include=
 exclude=
 processor=MailEntityProcessor
 protocol=imap
  /
   
HTH
Peter
   
   
   
On Fri, Mar 4, 2011 at 4:42 PM, Gora Mohanty 
  g...@mimirtech.com
   
  wrote:
   
 On Fri, Mar 4, 2011 at 9:20 PM, Matias Alonso 
  matiasgalo...@gmail.com
   
 wrote:
  Hi everyone!
 
 
   I’m trying to index mails into solr through DHI (based
 on
   the
  “example-DIH”). For this I´m using my personal email from
gmail,
  but
   I
 can´t
  index.

 Have not used the MailEntityProcessor with Gmail, but some
 points below:

  Configuration in Data-config .xml:
 
  dataConfig
 
   document
 
 entity name=email
 
   user=m...@gmail.com
 ^ I presume that you have put in your
   actual
email address here.
 [...]
   protocol=imap/
  ^ Shouldn't this be imaps, at
 least
  as
 per http://wiki.apache.org/solr/MailEntityProcessor

 Regards,
 Gora

   
  
 

   
  
 



Re: Help -DIH (mail)

2011-03-04 Thread Peter Sturge
Can you try this:

Issue a full import command like this:

http://localhost:8983/solr/dataimport?command=full-importhttp://localhost:8983/solr/db/dataimport?command=full-import

(There is no core name here - if you're using a core name (db?), then add
that in between solr/ and /dataimport)

then, run:
http://localhost:8983/solr/dataimport?command=statushttp://localhost:8983/solr/db/dataimport?command=full-import

This will show the results of the previous import. Has it been rolled-back?
If so, there might be something in the log if it's enabled (see your jre's
lib/logging.properties file).
(you won't see any errors unless you run the status command - that's where
they're stored)

HTH
Peter




On Sat, Mar 5, 2011 at 12:46 AM, Matias Alonso matiasgalo...@gmail.comwrote:

 I´m using the trunk.

 Thanks Peter for your preoccupation!

 Matias.



 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Matias,
 
  What version of Solr are you using? Are you running any patches (maybe
  SOLR-2245)?
 
  Thanks,
  Peter
 
 
 
  On Fri, Mar 4, 2011 at 8:25 PM, Matias Alonso matiasgalo...@gmail.com
  wrote:
 
   Hi Peter,
  
   From DataImportHandler Development Console I made a full-import, but
   didn´t work.
  
   Now, I execute 
   http://localhost:8983/solr/mail/dataimport?command=full-import; but
   nothing
   happends; no index; no errors.
  
   thks...
  
   Matias.
  
  
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi Mataias,
   
   
   
  
 
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimportaccesses
the dataimport handler, but you need to tell it to do something by
sending a command:
   
  http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
?command=full-import
http://localhost:8983/solr/db/dataimport?command=full-import
   
If you haven't already, have a look at:
   
   
   
  
 
 http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FDataImportHandler
   
It gives very thorough and useful advice on getting the DIH working.
   
Peter
   
   
   
On Fri, Mar 4, 2011 at 6:59 PM, Matias Alonso 
 matiasgalo...@gmail.com
wrote:
   
 Hi Peter,

 I test with deltaFetch=false, but doesn´t work :(
 I'm using DataImportHandler Development Console to index (

  
 http://localhost:8983/solr/mail/admin/dataimport.jsp?handler=/dataimport
);
 I'm working with example-DIH.

 thks...



 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi Matias,
 
  I haven't seen it in the posts, but I may have missed it -- what
 is
   the
  import command you're sending?
  Something like:
  http://localhost:8983/solr/db/dataimport?command=full-import
 
  Can you also test it with deltaFetch=false. I seem to remember
   having
  some
  problems with delta in the MailEntityProcessor.
 
 
 
  On Fri, Mar 4, 2011 at 6:29 PM, Matias Alonso 
   matiasgalo...@gmail.com
  wrote:
 
   dataConfig
document
 entity name=email
 user=myem...@gmail.com
password=mypassword
host=imap.gmail.com
fetchMailsSince=2011-01-01 00:00:00
deltaFetch=true
include=
exclude=
recurse=false
  
  
  
 

   
  
 
 folders=Recibidos,recibidos,RECIBIDOS,inbox.InBox,INBOX,Mail,MAIL,mail,CORREO,correo,Correo
 includeContent=true
processAttachments=false
includeOtherUserFolders=false
includeSharedFolders=false
batchSize=100
processor=MailEntityProcessor
protocol=imaps /
/document
   /dataConfig
  
   2011/3/4 Peter Sturge peter.stu...@gmail.com
  
Hi Matias,
   
Can you post your data-config.xml? (with disquised
names/credentials)
   
Thanks,
Peter
   
   
On Fri, Mar 4, 2011 at 5:13 PM, Matias Alonso 
 matiasgalo...@gmail.com
wrote:
   
 Thks Peter,

 Yes, gmail gives me imaps (i understood that). So, I tried
  what
you
mention
 but I had get the original mesange I posted.

 Matias.




 2011/3/4 Peter Sturge peter.stu...@gmail.com

  Hi,
 
  You need to put your password in as well. You should use
protocol=imap
  unless your gmail is set for imaps (I don't believe the
  free
 gmail
gives
  you
  this).
 
 entity name=email
   user=u...@mydomain.com
   password=userpwd
   host=imap.mydomain.com
   include=
   exclude=
   processor=MailEntityProcessor
   protocol=imap
/
 
  HTH
  Peter

Re: Separating Index Reader and Writer

2011-02-06 Thread Peter Sturge
Hi,

We use this scenario in production where we have one write-only Solr
instance and 1 read-only, pointing to the same data.
We do this so we can optimize caching/etc. for each instance for
write/read. The main performance gain is in cache warming and
associated parameters.
For your Index W, it's worth turning off cache warming altogether, so
commits aren't slowed down by warming.

Peter


On Sun, Feb 6, 2011 at 3:25 PM, Isan Fulia isan.fu...@germinait.com wrote:
 Hi all,
 I have setup two indexes one for reading(R) and other for writing(W).Index R
 refers to the same data dir of W (defined in solrconfig via dataDir).
 To make sure the R index sees the indexed documents of W , i am firing an
 empty commit on R.
 With this , I am getting performance improvement as compared to using the
 same index for reading and writing .
 Can anyone help me in knowing why this performance improvement is taking
 place even though both the indexeses are pointing to the same data
 directory.

 --
 Thanks  Regards,
 Isan Fulia.



Re: Document level security

2011-01-20 Thread Peter Sturge
Hi,

One of the things about Document Security is that it never involves
just one thing. There are a lot of things to consider, and
unfortunately, they're generally non-trivial.

Deciding how to store/hold/retrieve permissions is certainly one of
those things, and you're right, you should avoid attaching permissions
to document data in the index, because if you want to change
permissions (and you will want to change them at some point), it can
be a cumbersome job, particularly if it involves millions of
documents, replication, shards etc. It's also generally a good idea
not to tie your schema to permission fields.

Another big consideration is authentication - how can you be sure the
request is coming from the user you think it is? Is there a
certificate involved? Has the user authenticated to the container? If
so, how do you get to this? and so on...

For permissions storage, there are two realistic approaches to consider:
   1. Write a SearchComponent that handles permission requests. This
typically involves storing/reading permissions in/from a file,
database or separate index (see SOLR-1872)
   2. Use an LCF module to retrieve permissions from the original
documents themselves (see SOLR-1834)

Hope this helps,
Peter



On Thu, Jan 20, 2011 at 8:44 PM, Rok Rejc rokrej...@gmail.com wrote:
 Hi all,

 I have an index containing a couple of million documents.
 Documents are grouped into groups, each group contains from 1000-2
 documents.

 The problem:
 Each group has defined permission settings. It can be viewed by public,
 viewed by registred users, or viewed by a list of users (each group has her
 own list of users).
 Said differently: I need a document security.

 What I read from the other threads it is not recommended to store
 permissions in the index. I have already all the permissions in the
 database, but I don't know how to connect the database and the index.
 I can query the database to get the groups in which the user is and after
 that do the OR query, but I am afraid that this list can be too big (100
 OR's could also exceeds maximum HTTP GET query string length).

 What are the other options? Should I write a custom collector which will
 query (and cache) the database for permissions?

 Any ideas are appreciated...

 Many thanks, Rok



Re: How to implement and a system based on IMAP auth

2010-12-13 Thread Peter Sturge
imap has no intrinsic functionality for logging in as a user then
'impersonating' someone else.
What you can do is setup your email server so that your administrator
account or similar has access to other users via shared folders (this
is supported in imap2 servers - e.g. Exchange).
This is done all the time, for example if a manager wants his/her
secretary to have access to his/her mailbox.
Of course all access in this way needs to be in line with privacy policies etc.

When you connect as, say, 'admin', you can then see the shared folders
you have access to.
These folders are accessible via imap.
This is more of an imap thing, and isn't really related to DIH/Solr per se.

For Exchange servers, have a look at:
   http://www.petri.co.il/grant_full_mailbox_rights_on_exchange_2000_2003.htm
and
   http://www.ehow.com/how_5656820_share-exchange-mailboxes.html

HTH

Peter




On Mon, Dec 13, 2010 at 2:32 PM, milomalo2...@libero.it
milomalo2...@libero.it wrote:
 Hi Guys,

 i am new in Solr world and i was trying to figure out how to implement an
 application which would be able to connect to our business mail server throug
 IMAP connection (1000 users) and to index the information related e-mail
 contents.

 I tried to use DH- import with the preconfigured imap class provided in the
 solr example but as i could see there is no way to fetch 1000 user and 
 retrieve
 information for them

 What would you suggest as first step to follow ?
 should i use SOLRJ as client in order to reach user content across imap
 connection?
 Doesn anyone had experience with that ?

 thanks in advance






Re: SOLR Thesaurus

2010-12-10 Thread Peter Sturge
Hi Lee,

Perhaps Solr's clustering component might be helpful for your use case?
http://wiki.apache.org/solr/ClusteringComponent




On Fri, Dec 10, 2010 at 9:17 AM, lee carroll
lee.a.carr...@googlemail.com wrote:
 Hi Chris,

 Its all a bit early in the morning for this mined :-)

 The question asked, in good faith, was does solr support or extend to
 implementing a thesaurus. It looks like it does not which is fine. It does
 support synonyms and synonym rings which is again fine. The ski example was
 an illustration in response to a follow up question for more explanation on
 what a thesaurus is.

 An attempt at an answer of why a thesaurus; is below.

 Use case 1: improve facets

 Motivation
 Unstructured lists of labels in facets offer very poor user experience.
 Similar to tag clouds users find them arbitrary, with out focus and often
 overwhelming. Labels in facets which are grouped in meaningful ways relevant
 to the user increase engagement, perceived relevance and user satisfaction.

 Solution
 A thesaurus of term relationships could be used to group facet labels

 Implementation
 (er completely out of my depth at this point)
 Thesaurus relationships defined in a simple text file
 term, bt=term,term nt= term, term rt=term, term, pt=term
 if a search specifies a facet to be returned the field terms are identified
 by reading the thesaurus into groups, broader terms, narrower terms, related
 terms etc
 These groups are returned as part of the response for the UI to display
 faceted labels as broader, narrower, related terms etc

 Use case 2: Increase synonym search precision

 Motivation
 Synonyms rings do not allow differences in synonym to be identified. Rarely
 are synonyms exactly equivalent. This leads to a decrease in search
 precision.

 Solution
 Boost queries based on search term thesaurus relationships

 Implementation
 (again completely  out of depth here)
 Allow terms in the index to be identified as bt , nt, .. terms of the search
 term. Allow query parser to boost terms differentially based on these
 thesaurus relationships



 As for the x and y stuff I'm not sure, like i say its quite early in the
 morning for me. I'm sure their may well be a different way of achieving the
 above (but note it is more than a hierarchy). However the librarians have
 been doing this for 50 years now .

 Again though just to repeat this is hardly a killer for us. We've looked at
 solr for a project; created a proto type; generated tons of questions, had
 them answered in the main by the docs, some on this list and been amazed at
 the fantastic results solr has given us. In fact with a combination of
 keepwords and synonyms we have got a pretty nice simple set of facet labels
 anyway (my motivation for the original question), so our corpus at the
 moment does not really need a thesaurus! :-)

 Thanks Lee


 On 9 December 2010 23:38, Chris Hostetter hossman_luc...@fucit.org wrote:



 : a term can have a Prefered Term (PT), many Broader Terms (BT), Many
 Narrower
 : Terms (NT) Related Terms (RT) etc
         ...
 : User supplied Term is say : Ski
 :
 : Prefered term: Skiing
 : Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
 : Narrower terms: down hill skiing, telemark, cross country
 : Related terms: boarding, snow boarding, winter holidays

 I'm still lost.

 You've described a black box with some sample input (Ski) and some
 corrisponding sample output (PT=..., BT=..., NT=..., RT=) -- but you
 haven't explained what you want to do with tht black box.  Assuming such a
 black box existed in solr what are you expecting/hoping to do with it?
 how would such a black box modify solr's user experience?  what is your
 goal?

 Smells like an XY Problem...
 http://people.apache.org/~hossman/#xyproblemhttp://people.apache.org/%7Ehossman/#xyproblem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341


 -Hoss




Re: How badly does NTFS file fragmentation impact search performance? 1.1X? 10X? 100X?

2010-12-08 Thread Peter Sturge
There are, as you would expect, a lot of factors that impact the
amount of fragmentation that occurs:
commit rate, mergeFactor updates/deletes vs 'new' data etc.

Having run reasonably large indexes on NTFS (25GB), we've not found
fragmentation to be much of a hindrance.
I don't have any definitive benchmark numbers, sorry, but as an index
grows to large sizes, other factors overshadow
any fragmentation hit - e.g. sharding, replication, cache warming etc.

If you're really worried that fragmentation is affecting performance,
you can move to using SSD drives, which don't suffer from
fragmentation (in fact, they must never be defragmented), and of
course they absolutely fly.

Peter


On Wed, Dec 8, 2010 at 5:59 PM, Will Milspec will.mils...@gmail.com wrote:
 Hi all,

 Pardon if this isn't the best place to post this email...maybe it belongs on
 the lucene-user list .  Also, it's basically windows-specific,so not of use
 to everyone...

 The question: does NTFS fragmentation affect  search performance a little
 bit or a lot? It's obvious that fragmentation will slow things down,
 but is it a factor of .1, 10 , or 100? (i.e what order of magnitude)?

 As a follow up: should solr/lucene users periodically remind Windows
 sysadmins to defrag their drives ?

 On a production system, I ran the windows defrag analyzer and found heavy
 fragmentation on the lucene index.

 11,839          492 MB          \data\index\search\_6io5.cfs
 7,153           433 MB          \data\index\search\_5ld6.cfs
 6,953           661 MB          \data\index\search\_8jvj.cfs
 5,824           74 MB           \data\index\search\_5ld7.frq
 5,691           356 MB          \data\index\search\_9eev.fdt
 5,638           352 MB          \data\index\search\_8mqi.fdt
 5,629           352 MB          \data\index\search\_8jvj.fdt
 5,609           351 MB          \data\index\search\_88z8.fdt
 5,590           355 MB          \data\index\search\_96l5.fdt
 5,568           354 MB          \data\index\search\_8zjn.fdt
 5,471           342 MB          \data\index\search\_5wgo.fdt
 5,466           342 MB          \data\index\search\_5uo1.fdt
 5,450           340 MB          \data\index\search\_5hrn.fdt
 5,429           345 MB          \data\index\search\_6nyy.fdt
 5,371           353 MB          \data\index\search\_8sob.fdt

 Incidentally, we periodically experience some *very* slow searches. Out of
 curiousity, I checked for file fragmentation (using 'analyze' mode of the
 nfts defragger)

 nota bene: Windows sysinternals has a utility Contig.exe whic allows you
 to defragment individual drives/directories. We'll use that to defragmeent
 the  index direcotires

 will



Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
The Win7 crashes aren't from disk drivers - they come from, in this
case, a Broadcom wireless adapter driver.
The corruption comes as a result of the 'hard stop' of Windows.

I would imagine this same problem could/would occur on any OS if the
plug was pulled from the machine.

Thanks,
Peter


On Thu, Dec 2, 2010 at 4:07 AM, Lance Norskog goks...@gmail.com wrote:
 Is there any way that Windows 7 and disk drivers are not honoring the
 fsync() calls? That would cause files and/or blocks to get saved out
 of order.

 On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge peter.stu...@gmail.com wrote:
 After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
 LockObtainFailedException errors: (excerpt)

   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock


 When I run CheckIndex, I get: (excerpt)

  30 of 30: name=_2fi docCount=857
    compound=false
    hasProx=true
    numFiles=8
    size (MB)=0.769
    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev 
 ${svnver
 sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, 
 java.version=1.6.0_18,
 java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.FAILED
    WARNING: fixIndex() would remove reference to this segment; full 
 exception:
 org.apache.lucene.index.CorruptIndexException: did not read all bytes from 
 file
 _2fi.fnm: read 1 vs size 512
        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
        at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
        at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReade
 r.java:119)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)

 WARNING: 1 broken segments (containing 857 documents) detected


 This seems to happen every time Windows 7 crashes, and it would seem
 extraordinary bad luck for this tiny test index to be in the middle of
 a commit every time.
 (it is set to commit every 40secs, but for such a small index it only
 takes millis to complete)

 Does this seem right? I don't remember seeing so many corruptions in
 the index - maybe it is the world of Win7 dodgy drivers, but it would
 be worth investigating if there's something amiss in Solr/Lucene when
 things go down unexpectedly...

 Thanks,
 Peter


 On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The index itself isn't corrupt - just one of the segment files. This
 means you can read the index (less the offending segment(s)), but once
 this happens it's no longer possible to
 access the documents that were in that segment (they're gone forever),
 nor write/commit to the index (depending on the env/request, you get
 'Error reading from index file..' and/or WriteLockError)
 (note that for my use case, documents are dynamically created so can't
 be re-indexed).

 Restarting Solr fixes the write lock errors (an indirect environmental
 symptom of the problem), and running CheckIndex -fix is the only way
 I've found to repair the index so it can be written to (rewrites the
 corrupted segment(s)).

 I guess I was wondering if there's a mechanism that would support
 something akin to a transactional rollback for segments.

 Thanks,
 Peter



 On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com 
 wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com






 --
 Lance Norskog
 goks...@gmail.com



Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
As I'm not familiar with the syncing in Lucene, I couldn't say whether
there's a specific problem with regards Win7/2008 server etc.

Windows has long had the somewhat odd behaviour of deliberately
caching file handles after an explicit close(). This has been part of
NTFS since NT 4 days, but there may be some new behaviour introduced
in Windows 6.x (and there is a lot of new behaviour) that causes an
issue. I have also seen this problem in Windows Server 2008 (server
version of Win7 - same file system).

I'll try some further testing on previous Windows versions, but I've
not previously come across a single segment corruption on Win 2k3/XP
after hard failures. In fact, it was when I first encountered this
problem on Server 2008 that I even discovered CheckIndex existed!

I guess a good question for the community is: Has anyone else
seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
Win7)?

Mike, are there any diagnostics/config etc. that I could try to help
isolate the problem?

Many thanks,
Peter



On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The Win7 crashes aren't from disk drivers - they come from, in this
 case, a Broadcom wireless adapter driver.
 The corruption comes as a result of the 'hard stop' of Windows.

 I would imagine this same problem could/would occur on any OS if the
 plug was pulled from the machine.

 Actually, Lucene should be robust to this -- losing power, OS crash,
 hardware failure (as long as the failure doesn't flip bits), etc.
 This is because we do not delete files associated with an old commit
 point until all files referenced by the new commit point are
 successfully fsync'd.

 However it sounds like something is wrong, at least on Windows 7.

 I suspect it may be how we do the fsync -- if you look in
 FSDirectory.fsync, you'll see that we take a String fileName in.  We
 then open a new read/write RandomAccessFile, and call its
 .getFD().sync().

 I think this is potentially risky, ie, it would be better if we called
 .sync() on the original file we had opened for writing and written
 lots of data to, before closing it, instead of closing it, opening a
 new FileDescriptor, and calling sync on it.  We could conceivably take
 this approach, entirely in the Directory impl, by keeping the pool of
 file handles for write open even after .close() was called.  When a
 file is deleted we'd remove it from that pool, and when it's finally
 sync'd we'd then sync it and remove it from the pool.

 Could it be that on Windows 7 the way we fsync (opening a new
 FileDescriptor long after the first one was closed) doesn't in fact
 work?

 Mike



Re: Tuning Solr caches with high commit rates (NRT)

2010-12-02 Thread Peter Sturge
In order for the 'read-only' instance to see any new/updated
documents, it needs to do a commit (since it's read-only, it is a
commit of 0 documents).
You can do this via a client service that issues periodic commits, or
use autorefresh from within solrconfig.xml. Be careful that you don't
do anything in the read-only instance that will change the underlying
index - like optimize.

Peter


On Thu, Dec 2, 2010 at 12:51 PM, stockii st...@shopgate.com wrote:

 great thread and exactly my problems :D

 i set up two solr-instances, one for update the index and another for
 searching.

 When i perform an update. the search-instance dont get the new documents.
 when i start a commit on searcher he found it. how can i say the searcher
 that he alwas look not only the old index. automatic refresh ? XD
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-tp1461275p2005738.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Preventing index segment corruption when windows crashes

2010-11-30 Thread Peter Sturge
The index itself isn't corrupt - just one of the segment files. This
means you can read the index (less the offending segment(s)), but once
this happens it's no longer possible to
access the documents that were in that segment (they're gone forever),
nor write/commit to the index (depending on the env/request, you get
'Error reading from index file..' and/or WriteLockError)
(note that for my use case, documents are dynamically created so can't
be re-indexed).

Restarting Solr fixes the write lock errors (an indirect environmental
symptom of the problem), and running CheckIndex -fix is the only way
I've found to repair the index so it can be written to (rewrites the
corrupted segment(s)).

I guess I was wondering if there's a mechanism that would support
something akin to a transactional rollback for segments.

Thanks,
Peter



On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com



Re: SOLR for Log analysis feasibility

2010-11-30 Thread Peter Sturge
We do a lot of precisely this sort of thing. Ours is a commercial
product (Honeycomb Lexicon) that extracts behavioural information from
logs, events and network data (don't worry, I'm not pushing this on
you!) - only to say that there are a lot of considerations beyond base
Solr when it comes to handling log, event and other 'transient' data
streams.
Aside from the obvious issues of horizontal scaling, reliable
delivery/retry/replication etc., there are other important issues,
particularly with regards data classification, reporting engines and
numerous other items.
It's one of those things that's sounds perfectly reasonable at the
outset, but all sorts of things crop up the deeper you get into it.

Peter


On Tue, Nov 30, 2010 at 11:44 AM, phoey pho...@gmail.com wrote:

 We are looking into building a reporting feature and investigating solutions
 which will allow us to search though our logs for downloads, searches and
 view history.

 Each log item is relatively small

 download history

 add
        doc
                field name=uuiditem123-v1/field
                field name=marketphotography/field
                field name=nameitem 1/field
                field name=userid1/field
                field name=version1/field
                field name=downloadTypehires/field
                field name=itemId123/field
                field name=timestamp2009-11-07T14:50:54Z/field
        /doc
 /add

 search history

 add
        doc
                field name=uuid1/field
                field name=querybrand assets/field
                field name=userid1/field
                field name=timestamp2009-11-07T14:50:54Z/field
        /doc
 /add

 view history

 add
        doc
                field name=uuid1/field
                field name=itemId123/field
                field name=userid1/field
                field name=timestamp2009-11-07T14:50:54Z/field
        /doc
 /add


 and we reckon that we could have around 10 - 30 million log records for each
 type (downloads, searches, views) so 70 million records in total but
 obviously must scale higher.

 concurrent users will be around 10 - 20 (relatively low)

 new logs will be imported as a batch overnight.

 Because we have some previous experience with SOLR and because the interface
 needs to have full-text searching and filtering we built a prototype using
 SOLR 4.0. We used the new field collapsing feature within SOLR 4.0 to
 collapse on groups of data. For example view History needs to collapse on
 itemId. Each row will then show the frequency on how many views the item has
 had. This is achieved by the number of items which have been grouped.

 The requirements for the solution is to be schemaless to allow adding new
 fields to new documents easier, and have a powerful search interface, both
 which SOLR can do.

 QUESTIONS

 Our prototype is working as expected but im unsure if

 1. has anyone got experience with using SOLR for log analysis.
 2. SOLR can scale but when is the limit when i should start considering
 about sharding the index. It should be fine with 100+ million records.
 3. We are using a nightly build of SOLR for the field collapsing feature.
 Would it be possible to patch SOLR 1.4.1 with the SOLR-236 patch? has anyone
 used this in production?

 thanks
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1992202.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Preventing index segment corruption when windows crashes

2010-11-30 Thread Peter Sturge
After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
LockObtainFailedException errors: (excerpt)

   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
obtain timed out:
nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock


When I run CheckIndex, I get: (excerpt)

 30 of 30: name=_2fi docCount=857
   compound=false
   hasProx=true
   numFiles=8
   size (MB)=0.769
   diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev ${svnver
sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, java.version=1.6.0_18,
java.vendor=Sun Microsystems Inc.}
   no deletions
   test: open reader.FAILED
   WARNING: fixIndex() would remove reference to this segment; full exception:
org.apache.lucene.index.CorruptIndexException: did not read all bytes from file
_2fi.fnm: read 1 vs size 512
       at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
       at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
       at org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReade
r.java:119)
       at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
       at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
       at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
       at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)

WARNING: 1 broken segments (containing 857 documents) detected


This seems to happen every time Windows 7 crashes, and it would seem
extraordinary bad luck for this tiny test index to be in the middle of
a commit every time.
(it is set to commit every 40secs, but for such a small index it only
takes millis to complete)

Does this seem right? I don't remember seeing so many corruptions in
the index - maybe it is the world of Win7 dodgy drivers, but it would
be worth investigating if there's something amiss in Solr/Lucene when
things go down unexpectedly...

Thanks,
Peter


On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The index itself isn't corrupt - just one of the segment files. This
 means you can read the index (less the offending segment(s)), but once
 this happens it's no longer possible to
 access the documents that were in that segment (they're gone forever),
 nor write/commit to the index (depending on the env/request, you get
 'Error reading from index file..' and/or WriteLockError)
 (note that for my use case, documents are dynamically created so can't
 be re-indexed).

 Restarting Solr fixes the write lock errors (an indirect environmental
 symptom of the problem), and running CheckIndex -fix is the only way
 I've found to repair the index so it can be written to (rewrites the
 corrupted segment(s)).

 I guess I was wondering if there's a mechanism that would support
 something akin to a transactional rollback for segments.

 Thanks,
 Peter



 On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com 
 wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com




Preventing index segment corruption when windows crashes

2010-11-29 Thread Peter Sturge
Hi,

With the advent of new windows versions, there are increasing
instances of system blue-screens, crashes, freezes and ad-hoc
failures.
If a Solr index is running at the time of a system halt, this can
often corrupt a segments file, requiring the index to be -fix'ed by
rewriting the offending file.
Aside from the vagaries of automating such fixes, depending on the
mergeFactor, this can be quite a few documents permanently lost.

Would anyone have any experience/wisdom/insight on ways to mitigate
such corruption in Lucene/Solr - e.g. applying a temp file technique
etc.; though perhaps not 'just use Linux'.. :-)
There are of course, client-side measures that can hold some number of
pending documents until they are truly committed, but a
server-side/Lucene method would be perferable, if possible.

Thanks,
Peter


Re: SOLR and secure content

2010-11-23 Thread Peter Sturge
Yes, as mentioned in the above link, there's SOLR-1872 for maintaing
your own document-level access control. Also, if you have access to
the file system documents and want to use their existing ACL, have a
look at SOLR-1834.
Document-level access control can be a real 'can of worms', and it can
be worthwhile spending a bit of time defining exactly what you need.

Thanks,
Peter



On Mon, Nov 22, 2010 at 11:58 PM, Savvas-Andreas Moysidis
savvas.andreas.moysi...@googlemail.com wrote:
 maybe this older thread on Modeling Access Control might help:

 http://lucene.472066.n3.nabble.com/Modelling-Access-Control-td1756817.html#a1761482

 Regards,
 -- Savvas

 On 22 November 2010 18:53, Jos Janssen j...@websdesign.nl wrote:


 Hi,

 We plan to make an application layer in PHP which will communicate to the
 solr server.

 Direct calls will only be made for administration purposes only.

 regards,

 jos
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-and-secure-content-tp1945028p1947970.html
 Sent from the Solr - User mailing list archive at Nabble.com.




RE: DataImportHandlerException for custom DIH Transformer

2010-11-19 Thread Peter Sturge
Hi,

This problem is usually because your custom Transformer is in the
solr/lib folder, when it needs to be in the webapps .war file (under
WEB-INF/lib of course).
Place your custom Transformer in a .jar in your .war and you should be
good to go.

Thanks,
Peter



Subject:
RE: DataImportHandlerException for custom DIH Transformer
From:
Vladimir Sutskever vladimir.sutske...@...
Date:
1969-12-31 19:00

I am experiencing a similar situation?

Any comments?


-Original Message-
From: Shashikant Kore [mailto:shashik...@gmail.com]
Sent: Wednesday, September 08, 2010 2:54 AM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandlerException for custom DIH Transformer

Resurrecting an old thread.

I faced exact problem as Tommy and the jar was in {solr.home}/lib as Noble
had suggested.

My custom transformer overrides following method as per the specification of
Transformer class.

public Object transformRow(MapString, Object row, Context
context);

But, in the code (EntityProcessorWrapper.java), I see the following line.

  final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class);

This doesn't match the method signature in Transformer. I think this should
be

  final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class,
Context.class);

I have verified that adding a method transformRow(MapString, Object row)
works.

Am I missing something?

--shashi

2010/2/8 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

On Mon, Feb 8, 2010 at 9:13 AM, Tommy Chheng tommy.chh...@gmail.com wrote:

I'm having trouble making a custom DIH transformer in solr
1.4. I compiled the General TrimTransformer into a jar. (just
copy/paste

sample

code from http://wiki.apache.org/solr/DIHCustomTransformer) I
placed the jar along with the dataimporthandler jar in solr/lib (same
directory as the jetty jar)

do not keep in solr/lib it wont work. keep it in {solr.home}/lib

Then I added to my DIH data-config.xml file:
transformer=DateFormatTransformer, RegexTransformer,
com.chheng.dih.transformers.TrimTransformer Now I get this exception
when I try running the import.
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodException:
com.chheng.dih.transformers.TrimTransformer.transformRow(java.util.Map)
at


org.apache.solr.handler.dataimport.EntityProcessorWrapper.loadTransformers(EntityProcessorWrapper.java:120)

I noticed the exception lists
TrimTransformer.transformRow(java.util.Map) but the abstract
Transformer class defines a two parameter method:
transformRow(MapString, Object row, Context context)? -- Tommy
Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng
http://tommy.chheng.com

-- - Noble
Paul | Systems Architect| AOL | http://aol.com


Re: Possibilities of (near) real time search with solr

2010-11-18 Thread Peter Sturge
 Maybe I didn't fully understood what you explained: but doesn't this mean
 that you'll have one index per day?
 Or are you overwriting, via replicating, every shard and the number of shard
 is fixed?
 And why are you replicating from the local replica to the next shard? (why
 not directly from active to next shard?)

Yes, you can have one index per day (for us, our boundary is typically
1 month, so is less of an issue).
The 'oldest' replica in the round robin is overwritten, yes. We use
fixed shard numbers, but you don't have to.
Does yours need to be once a day?
We used our own round robin code because it was pre-Solr Cloud...
I'm not too familiar with them, but I believe it's certainly worth
having a look at Solr Cloud or Katta - could be useful here in
dynamically allocating shards.

Peter



On Thu, Nov 18, 2010 at 5:41 PM, Peter Karich peat...@yahoo.de wrote:
  Hi Peter!

 * I believe the NRT patches are included in the 4.x trunk. I don't
 think there's any support as yet in 3x (uses features in Lucene 3.0).

 I'll investage how much effort it is to update to solr4

 * For merging, I'm talking about commits/writes. If you merge while
 commits are going on, things can get a bit messy (maybe on source
 cores this is ok, but I have a feeling it's not).

 ok

 * For moving data to a an 'offline' read-only core, this is the trickiest
 bit.
 We do this today by using a round-robin chain of remote shards and 2
 local cores. At the boundary time (e.g. 1 day), the 'active' core is
 replicated locally, then this local replica is replicated to the next
 shard in the chain. Once everything is complete, the local replica is
 discarded, and the 'active' core is cleaned, being careful not to
 delete any new data since the replicated commit point.

 Maybe I didn't fully understood what you explained: but doesn't this mean
 that you'll have one index per day?
 Or are you overwriting, via replicating, every shard and the number of shard
 is fixed?
 And why are you replicating from the local replica to the next shard? (why
 not directly from active to next shard?)

 Regards,
 Peter.



Re: Possibilities of (near) real time search with solr

2010-11-18 Thread Peter Sturge
 no, I only thought you use one day :-)
 so you don't or do you have 31 shards?


No, we use 1 shard per month - e.g. 7 shards will hold 7 month's of data.
It can be set to 1 day, but you would need to have a huge amount of
data in a single day to warrant doing that.



On Thu, Nov 18, 2010 at 8:20 PM, Peter Karich peat...@yahoo.de wrote:


  Does yours need to be once a day?

 no, I only thought you use one day :-)
 so you don't or do you have 31 shards?


  having a look at Solr Cloud or Katta - could be useful
  here in dynamically allocating shards.

 ah, thx! I will take a look at it (after trying solr4)!

 Regards,
 Peter.


 Maybe I didn't fully understood what you explained: but doesn't this mean
 that you'll have one index per day?
 Or are you overwriting, via replicating, every shard and the number of
 shard
 is fixed?
 And why are you replicating from the local replica to the next shard?
 (why
 not directly from active to next shard?)

 Yes, you can have one index per day (for us, our boundary is typically
 1 month, so is less of an issue).
 The 'oldest' replica in the round robin is overwritten, yes. We use
 fixed shard numbers, but you don't have to.
 Does yours need to be once a day?
 We used our own round robin code because it was pre-Solr Cloud...
 I'm not too familiar with them, but I believe it's certainly worth
 having a look at Solr Cloud or Katta - could be useful here in
 dynamically allocating shards.

 Peter



 On Thu, Nov 18, 2010 at 5:41 PM, Peter Karichpeat...@yahoo.de  wrote:

  Hi Peter!

 * I believe the NRT patches are included in the 4.x trunk. I don't
 think there's any support as yet in 3x (uses features in Lucene 3.0).

 I'll investage how much effort it is to update to solr4

 * For merging, I'm talking about commits/writes. If you merge while
 commits are going on, things can get a bit messy (maybe on source
 cores this is ok, but I have a feeling it's not).

 ok

 * For moving data to a an 'offline' read-only core, this is the
 trickiest
 bit.
 We do this today by using a round-robin chain of remote shards and 2
 local cores. At the boundary time (e.g. 1 day), the 'active' core is
 replicated locally, then this local replica is replicated to the next
 shard in the chain. Once everything is complete, the local replica is
 discarded, and the 'active' core is cleaned, being careful not to
 delete any new data since the replicated commit point.

 Maybe I didn't fully understood what you explained: but doesn't this mean
 that you'll have one index per day?
 Or are you overwriting, via replicating, every shard and the number of
 shard
 is fixed?
 And why are you replicating from the local replica to the next shard?
 (why
 not directly from active to next shard?)

 Regards,
 Peter.



 --
 http://jetwick.com twitter search prototype




Re: Possibilities of (near) real time search with solr

2010-11-17 Thread Peter Sturge
* I believe the NRT patches are included in the 4.x trunk. I don't
think there's any support as yet in 3x (uses features in Lucene 3.0).

* For merging, I'm talking about commits/writes. If you merge while
commits are going on, things can get a bit messy (maybe on source
cores this is ok, but I have a feeling it's not).

* For moving data to a an 'offline' read-only core, this is the trickiest bit.
We do this today by using a round-robin chain of remote shards and 2
local cores. At the boundary time (e.g. 1 day), the 'active' core is
replicated locally, then this local replica is replicated to the next
shard in the chain. Once everything is complete, the local replica is
discarded, and the 'active' core is cleaned, being careful not to
delete any new data since the replicated commit point.

It's not the easiest thing to implement, but boy it scales forever!

Peter


Re: Tuning Solr caches with high commit rates (NRT)

2010-11-16 Thread Peter Sturge
Many thanks, Peter K. for posting up on the wiki - great!

Yes, fc = field cache. Field Collapsing is something very nice indeed,
but is entirely different.

As Erik mentions in the wiki post, using per-segment faceting can be a
huge boon to performance. It does require the latest Solr trunk build
and new Lucene, though (last time I checked, this isn't in the Solr 3x
branch).

enum vs fc? This will depend a lot on what your data looks like - e.g.
lots of unique terms vs lots of the same terms.
In all the tests we've done here with 20m doc indexes (using 3x
branch), enum has always used less memory than fc (sometimes much
less), but fc is faster for searches. Again, your data experience may
vary.

The main point in this thread for NRT and faceting is to warm caches
as quickly as possible - this generally means judicious facet
selection, and for us at least, using LRUCache a.o.t. FastLRUCache for
filter caches.



On Mon, Nov 15, 2010 at 11:56 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:
 (10/11/16 8:36), Jonathan Rochkind wrote:

 In Solr 1.4, facet.method=enum DOES work on multi-valued fields, I'm
 pretty certain.

 Correct, and I didn't say that facet.method=enum doesn't work for
 multiValued/tokenized field in my previous mail.

 I think Koji's explanation is based on before Solr 1.4

 No, as facet.method had been introduced in 1.4.

 Koji
 --
 http://www.rondhuit.com/en/



Re: Possibilities of (near) real time search with solr

2010-11-16 Thread Peter Sturge
Hi Peter,

First off, many thanks for putting together the NRT Wiki page!

This may have changed recently, but the NRT stuff - e.g. per-segment
commits etc. is for the latest Solr 4 trunk only.
If your setup uses the 3x Solr code branch, then there's a bit of work
to do to move to the new version.
Some of this is due to the new 3.x Lucene, which has a lot of cool new
stuff in it, but also deprecates a lot of old stuff,
so existing SolrJ clients and custom server-side code/configuration
will need to take this into account.
We've not had the time to do this, so that's about as far as I can go
on that one for now.

We have had some very good success with distributed/shard searching -
i.e. 'new' data arrives in a relatively small index, and so can remain
fast, whilst distributed shards hold 'older' data and so can keep
their caches warm (i.e. very few/no commits). This works particularly
well for summary data  (facets, filter queries etc. that sit in
caches) .
Be careful about merging, as all involved cores will pause for the
merging period. Really needs to be done out-of-hours, or better still,
offline (i.e. replicate the cores, then merge, then bring them live).
The trickiest bit about the above is defining when data is deemed to
be 'old' and then moving that data in an efficient manner to a
read-only shard. Using SolrJ can help in this regard as it can offload
some of the administration from the server(s).

Thanks,
Peter


On Mon, Nov 15, 2010 at 8:06 PM, Peter Karich peat...@yahoo.de wrote:
 Hi,

 I wanted to provide my indexed docs (tweets) relative fast: so 1 to 10 sec
 or even 30 sec would be ok.

 At the moment I am using the read only core scenario described here (point
 5)*
 with a commit frequency of 180 seconds which was fine until some days. (I am
 using solr1.4.1)
 Now the time a commit takes is too high (40-80s) and too CPU-heavy because
 the index is too large 7GB.

 I thought about some possible solutions:
 1. using solr NRT patches**
 2. using shards (+ multicore) where I feed into a relative small core and
 merges them later (every hour or so) to reduce the number of cores
 3. It would be also nice if someone could explain what and if there are
 benefits when using solr4.0 ...

 The problem for 1. is that I haven't found a guide how to apply all the
 patches. Or is NRT not possible at the moment with solr? Does anybody has a
 link for me?

 Then I looked into solution 2. It seems to me that the CPU- and
 administration-overhead of sharding can be quite high. Any hints (I am using
 SolrJ)? E.g. I need to include the date facet patch

 Or how would you solve this?

 Regards,
 Peter.

 *
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3caanlktincgekjlbxe_bsaahlct_hlr_kwuxm5zxovt...@mail.gmail.com%3e

 **
 https://issues.apache.org/jira/browse/SOLR-1606


 --
 http://jetwick.com twitter search prototype



Re: Modelling Access Control

2010-10-24 Thread Peter Sturge
Hi,

See SOLR-1872 for a way of providing access control, whilst placing
the ACL configuration itself outside of Solr, which is generally a
good idea.
   
http://www.lucidimagination.com/search/out?u=http://issues.apache.org/jira/browse/SOLR-1872

There are a number of ways to approach Access Control, but you will
need to take a number of factors into account that aren't issues if
you're doing non-acl Solr queries.
You can use this patch to achieve authentication and authorization, or
use it as a template for similar techniques.

Peter



On Sat, Oct 23, 2010 at 9:03 AM, Paul Carey paul.p.ca...@gmail.com wrote:
 Hi

 My domain model is made of users that have access to projects which
 are composed of items. I'm hoping to use Solr and would like to make
 sure that searches only return results for items that users have
 access to.

 I've looked over some of the older posts on this mailing list about
 access control and saw a suggestion along the lines of
 acl:user_id AND (actual query).

 While this obviously works, there are a couple of niggles. Every item
 must have a list of valid user ids (typically less than 100 in my
 case). Every time a collaborator is added to or removed from a
 project, I need to update every item in that project. This will
 typically be fewer than 1000 items, so I guess is no big deal.

 I wondered if the following might be a reasonable alternative,
 assuming the number of projects to which a user has access is lower
 than a certain bound.
 (acl:project_id OR acl:project_id OR ... ) AND (actual query)

 When the numbers are small - e.g. each user has access to ~20 projects
 and each project has ~20 collaborators - is one approach preferable
 over another? And when outliers exist - e.g. a project with 2000
 collaborators, or a user with access to 2000 projects - is one
 approach more liable to fail than the other?

 Many thanks

 Paul



Spanning an index across multiple volumes

2010-10-17 Thread Peter Sturge
Is it possible to get an index to span multiple disk volumes - i.e.
when its 'primary' volume fills up (or optimize needs more room), tell
Solr/Lucene to use a secondary/tertiary/quaternary et al volume?

I've not seen any configuration that would allow this, but maybe
others have a use case for such functionality?

Thanks,
Peter


Re: Experience running Solr on ISCSI

2010-10-08 Thread Peter Sturge
Hi,

We've used iSCSI SANs with 6x1TB 15k SAS drives RAID10 in production
environments, and this works very well for both reads and writes. We
also have FibreChannel environments, and this is faster as you would
expect. It's also a lot more expensive.

The performance bottleneck will have more to do with running
virtualization rather than with iSCSI-based hardware. If you run a
physical server using iSCSI with decent disks, you should be getting
good results.

Peter




On Thu, Oct 7, 2010 at 12:45 PM, Shawn Heisey elyog...@elyograg.org wrote:
  On 10/6/2010 7:23 AM, Thijs wrote:

 Hi.

 Our hardware department is planning on moving some stuff to new machines
 (on our request)
 They are suggesting using virtualization (some CISCO solution) on those
 machines and having the 'disk' connected via ISCSI.

 Does anybody have experience running a SOLR index on a ISCSI drive?
 We have already tried with NFS but that is slowing the index process down
 to much, about 12 times slower. So NFS is a no-go. I could have know that as
 it is mentioned on a lot of places to avoid nfs. But I can't find info about
 ISCSI

 Does anybody have experience running a SOLR index on a virtualized
 environment? Is it resistant enough that it keeps working when the
 virtualized machine is transfered to a different hardware node?

 thanks

 I've not actually used it myself, but I would not expect it to cause you any
 issues.  It should be similar to fibrechannel.  Usually fibrechannel is
 faster, unless you REALLY spend some money and get 10Gb/s ethernet hardware.
  If we assume that you'll have a fairly standard gigabit setup with only one
 port on your server, you should see potential speeds near one gigabit.  This
 is faster than the sustained rate on most single hard drives.  I was just
 reading that Seagate's 15K 600GB SAS drive is 171MB/s, which would get close
 to 1.3GB/s, so in that case, it could overwhelm a single iSCSI port.

 With something like iSCSI or fibrechannel, you have extra points of failure,
 because you normally don't want to implement them without dedicated
 switching hardware.  The solution there is redundancy, which of course
 drives the cost up even higher.  You also usually get higher speeds because
 of load balancing across those multiple links.

 Shawn




Re: Question Related to sorting on Date

2010-09-27 Thread Peter Sturge
Hi Ahson,

You'll really want to store an additional date field (make it a
TrieDateField type) that has only the date, and in the reverse order
from how you've shown it. You can still keep the one you've got, just
use it only for 'human viewing' rather than sorting.
Something like:
20080205  if your example is 5 Feb, or 20080502 for May 2nd.

This way, the parsing is most efficient, you won't have to do any
tricky parsing at sort time, and, when your index gets large, your
sorted searches will remain fast.




On Mon, Sep 27, 2010 at 7:45 PM, Ahson Iqbal mianah...@yahoo.com wrote:
 hi all

 I have a question related to sorting of date field i have Date field  that is
 indexed like a string and look like 5/2/2008 4:33:30 PM i want  to do 
 sorting
 on this field on the basis of date, time does not  matters. any suggestion 
 how i
 could ignore the time part from this field  and just sort on the date?





Re: Solr Reporting

2010-09-23 Thread Peter Sturge
Hi,

Are you going to generate a report with 3 records in it? That will
be a very large report - will anyone really want to read through that?
If you want/need 'summary' reports - i.e. stats on on the 30k records,
it is much more efficient to setup faceting and/or server-side
analysis to do this, rather than download
3 records to a client, then do statistical analysis on the result.
It will take a while to stream 3 records over an http connection,
and, if you're building, say, a PDF table for 30k records, that will
take some time as well.
Server-side analysis then just send the results will work better, if
that fits your remit for reporting.

Peter



On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com wrote:
 Thank you for your suggestions .. makes sense and I didnt knew about the
 XsltResponseWriter .. that opens up door to all kind of possibilities ..so
 its great to know about that

 but before I go that route .. what about performance .. In Solr Wiki it
 mentions that XSLT transformation isnt so bad in terms of memory usage but I
 guess its all relative to the amount of data and obviously system resources
 ..

 my data set will be around 15000 - 30'000 records at the most ..I do have
 about 30 some fields but all fields are either small strings (less than 500
 chars) or dates, int, booleans etc .. so should I be worried about
 performances problems while doing the XSLT translations .. secondly for
 reports Ill have to request solr to send all 15000 some records at the same
 time to be entered in report output files .. is there a way to kind of
 stream that process .. well I think Solr native xml is already streamed to
 you but sounds like for the translation it will have to load the whole thing
 in RAM ..

 and again what about SolrJ .. isnt that supposed to provide better
 performance since its in java .. well I guess it shouldnt be much different
 since it also uses the HTTP calls to communicate to Solr ..

 Thanks for your help
 Adeel

 On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote:


 keep in mind that the str name=id paradigm isn't completely useless,
 the
 str is a data type (string), it can be int, float, double, date, and
 others.
 So to not lose any information you may want to do something like:

 id type=int123/id
 title type=strxyz/title

 Which I agree makes more sense to me. The name of the field is more
 important than it's datatype, but I don't want to lose track of the data
 type.

 Ken
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr Reporting

2010-09-23 Thread Peter Sturge
Yes, that makes sense. So, more of a bulk data export requirement.
If the excel data doesn't have to go out on the web, you could export
to a local file (using a local solj streamer), then publish it,
which might save some external http bandwidth if that's a concern.
We do this all the time using a local solrj client, so if you've got a
big data stream (e.g. an entire core), you don't
have to send it through your outward-facing web servers. Using a
replica to retrieve/export the data might be worth considering as
well.


On Thu, Sep 23, 2010 at 7:21 PM, Adeel Qureshi adeelmahm...@gmail.com wrote:
 Hi Peter

 I understand what you are saying but I think you are thinking more of report
 as graph and analysis and summary kind of data .. for my reports I do need
 to include all records that qualify certain criteria .. e.g. a listing of
 all orders placed in last 6 months .. now that could be 1 orders and yes
 I will need probably a report that summarizes all that data but at the same
 time .. I need all those 1 records to be exported in an excel file ..
 those are the reports that I am talking about ..

 and 3 probably is a stretch .. it might be 10-15000 at the most but I
 guess its still the same idea .. and yes I realize that its alot of data to
 be transferred over http .. but thats exactly why i am asking for suggestion
 on how to do .. I find it hard to believe that this is an unusual
 requirement .. I think most companies do reports that dump all records from
 databases in excel files ..

 so again to clarify I definitely need reports that present statistics and
 averages and yes I will be using facets and all kind of stuff there and I am
 not so concerned about those reports because like you pointed out, for those
 reports there will be very little data transfer but its the full data dump
 reports that I am trying to figure out the best way to handle.

 Thanks for your help
 Adeel



 On Thu, Sep 23, 2010 at 11:43 AM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Are you going to generate a report with 3 records in it? That will
 be a very large report - will anyone really want to read through that?
 If you want/need 'summary' reports - i.e. stats on on the 30k records,
 it is much more efficient to setup faceting and/or server-side
 analysis to do this, rather than download
 3 records to a client, then do statistical analysis on the result.
 It will take a while to stream 3 records over an http connection,
 and, if you're building, say, a PDF table for 30k records, that will
 take some time as well.
 Server-side analysis then just send the results will work better, if
 that fits your remit for reporting.

 Peter



 On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com
 wrote:
  Thank you for your suggestions .. makes sense and I didnt knew about the
  XsltResponseWriter .. that opens up door to all kind of possibilities
 ..so
  its great to know about that
 
  but before I go that route .. what about performance .. In Solr Wiki it
  mentions that XSLT transformation isnt so bad in terms of memory usage
 but I
  guess its all relative to the amount of data and obviously system
 resources
  ..
 
  my data set will be around 15000 - 30'000 records at the most ..I do have
  about 30 some fields but all fields are either small strings (less than
 500
  chars) or dates, int, booleans etc .. so should I be worried about
  performances problems while doing the XSLT translations .. secondly for
  reports Ill have to request solr to send all 15000 some records at the
 same
  time to be entered in report output files .. is there a way to kind of
  stream that process .. well I think Solr native xml is already streamed
 to
  you but sounds like for the translation it will have to load the whole
 thing
  in RAM ..
 
  and again what about SolrJ .. isnt that supposed to provide better
  performance since its in java .. well I guess it shouldnt be much
 different
  since it also uses the HTTP calls to communicate to Solr ..
 
  Thanks for your help
  Adeel
 
  On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com
 wrote:
 
 
  keep in mind that the str name=id paradigm isn't completely useless,
  the
  str is a data type (string), it can be int, float, double, date, and
  others.
  So to not lose any information you may want to do something like:
 
  id type=int123/id
  title type=strxyz/title
 
  Which I agree makes more sense to me. The name of the field is more
  important than it's datatype, but I don't want to lose track of the data
  type.
 
  Ken
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 




Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Hi,

It's great to see such a fantastic response to this thread - NRT is
alive and well!

I'm hoping to collate this information and add it to the wiki when I
get a few free cycles (thanks Erik for the heads up).

In the meantime, I thought I'd add a few tidbits of additional
information that might prove useful:

1. The first one to note is that the techniques/setup described in
this thread don't fix the underlying potential for OutOfMemory errors
- there can always be an index large enough to ask of its JVM more
memory than is available for cache.
These techniques, however, mitigate the risk, and provide an efficient
balance between memory use and search performance.
There are some interesting discussions going on for both Lucene and
Solr regarding the '2 pounds of baloney into a 1 pound bag' issue of
unbounded caches, with a number of interesting strategies.
One strategy that I like, but haven't found in discussion lists is
auto-limiting cache size/warming based on available resources (similar
to the way file system caches use free memory). This would allow
caches to adjust to their memory environment as indexes grow.

2. A note regarding lockType in solrconfig.xml for dual Solr
instances: It's best not to use 'none' as a value for lockType - this
sets the lockType to null, and as the source comments note, this is a
recipe for disaster, so, use 'simple' instead.

3. Chris mentioned setting maxWarmingSearchers to 1 as a way of
minimizing the number of onDeckSearchers. This is a prudent move --
thanks Chris for bringing this up!

All the best,
Peter




On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote:
 Peter Sturge,

 this was a nice hint, thanks again! If you are here in Germany anytime I
 can invite you to a beer or an apfelschorle ! :-)
 I only needed to change the lockType to none in the solrconfig.xml,
 disable the replication and set the data dir to the master data dir!

 Regards,
 Peter Karich.

 Hi Peter,

 this scenario would be really great for us - I didn't know that this is
 possible and works, so: thanks!
 At the moment we are doing similar with replicating to the readonly
 instance but
 the replication is somewhat lengthy and resource-intensive at this
 datavolume ;-)

 Regards,
 Peter.


 1. You can run multiple Solr instances in separate JVMs, with both
 having their solr.xml configured to use the same index folder.
 You need to be careful that one and only one of these instances will
 ever update the index at a time. The best way to ensure this is to use
 one for writing only,
 and the other is read-only and never writes to the index. This
 read-only instance is the one to use for tuning for high search
 performance. Even though the RO instance doesn't write to the index,
 it still needs periodic (albeit empty) commits to kick off
 autowarming/cache refresh.

 Depending on your needs, you might not need to have 2 separate
 instances. We need it because the 'write' instance is also doing a lot
 of metadata pre-write operations in the same jvm as Solr, and so has
 its own memory requirements.

 2. We use sharding all the time, and it works just fine with this
 scenario, as the RO instance is simply another shard in the pack.


 On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote:


 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:



 or a local read-only instance that reads the same core as the indexing
 instance (for the latter, you'll need something that periodically 
 refreshes - i.e. runs commit()).


 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.



 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Peter Sturge
Solr 4.x has new NRT stuff included (uses latest Lucene 3.x, includes
per-segment faceting etc.). The Solr 3.x branch doesn't currently..


On Fri, Sep 17, 2010 at 8:06 PM, Andy angelf...@yahoo.com wrote:
 Does Solr use Lucene NRT?

 --- On Fri, 9/17/10, Erick Erickson erickerick...@gmail.com wrote:

 From: Erick Erickson erickerick...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Friday, September 17, 2010, 1:05 PM
 Near Real Time...

 Erick

 On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon gear...@sbcglobal.netwrote:

  BTW, what is NRT?
 
  Dennis Gearon
 
  Signature Warning
  
  EARTH has a Right To Life,
   otherwise we all die.
 
  Read 'Hot, Flat, and Crowded'
  Laugh at http://www.yert.com/film.php
 
 
  --- On Fri, 9/17/10, Peter Sturge peter.stu...@gmail.com
 wrote:
 
   From: Peter Sturge peter.stu...@gmail.com
   Subject: Re: Tuning Solr caches with high commit
 rates (NRT)
   To: solr-user@lucene.apache.org
   Date: Friday, September 17, 2010, 2:18 AM
   Hi,
  
   It's great to see such a fantastic response to
 this thread
   - NRT is
   alive and well!
  
   I'm hoping to collate this information and add it
 to the
   wiki when I
   get a few free cycles (thanks Erik for the heads
 up).
  
   In the meantime, I thought I'd add a few tidbits
 of
   additional
   information that might prove useful:
  
   1. The first one to note is that the
 techniques/setup
   described in
   this thread don't fix the underlying potential
 for
   OutOfMemory errors
   - there can always be an index large enough to
 ask of its
   JVM more
   memory than is available for cache.
   These techniques, however, mitigate the risk, and
 provide
   an efficient
   balance between memory use and search
 performance.
   There are some interesting discussions going on
 for both
   Lucene and
   Solr regarding the '2 pounds of baloney into a 1
 pound bag'
   issue of
   unbounded caches, with a number of interesting
 strategies.
   One strategy that I like, but haven't found in
 discussion
   lists is
   auto-limiting cache size/warming based on
 available
   resources (similar
   to the way file system caches use free memory).
 This would
   allow
   caches to adjust to their memory environment as
 indexes
   grow.
  
   2. A note regarding lockType in solrconfig.xml
 for dual
   Solr
   instances: It's best not to use 'none' as a value
 for
   lockType - this
   sets the lockType to null, and as the source
 comments note,
   this is a
   recipe for disaster, so, use 'simple' instead.
  
   3. Chris mentioned setting maxWarmingSearchers to
 1 as a
   way of
   minimizing the number of onDeckSearchers. This is
 a prudent
   move --
   thanks Chris for bringing this up!
  
   All the best,
   Peter
  
  
  
  
   On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
 peat...@yahoo.de
   wrote:
Peter Sturge,
   
this was a nice hint, thanks again! If you
 are here in
   Germany anytime I
can invite you to a beer or an apfelschorle
 ! :-)
I only needed to change the lockType to none
 in the
   solrconfig.xml,
disable the replication and set the data dir
 to the
   master data dir!
   
Regards,
Peter Karich.
   
Hi Peter,
   
this scenario would be really great for
 us - I
   didn't know that this is
possible and works, so: thanks!
At the moment we are doing similar with
   replicating to the readonly
instance but
the replication is somewhat lengthy and
   resource-intensive at this
datavolume ;-)
   
Regards,
Peter.
   
   
1. You can run multiple Solr
 instances in
   separate JVMs, with both
having their solr.xml configured to
 use the
   same index folder.
You need to be careful that one and
 only one
   of these instances will
ever update the index at a time. The
 best way
   to ensure this is to use
one for writing only,
and the other is read-only and never
 writes to
   the index. This
read-only instance is the one to use
 for
   tuning for high search
performance. Even though the RO
 instance
   doesn't write to the index,
it still needs periodic (albeit
 empty) commits
   to kick off
autowarming/cache refresh.
   
Depending on your needs, you might
 not need to
   have 2 separate
instances. We need it because the
 'write'
   instance is also doing a lot
of metadata pre-write operations in
 the same
   jvm as Solr, and so has
its own memory requirements.
   
2. We use sharding all the time, and
 it works
   just fine with this
scenario, as the RO instance is
 simply another
   shard in the pack.
   
   
On Sun, Sep 12, 2010 at 8:46 PM,
 Peter Karich
   peat...@yahoo.de
   wrote:
   
   
Peter,
   
thanks a lot for your in-depth
   explanations!
Your findings will be definitely
 helpful
   for my next performance
improvement tests :-)
   
Two questions:
   
1. How would I do

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
The balanced segment merging is a really cool idea. I'll definetely
have a look at this, thanks!

One thing I forgot to mention in the original post is we use a
mergeFactor of 25. Somewhat on the high side, so that incoming commits
aren't trying to merge new data into large segments.
25 is a good balance for us between number of files and search
performance. This LinkedIn patch could come in very handy for handling
merges.


On Mon, Sep 13, 2010 at 2:20 AM, Lance Norskog goks...@gmail.com wrote:
 Bravo!

 Other tricks: here is a policy for deciding when to merge segments that
 attempts to balance merging with performance. It was contributed by
 LinkedIn- they also run indexsearch in the same instance (not Solr, a
 different Lucene app).

 lucene/contrib/misc/src/java/org/apache/lucene/index/BalancedSegmentMergePolicy.java

 The optimize command now includes a partial optimize option, so you can do
 larger controlled merges.

 Peter Sturge wrote:

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g.5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
     Reduce the commit rate to allow searchers to fully warm before the
 next commit
     Reduce or eliminate the autowarming in caches
     Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
     filterCache
       class=solr.LRUCache
       size=3600
       initialSize=1400
       autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
    useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
1. You can run multiple Solr instances in separate JVMs, with both
having their solr.xml configured to use the same index folder.
You need to be careful that one and only one of these instances will
ever update the index at a time. The best way to ensure this is to use
one for writing only,
and the other is read-only and never writes to the index. This
read-only instance is the one to use for tuning for high search
performance. Even though the RO instance doesn't write to the index,
it still needs periodic (albeit empty) commits to kick off
autowarming/cache refresh.

Depending on your needs, you might not need to have 2 separate
instances. We need it because the 'write' instance is also doing a lot
of metadata pre-write operations in the same jvm as Solr, and so has
its own memory requirements.

2. We use sharding all the time, and it works just fine with this
scenario, as the RO instance is simply another shard in the pack.


On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich peat...@yahoo.de wrote:
 Peter,

 thanks a lot for your in-depth explanations!
 Your findings will be definitely helpful for my next performance
 improvement tests :-)

 Two questions:

 1. How would I do that:

 or a local read-only instance that reads the same core as the indexing
 instance (for the latter, you'll need something that periodically refreshes 
 - i.e. runs commit()).


 2. Did you try sharding with your current setup (e.g. one big,
 nearly-static index and a tiny write+read index)?

 Regards,
 Peter.

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
     Reduce the commit rate to allow searchers to fully warm before the
 next commit
     Reduce or eliminate the autowarming in caches
     Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
     filterCache
       class=solr.LRUCache
       size=3600
       initialSize=1400
       autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in 

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
Hi Erik,

I thought this would be good for the wiki, but I've not submitted to
the wiki before, so I thought I'd put this info out there first, then
add it if it was deemed useful.
If you could let me know the procedure for submitting, it probably
would be worth getting it into the wiki (couldn't do it straightaway,
as I have a lot of projects on at the moment). If you're able/willing
to put it on there for me, that would be very kind of you!

Thanks!
Peter


On Sun, Sep 12, 2010 at 5:43 PM, Erick Erickson erickerick...@gmail.com wrote:
 Peter:

 This kind of information is extremely useful to document, thanks! Do you
 have the time/energy to put it up on the Wiki? Anyone can edit it by
 creating
 a logon. If you don't, would it be OK if someone else did it (with
 attribution,
 of course)? I guess that by bringing it up I'm volunteering :)...

 Best
 Erick

 On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-13 Thread Peter Sturge
Hi Dennis,

These are the Lucene file segments that hold the index data on the file system.
Have a look at: http://wiki.apache.org/solr/SolrPerformanceFactors

Peter


On Mon, Sep 13, 2010 at 7:02 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 BTW, what is a segment?

 I've only heard about them in the last 2 weeks here on the list.
 Dennis Gearon

 Signature Warning
 
 EARTH has a Right To Life,
  otherwise we all die.

 Read 'Hot, Flat, and Crowded'
 Laugh at http://www.yert.com/film.php


 --- On Sun, 9/12/10, Jason Rutherglen jason.rutherg...@gmail.com wrote:

 From: Jason Rutherglen jason.rutherg...@gmail.com
 Subject: Re: Tuning Solr caches with high commit rates (NRT)
 To: solr-user@lucene.apache.org
 Date: Sunday, September 12, 2010, 7:52 PM
 Yeah there's no patch... I think
 Yonik can write it. :-)  Yah... The
 Lucene version shouldn't matter.  The distributed
 faceting
 theoretically can easily be applied to multiple segments,
 however the
 way it's written for me is a challenge to untangle and
 apply
 successfully to a working patch.  Also I don't have
 this as an itch to
 scratch at the moment.

 On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
  Hi Jason,
 
  I've tried some limited testing with the 4.x trunk
 using fcs, and I
  must say, I really like the idea of per-segment
 faceting.
  I was hoping to see it in 3.x, but I don't see this
 option in the
  branch_3x trunk. Is your SOLR-1606 patch referred to
 in SOLR-1617 the
  one to use with 3.1?
  There seems to be a number of Solr issues tied to this
 - one of them
  being Lucene-1785. Can the per-segment faceting patch
 work with Lucene
  2.9/branch_3x?
 
  Thanks,
  Peter
 
 
 
  On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
  jason.rutherg...@gmail.com
 wrote:
  Peter,
 
  Are you using per-segment faceting, eg, SOLR-1617?
  That could help
  your situation.
 
  On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
 peter.stu...@gmail.com
 wrote:
  Hi,
 
  Below are some notes regarding Solr cache
 tuning that should prove
  useful for anyone who uses Solr with frequent
 commits (e.g. 5min).
 
  Environment:
  Solr 1.4.1 or branch_3x trunk.
  Note the 4.x trunk has lots of neat new
 features, so the notes here
  are likely less relevant to the 4.x
 environment.
 
  Overview:
  Our Solr environment makes extensive use of
 faceting, we perform
  commits every 30secs, and the indexes tend be
 on the large-ish side
  (20million docs).
  Note: For our data, when we commit, we are
 always adding new data,
  never changing existing data.
  This type of environment can be tricky to
 tune, as Solr is more geared
  toward fast reads than frequent writes.
 
  Symptoms:
  If anyone has used faceting in searches where
 you are also performing
  frequent commits, you've likely encountered
 the dreaded OutOfMemory or
  GC Overhead Exeeded errors.
  In high commit rate environments, this is
 almost always due to
  multiple 'onDeck' searchers and autowarming -
 i.e. new searchers don't
  finish autowarming their caches before the
 next commit()
  comes along and invalidates them.
  Once this starts happening on a regular basis,
 it is likely your
  Solr's JVM will run out of memory eventually,
 as the number of
  searchers (and their cache arrays) will keep
 growing until the JVM
  dies of thirst.
  To check if your Solr environment is suffering
 from this, turn on INFO
  level logging, and look for: 'PERFORMANCE
 WARNING: Overlapping
  onDeckSearchers=x'.
 
  In tests, we've only ever seen this problem
 when using faceting, and
  facet.method=fc.
 
  Some solutions to this are:
     Reduce the commit rate to allow searchers
 to fully warm before the
  next commit
     Reduce or eliminate the autowarming in
 caches
     Both of the above
 
  The trouble is, if you're doing NRT commits,
 you likely have a good
  reason for it, and reducing/elimintating
 autowarming will very
  significantly impact search performance in
 high commit rate
  environments.
 
  Solution:
  Here are some setup steps we've used that
 allow lots of faceting (we
  typically search with at least 20-35 different
 facet fields, and date
  faceting/sorting) on large indexes, and still
 keep decent search
  performance:
 
  1. Firstly, you should consider using the enum
 method for facet
  searches (facet.method=enum) unless you've got
 A LOT of memory on your
  machine. In our tests, this method uses a lot
 less memory and
  autowarms more quickly than fc. (Note, I've
 not tried the new
  segement-based 'fcs' option, as I can't find
 support for it in
  branch_3x - looks nice for 4.x though)
  Admittedly, for our data, enum is not quite as
 fast for searching as
  fc, but short of purchsing a Thaiwanese RAM
 factory, it's a worthwhile
  tradeoff.
  If you do have access to LOTS of memory, AND
 you can guarantee that
  the index won't grow beyond the memory
 capacity (i.e. you have some
  sort of deletion policy in place), fc can be a
 lot

Re: Invalid version or the data in not in 'javabin' format

2010-09-12 Thread Peter Sturge
Could be a solrj .jar version compat issue. Check that  the client and
server's solrj version jars match up.

Peter


On Sun, Sep 12, 2010 at 1:16 PM, h00kpub...@gmail.com
h00kpub...@googlemail.com wrote:
  hi... currently i am integrating nutch (release 1.2) into solr (trunk). if
 i indexing to solr index with nutch i got the exception:

 java.lang.RuntimeException: Invalid version or the data in not in 'javabin'
 format
        at
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
        at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
        at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
        at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:98)
        at
 org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
        at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
 2010-09-12 11:44:55,101 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!

 can you tell me, whats wrong or how can i fix this?

 best regards marcel :)






Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi,

Below are some notes regarding Solr cache tuning that should prove
useful for anyone who uses Solr with frequent commits (e.g. 5min).

Environment:
Solr 1.4.1 or branch_3x trunk.
Note the 4.x trunk has lots of neat new features, so the notes here
are likely less relevant to the 4.x environment.

Overview:
Our Solr environment makes extensive use of faceting, we perform
commits every 30secs, and the indexes tend be on the large-ish side
(20million docs).
Note: For our data, when we commit, we are always adding new data,
never changing existing data.
This type of environment can be tricky to tune, as Solr is more geared
toward fast reads than frequent writes.

Symptoms:
If anyone has used faceting in searches where you are also performing
frequent commits, you've likely encountered the dreaded OutOfMemory or
GC Overhead Exeeded errors.
In high commit rate environments, this is almost always due to
multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
finish autowarming their caches before the next commit()
comes along and invalidates them.
Once this starts happening on a regular basis, it is likely your
Solr's JVM will run out of memory eventually, as the number of
searchers (and their cache arrays) will keep growing until the JVM
dies of thirst.
To check if your Solr environment is suffering from this, turn on INFO
level logging, and look for: 'PERFORMANCE WARNING: Overlapping
onDeckSearchers=x'.

In tests, we've only ever seen this problem when using faceting, and
facet.method=fc.

Some solutions to this are:
Reduce the commit rate to allow searchers to fully warm before the
next commit
Reduce or eliminate the autowarming in caches
Both of the above

The trouble is, if you're doing NRT commits, you likely have a good
reason for it, and reducing/elimintating autowarming will very
significantly impact search performance in high commit rate
environments.

Solution:
Here are some setup steps we've used that allow lots of faceting (we
typically search with at least 20-35 different facet fields, and date
faceting/sorting) on large indexes, and still keep decent search
performance:

1. Firstly, you should consider using the enum method for facet
searches (facet.method=enum) unless you've got A LOT of memory on your
machine. In our tests, this method uses a lot less memory and
autowarms more quickly than fc. (Note, I've not tried the new
segement-based 'fcs' option, as I can't find support for it in
branch_3x - looks nice for 4.x though)
Admittedly, for our data, enum is not quite as fast for searching as
fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
tradeoff.
If you do have access to LOTS of memory, AND you can guarantee that
the index won't grow beyond the memory capacity (i.e. you have some
sort of deletion policy in place), fc can be a lot faster than enum
when searching with lots of facets across many terms.

2. Secondly, we've found that LRUCache is faster at autowarming than
FastLRUCache - in our tests, about 20% faster. Maybe this is just our
environment - your mileage may vary.

So, our filterCache section in solrconfig.xml looks like this:
filterCache
  class=solr.LRUCache
  size=3600
  initialSize=1400
  autowarmCount=3600/

For a 28GB index, running in a quad-core x64 VMWare instance, 30
warmed facet fields, Solr is running at ~4GB. Stats filterCache size
shows usually in the region of ~2400.

3. It's also a good idea to have some sort of
firstSearcher/newSearcher event listener queries to allow new data to
populate the caches.
Of course, what you put in these is dependent on the facets you need/use.
We've found a good combination is a firstSearcher with as many facets
in the search as your environment can handle, then a subset of the
most common facets for the newSearcher.

4. We also set:
   useColdSearchertrue/useColdSearcher
just in case.

5. Another key area for search performance with high commits is to use
2 Solr instances - one for the high commit rate indexing, and one for
searching.
The read-only searching instance can be a remote replica, or a local
read-only instance that reads the same core as the indexing instance
(for the latter, you'll need something that periodically refreshes -
i.e. runs commit()).
This way, you can tune the indexing instance for writing performance
and the searching instance as above for max read performance.

Using the setup above, we get fantastic searching speed for small
facet sets (well under 1sec), and really good searching for large
facet sets (a couple of secs depending on index size, number of
facets, unique terms etc. etc.),
even when searching against largeish indexes (20million docs).
We have yet to see any OOM or GC errors using the techniques above,
even in low memory conditions.

I hope there are people that find this useful. I know I've spent a lot
of time looking for stuff like this, so hopefullly, this will save
someone some time.


Peter


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Peter Sturge
Hi Jason,

I've tried some limited testing with the 4.x trunk using fcs, and I
must say, I really like the idea of per-segment faceting.
I was hoping to see it in 3.x, but I don't see this option in the
branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
one to use with 3.1?
There seems to be a number of Solr issues tied to this - one of them
being Lucene-1785. Can the per-segment faceting patch work with Lucene
2.9/branch_3x?

Thanks,
Peter



On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Peter,

 Are you using per-segment faceting, eg, SOLR-1617?  That could help
 your situation.

 On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge peter.stu...@gmail.com wrote:
 Hi,

 Below are some notes regarding Solr cache tuning that should prove
 useful for anyone who uses Solr with frequent commits (e.g. 5min).

 Environment:
 Solr 1.4.1 or branch_3x trunk.
 Note the 4.x trunk has lots of neat new features, so the notes here
 are likely less relevant to the 4.x environment.

 Overview:
 Our Solr environment makes extensive use of faceting, we perform
 commits every 30secs, and the indexes tend be on the large-ish side
 (20million docs).
 Note: For our data, when we commit, we are always adding new data,
 never changing existing data.
 This type of environment can be tricky to tune, as Solr is more geared
 toward fast reads than frequent writes.

 Symptoms:
 If anyone has used faceting in searches where you are also performing
 frequent commits, you've likely encountered the dreaded OutOfMemory or
 GC Overhead Exeeded errors.
 In high commit rate environments, this is almost always due to
 multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
 finish autowarming their caches before the next commit()
 comes along and invalidates them.
 Once this starts happening on a regular basis, it is likely your
 Solr's JVM will run out of memory eventually, as the number of
 searchers (and their cache arrays) will keep growing until the JVM
 dies of thirst.
 To check if your Solr environment is suffering from this, turn on INFO
 level logging, and look for: 'PERFORMANCE WARNING: Overlapping
 onDeckSearchers=x'.

 In tests, we've only ever seen this problem when using faceting, and
 facet.method=fc.

 Some solutions to this are:
    Reduce the commit rate to allow searchers to fully warm before the
 next commit
    Reduce or eliminate the autowarming in caches
    Both of the above

 The trouble is, if you're doing NRT commits, you likely have a good
 reason for it, and reducing/elimintating autowarming will very
 significantly impact search performance in high commit rate
 environments.

 Solution:
 Here are some setup steps we've used that allow lots of faceting (we
 typically search with at least 20-35 different facet fields, and date
 faceting/sorting) on large indexes, and still keep decent search
 performance:

 1. Firstly, you should consider using the enum method for facet
 searches (facet.method=enum) unless you've got A LOT of memory on your
 machine. In our tests, this method uses a lot less memory and
 autowarms more quickly than fc. (Note, I've not tried the new
 segement-based 'fcs' option, as I can't find support for it in
 branch_3x - looks nice for 4.x though)
 Admittedly, for our data, enum is not quite as fast for searching as
 fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
 tradeoff.
 If you do have access to LOTS of memory, AND you can guarantee that
 the index won't grow beyond the memory capacity (i.e. you have some
 sort of deletion policy in place), fc can be a lot faster than enum
 when searching with lots of facets across many terms.

 2. Secondly, we've found that LRUCache is faster at autowarming than
 FastLRUCache - in our tests, about 20% faster. Maybe this is just our
 environment - your mileage may vary.

 So, our filterCache section in solrconfig.xml looks like this:
    filterCache
      class=solr.LRUCache
      size=3600
      initialSize=1400
      autowarmCount=3600/

 For a 28GB index, running in a quad-core x64 VMWare instance, 30
 warmed facet fields, Solr is running at ~4GB. Stats filterCache size
 shows usually in the region of ~2400.

 3. It's also a good idea to have some sort of
 firstSearcher/newSearcher event listener queries to allow new data to
 populate the caches.
 Of course, what you put in these is dependent on the facets you need/use.
 We've found a good combination is a firstSearcher with as many facets
 in the search as your environment can handle, then a subset of the
 most common facets for the newSearcher.

 4. We also set:
   useColdSearchertrue/useColdSearcher
 just in case.

 5. Another key area for search performance with high commits is to use
 2 Solr instances - one for the high commit rate indexing, and one for
 searching.
 The read-only searching instance can be a remote replica, or a local
 read-only instance that reads the same core as the indexing instance

  1   2   >