Re: Solr schema design: fitting time-series data

2017-01-15 Thread map reduced
I may have used wrong terminology, by complex types I meant non-primitive
types. Mutlivalued can be conceptualized as a list of values for instance
in your example myint = [ 32, 77] etc which you can possibly analyze and
query upon. What I was trying to ask is if a complex type can be
multi-valued or something along those lines that can be supported by range
queries.

For instance: Below rows will have to be individual docs in Solr (in my
knowledge) -  If I want to range query from ts=Jan 12 to ts=Jan 15 give me
sum of 'unique' where 'contentId=1,product=mobile'

contentId=1,product=mobilets=Jan15 total=12,unique=5
contentId=1,product=mobilets=Jan14 total=10,unique=3
contentId=1,product=mobilets=Jan13 total=15,unique=2
contentId=1,product=mobilets=Jan12 total=17,unique=4
..

This increases number of documents in Solr by a lot. Only if there was a
way to do something like:

{
contentId=1
product=mobile
ts = [

{

time = Jan15

total = 12

unique = 15

},
{

time = Jan16

total = 10

unique = 3

},

..
..
]}

Of course above isn't allowed, but some way to squeeze timestamps in single
document so that it doesn't increase the number of document by a lot and I
am still able to range query on 'ts'.

For some (combination of fields) rows the timestamps may go upto last 3-6
months!

Let me know if I am still being unclear.

On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson 
wrote:

> bq: I know multivalued fields don't support complex data  types
>
> Not sure what you're talking about here. mulitValued actually has
> nothing to do with data types. You can have text fields which
> are analyzed and produce multiple tokens and are multiValued.
> You can have primitive types (string, int/long/float/double,
> boolean etc) that are multivalued. or they can be single valued.
>
> All "multiValued" means is that the _input_ can have the same field
> repeated, i.e.
> 
>   some stuff
>   more stuff
>   77
> 
>
> This doc would fail of mytext or myint were multiValued=false but
> succeed if multiValued=true at index time.
>
> There are some subtleties with text (analyzed) multivalued fields having
> to do with token offsets, but that's not germane.
>
> Does that change your problem? Your document could have a dozen
> timestamps
>
> However, there isn't a good way to query across multiple multivalued fields
> in parallel. That is, a doc like
>
> myint=1
> myint=2
> myint=3
> mylong=4
> mylong=5
> mylong=6
>
> there's no good way to say "only match this document if mhyint=1 AND
> mylong=4 AND they_are_both_in_the_same_position.
>
> That is, asking for myint=1 AND mylong=6 would match the above. Is
> that what you're
> wondering about?
>
> --
> I expect you're really asking to do the second above, in which case you
> might
> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x
>
> Best,
> Erick
>
> On Sun, Jan 15, 2017 at 7:31 PM, map reduced  wrote:
> > Hi,
> >
> > I am trying to fit the following data in Solr to support flexible queries
> > and would like to get your input on the same. I have data about users
> say:
> >
> > contentID (assume uuid),
> > platform (eg. website, mobile etc),
> > softwareVersion (eg. sw1.1, sw2.5, ..etc),
> > regionId (eg. us144, uk123, etc..)
> > 
> >
> > and few more other such fields. This data is partially pre aggregated
> (read
> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> > format:
> >
> > timestamp  pre-aggregated data [ uniques, total]
> >  Jan 15[ 12, 4]
> >  Jan 14[ 4, 3]
> >  Jan 13[ 8, 7]
> >  ......
> >
> > And then I also have less granular data say "contentID = uuid123 and
> > platform = mobile and softwareVersion = ANY and regionId = ANY (These
> > values will be more than above table since granularity is reduced)
> >
> > timestamp : pre-aggregated data [uniques, total]
> >  Jan 15[ 100, 40]
> >  Jan 14[ 45, 30]
> >  ...   ...
> >
> > I'll get queries like "contentID = uuid123 and platform = mobile" , give
> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for
> Jan15 -
> > Jan01.
> >
> > I was thinking of simple schema where documents will be like (first
> example
> > above):
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "sw1.2",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 12,
> >   "total": 4
> > }
> >
> > second example from above:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "ANY",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 100,
> >   "total": 40
> > }
> >
> > Possible optimization:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
> >   "unique": 12,
>

SOLR Installation / Configuration related

2017-01-15 Thread Prasanna S. Dhakephalkar
Hi,

 

I have a standalone installation of solr 5.3.1

 

Recently I have started facing an issue, whenever the Garbage collector
kicks in, and at that time if there is a request to solr,

Solr (http) responds back with status 0 and the service is not served, it
gets served after few seconds.

 

The PHP library catches it as

Exception of type Apache_Solr_HttpTransportException occurred with Message:
'0' Status: Communication Error in File ..libraries/Solr/Service.php at Line
...

Any suggestion / ideas to avoid this disruption of service ?

Regards,

Prasanna.

 

 

 



Re: Solr schema design: fitting time-series data

2017-01-15 Thread Erick Erickson
bq: I know multivalued fields don't support complex data  types

Not sure what you're talking about here. mulitValued actually has
nothing to do with data types. You can have text fields which
are analyzed and produce multiple tokens and are multiValued.
You can have primitive types (string, int/long/float/double,
boolean etc) that are multivalued. or they can be single valued.

All "multiValued" means is that the _input_ can have the same field
repeated, i.e.

  some stuff
  more stuff
  77


This doc would fail of mytext or myint were multiValued=false but
succeed if multiValued=true at index time.

There are some subtleties with text (analyzed) multivalued fields having
to do with token offsets, but that's not germane.

Does that change your problem? Your document could have a dozen
timestamps

However, there isn't a good way to query across multiple multivalued fields
in parallel. That is, a doc like

myint=1
myint=2
myint=3
mylong=4
mylong=5
mylong=6

there's no good way to say "only match this document if mhyint=1 AND
mylong=4 AND they_are_both_in_the_same_position.

That is, asking for myint=1 AND mylong=6 would match the above. Is
that what you're
wondering about?

--
I expect you're really asking to do the second above, in which case you might
want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x

Best,
Erick

On Sun, Jan 15, 2017 at 7:31 PM, map reduced  wrote:
> Hi,
>
> I am trying to fit the following data in Solr to support flexible queries
> and would like to get your input on the same. I have data about users say:
>
> contentID (assume uuid),
> platform (eg. website, mobile etc),
> softwareVersion (eg. sw1.1, sw2.5, ..etc),
> regionId (eg. us144, uk123, etc..)
> 
>
> and few more other such fields. This data is partially pre aggregated (read
> Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> format:
>
> timestamp  pre-aggregated data [ uniques, total]
>  Jan 15[ 12, 4]
>  Jan 14[ 4, 3]
>  Jan 13[ 8, 7]
>  ......
>
> And then I also have less granular data say "contentID = uuid123 and
> platform = mobile and softwareVersion = ANY and regionId = ANY (These
> values will be more than above table since granularity is reduced)
>
> timestamp : pre-aggregated data [uniques, total]
>  Jan 15[ 100, 40]
>  Jan 14[ 45, 30]
>  ...   ...
>
> I'll get queries like "contentID = uuid123 and platform = mobile" , give
> sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
> Jan01.
>
> I was thinking of simple schema where documents will be like (first example
> above):
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "sw1.2",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 12,
>   "total": 4
> }
>
> second example from above:
>
> {
>   "contentID": "uuid12349789",
>   "platform" : "mobile",
>   "softwareVersion": "ANY",
>   "regionId": "ANY",
>   "ts" : "2017-01-15T01:01:21Z",
>   "unique": 100,
>   "total": 40
> }
>
> Possible optimization:
>
> {
>   "contentID": "uuid12349789",
>   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
>   "unique": 12,
>   "total": 4
>   },
>  "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
>   "unique": 100,
>   "total": 40
>   },
>   "ts" : "2017-01-15T01:01:21Z"
>   }
>
> Challenges: Number of such rows is very large and it'll grow exponentially
> with every new field - For instance if I go with above suggested schema,
> I'll end up storing a new document for each combination of
> contentID,platform,softwareVersion,regionId. Now if we throw in another
> field to this document, number of combinations increase exponentially.I
> have more than a billion such combination rows already.
>
> I am hoping to find advice by experts if
>
>1. Multiple such fields can be fit in same document for different 'ts'
>such that range queries are possible on it.
>2. time range (ts) can be fit in same document as a list(?) (to reduce
>number of rows). I know multivalued fields don't support complex data
>types, but if anything else can be done with the data/schema to reduce
>query time and number of rows.
>
> The number of these rows are very large, for sure more than 1billion (if we
> go with the schema I was suggesting). What schema would you suggest for
> this that'll fit query requirements?
>
> FYI: All queries will be exact match on fields (no partial or tokenized),
> so no analysis on fields is necessary. And almost all queries are range
> queries.
>
> Thanks,
>
> KP


Solr schema design: fitting time-series data

2017-01-15 Thread map reduced
Hi,

I am trying to fit the following data in Solr to support flexible queries
and would like to get your input on the same. I have data about users say:

contentID (assume uuid),
platform (eg. website, mobile etc),
softwareVersion (eg. sw1.1, sw2.5, ..etc),
regionId (eg. us144, uk123, etc..)


and few more other such fields. This data is partially pre aggregated (read
Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
format:

timestamp  pre-aggregated data [ uniques, total]
 Jan 15[ 12, 4]
 Jan 14[ 4, 3]
 Jan 13[ 8, 7]
 ......

And then I also have less granular data say "contentID = uuid123 and
platform = mobile and softwareVersion = ANY and regionId = ANY (These
values will be more than above table since granularity is reduced)

timestamp : pre-aggregated data [uniques, total]
 Jan 15[ 100, 40]
 Jan 14[ 45, 30]
 ...   ...

I'll get queries like "contentID = uuid123 and platform = mobile" , give
sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
platform=mobile and softwareVersion=sw1.2", give sum of 'total' for Jan15 -
Jan01.

I was thinking of simple schema where documents will be like (first example
above):

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "sw1.2",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 12,
  "total": 4
}

second example from above:

{
  "contentID": "uuid12349789",
  "platform" : "mobile",
  "softwareVersion": "ANY",
  "regionId": "ANY",
  "ts" : "2017-01-15T01:01:21Z",
  "unique": 100,
  "total": 40
}

Possible optimization:

{
  "contentID": "uuid12349789",
  "platform.mobile.softwareVersion.sw1.2.region.us12" : {
  "unique": 12,
  "total": 4
  },
 "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
  "unique": 100,
  "total": 40
  },
  "ts" : "2017-01-15T01:01:21Z"
  }

Challenges: Number of such rows is very large and it'll grow exponentially
with every new field - For instance if I go with above suggested schema,
I'll end up storing a new document for each combination of
contentID,platform,softwareVersion,regionId. Now if we throw in another
field to this document, number of combinations increase exponentially.I
have more than a billion such combination rows already.

I am hoping to find advice by experts if

   1. Multiple such fields can be fit in same document for different 'ts'
   such that range queries are possible on it.
   2. time range (ts) can be fit in same document as a list(?) (to reduce
   number of rows). I know multivalued fields don't support complex data
   types, but if anything else can be done with the data/schema to reduce
   query time and number of rows.

The number of these rows are very large, for sure more than 1billion (if we
go with the schema I was suggesting). What schema would you suggest for
this that'll fit query requirements?

FYI: All queries will be exact match on fields (no partial or tokenized),
so no analysis on fields is necessary. And almost all queries are range
queries.

Thanks,

KP


Re: Referencing a !key and !stat in facet.pivot

2017-01-15 Thread John Blythe
Appreciated. Will give it a whirl tomorrow and report back!

On Sun, Jan 15, 2017 at 12:36 PM Chris Hostetter 
wrote:

>
>
> If i'm understanding your question correctly, what you're looking for is
>
> simply...
>
>
>
> stats.field={!tag=pivot_stats}lastPrice
>
> facet.pivot={!key=pivot stats=pivot_stats}buyer,vendor
>
>
>
> ...there should only ever be one set of "{}" in the facet.pivot, defining
>
> the set of local params, and there are 2 param=values definied inside
>
> those "{}" (just like if you wanted multiple local params for the
>
> stats.field to define what stats you want to compute)
>
>
>
>
> https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
>
>
>
>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-CombiningStatsComponentWithPivots
>
>
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-LocalParametersforFaceting
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component#TheStatsComponent-LocalParameters
>
>
>
>
>
> : Date: Thu, 12 Jan 2017 20:44:35 -0500
>
> : From: John Blythe 
>
> : Reply-To: solr-user@lucene.apache.org
>
> : To: solr-user 
>
> : Subject: Referencing a !key and !stat in facet.pivot
>
> :
>
> : hi all
>
> :
>
> : i'm having an issue with an attempt to assign a key to a facet.pivot
> while
>
> : simultaneously referencing one of my stat fields.
>
> :
>
> : i've got something like this:
>
> :
>
> : stats.field={!tag=pivot_stats}lastPrice&
>
> : > ...
>
> : > facet.pivot={!key=pivot} {!stats=pivot_stats}buyer,vendor& ...
>
> :
>
> :
>
> : i've attempted it without a space, wrapping the entire pivot in the
> !key's
>
> : { } braces and anything else i could think of. some return errors, others
>
> : return the query results but w an empty
>
> :
>
> : "facet_counts":{
>
> : > 
>
> : > "facet_pivot":{
>
> : >   "pivot":[]}},
>
> :
>
> :
>
> : it will work if I totally remove the {!key=pivot} portion, however.
>
> :
>
> : is there any way to have both present?
>
> :
>
> : thanks!
>
> :
>
>
>
> -Hoss
>
> http://www.lucidworks.com/
>
>


Re: Error Loading Custom Codec class with Solr Codec Factory. Class cast exception

2017-01-15 Thread Chris Hostetter


: But when I try to load this codec directly via Solrconfig.xml CodecFactory
: as below.
: 
: 
: 

...there is a difference between a (lucene layer) Codec.  And a (solr 
layer) CodecFactory.

Having the codec code in place (with the necessary SPI metadata 
files) let's Solr/Lucene *read* indexes written in that codec, but in 
order to create new indexes with your codec, you have to write 
a concrete implementation of the CodecFactory abstract class and provide 
that *Factory* class name in your  config line.  There is 
probably no CodecFactory for your DummyEncryptedLucene60Codec defined in 
the patch you're trying out.



-Hoss
http://www.lucidworks.com/


Re: Referencing a !key and !stat in facet.pivot

2017-01-15 Thread Chris Hostetter

If i'm understanding your question correctly, what you're looking for is 
simply...

stats.field={!tag=pivot_stats}lastPrice
facet.pivot={!key=pivot stats=pivot_stats}buyer,vendor

...there should only ever be one set of "{}" in the facet.pivot, defining 
the set of local params, and there are 2 param=values definied inside 
those "{}" (just like if you wanted multiple local params for the 
stats.field to define what stats you want to compute)

https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries

https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-CombiningStatsComponentWithPivots
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-LocalParametersforFaceting
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component#TheStatsComponent-LocalParameters


: Date: Thu, 12 Jan 2017 20:44:35 -0500
: From: John Blythe 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user 
: Subject: Referencing a !key and !stat in facet.pivot
: 
: hi all
: 
: i'm having an issue with an attempt to assign a key to a facet.pivot while
: simultaneously referencing one of my stat fields.
: 
: i've got something like this:
: 
: stats.field={!tag=pivot_stats}lastPrice&
: > ...
: > facet.pivot={!key=pivot} {!stats=pivot_stats}buyer,vendor& ...
: 
: 
: i've attempted it without a space, wrapping the entire pivot in the !key's
: { } braces and anything else i could think of. some return errors, others
: return the query results but w an empty
: 
: "facet_counts":{
: > 
: > "facet_pivot":{
: >   "pivot":[]}},
: 
: 
: it will work if I totally remove the {!key=pivot} portion, however.
: 
: is there any way to have both present?
: 
: thanks!
: 

-Hoss
http://www.lucidworks.com/


Missing Segment File

2017-01-15 Thread Moenieb Davids
Hi All,

How does one resolve the missing segments issue:
 java.nio.file.NoSuchFileException: /pathxxx/data/index/segments_1bj

Seems like it only occurs on large csv imports via DIH











===
GPAA e-mail Disclaimers and confidential note 

This e-mail is intended for the exclusive use of the addressee only.
If you are not the intended recipient, you should not use the contents 
or disclose them to any other person. Please notify the sender immediately 
and delete the e-mail. This e-mail is not intended nor 
shall it be taken to create any legal relations, contractual or otherwise. 
Legally binding obligations can only arise for the GPAA by means of 
a written instrument signed by an authorised signatory.
===



Solr - example for using percentiles

2017-01-15 Thread Vidal, Gilad
Hi,
Can you direct me for Java example using Solr percentiles?
The following 3 examples are not seems to be working.

  search.setParam("facet", true);
  search.setParam("percentiles", true);
  search.setParam("percentiles.field", "networkTime");
  search.setParam("percentiles.requested.percentiles", "25,50,75");
  search.setParam("percentiles.lower.fence", "0");
  search.setParam("percentiles.upper.fence", "100");
  search.setParam("percentiles.gap", "10");
  search.setParam("percentiles.averages", true);


  search.setParam("facet", true);
  search.setParam("facets.stats.percentiles", true);
  search.setParam("facets.stats.percentiles.field", "networkTime");
  search.setParam("f.networkTime.stats.percentiles.requested", 
"25,50,75");
  search.setParam("f.networkTime.stats.percentiles.lower.fence", 
"0");
  search.setParam("f.networkTime.stats.percentiles.upper.fence", 
"100");
  search.setParam("f.networkTime.stats.percentiles.gap", "10");
  search.setParam("facets.stats.percentiles.averages", true);


  search.setParam("facet", true);
  search.setParam("facets.stats.percentiles", true);
  search.setParam("facets.stats.percentiles.field", "networkTime");
  search.setParam("facets.networkTime.stats.percentiles.requested", 
"25,50,75");
  
search.setParam("facets.networkTime.stats.percentiles.lower.fence", "0");
  
search.setParam("facets.networkTime.stats.percentiles.upper.fence", "100");
  search.setParam("facets.networkTime.stats.percentiles.gap", "10");
  search.setParam("facets.stats.percentiles.averages", true);

Thanks,
Gilad