Re: Subsecond queries possible?

2015-07-01 Thread Debasish Das
If you take bitmap indices out of sybase then I am guessing spark sql will
be at par with sybase ?

On that note are there plans of integrating indexed rdd ideas to spark sql
to build indices ? Is there a JIRA tracking it ?
On Jun 30, 2015 7:29 PM, Eric Pederson eric...@gmail.com wrote:

 Hi Debasish:

 We have the same dataset running on SybaseIQ and after the caches are warm
 the queries come back in about 300ms.  We're looking at options to relieve
 overutilization and to bring down licensing costs.  I realize that Spark
 may not be the best fit for this use case but I'm interested to see how far
 it can be pushed.

 Thanks for your help!


 -- Eric

 On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I got good runtime improvement from hive partitioninp, caching the
 dataset and increasing the cores through repartition...I think for your
 case generating mysql style indexing will help further..it is not supported
 in spark sql yet...

 I know the dataset might be too big for 1 node mysql but do you have a
 runtime estimate from running the same query on mysql with appropriate
 column indexing ? That should give us a good baseline number...

 For my case at least I could not put the data on 1 node mysql as it was
 big...

 If you can write the problem in a document view you can use a document
 store like solr/elastisearch to boost runtime...the reverse indices can get
 you subsecond latencies...again the schema design matters for that and you
 might have to let go some of sql expressiveness (like balance in a
 predefined bucket might be fine but looking for the exact number might be
 slow)





Re: Subsecond queries possible?

2015-07-01 Thread Eric Pederson
I removed all of the indices from the table in IQ and the time went up to
700ms for the query on the full dataset.   The best time I've got so far
with Spark for the full dataset is 4s with a cached table and 30 cores.

However, every column in IQ is automatically indexed by default
http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc00170.1510/html/iqapgv1/BABHJCIC.htm,
and those indexes you can't remove.  They aren't even listed in the
metadata.  So even though I removed all of the indexes the default indexes
are still there.

It's a baseline but I'm really comparing apples and oranges right now.
 But it's an interesting experiment nonetheless.



-- Eric

On Wed, Jul 1, 2015 at 12:47 PM, Debasish Das debasish.da...@gmail.com
wrote:

 If you take bitmap indices out of sybase then I am guessing spark sql will
 be at par with sybase ?

 On that note are there plans of integrating indexed rdd ideas to spark sql
 to build indices ? Is there a JIRA tracking it ?
 On Jun 30, 2015 7:29 PM, Eric Pederson eric...@gmail.com wrote:

 Hi Debasish:

 We have the same dataset running on SybaseIQ and after the caches are
 warm the queries come back in about 300ms.  We're looking at options to
 relieve overutilization and to bring down licensing costs.  I realize that
 Spark may not be the best fit for this use case but I'm interested to see
 how far it can be pushed.

 Thanks for your help!


 -- Eric

 On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
 wrote:

 I got good runtime improvement from hive partitioninp, caching the
 dataset and increasing the cores through repartition...I think for your
 case generating mysql style indexing will help further..it is not supported
 in spark sql yet...

 I know the dataset might be too big for 1 node mysql but do you have a
 runtime estimate from running the same query on mysql with appropriate
 column indexing ? That should give us a good baseline number...

 For my case at least I could not put the data on 1 node mysql as it was
 big...

 If you can write the problem in a document view you can use a document
 store like solr/elastisearch to boost runtime...the reverse indices can get
 you subsecond latencies...again the schema design matters for that and you
 might have to let go some of sql expressiveness (like balance in a
 predefined bucket might be fine but looking for the exact number might be
 slow)





Re: Subsecond queries possible?

2015-06-30 Thread Eric Pederson
Hi Debasish:

We have the same dataset running on SybaseIQ and after the caches are warm
the queries come back in about 300ms.  We're looking at options to relieve
overutilization and to bring down licensing costs.  I realize that Spark
may not be the best fit for this use case but I'm interested to see how far
it can be pushed.

Thanks for your help!


-- Eric

On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das debasish.da...@gmail.com
wrote:

 I got good runtime improvement from hive partitioninp, caching the dataset
 and increasing the cores through repartition...I think for your case
 generating mysql style indexing will help further..it is not supported in
 spark sql yet...

 I know the dataset might be too big for 1 node mysql but do you have a
 runtime estimate from running the same query on mysql with appropriate
 column indexing ? That should give us a good baseline number...

 For my case at least I could not put the data on 1 node mysql as it was
 big...

 If you can write the problem in a document view you can use a document
 store like solr/elastisearch to boost runtime...the reverse indices can get
 you subsecond latencies...again the schema design matters for that and you
 might have to let go some of sql expressiveness (like balance in a
 predefined bucket might be fine but looking for the exact number might be
 slow)



Re: Subsecond queries possible?

2015-06-30 Thread Michael Armbrust

 This brings up another question/issue - there doesn't seem to be a way to
 partition cached tables in the same way you can partition, say a Hive
 table.  For example, we would like to partition the overall dataset (233m
 rows, 9.2Gb) by (product, coupon) so when we run one of these queries
 Spark won't have to scan all the data, just the partition from the query,
 eg, (FNM30, 3.0).


If you order the data on the interesting column before caching, we keep
min/max statistics that let us do similar data skipping automatically.


Re: Subsecond queries possible?

2015-06-30 Thread Debasish Das
I got good runtime improvement from hive partitioninp, caching the dataset
and increasing the cores through repartition...I think for your case
generating mysql style indexing will help further..it is not supported in
spark sql yet...

I know the dataset might be too big for 1 node mysql but do you have a
runtime estimate from running the same query on mysql with appropriate
column indexing ? That should give us a good baseline number...

For my case at least I could not put the data on 1 node mysql as it was
big...

If you can write the problem in a document view you can use a document
store like solr/elastisearch to boost runtime...the reverse indices can get
you subsecond latencies...again the schema design matters for that and you
might have to let go some of sql expressiveness (like balance in a
predefined bucket might be fine but looking for the exact number might be
slow)