[jira] [Commented] (IGNITE-3084) Investigate how Ignite can support Spark DataFrame

2017-01-03 Thread Valentin Kulichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15795597#comment-15795597
 ] 

Valentin Kulichenko commented on IGNITE-3084:
-

Logical plan (which is actually AST) is built by Spark based on the API calls 
you make. It supports both SQL (Spark parses it by itself in this case) and 
chain methods like {{filter(..)}}, {{join(..)}}, etc. Logical plan is then 
converted to physical plan which defines how the logical plan is actually 
executed. So basically we need a strategy that will generate SQL query for 
Ignite based on AST provided by Spark.

In addition to this, MemSQL provides an option to execute SQL query as is when 
{{SQLContext.sql(..)}} method is called (i.e. it bypasses Spark query 
parser/planner). Not sure this is really useful because this implies adding 
another method on top of standard API, but it's fairly easy to add, so it make 
sense to do the same.

> Investigate how Ignite can support Spark DataFrame
> --
>
> Key: IGNITE-3084
> URL: https://issues.apache.org/jira/browse/IGNITE-3084
> Project: Ignite
>  Issue Type: Task
>  Components: Ignite RDD
>Affects Versions: 1.5.0.final
>Reporter: Vladimir Ozerov
>Assignee: Valentin Kulichenko
>  Labels: bigdata
> Fix For: 2.0
>
>
> We see increasing demand on nice DataFrame support for our Spark integration. 
> Need to investigate how could we do that.
> Looks like we can investigate how MemSQL do that and take it as a starting 
> point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-3084) Investigate how Ignite can support Spark DataFrame

2017-01-02 Thread Vladimir Ozerov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794332#comment-15794332
 ] 

Vladimir Ozerov commented on IGNITE-3084:
-

Val,

Cool analysis! I would say that executing query-on-partition is very useful 
feature. Not only it will help us with Spark, but will allow us to perform 
certain useful SQL optimizations (e.g. IGNITE-4509 and IGNITE-4510). 

I am not quite sure I understand how to work with plans and strategies. Does it 
mean that we will have to analyze SQL somehow (e.g. build AST) to give correct 
hints to Spark?


> Investigate how Ignite can support Spark DataFrame
> --
>
> Key: IGNITE-3084
> URL: https://issues.apache.org/jira/browse/IGNITE-3084
> Project: Ignite
>  Issue Type: Task
>  Components: Ignite RDD
>Affects Versions: 1.5.0.final
>Reporter: Vladimir Ozerov
>Assignee: Valentin Kulichenko
>  Labels: bigdata
> Fix For: 2.0
>
>
> We see increasing demand on nice DataFrame support for our Spark integration. 
> Need to investigate how could we do that.
> Looks like we can investigate how MemSQL do that and take it as a starting 
> point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (IGNITE-3084) Investigate how Ignite can support Spark DataFrame

2017-01-02 Thread Valentin Kulichenko (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15794210#comment-15794210
 ] 

Valentin Kulichenko commented on IGNITE-3084:
-

I made some investigation and here is what in my view needs to be done to 
support integration between Ignite and Spark DataFrame.

# Provide implementation of {{BaseRelation}} mixed with {{PrunedFilteredScan}}. 
It should be able to execute a query based on provided filters and selected 
fields and return RDD that iterates through results. Since RDD works on per 
partition level, most likely we will need to add an ability to run SQL query on 
a particular partition.
# Provide implementation of {{Catalog}} to properly lookup Ignite relations.
# Create {{IgniteSQLContext}} that will override the catalog.

Steps above will add a new datasource to Spark. However generally, while Spark 
is executing a query, it first fetches data from the source to its own memory 
to create RDDs. Therefore this is not enough for Ignite because we already have 
data in memory. In case there is only Ignite data participating in the query, 
we want Spark to issue a query directly to Ignite.

To accomplish this we can provide our own implementation of {{Strategy}} which 
Spark uses to convert logical plan to physical plan. For any type of 
{{LogicalPlan}}, this custom strategy should be able to generate SQL query for 
Ignite, based on the whole  plan tree. If there are non-Ignite relations in the 
plan, we should fall back to native Spark strategies (return {{Nil}} as a 
physical plan).

{{IgniteSQLContext}} should append the custom strategy to collection of Spark 
strategies. Here is a good example of how custom strategy can be created and 
injected: https://gist.github.com/marmbrus/f3d121a1bc5b6d6b57b9

> Investigate how Ignite can support Spark DataFrame
> --
>
> Key: IGNITE-3084
> URL: https://issues.apache.org/jira/browse/IGNITE-3084
> Project: Ignite
>  Issue Type: Task
>  Components: Ignite RDD
>Affects Versions: 1.5.0.final
>Reporter: Vladimir Ozerov
>Assignee: Valentin Kulichenko
>  Labels: bigdata
> Fix For: 2.0
>
>
> We see increasing demand on nice DataFrame support for our Spark integration. 
> Need to investigate how could we do that.
> Looks like we can investigate how MemSQL do that and take it as a starting 
> point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)