Re: V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Hyukjin Kwon
[+Yuming]

Will try to have some time to take a look as well.

2018-04-16 11:27 GMT+08:00 Thakrar, Jayesh :

> Thanks Sameer!
>
>
>
> *From: *Sameer Agarwal 
> *Date: *Sunday, April 15, 2018 at 10:02 PM
> *To: *"Thakrar, Jayesh" 
> *Cc: *"dev@spark.apache.org" , Hyukjin Kwon <
> gurwls...@gmail.com>
> *Subject: *Re: V2.3 Scala API to Github Links Incorrect
>
>
>
> [+Hyukjin]
>
>
>
> Thanks for flagging this Jayesh. https://github.com/
> apache/spark-website/pull/111 is tracking a short term fix to the API
> docs and https://issues.apache.org/jira/browse/SPARK-23732 tracks the fix
> to the release scripts.
>
>
>
> Regards,
>
> Sameer
>
>
>
>
>
> On 15 April 2018 at 18:50, Thakrar, Jayesh 
> wrote:
>
> In browsing through the API docs, the links to Github source code seem to
> be pointing to a dev branch rather than the release branch.
>
>
>
> Here's one example
>
> Go to the API doc page below and click on the "ProcessingTime.scala" link
> which points to Sameer's dev branch.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#
> org.apache.spark.sql.streaming.ProcessingTime
>
>
>
> https://github.com/apache/spark/tree/v2.3.0/Users/
> sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/streaming/
> ProcessingTime.scala
>
>
>
> Any chance this can be corrected please?
>
>
>
> BTW, I know working and executing on a release is an arduous task, so
> thanks for all the effort, Sameer and the dev/release team and contributors!
>
>
>
> Thanks,
>
> Jayesh
>
>
>
>
>


Re: V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
Thanks Sameer!

From: Sameer Agarwal 
Date: Sunday, April 15, 2018 at 10:02 PM
To: "Thakrar, Jayesh" 
Cc: "dev@spark.apache.org" , Hyukjin Kwon 

Subject: Re: V2.3 Scala API to Github Links Incorrect

[+Hyukjin]

Thanks for flagging this Jayesh. 
https://github.com/apache/spark-website/pull/111 is tracking a short term fix 
to the API docs and https://issues.apache.org/jira/browse/SPARK-23732 tracks 
the fix to the release scripts.

Regards,
Sameer


On 15 April 2018 at 18:50, Thakrar, Jayesh 
> wrote:
In browsing through the API docs, the links to Github source code seem to be 
pointing to a dev branch rather than the release branch.

Here's one example
Go to the API doc page below and click on the "ProcessingTime.scala" link which 
points to Sameer's dev branch.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.ProcessingTime

https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/streaming/ProcessingTime.scala

Any chance this can be corrected please?

BTW, I know working and executing on a release is an arduous task, so thanks 
for all the effort, Sameer and the dev/release team and contributors!

Thanks,
Jayesh




Re: V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Sameer Agarwal
[+Hyukjin]

Thanks for flagging this Jayesh. https://github.com/apache/spar
k-website/pull/111 is tracking a short term fix to the API docs and
https://issues.apache.org/jira/browse/SPARK-23732 tracks the fix to the
release scripts.

Regards,
Sameer


On 15 April 2018 at 18:50, Thakrar, Jayesh 
wrote:

> In browsing through the API docs, the links to Github source code seem to
> be pointing to a dev branch rather than the release branch.
>
>
>
> Here's one example
>
> Go to the API doc page below and click on the "ProcessingTime.scala" link
> which points to Sameer's dev branch.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#
> org.apache.spark.sql.streaming.ProcessingTime
>
>
>
> https://github.com/apache/spark/tree/v2.3.0/Users/
> sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/streaming/
> ProcessingTime.scala
>
>
>
> Any chance this can be corrected please?
>
>
>
> BTW, I know working and executing on a release is an arduous task, so
> thanks for all the effort, Sameer and the dev/release team and contributors!
>
>
>
> Thanks,
>
> Jayesh
>
>
>


[discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-15 Thread Wenchen Fan
Hi all,

I'd like to propose an API change to the data source v2.

One design goal of data source v2 is API type safety. The FileFormat API is
a bad example, it asks the implementation to return InternalRow even it's
actually ColumnarBatch. In data source v2 we add a type parameter to
DataReader/WriterFactoty and DataReader/Writer, so that data source
supporting columnar scan returns ColumnarBatch at API level.

However, we met some problems when migrating streaming and file-based data
source to data source v2.

For the streaming side, we need a variant of DataReader/WriterFactory to
add streaming specific concept like epoch id and offset. For details please
see ContinuousDataReaderFactory and
https://docs.google.com/document/d/1PJYfb68s2AG7joRWbhrgpEWhrsPqbhyRwUVl9V1wPOE/edit#

But this conflicts with the special format mixin traits like
SupportsScanColumnarBatch. We have to make the streaming variant of
DataReader/WriterFactory to extend the original DataReader/WriterFactory,
and do type cast at runtime, which is unnecessary and violate the type
safety.

For the file-based data source side, we have a problem with code
duplication. Let's take ORC data source as an example. To support both
unsafe row and columnar batch scan, we need something like

// A lot of parameters to carry to the executor side
class OrcUnsafeRowFactory(...) extends DataReaderFactory[UnsafeRow] {
  def createDataReader ...
}

class OrcColumnarBatchFactory(...) extends DataReaderFactory[ColumnarBatch]
{
  def createDataReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createUnsafeRowFactories = ... // logic to prepare the parameters and
create factories

  def createColumnarBatchFactories = ... // logic to prepare the parameters
and create factories
}

You can see that we have duplicated logic for preparing parameters and
defining the factory.

Here I propose to remove all the special format mixin traits and change the
factory interface to

public enum DataFormat {
  ROW,
  INTERNAL_ROW,
  UNSAFE_ROW,
  COLUMNAR_BATCH
}

interface DataReaderFactory {
  DataFormat dataFormat;

  default DataReader createRowDataReader() {
throw new IllegalStateException();
  }

  default DataReader createUnsafeRowDataReader() {
throw new IllegalStateException();
  }

  default DataReader createColumnarBatchDataReader() {
throw new IllegalStateException();
  }
}

Spark will look at the dataFormat and decide which create data reader
method to call.

Now we don't have the problem for the streaming side as these special
format mixin traits go away. And the ORC data source can also be simplified
to

class OrcReaderFactory(...) extends DataReaderFactory {
  def createUnsafeRowReader ...

  def createColumnarBatchReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createReadFactories = ... // logic to prepare the parameters and
create factories
}

We also have a potential benefit of supporting hybrid storage data source,
which may keep real-time data in row format, and history data in columnar
format. Then they can make some DataReaderFactory output InternalRow and
some output ColumnarBatch.

Thoughts?


V2.3 Scala API to Github Links Incorrect

2018-04-15 Thread Thakrar, Jayesh
In browsing through the API docs, the links to Github source code seem to be 
pointing to a dev branch rather than the release branch.

Here's one example
Go to the API doc page below and click on the "ProcessingTime.scala" link which 
points to Sameer's dev branch.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.ProcessingTime

https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/streaming/ProcessingTime.scala

Any chance this can be corrected please?

BTW, I know working and executing on a release is an arduous task, so thanks 
for all the effort, Sameer and the dev/release team and contributors!

Thanks,
Jayesh